# TiVo - Segmentation Analysis using _k_-means Clustering

**NOTE**: All cells that have Excercises in them are marked with `#EXERCISE-TODO-[NO OF EXERCISE]`

## Import General Packages

In [None]:
%pip install -r requirements.txt

In [1]:
import pandas as pd # type: ignore
import numpy as np # type: ignore
import matplotlib.pyplot as plt # type: ignore

## Read and feel the data  (EXTRACT part of ETL pipeline)

### Read in from Excel file

In [2]:
dtivo_master = pd.read_excel('./data/tivo_subset.xlsx',
                             index_col=0)            #Read in the first column (ID) as the index
dtivo_master.head()

Unnamed: 0_level_0,Gender,Education,Income,Age,PurchasePoint,ElecSpend_Annual,Tvhours,TechAdopt,Fav_Feature
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,male,none,49,30,mass-consumer electronics,420,2,late,saving favorite shows to watch as a family
2,male,none,46,36,mass-consumer electronics,420,10,late,saving favorite shows to watch as a family
3,male,BA,58,66,specialty stores,768,0,early,time shifting
4,male,PhD,51,78,mass-consumer electronics,396,5,late,saving favorite shows to watch as a family
5,female,none,46,52,mass-consumer electronics,540,2,late,saving favorite shows to watch as a family


### Get a feel of the data

In [3]:
dtivo_master.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Gender            1000 non-null   object
 1   Education         1000 non-null   object
 2   Income            1000 non-null   int64 
 3   Age               1000 non-null   int64 
 4   PurchasePoint     1000 non-null   object
 5   ElecSpend_Annual  1000 non-null   int64 
 6   Tvhours           1000 non-null   int64 
 7   TechAdopt         1000 non-null   object
 8   Fav_Feature       1000 non-null   object
dtypes: int64(4), object(5)
memory usage: 78.1+ KB


### Exploratory Analyses: explore possible bases for segmentation

In [4]:
dtivo_master.groupby('TechAdopt')['Fav_Feature'].value_counts()

TechAdopt  Fav_Feature                               
early      cool gadget                                   228
           schedule control                              221
           time shifting                                 221
           programming/interactive features              130
late       saving favorite shows to watch as a family    200
Name: count, dtype: int64

*Q: Did we learn anything about the product feature that is of interest to early vs. late adopters?  Which ones will you market to the premium market segment?*

Ans:

In [None]:
#EXERCISE-TODO-01: Obtain grouped value counts for Purchase_Point
#and write in commentary under it, what ordinal scale would
#suitable?
dtivo_master.groupby('TechAdopt')[__________________].value_counts()


## Data Transformations  (TRANSFORM part of ETL pipeline)

In [6]:
#Make a copy of the master DataFrame to apply
#transformations to.  We may need the original master file
#so this is a good practice to apply transformations to a copy
dtivo = dtivo_master.copy()

### Factorizations

#### Fav_Feature

In [18]:
#Map the zipped dictionary from 'Fav_Feature' to 'Fav_Coded'
dtivo['Fav_Coded'], fav_uniques = dtivo['Fav_Feature'].factorize()
print(fav_uniques)


Index(['saving favorite shows to watch as a family', 'time shifting',
       'cool gadget', 'schedule control', 'programming/interactive features'],
      dtype='object')


In [19]:
dtivo.head(10)

Unnamed: 0_level_0,Gender,Education,Income,Age,PurchasePoint,ElecSpend_Annual,Tvhours,TechAdopt,Fav_Feature,Fav_Coded
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,male,none,49,30,mass-consumer electronics,420,2,late,saving favorite shows to watch as a family,0
2,male,none,46,36,mass-consumer electronics,420,10,late,saving favorite shows to watch as a family,0
3,male,BA,58,66,specialty stores,768,0,early,time shifting,1
4,male,PhD,51,78,mass-consumer electronics,396,5,late,saving favorite shows to watch as a family,0
5,female,none,46,52,mass-consumer electronics,540,2,late,saving favorite shows to watch as a family,0
6,female,BA,31,72,retail,168,1,early,time shifting,1
7,male,none,33,62,discount,216,0,early,cool gadget,2
8,male,none,29,30,retail,276,1,early,schedule control,3
9,male,none,57,60,specialty stores,888,0,early,schedule control,3
10,female,none,30,59,discount,192,0,early,schedule control,3


#### Education
Sometimes the ordering of the numbers will matter, for example, in Education, we want to preserve our own order - especially when we see the results that summarize clusters.  So we "Zip" the categories with our own factors.

In [7]:
ed_types = dtivo['Education'].unique()
print(ed_types)

['none' 'BA' 'PhD' 'MA']


In [None]:
dtivo['Education'].factorize(sort=True)

In [8]:
ed_levels = [1,2,4,3]  #order numbers matched against ed_types list
dict_edtypes = dict(zip(ed_types,ed_levels))
print(dict_edtypes)

{'none': 1, 'BA': 2, 'PhD': 4, 'MA': 3}


In [None]:
dtivo['Ed_Coded'] = dtivo['Education'].map(dict_edtypes)
dtivo.head(10)

#### PurchasePoint 

In [21]:
#EXERCISE-TODO-02: Decide whether you would like to 
#code 'PurchasePoint' using simply factorization (as we showed in Favorite Feature) 
# OR by mapping your own 'zipped' dictionary.
#HINT: CONSIDER YOUR FINDINGS FROM TODO-01
# WHICHEVER WAY YOU, CHOSE SAVE THE RESULT


#?
#?
#?
dtivo["PPoint_Coded"] = dtivo['Education']._____________________________
dtivo.head(10)


#### Factorizing a binary - what to label the result?

In [23]:
dtivo['Gender'].head()

ID
1      male
2      male
3      male
4      male
5    female
Name: Gender, dtype: object

In [34]:
dtivo['Gender'].factorize(sort=True)

(array([1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1,
        0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
        0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
        1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
        0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
        0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
        0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
        0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
        1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 

When factorizing a binary variable, you need to anticipate which category gets zero and which gets one.  Factorize offers to first sort the 'unique' categories.  For example, our dataset only has two genders (male and female) so this is a binary variable.  If we apply sort, we can anticipate that as 'F' comes before 'M', females will get `0` and males will get assigned `1`.  In this case the Series we get is Male as males have `1`.

 Without this, what ever the first category appearing in the first row is will get 0 and the next category will get 1 (so given the above data, male would get `0` without sort)

In [35]:
dtivo['Male'], gender_uniques = dtivo['Gender'].factorize(sort=True)
print(gender_uniques)
dtivo.head()

Index(['female', 'male'], dtype='object')


Unnamed: 0_level_0,Gender,Education,Income,Age,PurchasePoint,ElecSpend_Annual,Tvhours,TechAdopt,Fav_Feature,Fav_Coded,Female,Male
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,male,none,49,30,mass-consumer electronics,420,2,late,saving favorite shows to watch as a family,0,0,1
2,male,none,46,36,mass-consumer electronics,420,10,late,saving favorite shows to watch as a family,0,0,1
3,male,BA,58,66,specialty stores,768,0,early,time shifting,1,0,1
4,male,PhD,51,78,mass-consumer electronics,396,5,late,saving favorite shows to watch as a family,0,0,1
5,female,none,46,52,mass-consumer electronics,540,2,late,saving favorite shows to watch as a family,0,1,0


### Rows to keep for clustering
We are only interested at launch in segments *within* early adopters => filter out late adopters.


In [38]:
#Assuming we have done some segment targeting already - like focus on Early adopters only.
dtivo_early = dtivo[dtivo['TechAdopt']=='early'].copy()
len(dtivo_early)

800

In [39]:
outlierIDs = []   #WE CAN DROP ANY CASES THAT WE THINK ARE OUTLIERS AND WILL DISTORT OUR SEGMENTS
#EXERCISE-TODO-Nth: RETURN TO THIS AFTER RUNNING THE CLUSTER ANALYSIS
dtivo_early.drop(index=outlierIDs, inplace=True)

In [40]:
len(dtivo_early)

800

### Columns to keep for clustering

In [41]:
colsAll = list(dtivo_early.columns)
colsAll

['Gender',
 'Education',
 'Income',
 'Age',
 'PurchasePoint',
 'ElecSpend_Annual',
 'Tvhours',
 'TechAdopt',
 'Fav_Feature',
 'Fav_Coded',
 'Female',
 'Male']

The word 'descriptors' signifies that these are the chosen variables that we believe will truly help us predict our segmentation.  Descriptors chosen should be those that we will know about the customer before the marketing action takes place.  For example, before selling our product, will we know their purchase point?

In [44]:

colsToDrop = ['Education',      #factorized, only keeping its coded form
              'PurchasePoint',  #factorized, only keeping its coded form
              'Tvhours',        #Dropping to keep it simple
              'TechAdopt',      #Don't need after filtering out TechAdopter==late.
              'Fav_Feature',    #factorized, only keeping coded form
              'Gender',         #already have 'Male' after factorization
              'Fav_Coded',      #Dropping to keep it simple
#              'PPoint_Coded',   #Dropping to keep it simple in first run - uncomment after TODOS
                                #EXERCISE-TODO-TAKEHOME - try doing this again but with some of the variables we dropped for simplicity
             ]

dtivo_descriptors = dtivo_early.drop(columns=colsToDrop)
colsKept = list(dtivo_descriptors.columns)
print(colsKept)
dtivo_descriptors.head()

['Income', 'Age', 'ElecSpend_Annual', 'Male']


Unnamed: 0_level_0,Income,Age,ElecSpend_Annual,Male
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,58,66,768,1
6,31,72,168,0
7,33,62,216,1
8,29,30,276,1
9,57,60,888,1


## Normalize using Standardization

If Clustering has to give equal weight to each field (or 'feature' in ML terms), then<br>
we must make sure they are on the same scale (i.e. **Normalize** the fields).

If we are interested in how our segments vary from the average (e.g. <br>
above average income, highly below average age, etc.), then we use<br>
**Standardization** for normalization.


In [45]:
#IMPORT PACKAGE FOR STANDARDIZATION
from sklearn import preprocessing

In [46]:
#STANDARDIZED SCALING
dtivo_std = pd.DataFrame(preprocessing.scale(dtivo_descriptors),
                         columns=colsKept,                 #TODO: REVIEW in light of any changes to COLUMNS DROPPED
                         index=dtivo_descriptors.index
                        )

#only for checking if done
dtivo_std.describe().round(2)

Unnamed: 0,Income,Age,ElecSpend_Annual,Male
count,800.0,800.0,800.0,800.0
mean,0.0,-0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0
min,-0.46,-1.59,-1.19,-1.11
25%,-0.26,-0.94,-0.69,-1.11
50%,-0.17,-0.0,-0.52,0.9
75%,-0.05,0.87,0.37,0.9
max,20.26,1.81,3.33,0.9


In [None]:
dtivo_std.head()

## K-means Clustering

In [62]:
#Import Required Packages
from sklearn.cluster import KMeans

### Learning Step: Start by making k=4 clusters

#### Fitting the k-means model for k=4

In [63]:
#TODO-16: Fit 4 clusters using KMeans to standardized data
k4_model = KMeans(n_clusters=4, 
                  random_state=111).fit(dtivo_std)

#### Tagging the segmentation bases dataset with cluster numbers

In [None]:
#TODO-17: Tag clusters in df _cl2 for cluster analysis
dtivo_k4 = dtivo_descriptors.copy()
dtivo_k4['k4'] = k4_model.labels_
dtivo_k4.head()

In [65]:
dtivo_k4.to_csv("dtivo_k4.csv")

### Learning Step: First Cluster Analysis

#### Basic summary table

In [None]:
#TODO-18: Select appropriate metrics to summarize the cluster features.
pivot_k4 = dtivo_k4.groupby('k4').agg(['mean', 'std']).round(1)
pivot_k4.insert(0, 'Size', dtivo_k4.groupby('k4').size())
pivot_k4

In [67]:
pivot_k4.to_excel('pivot_k4_tivo.xlsx')

In [68]:
#TODO-19: GO BACK TO 'Rows to keep for clustering' and use the following code _there_ to 
#drop the 2 records that are outliers (As k-means has some randomness in it, your cluster
#number may be different for the 2 records - please adjust code for that)

#outlierIDs = dtivo_k4[dtivo_k4['k4']==3].index
#print(outlierIDs)

#dtivo_early.drop(index=outlierIDs, inplace=True)
## Ignore the warning - we want to make copy different from original so its ok

#len(dtivo_early)

#### Summary table with Heatmap (k=4)

In [None]:
pk4mins = pivot_k4.min(0)
pk4maxs = pivot_k4.max(0)
mm_pivot = (pivot_k4 - pk4mins)/(pk4maxs-pk4mins)
mm_pivot.round(2)

In [70]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#TODO-20: Use min-max pivot for data, 
#and actual pivot as annotations to the heatmap

fig = plt.figure(figsize=(10, 4))
pk4_map = sns.heatmap(
    mm_pivot,
    annot=pivot_k4,  #actual pivot becomes annotation
    cmap='coolwarm_r',
    cbar=False,
    fmt='.1f',
    lw=0.01)

pk4_map.set(xlabel='Segmentation Bases',
           ylabel = 'Cluster')
plt.show()

In [72]:
fig.savefig('heatmap_k4_tivo.png', dpi=300, bbox_inches='tight')

#### Defining clusters with 'persona' descriptions

In [None]:
pivot_k4_annot = pivot_k4.copy()
pivot_k4_annot['Persona'] = ['Budget-seeking Middle-Aged Woman',
                             'Educated Hi-Spender',
                             'Budget-seeking Middle-Aged Man',
                             'Top-buck Spending Senior'
                            ]
pivot_k4_annot

In [74]:
pivot_k4_annot.to_excel('pivot_k4_tivo.xlsx')

### Finding the best 'k'

#### Elbow Method

In [None]:
k4_model.inertia_

In [None]:
%pip install yellowbrick

In [None]:
%pip install --upgrade matplotlib

In [81]:
# import matplotlib.pyplot as plt
# plt.rcParams['font.sans-serif'] = ['DejaVu Sans']  # or any other font that is available
# plt.rcParams['font.family'] = 'sans-serif'


In [82]:
#TODO-23: Which visualizer will we use?
from yellowbrick.cluster import KElbowVisualizer

In [None]:
#TODO-24: Set-up the KElbowVisualizer
kBegin = 3
kEnd = 15
km = KMeans(random_state=111)
visualizer = KElbowVisualizer(km, k=(kBegin,kEnd), timings=False)
visualizer.fit(dtivo_std)
visualizer.show()

#### Silhouette Score

##### Average Silhouette Score

In [None]:
#TODO-25: Change the metric to 'silhouette'
vizElbowSil = KElbowVisualizer(km,
                               metric='silhouette',
                               k=(kBegin, kEnd),
                               timings=False)
vizElbowSil.fit(dtivo_std)
vizElbowSil.show()

##### Detailed Silhouette per Clustering

In [85]:
dtivo_std.to_csv("dtivo_reduced_std.csv")

In [86]:
#TODO-26: Import correct Visualizer
from yellowbrick.cluster import SilhouetteVisualizer

In [None]:
km = KMeans(n_clusters=6, random_state=111)
silViz = SilhouetteVisualizer(km, colors='yellowbrick')
silViz.fit(dtivo_std)
silViz.show()


In [None]:
#TODO-27: Setup and run the SilhouetteVisualizer
plt.clf()
kStart = 3
kStop = 7  #last k for which you want a subplot
subx = 0  #staring row index of subplot on which to plot
fig, ax = plt.subplots(kStop - kStart, 1,
                       figsize=(7, 16), sharex=True)
fig.tight_layout(pad=5.0)
for i in range(kStart, kStop):
    ax[subx].set_title(f"Silhouette values for k= {i}")
    ax[subx].tick_params(labelbottom=True)
    km = KMeans(n_clusters=i, random_state=111)
    silViz = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[subx])
    silViz.fit(dtivo_std)
    silViz.finalize()
    subx += 1

### Clustering using best k in k=3 to k=6 range

#### Fitting the k-means model for k= 6

In [89]:
k6_model = KMeans(n_clusters=6, 
                  random_state=111).fit(dtivo_std)

#### Tagging the segmentation bases dataset with cluster numbers

In [None]:
dtivo_k6 = dtivo_descriptors.copy()
dtivo_k6['k6'] = k6_model.labels_
dtivo_k6.head()

#### Cluster Analysis k=6

#### Basic summary table

In [None]:
pivot_k6 = dtivo_k6.groupby('k6').agg(['mean', 'std']).round(1)
pivot_k6.insert(0, 'Size', dtivo_k6.groupby('k6').size())
pivot_k6

In [None]:
#TODO-27: What fields are being multiplied to get value the segment?
k6_segValues = pivot_k6['ElecSpend_Annual']['mean']*pivot_k6['Size']
k6_segValues = k6_segValues.sort_values(ascending=False)
k6_segValues.map('${:,.0f}'.format)  #after : comes the formating --> , for commas and .0f for 0 decimal places float

In [93]:
pivot_k6.to_excel('pivot_k6_tivo.xlsx')

#### Summary table with Heatmap (k=6)

In [94]:
pk6mins = pivot_k6.min(0)
pk6maxs = pivot_k6.max(0)
mm_pivot = (pivot_k6 - pk6mins)/(pk6maxs-pk6mins)

In [None]:
fig = plt.figure(figsize=(10, 5))
pk6_map = sns.heatmap(
    mm_pivot,
    annot=pivot_k6,  #actual pivot becomes annotation
    cmap='coolwarm_r',
    cbar=False,
    fmt='.1f',
    lw=0.01)

pk6_map.set(xlabel='Segmentation Bases',
           ylabel = 'Cluster')
plt.show()

#### Box Plots of k=6 clusters by bases

In [None]:
#TODO-28: We have to use the standardized dataframe to show all features on one scale
dtivo_stdk6 = dtivo_std.copy()
dtivo_stdk6['k6'] = k6_model.labels_
dtivo_stdk6.head()

In [None]:
#TODO-29: Tag the standardized dataframe for a common scale plot
medprops = dict(linewidth=3)
gpbxplot = dtivo_stdk6.groupby('k6').boxplot(column=colsKept,
                                             figsize=(12, 12),
                                             patch_artist=True,     #patches are the boxes - True allows us to change its colors
                                             medianprops=medprops)  #can change the median line properties using median props
plt.tight_layout(pad=2.0)

Can choose colormap options from https://matplotlib.org/stable/gallery/color/colormap_reference.html

In [None]:
#Have different color and titles for each boxplot
#TODO-30: Choose a matplotlib colormap profile
bxcmap = plt.cm.get_cmap('plasma', len(colsKept))
bxcmap

In [None]:
bxcolors = [bxcmap(i) for i in range(0,len(colsKept))]
bxcolors #in RGB-Alpha form

In [None]:
#TODO-31: How do we change the colors of each groupby boxplot subplot?
#f"..." are f-strings or formatted strings.  They allow using variables between curly brackets {}
for i, ax in enumerate(gpbxplot):
    ax.set_title(f"CLUSTER {i}: Size ={pivot_k6.loc[i,'Size'].iat[0]}")
    ax.tick_params(labelleft=True)
    for j, patch in enumerate(ax.patches):
        patch.set_facecolor(bxcolors[j])
plt.tight_layout(pad=3.0)
gpbxplot[1].get_figure()

In [101]:
#TODO-32: How do we save the plot?
fig_k6boxplot = gpbxplot[1].get_figure()
fig_k6boxplot.savefig('boxplot_k6.png', dpi=400)

## TODO-21: k=5

### TODO-22: Recommendations: quality of k=5 vs. k=4 solution and which segment to target?

<div class="alert alert-block alert-info">Compare your 5 cluster solution with the earlier 4 cluster one:<br>
- is the extra cluster identified worth marketing to?<br>
- Based on the 5 segments solution, which segment(s) should TiVo target<br>
  to fulfill its marketing objective?<br>
- What does it imply for the 4Ps.</div>