# TiVo - Segmentation Analysis using _k_-means Clustering

## Name & ID

**NAME: STUDENT Z**<br>
*roll no: m04000zzz*

## Import General Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Read and feel the data

### Read in from Excel file

In [2]:
!pip install openpyxl



In [3]:
#TODO-01>> Which command to use to read in the Excel file in data folder?
dtivo_master = pd.read_excel('./data/tivo_subset.xlsx',
                             index_col=0)            #Read in the first column (ID) as the index
dtivo_master.head()

Unnamed: 0_level_0,Gender,Education,Income,Age,PurchasePoint,ElecSpend_Annual,Tvhours,TechAdopt,Fav_Feature
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,male,none,49,30,mass-consumer electronics,420,2,late,saving favorite shows to watch as a family
2,male,none,46,36,mass-consumer electronics,420,10,late,saving favorite shows to watch as a family
3,male,BA,58,66,specialty stores,768,0,early,time shifting
4,male,PhD,51,78,mass-consumer electronics,396,5,late,saving favorite shows to watch as a family
5,female,none,46,52,mass-consumer electronics,540,2,late,saving favorite shows to watch as a family


### Get a feel of the data

In [4]:
dtivo_master.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Gender            1000 non-null   object
 1   Education         1000 non-null   object
 2   Income            1000 non-null   int64 
 3   Age               1000 non-null   int64 
 4   PurchasePoint     1000 non-null   object
 5   ElecSpend_Annual  1000 non-null   int64 
 6   Tvhours           1000 non-null   int64 
 7   TechAdopt         1000 non-null   object
 8   Fav_Feature       1000 non-null   object
dtypes: int64(4), object(5)
memory usage: 78.1+ KB


In [5]:
dtivo_master.groupby('TechAdopt')['Fav_Feature'].value_counts()

TechAdopt  Fav_Feature                               
early      cool gadget                                   228
           schedule control                              221
           time shifting                                 221
           programming/interactive features              130
late       saving favorite shows to watch as a family    200
Name: count, dtype: int64

Given this output, we can assign numbers as follows:
1. --> saving favorite...
1. --> cool gadget (general statement, not specific about which feature)
1. --> schedule control (slightly advanced feature)
1. --> time shifting (more advanced feature)
1. --> programming/interactive features (most advanced feature)

In [6]:
#TODO-03: Obtain grouped value counts for Purchase_Point
#and write in commentary under it, what ordinal scale would
#suitable?
dtivo_master.groupby('TechAdopt')['PurchasePoint'].value_counts()


TechAdopt  PurchasePoint            
early      retail                       294
           discount                     293
           specialty stores             170
           web (ebay)                    43
late       mass-consumer electronics    200
Name: count, dtype: int64

Given this output, we can assign numbers as follows:
1. --> discount
1. --> retail
1. --> specialty stores
1. --> web(ebay)
1. --> mass-consumer electronics

## Data Transformations

In [6]:
#Make a copy of the master DataFrame to apply
#transformations to.
dtivo = dtivo_master.copy()

### Factorizations

**CONSIDER USING SIMPLE FACTORIZATION**

#### Education

In [None]:
#TODO-04: Get list/arry of unique categories
ed_types = dtivo['Education'].unique()
print(ed_types)

In [None]:
#TODO-05-06: Provide a list of numbers in order required 
#e.g. we want numbers to increase with level of education - 
#so PhD is given 4

ed_levels = [1,2,4,3]  #order numbers matched against ed_types list
dict_edtypes = dict(zip(ed_types,ed_levels))
print(dict_edtypes)

In [None]:
#TODO-07: Map the zipped dictionary from 'Education' to 'Ed_Coded'
dtivo['Ed_Coded'] = dtivo['Education'].map(dict_edtypes)
dtivo.head(10)

#### Fav_Feature

In [None]:
fav_types = dtivo['Fav_Feature'].unique()
list(enumerate(fav_types))  #enumerate just to count number of types

In [None]:
#Given own numerical levels of fav justified earlier.
fav_levels = [1,4,2,3,5]  #follows order of ed_types
dict_favtypes = dict(zip(fav_types,fav_levels))
print(dict_favtypes)

In [None]:
#Map the zipped dictionary from 'Fav_Feature' to 'Fav_Coded'
dtivo['Fav_Coded'] = dtivo['Fav_Feature'].map(dict_favtypes)
dtivo.head(10)

#### PurchasePoint 

In [None]:
#TODO-08: Get list/arry of unique categories
ppoint_types = dtivo['PurchasePoint'].unique()
list(enumerate(ppoint_types))

Given this output, we can assign numbers as follows:

1. --> discount
1. --> retail
1. --> specialty stores
1. --> web(ebay)
1. --> mass-consumer electronics

In [None]:
#TODO-09-10: Provide a list of numbers in order justified in TODO-03. 
ppoint_levels = [5,3,2,1,4]
dict_pptypes = dict(zip(ppoint_types,ppoint_levels))
print(dict_pptypes)

In [None]:
#TODO-11: Map the zipped dictionary from 'PurchasePoint' to 'PPoint_Coded'
dtivo['PPoint_Coded'] = dtivo['PurchasePoint'].map(dict_pptypes)
dtivo.head(10)

### Making Dummy Variables (or One-Hot Encoding)

When we have binary variables **OR** if each category needs its own column,<br>
we use `pd.get_dummies()` to create dummy variable columns.

This is also called **"One-Hot Encoding" (OHE)**, when a category becomes<br>
a binary (0,1) variable.  It is quite useful for performing regression with<br>
categorical explanatory variables.<br>

In our case, OHE can be applied to `TechAdopt` and `Gender` They are binary and<br>
we only need one of them to define the category.

In [None]:
dummCols = ['Gender']   #TODO-12: Add TechAdopt to this list
#TODO-13: apply get_dummies to all columns that need categories
#converted to binary
dtivo = pd.get_dummies(dtivo, columns = dummCols, dtype=int)
dtivo.head()

### Rows to keep for clustering
We are only interested at launch in segments *within* early adopters => filter out late adopters.


In [None]:
#Assuming we have done some segment targeting already - like focus on Early adopters only.
dtivo_early = dtivo[dtivo['TechAdopt']=='early']
len(dtivo_early)

In [None]:
outlierIDs = [441, 923]
dtivo_early.drop(index=outlierIDs, inplace=True)

In [None]:
len(dtivo_early)

### Columns to keep for clustering

In [None]:
colsAll = list(dtivo_early.columns)
colsAll

In [None]:
colsToDrop = ['Education',      #factorized
              'PurchasePoint',  #will be factorized, need to keep its coded form when factorized
              'Tvhours',        #Initially using as descriptor not basis so dropping here, also to keep it simple
              'TechAdopt',      #Don't need after filtering out TechAdopter==late.
              'Fav_Feature',    #factorized
              'Gender_female',  #already have male who are more frequent so don't need female
              'Fav_Coded',      #Treating as descriptor to inform promotion, not as cluster basis.
              'PPoint_Coded',  #Treating as descriptor to inform promotion, not as cluster basis.
                                #TODO-14 - ADD MORE COLUMNS TO DROP AFTER COMPLETING EARLIER TODOS
             ]

dtivo_bases = dtivo_early.drop(columns=colsToDrop)
colsKept = list(dtivo_bases.columns)
print(colsKept)
dtivo_bases.head()

## Normalize using Standardization

If Clustering has to give equal weight to each field (or 'feature' in ML terms), then<br>
we must make sure they are on the same scale (i.e. **Normalize** the fields).

If we are interested in how our segments vary from the average (e.g. <br>
above average income, highly below average age, etc.), then we use<br>
**Standardization** for normalization.


In [22]:
#IMPORT PACKAGE FOR STANDARDIZATION
from sklearn import preprocessing

In [None]:
#STANDARDIZED SCALING
## TODO-23: need to reconvert to DataFrame as sklearn produces
## numerical array without headings
dtivo_std = pd.DataFrame(preprocessing.scale(dtivo_bases),
                         columns=colsKept,                 #TODO-15: REVIEW in light of TODO-14.
                         index=dtivo_bases.index
                        )

#only for checking if done
dtivo_std.describe().round(2)

In [None]:
dtivo_std.head()

## K-means Clustering

In [25]:
#Import Required Packages
from sklearn.cluster import KMeans

### Learning Step: Start by making k=4 clusters

#### Fitting the k-means model for k=4

In [26]:
#TODO-16: Fit 4 clusters using KMeans to standardized data
k4_model = KMeans(n_clusters=4, 
                  random_state=111).fit(dtivo_std)

#### Tagging the segmentation bases dataset with cluster numbers

In [None]:
#TODO-17: Tag clusters in df _cl2 for cluster analysis
dtivo_k4 = dtivo_bases.copy()
dtivo_k4['k4'] = k4_model.labels_
dtivo_k4.head()

In [28]:
dtivo_k4.to_csv("outputs/dtivo_k4.csv")

### Learning Step: First Cluster Analysis

#### Basic summary table

In [None]:
#TODO-18: Select appropriate metrics to summarize the cluster features.
pivot_k4 = dtivo_k4.groupby('k4').agg(['mean', 'std']).round(1)
pivot_k4.insert(0, 'Size', dtivo_k4.groupby('k4').size())
pivot_k4

In [30]:
pivot_k4.to_excel('outputs/pivot_k4_tivo.xlsx')

In [31]:
#TODO-19: GO BACK TO 'Rows to keep for clustering' and use the following code _there_ to 
#drop the 2 records that are outliers (As k-means has some randomness in it, your cluster
#number may be different for the 2 records - please adjust code for that)

#outlierIDs = dtivo_k4[dtivo_k4['k4']==3].index
#print(outlierIDs)

#dtivo_early.drop(index=outlierIDs, inplace=True)
## Ignore the warning - we want to make copy different from original so its ok

#len(dtivo_early)

#### Summary table with Heatmap (k=4)

In [None]:
pk4mins = pivot_k4.min(0)
pk4maxs = pivot_k4.max(0)
mm_pivot = (pivot_k4 - pk4mins)/(pk4maxs-pk4mins)
mm_pivot.round(2)

In [33]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#TODO-20: Use min-max pivot for data, 
#and actual pivot as annotations to the heatmap

fig = plt.figure(figsize=(10, 4))
pk4_map = sns.heatmap(
    mm_pivot,
    annot=pivot_k4,  #actual pivot becomes annotation
    cmap='coolwarm_r',
    cbar=False,
    fmt='.1f',
    lw=0.01)

pk4_map.set(xlabel='Segmentation Bases',
           ylabel = 'Cluster')
plt.show()

In [35]:
fig.savefig('figs/heatmap_k4_tivo.png', dpi=300, bbox_inches='tight')

#### Defining clusters with 'persona' descriptions

In [None]:
pivot_k4_annot = pivot_k4.copy()
pivot_k4_annot['Persona'] = ['Budget-seeking Middle-Aged Woman',
                             'Educated Hi-Spender',
                             'Budget-seeking Middle-Aged Man',
                             'Top-buck Spending Senior'
                            ]
pivot_k4_annot

In [37]:
pivot_k4_annot.to_excel('outputs/pivot_k4_tivo.xlsx')

### Finding the best 'k'

#### Elbow Method

In [None]:
k4_model.inertia_

In [39]:
#TODO-23: Which visualizer will we use?
from yellowbrick.cluster import KElbowVisualizer

In [None]:
#TODO-24: Set-up the KElbowVisualizer
kBegin = 3
kEnd = 15
km = KMeans(random_state=111)
visualizer = KElbowVisualizer(km, k=(kBegin,kEnd), timings=False)
visualizer.fit(dtivo_std)
visualizer.show()

#### Silhouette Score

##### Average Silhouette Score

In [None]:
#TODO-25: Change the metric to 'silhouette'
vizElbowSil = KElbowVisualizer(km,
                               metric='silhouette',
                               k=(kBegin, kEnd),
                               timings=False)
vizElbowSil.fit(dtivo_std)
vizElbowSil.show()

##### Detailed Silhouette per Clustering

In [42]:
dtivo_std.to_csv("outputs\dtivo_reduced_std.csv")

In [43]:
#TODO-26: Import correct Visualizer
from yellowbrick.cluster import SilhouetteVisualizer

In [None]:
km = KMeans(n_clusters=6, random_state=111)
silViz = SilhouetteVisualizer(km, colors='yellowbrick')
silViz.fit(dtivo_std)
silViz.show()


In [None]:
#TODO-27: Setup and run the SilhouetteVisualizer
plt.clf()
kStart = 3
kStop = 7  #last k for which you want a subplot
subx = 0  #staring row index of subplot on which to plot
fig, ax = plt.subplots(kStop - kStart, 1,
                       figsize=(7, 16), sharex=True)
fig.tight_layout(pad=5.0)
for i in range(kStart, kStop):
    ax[subx].set_title(f"Silhouette values for k= {i}")
    ax[subx].tick_params(labelbottom=True)
    km = KMeans(n_clusters=i, random_state=111)
    silViz = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[subx])
    silViz.fit(dtivo_std)
    silViz.finalize()
    subx += 1

### Clustering using best k in k=3 to k=6 range

#### Fitting the k-means model for k= 6

In [49]:
k6_model = KMeans(n_clusters=6, 
                  random_state=111).fit(dtivo_std)

#### Tagging the segmentation bases dataset with cluster numbers

In [None]:
dtivo_k6 = dtivo_bases.copy()
dtivo_k6['k6'] = k6_model.labels_
dtivo_k6.head()

#### Cluster Analysis k=6

#### Basic summary table

In [None]:
pivot_k6 = dtivo_k6.groupby('k6').agg(['mean', 'std']).round(1)
pivot_k6.insert(0, 'Size', dtivo_k6.groupby('k6').size())
pivot_k6

In [None]:
#TODO-27: What fields are being multiplied to get value the segment?
k6_segValues = pivot_k6['ElecSpend_Annual']['mean']*pivot_k6['Size']
k6_segValues = k6_segValues.sort_values(ascending=False)
k6_segValues.map('${:,.0f}'.format)  #after : comes the formating --> , for commas and .0f for 0 decimal places float

In [54]:
pivot_k6.to_excel('outputs/pivot_k6_tivo.xlsx')

#### Summary table with Heatmap (k=6)

In [55]:
pk6mins = pivot_k6.min(0)
pk6maxs = pivot_k6.max(0)
mm_pivot = (pivot_k6 - pk6mins)/(pk6maxs-pk6mins)

In [None]:
fig = plt.figure(figsize=(10, 5))
pk6_map = sns.heatmap(
    mm_pivot,
    annot=pivot_k6,  #actual pivot becomes annotation
    cmap='coolwarm_r',
    cbar=False,
    fmt='.1f',
    lw=0.01)

pk6_map.set(xlabel='Segmentation Bases',
           ylabel = 'Cluster')
plt.show()

#### Box Plots of k=6 clusters by bases

In [None]:
#TODO-28: We have to use the standardized dataframe to show all features on one scale
dtivo_stdk6 = dtivo_std.copy()
dtivo_stdk6['k6'] = k6_model.labels_
dtivo_stdk6.head()

In [None]:
#TODO-29: Tag the standardized dataframe for a common scale plot
medprops = dict(linewidth=3)
gpbxplot = dtivo_stdk6.groupby('k6').boxplot(column=colsKept,
                                             figsize=(12, 12),
                                             patch_artist=True,     #patches are the boxes - True allows us to change its colors
                                             medianprops=medprops)  #can change the median line properties using median props
plt.tight_layout(pad=2.0)

Can choose colormap options from https://matplotlib.org/stable/gallery/color/colormap_reference.html

In [None]:
#Have different color and titles for each boxplot
#TODO-30: Choose a matplotlib colormap profile
bxcmap = plt.cm.get_cmap('plasma', len(colsKept))
bxcmap

In [None]:
bxcolors = [bxcmap(i) for i in range(0,len(colsKept))]
bxcolors #in RGB-Alpha form

In [None]:
#TODO-31: How do we change the colors of each groupby boxplot subplot?
#f"..." are f-strings or formatted strings.  They allow using variables between curly brackets {}
for i, ax in enumerate(gpbxplot):
    ax.set_title(f"CLUSTER {i}: Size ={pivot_k6.loc[i,'Size'].iat[0]}")
    ax.tick_params(labelleft=True)
    for j, patch in enumerate(ax.patches):
        patch.set_facecolor(bxcolors[j])
plt.tight_layout(pad=3.0)
gpbxplot[1].get_figure()

In [62]:
#TODO-32: How do we save the plot?
fig_k6boxplot = gpbxplot[1].get_figure()
fig_k6boxplot.savefig('figs/boxplot_k6.png', dpi=400)

## TODO-21: k=5

### TODO-22: Recommendations: quality of k=5 vs. k=4 solution and which segment to target?

<div class="alert alert-block alert-info">Compare your 5 cluster solution with the earlier 4 cluster one:<br>
- is the extra cluster identified worth marketing to?<br>
- Based on the 5 segments solution, which segment(s) should TiVo target<br>
  to fulfill its marketing objective?<br>
- What does it imply for the 4Ps.</div>