# TiVo - ETL Pipeline - Preparing Data for Clustering

**NOTE**: All cells that have Excercises in them are marked with `#EXERCISE-TODO-[NO OF EXERCISE]`

## Import General Packages

In [None]:
%pip install -r requirements.txt

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## E: Extract and explore the data  (EXTRACT part of ETL pipeline)
*Note: In a true ETL pipeline, you will just extract the data not explore it :)*


### Read in from Excel file

In [None]:
dtivo_master = pd.read_excel('./data/tivo_subset.xlsx',
                             index_col=0)            #Read in the first column (ID) as the index
dtivo_master.head()

### Get a feel of the data

In [None]:
dtivo_master.info()

### Exploratory Analyses: explore possible bases for segmentation

In [None]:
dtivo_master.groupby('TechAdopt')['Fav_Feature'].value_counts()

*Q: Did we learn anything about the product feature that is of interest to early vs. late adopters?  Which ones will you market to the premium market segment?*

Ans:

In [None]:
#EXERCISE-TODO-01: Obtain grouped value counts for Purchase_Point
#and write in commentary under it, what ordinal scale would
#suitable?
dtivo_master.groupby('TechAdopt')[__________________].value_counts()


## T: Data Transformations  (TRANSFORM part of ETL pipeline)

In [7]:
#Make a copy of the master DataFrame to apply
#transformations to.  We may need the original master file
#so this is a good practice to apply transformations to a copy
dtivo = dtivo_master.copy()

### Factorizations

#### Fav_Feature

In [None]:
#Map the zipped dictionary from 'Fav_Feature' to 'Fav_Coded'
dtivo['Fav_Coded'], fav_uniques = dtivo['Fav_Feature'].factorize()
print(fav_uniques)


In [None]:
dtivo.head(10)

#### Education
Sometimes the ordering of the numbers will matter, for example, in Education, we want to preserve our own order - especially when we see the results that summarize clusters.  So we "Zip" the categories with our own factors.

In [None]:
ed_types = dtivo['Education'].unique()
print(ed_types)

In [None]:
dtivo['Education'].factorize(sort=True)

In [None]:
ed_levels = [1,2,4,3]  #order numbers matched against ed_types list
dict_edtypes = dict(zip(ed_types,ed_levels))
print(dict_edtypes)

In [None]:
dtivo['Ed_Coded'] = dtivo['Education'].map(dict_edtypes)
dtivo.head(10)

#### PurchasePoint 

In [21]:
#EXERCISE-TODO-02: Decide whether you would like to 
#code 'PurchasePoint' using simply factorization (as we showed in Favorite Feature) 
# OR by mapping your own 'zipped' dictionary.
#HINT: CONSIDER YOUR FINDINGS FROM TODO-01
# WHICHEVER WAY YOU, CHOSE SAVE THE RESULT


#?
#?
#?
dtivo["PPoint_Coded"] = dtivo['Education']._____________________________
dtivo.head(10)


#### Factorizing a binary - what to label the result?

In [None]:
dtivo['Gender'].head()

When factorizing a binary variable, you need to anticipate which category gets zero and which gets one.  Factorize offers to first sort the 'unique' categories.  For example, our dataset only has two genders (male and female) so this is a binary variable.  If we apply sort, we can anticipate that as 'F' comes before 'M', females will get `0` and males will get assigned `1`.  In this case the Series we get is Male as males have `1`.

 Without this, what ever the first category appearing in the first row is will get 0 and the next category will get 1 (so given the above data, male would get `0` without sort)

In [None]:
dtivo['Male'], gender_uniques = dtivo['Gender'].factorize(sort=True)
print(gender_uniques)
dtivo.head()

### Rows to keep for clustering
We are only interested at launch in segments *within* early adopters => filter out late adopters.


In [None]:
#Assuming we have done some segment targeting already - like focus on Early adopters only.
dtivo_early = dtivo[dtivo['TechAdopt']=='early'].copy()
len(dtivo_early)

In [25]:
outlierIDs = [441, 923]
#WE CAN DROP ANY CASES THAT WE THINK ARE OUTLIERS AND WILL DISTORT OUR SEGMENTS
#EXERCISE-TODO-Nth: RETURN TO THIS AFTER RUNNING THE CLUSTER ANALYSIS
dtivo_early.drop(index=outlierIDs, inplace=True)

In [None]:
len(dtivo_early)

### Columns to keep for clustering

In [None]:
colsAll = list(dtivo_early.columns)
colsAll

The word 'descriptors' signifies that these are the chosen variables that we believe will truly help us predict our segmentation.  Descriptors chosen should be those that we will know about the customer before the marketing action takes place.  For example, before selling our product, will we know their purchase point?

In [None]:

colsToDrop = ['Education',      #factorized, only keeping its coded form
              'PurchasePoint',  #factorized, only keeping its coded form
              'Tvhours',        #Dropping to keep it simple
              'TechAdopt',      #Don't need after filtering out TechAdopter==late.
              'Fav_Feature',    #factorized, only keeping coded form
              'Gender',         #already have 'Male' after factorization
              'Fav_Coded',      #Dropping to keep it simple
#              'PPoint_Coded',   #Dropping to keep it simple in first run - uncomment after TODOS
                                #EXERCISE-TODO-TAKEHOME - try doing this again but with some of the variables we dropped for simplicity
             ]

dtivo_descriptors = dtivo_early.drop(columns=colsToDrop)
colsKept = list(dtivo_descriptors.columns)
print(colsKept)
dtivo_descriptors.head()

### Normalize using Standardization

If Clustering has to give equal weight to each field (or 'feature' in ML terms), then<br>
we must make sure they are on the same scale (i.e. **Normalize** the fields).

If we are interested in how our segments vary from the average (e.g. <br>
above average income, highly below average age, etc.), then we use<br>
**Standardization** for normalization.


In [29]:
#IMPORT PACKAGE FOR STANDARDIZATION
from sklearn import preprocessing

In [None]:
#STANDARDIZED SCALING
dtivo_std = pd.DataFrame(preprocessing.scale(dtivo_descriptors),
                         columns=colsKept,                 #TODO: REVIEW in light of any changes to COLUMNS DROPPED
                         index=dtivo_descriptors.index
                        )

#only for checking if done
dtivo_std.describe().round(2)

In [None]:
dtivo_std.head()

## L: Loading the transformed data (LOAD part of ETL pipeline)

In [32]:
dtivo_descriptors.to_csv("data/dtivo_descriptors.csv")

In [33]:
dtivo_std.to_csv("data/dtivo_standardized.csv")