# Introduction

![](https://i.pinimg.com/originals/db/4f/88/db4f88f155d22599f59765e14f4c5497.jpg)

## Agenda of this talk

**Below is a typical simplified Machine learning model development pipeline for tabular data**

![](img/fig1.png)

In a typical model development pipeline there is raw data that exists (across servers/schemas etc) which is aggregated to get the exhaustive model development data or data which might be useful to solve the problem at hand .Post this the model development data is used to develop an outcome or the target variable(example:Sales,default,fraud etc) and independent variables which might be useful in predicting the target .The supervised machine learning algorithm uses the independent predictors and the target to develop a predictive entity which helps in getting an estimation for the predictive problem.

**Today's talk is based on how the raw variables(specifically categorical variables) should be transformed for usage into model development for better predictive accuracy and long term maintainance** 

https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-numerical-variables/

## A Quick segway into model development data types
<a id=data_types></a>

**Model development data** : Data captured for most of the problem statments that you might be trying to solve should fall in one of the below buckets:

- **Categorical variables:**A variable which does not represent a numeric entity or an entity that cannot be represented on a coordinate scale.They need to be transformed into a numeric format for usage in mathematical algorithms
 - **Ordinal variables:**variable  with inherent ranking/ordering
   - Examples:Academic grades(A++,A,A-,..),Age Bracket(New born,Baby,Toddler..)
 - **High cardinality:**variable with unique values which are greater than 15(**My own thumb rule**)
   - Examples: zipcodes,product IDs,Operating system version numbers,Email_domain_address
 - **Low cardinality:**variable with unique values which are less than 15(**My own thumb rule**)
   - Examples: credit_default_status(YES/NO),customer_status(Active/inactive/attrited)
 - **Variables that you might mistake to be numeric variables:**A variable whose values are numbers but does not have an inherent ordering to them
   - Examples : zipcodes,House-numbers,OS version numbers
- **Numeric variables:** A variable which can be represented as a numeric entity or on a coordinate scale.
 - The values that a numeric variable might take might vary depending upon the variable type and can be contiguous,integers,binary.They can be used directly as predictors in mathemarical algorithms
    - Examples :Distance,speed,Income,credit score,Indicator_for_having_a_pet(1/0)
- **Alternate data types**
 - **Text**
 - **Images**
 - **Videos**
 - **Every other damn thing under the blue sky** 🙄

Lets pick up an extremely popular dataset from kaggle to get a feel of the variable types we just encountered.

In [266]:
# Titance dataset:Predict survival on the Titanic
#(An extremly popular and a kind of Hello world dataset within competitive predictive modelling landscape)

import pandas as pd
pd.options.mode.chained_assignment = None
df=pd.read_csv("data/train.csv")

In [22]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In the dataset we have 12 variables, within which Survived(Who survives the Titanic) is the binary outcome to be predicted.Let's classify each of the other features into one of the above variable classification.


One of the quick ways to identify  a variables type other than business/domain knowledge is to check the data types of variable

In [17]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Its also worth checking the number of unique values for each variable

In [20]:
df.shape

(891, 12)

In [19]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

We observe that there are four variables which are classified as object dtype  which is a hint that they are categorical variables(Name,Sex,Cabin,Embarked).
Further,we have two variables which have same number of unique values as number of passengers,indicating they are the primary keys.
Based on that we can classify the 12 variables as:

1.  **PassengerId :** The primary key passenger ID is a High cardinality categorical variable.Although the variable is numeric in form we are not classifying it as numeric as it cannot be used on numeric scale that is passenger ID 1 passenger ID 2 has no meaning.

2. **Survived :** The outcome variable as this is a classification problem is a binary numeric variable(I am classifying it as numeric as it in already encoded as 1/ 0 if it was survived/Not_survived it would have been a low cardinality categorical variable which we would have needed to transform into numeric binary form for development of a classification algorithm

3. **Pclass :** A Low cardinality categorical variable

4. **Name :** A High cardinality categorical variable

5. **Sex :** A Low cardinality categorical variable

6. **Age :** A Numeric variable

7. **SibSp :** # of siblings / spouses aboard the Titanic, A Numeric variable

8. **Parch:** # of parents / children aboard the Titanic,A Numeric variable

9. **Ticket:** Ticket number,A High cardinality categorical variable

10. **Fare:** Passenger fare,A Numeric Variable

11. **Cabin:** Cabin number,A High cardinality categorical variable

12. **Embarked:** Port of Embarkation,A Low cardinality categorical variable



# Transforming categorical variables

## Some guidelines around choosing a categorical variables tranformation methodology

As we mentioned when we were lookinng at typical data types we would face during a predictive development task that categorical variables in their raw form are not usable in a mathematical predictive algorithm and they need to be transformed into a numeric form.

There are various methodologies to conduct the above tranformation for the categorical variables but before we look at them lets define few guidelines around what our final product should be and how we might want to evaluate the results of transformation from categorical to numeric.Below are three major questions that we would ask to evaluate any categorical variable transformation methodology we might find.

- **How much incremental improvement we observe in models predictive strength?**
- **Will the categorical variable transformation methodology be  supported by the technical infrastructure in place for inference of the model in production?**
- **How robust is the methodology against domain shift that we might observe in the data,which would eventually happen in this ever fluctuating world?**

## Methodologies for categorical variable transformation

Below is a list of major variable transformation methodologies used :

- **One Hot encoding**
- **Vanilla Count encoding**
- **K-fold Cross validated target encoding**
- **Vanilla target encoding**
- **K-fold Cross validated target encoding**
- **Catboost encoding**

Let's delve into each of them using the Titanic dataset that we encountered in [Section-2](#data_types)


## One Hot encoding

In one hot encoding we transform the categories within the variable into their own individual binary representation.Below example will make it clear.

<u>Below is the Titanic dataset</u>

In [89]:
print("The size of the dataset is {} with {} columns".format(df.shape[0],df.shape[1]))

The size of the dataset is 891 with 12 columns


In [92]:
df['Survived'].value_counts(normalize=True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

<b> The Survival rate is 38% as per the training data<b>

In [23]:

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<b>Before we proceed we will do some Hygiene transformaton on the data like removing redundant variables,imputing missing values and some other sanity changes

<b> First lets split the data intp train and test 

In [267]:
df_train,df_test=train_test_split(df,test_size=0.2,random_state=2)

<b> Quick Basic Missing value imputation for few variables

In [310]:
df_train['Age'].fillna(df_train['Age'].median(), inplace = True)
df_test['Age'].fillna(df_train['Age'].median(), inplace = True)

df_train['Embarked'].fillna(df_train['Embarked'].mode().iloc[0], inplace = True)
df_test['Embarked'].fillna(df_train['Embarked'].mode().iloc[0], inplace = True)

df_train['Cabin'].fillna(df_train['Cabin'].mode().iloc[0], inplace = True)
df_test['Cabin'].fillna(df_train['Cabin'].mode().iloc[0], inplace = True)

df_train['Pclass']=df_train['Pclass'].astype('object')
df_test['Pclass']=df_test['Pclass'].astype('object')

<b>Let's transform the Sex of the passengers using one Hot encoding,where we will have seperate binary representation for each gender type

In [269]:
df_train['Sex'].unique()

array(['male', 'female'], dtype=object)

In [270]:
df_train.join(pd.get_dummies(df_train['Sex'],prefix='Sex'))[['Sex','Sex_female','Sex_male']].head()

Unnamed: 0,Sex,Sex_female,Sex_male
30,male,0,1
10,female,1,0
873,male,0,1
182,male,0,1
876,male,0,1


We actually need only n-1 categories to be binarized that is Sex_male in itself captures the information if the sex is female or not.Hence:

In [271]:
df_train.join(pd.get_dummies(df_train['Sex'],prefix='Sex',drop_first=True))[['Sex','Sex_male']].head()

Unnamed: 0,Sex,Sex_male
30,male,1
10,female,0
873,male,1
182,male,1
876,male,1


We would now convert all predictive categorical variables in the dataset into One-hot encoded form and would attempt to develop a quick ML algorithm to predict the survival

<b>First a List of all Categorical variables in the Titanice dataset which heuristically could be predictors of survival of a passenger

In [272]:
Cat_predictors=list(df.drop(['PassengerId','Survived','Ticket',"Name","Age","Fare","SibSp","Parch"],axis=1).columns)


In [306]:
df_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [273]:
Cat_predictors

['Pclass', 'Sex', 'Cabin', 'Embarked']

<b>The Below one liner in pandas will one hot encode all variables in Cat_predictors<b>

In [339]:
import category_encoders as ce

OHE=ce.OneHotEncoder(df_train[Cat_predictors],use_cat_names=True)

OHE.fit(df_train[Cat_predictors])

df_train_OHE=df_train.join(OHE.transform(df_train[Cat_predictors]))
df_test_OHE=df_test.join(OHE.transform(df_test[Cat_predictors]))


In [340]:
df_train_OHE.head()[['Sex','Sex_male','Sex_female']]

Unnamed: 0,Sex,Sex_male,Sex_female
30,male,1,0
10,female,0,1
873,male,1,0
182,male,1,0
876,male,1,0


In [341]:
df_train_OHE.drop(Cat_predictors+['PassengerId','Name','Ticket'],axis=1,inplace=True)
df_test_OHE.drop(Cat_predictors+['PassengerId','Name','Ticket'],axis=1,inplace=True)

<b> Finally,below are the columns in our mock modelling dataset to predict survival on Titanic

In [342]:
print(df_train_OHE.shape,df_test_OHE.shape)

(712, 142) (179, 142)


In [343]:
df_train_OHE['Survived'].value_counts(normalize=True)

0    0.630618
1    0.369382
Name: Survived, dtype: float64

In [344]:
df_test_OHE['Survived'].value_counts(normalize=True)

0    0.558659
1    0.441341
Name: Survived, dtype: float64

Time to develop a quick model and check the predictive quality .For Absolute simplicity for this classification problem we will use a Logistic regression model .There would be probably lots of sighs and roll of eyes 🙄 but come on folks this is a toy problem,we are not trying to beat SOTA 😉

<b> <font color='red'> Note:There is lots of hand waving in the model development steps ignoring steps like correlations,robust missing value imputation,hyperparameter tuning and many other fine factors which might influence the scientific quality of a predictive model.We are doing that to be able to capture the flavor of categorical encoding within the stipulated time period.A model development process is a very nuanced process a combination of art and science.</font>
    

First to bench mark the predictions what if we predicted a random probablity for the test dataset.The model discrimination performance that we might get is:

In [345]:
import numpy as np
from sklearn.metrics import roc_auc_score
roc_auc_score(df_test_OHE['Survived'],np.random.random((df_test_OHE.shape[0])))

0.5120253164556963

In [346]:
from sklearn.linear_model import LogisticRegression
predictors=df_train_OHE.loc[:,df_train_OHE.columns!='Survived'].to_numpy()
outcome=df_train_OHE['Survived'].to_numpy()
clf = LogisticRegression(solver='liblinear')

clf.fit(predictors,outcome)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Now the model discrimination performance is:

In [347]:
roc_auc_score(df_test_OHE['Survived'],clf.predict_proba(df_test_OHE.loc[:,df_test_OHE.columns!='Survived'])[:,1])

0.8316455696202532

In [348]:
(0.8316455696202532-0.5289873417721519)/0.5289873417721519

0.5721464465183058

- **How much incremental improvement we observe in models predictive strength?**
  - ~57% above random prediction
- **Will the categorical variable transformation methodology be  supported by the technical infrastructure in place for inference of the model in production?**
   - The transformation process is quite simple computationally but the write cost of categorical to binary can be quite high in case of high cardanilty variables.For example:In the Titanic dataset we have 147 cabin numbers ,Hence one variable got transformed into 146 different binary variable
- **How robust is the methodology against domain shift that we might observe in the data,which would eventually happen in this ever fluctuating world?**
   - In case there is a domain shift in a variable,we might loose information capture.Example: If we are applying the Titanic classification model to lets say another ship disaster ,lets call it Thanos and in case in Thanos the cabin numbers are similar but there are 150 cabin numbers instead of 147 the one hot encoded variables will not be able to capture the details about the other three.
   <b> A solution to make the encoding stable with respect to a domain shift is including a catch-all binary variable which captured every other category that might pop-up in the future inference data<b>



## Vanilla Count encoding

In count encoding every category within the variable is replaced with its corresponding count.The counts for each categiry is stored and used to encode variables during inference or in production

<b> We will set up the count encoders and then fit it on training data

In [385]:
count_enc=ce.CountEncoder(cols=Cat_predictors)

count_enc.fit(df_train[Cat_predictors])

CountEncoder(cols=['Pclass', 'Sex', 'Cabin', 'Embarked'],
             combine_min_nan_groups=True, drop_invariant=False,
             handle_missing='count', handle_unknown=None, min_group_name=None,
             min_group_size=None, normalize=False, return_df=True, verbose=0)

<b>Let's have a look at the transformation of the Variables

In [351]:
df_train.join(count_enc.transform(df_train[Cat_predictors]).add_suffix('_count'))[['Pclass','Pclass_count']].head(10)

Unnamed: 0,Pclass,Pclass_count
30,1,175
10,3,389
873,3,389
182,3,389
876,3,389
213,2,148
157,3,389
780,3,389
572,1,175
77,3,389


In [352]:
df_train['Pclass'].value_counts()

3    389
1    175
2    148
Name: Pclass, dtype: int64

<b>As the results are what we want we will transform the train and test data for usage in model development

In [363]:
df_train_vce=df_train.join(count_enc.transform(df_train[Cat_predictors]).add_suffix('_count'))
df_test_vce=df_test.join(count_enc.transform(df_test[Cat_predictors]).add_suffix('_count'))


<b> This looks farely simple but there are some finer points which would come to haunt you during Inference time or during time you are productionalizing the model

<b> What happens if there is a domain shift,that is what if it is applied to a ship which might have cabin numbers that differ?Luckily we have the example simulated here in our test set.There are cabin numbers in test set which are not there in train set

In [369]:
list(set(df_test['Cabin'].unique())-set(df_train['Cabin'].unique()))

['A32',
 'D9',
 'B42',
 'D7',
 'D20',
 'C148',
 'A16',
 'B73',
 'E36',
 'A23',
 'A31',
 'D45',
 'E17',
 'C30',
 'E50',
 'F4',
 'E63',
 'F E69']

<b> How are these Cabin numbers encoded in the test set?

In [None]:
df_test_vce[['Cabin','Cabin_count']].loc[df_test_vce['Cabin_count'].isna()]

In [376]:
df_test_vce[['Cabin','Cabin_count']].loc[df_test_vce['Cabin'].isin(list(set(df_test['Cabin'].unique())-set(df_train['Cabin'].unique())))].head()

Unnamed: 0,Cabin,Cabin_count
630,A23,
128,F E69,
185,A32,
209,A31,
520,B73,


<b> Yes,so Nan as the count encoder has not seen these values in the encoding data within training dataset.Now,what do we do?
- The encoding code should have an additional handler for this as we will always observe these as the domain shifts and we move deeper into inference time period
- There are multiple ways we can handle it,for simplicity we will impute the unknowns by -9999

In [379]:
df_test_vce.fillna(-9999,inplace=True)

df_test_vce[['Cabin','Cabin_count']].loc[df_test_vce['Cabin'].isin(list(set(df_test['Cabin'].unique())-set(df_train['Cabin'].unique())))].head()

Unnamed: 0,Cabin,Cabin_count
630,A23,-9999.0
128,F E69,-9999.0
185,A32,-9999.0
209,A31,-9999.0
520,B73,-9999.0


In [380]:
df_train_vce.drop(Cat_predictors+['Name','Ticket','PassengerId'],axis=1,inplace=True)
df_test_vce.drop(Cat_predictors+['Name','Ticket','PassengerId'],axis=1,inplace=True)

<b> So,we have transformed the train and test into the format we wanted  using count encoding,Lets quickly develop  model so that we can look at the results

In [406]:
from sklearn.linear_model import LogisticRegression
predictors=df_train_vce.loc[:,df_train_vce.columns!='Survived'].to_numpy()
outcome=df_train_vce['Survived'].to_numpy()
clf = LogisticRegression(solver='liblinear')

clf.fit(predictors,outcome)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [407]:
roc_auc_score(df_train_vce['Survived'],clf.predict_proba(df_train_vce.loc[:,df_train_vte.columns!='Survived'])[:,1])

0.8657218830184525

In [383]:
roc_auc_score(df_test_vce['Survived'],clf.predict_proba(df_test_vce.loc[:,df_test_vce.columns!='Survived'])[:,1])

0.8384810126582278

In [384]:
(0.8384810126582278-0.5289873417721519)/0.5289873417721519

0.5850681981335247

- **How much incremental improvement we observe in models predictive strength?**
  - ~58% above random prediction and better than one hot encoding.In practice in most cases you will find count encoding has bettter prediction numbers than a simple one hot encoding but there is also a higher possibility of overfitting(There are ways around it we will discuss it soon)
- **Will the categorical variable transformation methodology be  supported by the technical infrastructure in place for inference of the model in production?**
  - A simple process of transformation and much lower write cost and variable maintainance cost compared to one hot encoding as each variable is represented by one variables post transformation compared to OHE where each variable is replaced with close to as many categories as in the variable
- **How robust is the methodology against domain shift that we might observe in the data,which would eventually happen in this ever fluctuating world?**
   - We have discusssed how to handle domain shift whiel doing count encoding
 

## Vanilla target encoding

In Target encoding each category within a variable is represented by the summary of target/outcome that it captures.

<b> Lets set up a target encoder and check how the results look like

In [387]:
Target_enc=ce.TargetEncoder(cols=Cat_predictors)

Target_enc.fit(df_train[Cat_predictors],df_train['Survived'])

TargetEncoder(cols=['Pclass', 'Sex', 'Cabin', 'Embarked'], drop_invariant=False,
              handle_missing='value', handle_unknown='value',
              min_samples_leaf=1, return_df=True, smoothing=1.0, verbose=0)

In [388]:
df_train_vte=df_train.join(Target_enc.transform(df_train[Cat_predictors]).add_suffix('_target'))
df_test_vte=df_test.join(Target_enc.transform(df_test[Cat_predictors]).add_suffix('_target'))


In [392]:
df_train_vte.groupby('Sex')['Survived'].agg('mean')

Sex
female    0.727273
male      0.172113
Name: Survived, dtype: float64

In [393]:
df_train_vte[['Sex','Sex_target']].head()

Unnamed: 0,Sex,Sex_target
30,male,0.172113
10,female,0.727273
873,male,0.172113
182,male,0.172113
876,male,0.172113


In [395]:
df_test_vte[['Sex','Sex_target']].head()

Unnamed: 0,Sex,Sex_target
707,male,0.172113
37,male,0.172113
615,female,0.727273
169,male,0.172113
68,female,0.727273


<b> Here,we have replace variable sex by the mean captured by rolled up categories within the variable

<b> Target encoding within category encoders by default replaces the unknowns in the test by the base rate of outcome.Which again is a very rudimentray way of handling this.It can be handled in multiple different ways based on domain knowledge and EDA

In [397]:
df_test_vte[['Cabin','Cabin_target']].loc[df_test_vte['Cabin'].isin(list(set(df_test['Cabin'].unique())-set(df_train['Cabin'].unique())))].head()

Unnamed: 0,Cabin,Cabin_target
630,A23,0.369382
128,F E69,0.369382
185,A32,0.369382
209,A31,0.369382
520,B73,0.369382


In [401]:
df_train_vte.drop(Cat_predictors+['Name','Ticket','PassengerId'],axis=1,inplace=True)
df_test_vte.drop(Cat_predictors+['Name','Ticket','PassengerId'],axis=1,inplace=True)

In [403]:
from sklearn.linear_model import LogisticRegression
predictors=df_train_vte.loc[:,df_train_vte.columns!='Survived'].to_numpy()
outcome=df_train_vte['Survived'].to_numpy()
clf = LogisticRegression(solver='liblinear')

clf.fit(predictors,outcome)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [405]:
roc_auc_score(df_train_vte['Survived'],clf.predict_proba(df_train_vte.loc[:,df_train_vte.columns!='Survived'])[:,1])

0.8711331475945701

In [404]:
roc_auc_score(df_test_vte['Survived'],clf.predict_proba(df_test_vte.loc[:,df_test_vte.columns!='Survived'])[:,1])

0.810379746835443

- **How much incremental improvement we observe in models predictive strength?**
  - We observe a high overfitting when we use Target encoding.Hence,a drop in AUC compared to previous methods but it can be handled well as we will observe soon
- **Will the categorical variable transformation methodology be  supported by the technical infrastructure in place for inference of the model in production?**
  - A simple process of transformation and much lower write cost and variable maintainance cost compared to one hot encoding as each variable is represented by one variables post transformation compared to OHE where each variable is replaced with close to as many categories as in the variable
- **How robust is the methodology against domain shift that we might observe in the data,which would eventually happen in this ever fluctuating world?**
   - It has to be figured out how you want to handle the categories introduced due to domain shift,here we have handled using base rate of outcome in the training data
 

## K-fold Cross validated target encoding

## Catboost encoding