## Comparison of Categorical Variable Encodings

In this lecture, we will compare the performance of the different feature categorical encoding techniques we learned so far.

We will compare:

- One hot encoding
- Replacing labels by the count
- Ordering labels according to target
- Mean Encoding
- WoE

Using the titanic dataset

In [72]:
pip install feature_engine

Collecting feature_engine
[?25l  Downloading https://files.pythonhosted.org/packages/57/6d/0c7594c89bf07a7c447b1a251d4e04b07104d4a9332de71e1de42b78b838/feature_engine-1.0.2-py2.py3-none-any.whl (152kB)
[K     |████████████████████████████████| 153kB 5.3MB/s 
Collecting statsmodels>=0.11.1
[?25l  Downloading https://files.pythonhosted.org/packages/da/69/8eef30a6237c54f3c0b524140e2975f4b1eea3489b45eb3339574fc8acee/statsmodels-0.12.2-cp37-cp37m-manylinux1_x86_64.whl (9.5MB)
[K     |████████████████████████████████| 9.5MB 8.7MB/s 
Installing collected packages: statsmodels, feature-engine
  Found existing installation: statsmodels 0.10.2
    Uninstalling statsmodels-0.10.2:
      Successfully uninstalled statsmodels-0.10.2
Successfully installed feature-engine-1.0.2 statsmodels-0.12.2


In [136]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from pandas.api.types import is_numeric_dtype
from sklearn.metrics import roc_auc_score
from feature_engine.encoding import *

In [74]:
# let's load the titanic dataset

# we will only use these columns in the demo
cols = ['pclass', 'age', 'sibsp', 'parch', 'fare',
        'sex', 'cabin', 'embarked', 'survived']

data = pd.read_csv('titanic.csv', usecols=cols)

data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B5,S
1,1,1,male,0.9167,1,2,151.55,C22,S
2,1,0,female,2.0,1,2,151.55,C22,S
3,1,0,male,30.0,1,2,151.55,C22,S
4,1,0,female,25.0,1,2,151.55,C22,S


In [75]:
# let's check for missing data

data.isnull().sum()

pclass         0
survived       0
sex            0
age          263
sibsp          0
parch          0
fare           1
cabin       1014
embarked       2
dtype: int64

In [76]:
# Drop observations with NA in Fare and embarked
data.dropna(axis=0,how='any',inplace=True,subset=['fare','embarked','age','cabin'])

In [77]:
data.isnull().sum()

pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
cabin       0
embarked    0
dtype: int64

In [78]:
# Now we extract the first letter of the cabin
data['cabin'] = data['cabin'].str[0]
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B,S
1,1,1,male,0.9167,1,2,151.55,C,S
2,1,0,female,2.0,1,2,151.55,C,S
3,1,0,male,30.0,1,2,151.55,C,S
4,1,0,female,25.0,1,2,151.55,C,S


In [79]:
# drop observations with cabin = T, they are too few
data=data[data.cabin!='T']

In [80]:
# Let's divide into train and test set
X_train,X_test,Y_train,Y_test=train_test_split(data.loc[:,data.columns!='survived'],data.loc[:,data.columns=='survived'],test_size=0.3)
X_train.shape, X_test.shape

((188, 8), (81, 8))

In [81]:
# Let's replace null values in numerical variables by the mean
for i in X_train:
    if is_numeric_dtype(X_train[i])==True:
        
        X_train[i].fillna(X_train[i].mean(),inplace=True)
        X_test[i].fillna(X_test[i].mean(),inplace=True)

In [82]:
data['cabin'].unique()

array(['B', 'C', 'E', 'D', 'A', 'F', 'G'], dtype=object)

In [83]:
# let's check that we have no missing data after NA imputation
data.isnull().sum()

pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
cabin       0
embarked    0
dtype: int64

In [84]:
X_train

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
110,1,male,30.0,0,0,27.7500,C,C
133,1,male,49.0,1,0,89.1042,C,C
294,1,male,49.0,1,1,110.8833,C,C
5,1,male,48.0,0,0,26.5500,E,S
10,1,male,47.0,1,0,227.5250,C,C
...,...,...,...,...,...,...,...,...
169,1,female,50.0,0,0,28.7125,C,C
8,1,female,53.0,2,0,51.4792,C,S
301,1,male,47.0,0,0,34.0208,D,S
233,1,female,56.0,0,1,83.1583,C,C


### One Hot Encoding

In [138]:
X_train_OHE=OneHotEncoder().fit(X_train,Y_train).transform(X_train)

X_train_OHE.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,sex_female,cabin_C,cabin_E,cabin_A,cabin_G,cabin_F,cabin_B,cabin_D,embarked_C,embarked_S,embarked_Q
110,1,30.0,0,0,27.75,1,0,1,0,0,0,0,0,0,1,0,0
133,1,49.0,1,0,89.1042,1,0,1,0,0,0,0,0,0,1,0,0
294,1,49.0,1,1,110.8833,1,0,1,0,0,0,0,0,0,1,0,0
5,1,48.0,0,0,26.55,1,0,0,1,0,0,0,0,0,0,1,0
10,1,47.0,1,0,227.525,1,0,1,0,0,0,0,0,0,1,0,0


In [139]:
X_test_OHE=OneHotEncoder().fit(X_train,Y_train).transform(X_test)

X_test_OHE.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,sex_female,cabin_C,cabin_E,cabin_A,cabin_G,cabin_F,cabin_B,cabin_D,embarked_C,embarked_S,embarked_Q
244,1,36.0,0,0,40.125,1,0,0,0,1,0,0,0,0,1,0,0
282,1,52.0,1,0,78.2667,0,1,0,0,0,0,0,0,1,1,0,0
232,1,47.0,0,0,52.0,1,0,1,0,0,0,0,0,0,0,1,0
113,1,23.0,3,2,263.0,0,1,1,0,0,0,0,0,0,0,1,0
229,1,17.0,1,0,108.9,0,1,1,0,0,0,0,0,0,1,0,0


### Count encoding

In [111]:
X_train_count=CountFrequencyEncoder().fit(X_train,Y_train).transform(X_train)

X_train_count.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
110,1,94,30.0,0,0,27.75,58,77
133,1,94,49.0,1,0,89.1042,58,77
294,1,94,49.0,1,1,110.8833,58,77
5,1,94,48.0,0,0,26.55,27,109
10,1,94,47.0,1,0,227.525,58,77


In [113]:
X_test_count=CountFrequencyEncoder().fit(X_train,Y_train).transform(X_test)

X_test_count.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
244,1,94,36.0,0,0,40.125,13,77
282,1,94,52.0,1,0,78.2667,31,77
232,1,94,47.0,0,0,52.0,58,109
113,1,94,23.0,3,2,263.0,58,109
229,1,94,17.0,1,0,108.9,58,77


### Ordered Integer Encoding

In [91]:
X_train_ordered=OrdinalEncoder().fit(X_train,Y_train).transform(X_train)

X_train_ordered.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
110,1,0,30.0,0,0,27.75,2,2
133,1,0,49.0,1,0,89.1042,2,2
294,1,0,49.0,1,1,110.8833,2,2
5,1,0,48.0,0,0,26.55,6,1
10,1,0,47.0,1,0,227.525,2,2


In [116]:
X_test_ordered=OrdinalEncoder().fit(X_train,Y_train).transform(X_test)

X_test_ordered.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
244,1,0,36.0,0,0,40.125,1,2
282,1,1,52.0,1,0,78.2667,3,2
232,1,0,47.0,0,0,52.0,2,1
113,1,1,23.0,3,2,263.0,2,1
229,1,1,17.0,1,0,108.9,2,2


### Mean Encoding

In [92]:
X_train_mean=MeanEncoder().fit(X_train,Y_train).transform(X_train)

X_train_mean.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
110,1,0.414894,30.0,0,0,27.75,0.637931,0.714286
133,1,0.414894,49.0,1,0,89.1042,0.637931,0.714286
294,1,0.414894,49.0,1,1,110.8833,0.637931,0.714286
5,1,0.414894,48.0,0,0,26.55,0.740741,0.651376
10,1,0.414894,47.0,1,0,227.525,0.637931,0.714286


In [119]:
X_test_mean=MeanEncoder().fit(X_train,Y_train).transform(X_test)

X_test_mean.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
244,1,0.414894,36.0,0,0,40.125,0.615385,0.714286
282,1,0.93617,52.0,1,0,78.2667,0.677419,0.714286
232,1,0.414894,47.0,0,0,52.0,0.637931,0.651376
113,1,0.93617,23.0,3,2,263.0,0.637931,0.651376
229,1,0.93617,17.0,1,0,108.9,0.637931,0.714286


### Probability Ratio

In [120]:
X_train_ratio=PRatioEncoder(encoding_method='ratio').fit(X_train,Y_train['survived']).transform(X_train)
X_test_ratio=PRatioEncoder(encoding_method='ratio').fit(X_train,Y_train['survived']).transform(X_test)
X_train_ratio.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
110,1,0.709091,30.0,0,0,27.75,1.761905,2.5
133,1,0.709091,49.0,1,0,89.1042,1.761905,2.5
294,1,0.709091,49.0,1,1,110.8833,1.761905,2.5
5,1,0.709091,48.0,0,0,26.55,2.857143,1.868421
10,1,0.709091,47.0,1,0,227.525,1.761905,2.5


### Random Forest Performance

In [121]:
# create a function to build random forests (n_estimators=50, random_state=39, max_depth=3) and compare performance in train and test set
def run_randomForests(X_train,X_test,Y_train,Y_test):
    rfc = RandomForestClassifier(n_estimators=50, random_state=39, max_depth=3)
    rfc.fit(X_train,Y_train['survived'])
    print("Train set")
    print("Random Forests roc-auc:",roc_auc_score(Y_train,rfc.predict(X_train)))
    print("Test set")
    print("Random Forests roc-auc:",roc_auc_score(Y_test,rfc.predict(X_test)))


In [122]:
X_train_OHE

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,sex_female,cabin_A,cabin_D,cabin_C,cabin_B,cabin_E,cabin_F,cabin_G,embarked_C,embarked_S,embarked_Q
244,1,36.0,0,0,40.1250,1,0,1,0,0,0,0,0,0,1,0,0
282,1,52.0,1,0,78.2667,0,1,0,1,0,0,0,0,0,1,0,0
232,1,47.0,0,0,52.0000,1,0,0,0,1,0,0,0,0,0,1,0
113,1,23.0,3,2,263.0000,0,1,0,0,1,0,0,0,0,0,1,0
229,1,17.0,1,0,108.9000,0,1,0,0,1,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
220,1,22.0,0,1,61.9792,0,1,0,0,0,1,0,0,0,1,0,0
101,1,39.0,0,0,29.7000,1,0,1,0,0,0,0,0,0,1,0,0
91,1,31.0,1,0,57.0000,1,0,0,0,0,1,0,0,0,0,1,0
93,1,53.0,1,1,81.8583,1,0,1,0,0,0,0,0,0,0,1,0


In [140]:
# OHE
run_randomForests(X_train_OHE, X_test_OHE, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.7937266038466503
Test set
Random Forests roc-auc: 0.7796495956873315


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [124]:
# counts
run_randomForests(X_train_count, X_test_count, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.5413063121208209
Test set
Random Forests roc-auc: 0.49056603773584906


In [125]:
# ordered labels
run_randomForests(X_train_ordered, X_test_ordered, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.8291596747127921
Test set
Random Forests roc-auc: 0.737533692722372


In [126]:
# mean encoding
run_randomForests(X_train_mean, X_test_mean, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.8291596747127921
Test set
Random Forests roc-auc: 0.737533692722372


In [127]:
# ratio
run_randomForests(X_train_ratio, X_test_ratio, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.8291596747127921
Test set
Random Forests roc-auc: 0.737533692722372


Comparing the roc_auc values on the test sets, we can see that one hot encoding has the worse performance. This makes sense because trees do not perform well in datasets with big feature spaces.

The remaining encodings returned similar performances. This also makes sense, because trees are non-linear models, so target guided encodings may not necessarily improve the model performance

### Logistic Regression Performance

In [129]:
# create a function for Logistic Regression
def run_logistic(X_train,X_test,Y_train,Y_test):
    rfc = LogisticRegression()
    rfc.fit(X_train,Y_train['survived'])
    print("Train set")
    print("Random Forests roc-auc:",roc_auc_score(Y_train,rfc.predict(X_train)))
    print("Test set")
    print("Random Forests roc-auc:",roc_auc_score(Y_test,rfc.predict(X_test)))


In [141]:
# OHE
run_logistic(X_train_OHE, X_test_OHE, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.7937266038466503
Test set
Random Forests roc-auc: 0.7796495956873315


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [132]:
# counts
run_logistic(X_train_count, X_test_count, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.5337550019362334
Test set
Random Forests roc-auc: 0.5525606469002695


In [133]:
# ordered labels
run_logistic(X_train_ordered, X_test_ordered, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.7776558667871434
Test set
Random Forests roc-auc: 0.7995283018867925


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [134]:
# mean encoding
run_logistic(X_train_mean, X_test_mean, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.6996256615464049
Test set
Random Forests roc-auc: 0.7995283018867925


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [135]:
# ratio
run_logistic(X_train_ratio, X_test_ratio, Y_train, Y_test)

Train set
Random Forests roc-auc: 0.7743642700400154
Test set
Random Forests roc-auc: 0.8342318059299192


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


For Logistic regression, the best performances are obtained with one hot encoding, as it preserves linear relationships with variables and target, and also with weight of evidence, and ordered encoding.

Note however how count encoding, returns the worse performance as it does not create a monotonic relationship between variables and target, and in this case, mean target encoding is probably causing over-fitting.