## Comparison of Categorical Variable Encodings

In this lecture, we will compare the performance of the different feature categorical encoding techniques we learned so far.

We will compare:

- One hot encoding
- Replacing labels by the count
- Ordering labels according to target
- Mean Encoding
- WoE

Using the titanic dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score

In [2]:
# let's load the titanic dataset

# we will only use these columns in the demo
cols = ['pclass', 'age', 'sibsp', 'parch', 'fare',
        'sex', 'cabin', 'embarked', 'survived']
import os
os.chdir("/Users/ashishsrimal/Phase1Code/Feature Engineering/HandsOnPythonCode/titanic/")
data = pd.read_csv('titanic.csv', usecols=cols)

data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,cabin,embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [3]:
# let's check for missing data

data.isnull().sum()

survived      0
pclass        0
sex           0
age         177
sibsp         0
parch         0
fare          0
cabin       687
embarked      2
dtype: int64

In [4]:
# Drop observations with NA in Fare and embarked

data.dropna(subset=['fare', 'embarked'], inplace=True)

In [5]:
# Now we extract the first letter of the cabin

data['cabin'] = data['cabin'].astype(str).str[0]

data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,cabin,embarked
0,0,3,male,22.0,1,0,7.25,n,S
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,n,S
3,1,1,female,35.0,1,0,53.1,C,S
4,0,3,male,35.0,0,0,8.05,n,S


In [6]:
# drop observations with cabin = T, they are too few

data = data[data['cabin'] != 'T']

In [7]:
# Let's divide into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels='survived', axis=1),  # predictors
    data['survived'],  # target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((621, 8), (267, 8))

In [8]:
# Let's replace null values in numerical variables by the mean


def impute_na(df, variable, value):
    df[variable].fillna(value, inplace=True)


impute_na(X_test, 'age', X_train['age'].mean())
impute_na(X_train, 'age',  X_train['age'].mean())
# note how I impute first the test set, this way the value of
# the median used will be the same for both train and test

In [9]:
X_train.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
352,3,male,15.0,1,1,7.2292,n,C
125,3,male,12.0,1,0,11.2417,n,C
579,3,male,32.0,0,0,7.925,n,S
424,3,male,18.0,1,1,20.2125,n,S
119,3,female,2.0,4,2,31.275,n,S


In [10]:
# let's check that we have no missing data after NA imputation

X_train.isnull().sum(), X_test.isnull().sum()

(pclass      0
 sex         0
 age         0
 sibsp       0
 parch       0
 fare        0
 cabin       0
 embarked    0
 dtype: int64,
 pclass      0
 sex         0
 age         0
 sibsp       0
 parch       0
 fare        0
 cabin       0
 embarked    0
 dtype: int64)

### One Hot Encoding

In [11]:
def get_OHE(df):

    df_OHE = pd.concat(
        [df[['pclass', 'age', 'sibsp', 'parch', 'fare']],
         pd.get_dummies(df[['sex', 'cabin', 'embarked']], drop_first=True)],
        axis=1)

    return df_OHE


X_train_OHE = get_OHE(X_train)
X_test_OHE = get_OHE(X_test)

X_train_OHE.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_n,embarked_Q,embarked_S
352,3,15.0,1,1,7.2292,1,0,0,0,0,0,0,1,0,0
125,3,12.0,1,0,11.2417,1,0,0,0,0,0,0,1,0,0
579,3,32.0,0,0,7.925,1,0,0,0,0,0,0,1,0,1
424,3,18.0,1,1,20.2125,1,0,0,0,0,0,0,1,0,1
119,3,2.0,4,2,31.275,0,0,0,0,0,0,0,1,0,1


In [12]:
X_test_OHE.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_male,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_n,embarked_Q,embarked_S
14,3,14.0,0,0,7.8542,0,0,0,0,0,0,0,1,0,1
159,3,29.696519,8,2,69.55,1,0,0,0,0,0,0,1,0,1
764,3,16.0,0,0,7.775,1,0,0,0,0,0,0,1,0,1
742,1,21.0,2,2,262.375,0,1,0,0,0,0,0,0,0,0
484,1,25.0,1,0,91.0792,1,1,0,0,0,0,0,0,0,0


### Count encoding

In [13]:
def categorical_to_counts(df_train, df_test):

    # make a temporary copy of the original dataframes
    df_train_temp = df_train.copy()
    df_test_temp = df_test.copy()

    for col in ['sex', 'cabin', 'embarked']:

        # make dictionary mapping category to counts
        counts_map = df_train_temp[col].value_counts().to_dict()

        # remap the labels to their counts
        df_train_temp[col] = df_train_temp[col].map(counts_map)
        df_test_temp[col] = df_test_temp[col].map(counts_map)

    return df_train_temp, df_test_temp


X_train_count, X_test_count = categorical_to_counts(X_train, X_test)

X_train_count.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
352,3,402,15.0,1,1,7.2292,473,108
125,3,402,12.0,1,0,11.2417,473,108
579,3,402,32.0,0,0,7.925,473,457
424,3,402,18.0,1,1,20.2125,473,457
119,3,219,2.0,4,2,31.275,473,457


### Ordered Integer Encoding

In [14]:
def categories_to_ordered(df_train, df_test, y_train, y_test):

    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()

    for col in ['sex', 'cabin', 'embarked']:

        # order categories according to target mean
        ordered_labels = df_train_temp.groupby(
            [col])['survived'].mean().sort_values().index

        # create the dictionary to map the ordered labels to an ordinal number
        ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}

        # remap the categories  to these ordinal numbers
        df_train_temp[col] = df_train[col].map(ordinal_label)
        df_test_temp[col] = df_test[col].map(ordinal_label)

    # remove the target
    df_train_temp.drop(['survived'], axis=1, inplace=True)
    df_test_temp.drop(['survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_ordered, X_test_ordered = categories_to_ordered(
    X_train, X_test, y_train, y_test)

X_train_ordered.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
352,3,0,15.0,1,1,7.2292,0,2
125,3,0,12.0,1,0,11.2417,0,2
579,3,0,32.0,0,0,7.925,0,0
424,3,0,18.0,1,1,20.2125,0,0
119,3,1,2.0,4,2,31.275,0,0


### Mean Encoding

In [15]:
def categories_to_mean(df_train, df_test, y_train, y_test):

    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()

    for col in ['sex', 'cabin', 'embarked']:

        # calculate mean target per category
        ordered_labels = df_train_temp.groupby(
            [col])['survived'].mean().to_dict()

        # remap the categories to target mean
        df_train_temp[col] = df_train[col].map(ordered_labels)
        df_test_temp[col] = df_test[col].map(ordered_labels)

    # remove the target
    df_train_temp.drop(['survived'], axis=1, inplace=True)
    df_test_temp.drop(['survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_mean, X_test_mean = categories_to_mean(
    X_train, X_test, y_train, y_test)

X_train_mean.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
352,3,0.199005,15.0,1,1,7.2292,0.308668,0.555556
125,3,0.199005,12.0,1,0,11.2417,0.308668,0.555556
579,3,0.199005,32.0,0,0,7.925,0.308668,0.347921
424,3,0.199005,18.0,1,1,20.2125,0.308668,0.347921
119,3,0.739726,2.0,4,2,31.275,0.308668,0.347921


### Probability Ratio

In [16]:
def categories_to_ratio(df_train, df_test, y_train, y_test):

    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()

    for col in ['sex', 'cabin', 'embarked']:

        # create df containing the different parts of the WoE equation
        # prob survived =1
        prob_df = pd.DataFrame(df_train_temp.groupby([col])['survived'].mean())

        # prob survived = 0
        prob_df['died'] = 1-prob_df.survived

        # calculate WoE
        prob_df['Ratio'] = np.log(prob_df.survived/prob_df.died)

        # capture woe in dictionary
        woe = prob_df['Ratio'].to_dict()

        # re-map the labels to WoE
        df_train_temp[col] = df_train[col].map(woe)
        df_test_temp[col] = df_test[col].map(woe)

    # drop the target
    df_train_temp.drop(['survived'], axis=1, inplace=True)
    df_test_temp.drop(['survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_ratio, X_test_ratio = categories_to_ratio(X_train, X_test, y_train, y_test)

X_train_ratio.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
352,3,-1.392525,15.0,1,1,7.2292,-0.806354,0.223144
125,3,-1.392525,12.0,1,0,11.2417,-0.806354,0.223144
579,3,-1.392525,32.0,0,0,7.925,-0.806354,-0.628189
424,3,-1.392525,18.0,1,1,20.2125,-0.806354,-0.628189
119,3,1.044545,2.0,4,2,31.275,-0.806354,-0.628189


### Random Forest Performance

In [17]:
# create a function to build random forests and compare performance in train and test set


def run_randomForests(X_train, X_test, y_train, y_test):

    rf = RandomForestClassifier(n_estimators=50, random_state=39, max_depth=3)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = rf.predict_proba(X_test)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [18]:
# OHE
run_randomForests(X_train_OHE, X_test_OHE, y_train, y_test)

Train set
Random Forests roc-auc: 0.8604254344839617
Test set
Random Forests roc-auc: 0.8677394034536892


In [19]:
# counts
run_randomForests(X_train_count, X_test_count, y_train, y_test)

Train set
Random Forests roc-auc: 0.8760494123290956
Test set
Random Forests roc-auc: 0.8819587006400194


In [20]:
# ordered labels
run_randomForests(X_train_ordered, X_test_ordered, y_train, y_test)

Train set
Random Forests roc-auc: 0.8770960989118821
Test set
Random Forests roc-auc: 0.8806605482429659


In [21]:
# mean encoding
run_randomForests(X_train_mean, X_test_mean, y_train, y_test)

Train set
Random Forests roc-auc: 0.8770960989118821
Test set
Random Forests roc-auc: 0.8806605482429659


In [22]:
# ratio
run_randomForests(X_train_ratio, X_test_ratio, y_train, y_test)

Train set
Random Forests roc-auc: 0.8770960989118821
Test set
Random Forests roc-auc: 0.8806605482429659


Comparing the roc_auc values on the test sets, we can see that one hot encoding has the worse performance. This makes sense because trees do not perform well in datasets with big feature spaces.

The remaining encodings returned similar performances. This also makes sense, because trees are non-linear models, so target guided encodings may not necessarily improve the model performance

### Logistic Regression Performance

In [23]:
def run_logistic(X_train, X_test, y_train, y_test):

    # function to train and test the performance of logistic regression
    logit = LogisticRegression(random_state=44, C=0.01, max_iter=100)
    logit.fit(X_train, y_train)

    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [24]:
# OHE
run_logistic(X_train_OHE, X_test_OHE, y_train, y_test)

Train set
Logistic Regression roc-auc: 0.8299952026864956
Test set
Logistic Regression roc-auc: 0.8428329911846394


In [25]:
# counts
run_logistic(X_train_count, X_test_count, y_train, y_test)

Train set
Logistic Regression roc-auc: 0.8447796506683529
Test set
Logistic Regression roc-auc: 0.8584711991305398


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# ordered labels
run_logistic(X_train_ordered, X_test_ordered, y_train, y_test)

Train set
Logistic Regression roc-auc: 0.8143385158856495
Test set
Logistic Regression roc-auc: 0.8309382924767539


In [27]:
# mean encoding
run_logistic(X_train_mean, X_test_mean, y_train, y_test)

Train set
Logistic Regression roc-auc: 0.7738339257288646
Test set
Logistic Regression roc-auc: 0.7892766574085255


In [28]:
# ratio
run_logistic(X_train_ratio, X_test_ratio, y_train, y_test)

Train set
Logistic Regression roc-auc: 0.8506236507555769
Test set
Logistic Regression roc-auc: 0.8726603067262407


For Logistic regression, the best performances are obtained with one hot encoding, as it preserves linear relationships with variables and target, and also with weight of evidence, and ordered encoding.

Note however how count encoding, returns the worse performance as it does not create a monotonic relationship between variables and target, and in this case, mean target encoding is probably causing over-fitting.