# A detailed guide to different encoding schemes to deal with categorical features in data and their implications on the model's performance.

This dataset contains only categorical features and since dealing with categorical features is such a common task and important skill to master, this kernel will focus on applying different techniques to deal with categorical data and their implications on the performance of the model.

The dataset contains the following types of categorical features:

* binary features (bin_*)
* low- and high-cardinality nominal features (nom_*)
* low- and high-cardinality ordinal features (ord_*)
* (potentially) cyclical features (day (of the week) and month features)

We would be predicting the probability [0, 1] of the binary target column.


## Nominal V/S Ordinal data

Nominal data does not have any sort of order in the data points/variables. For example, the results of a test could be each classified nominally as a "pass" or "fail".

Whereas on the other hand, Ordinal data is grouped according to some sort of ranking system, that is, it orders the data. For example, test results could be grouped in descending order by grade: A, B, C, D, E and F.

## Example of nominal data

![](https://miro.medium.com/max/1185/0*iKsDex5fUBQoYTju.png)

## Example of ordinal data

![](https://miro.medium.com/max/1038/1*ychLO4DAe5cvD1UwUuvjZw.png)

## Lets discuss the two most popular techniques for encoding categorical features:

* Label Encoding

Label encoding assigns each unique value to a different integer. (good fit for ordinal features)

This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).

For tree-based models (like decision trees and random forests), you can expect label encoding to work well with ordinal variables.

* One-Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data.

In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither more nor less than "Yellow")

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

![](https://chrisalbon.com/images/machine_learning_flashcards/One-Hot_Encoding_print.png)

## Yet another efficient way of encoding ordinal features is: Sklearn's OrdinalEncoder
* OrdinalEncoder falls in the same family of SKlearn's encoders alongside LabelEncoder and OneHotEncoder.
* It is a fairly straightforward wat to encode Orninal features. 
* This technique has been applied on the ordinal features in this kernel below.

## Let's dive in the competition dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load data
train = pd.read_csv('../input/cat-in-the-dat/train.csv')
test = pd.read_csv('../input/cat-in-the-dat/test.csv')
sub = pd.read_csv('../input/cat-in-the-dat/sample_submission.csv')
print(train.shape)
print(test.shape)

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
train.head()

# Let's take a step by step approach for each feature variable
## 1. Binary variables (bin_0 through bin_4)

In [None]:
bin_cols = ['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']
# loop to get column and the count of plots
for n, col in enumerate(train[bin_cols]): 
    plt.figure(n)
    sns.countplot(x=col, data=train, hue='target', palette='husl')

Converting bin_3 and bin_4 from Yes/No (Y/N) and True/False (T/F) to binary (1/0)

In [None]:
train['bin_3'] = train['bin_3'].replace(to_replace=['F', 'T'], value=['0', '1']).astype(int)
train['bin_4'] = train['bin_4'].replace(to_replace=['Y', 'N'], value=['1', '0']).astype(int)
test['bin_3'] = test['bin_3'].replace(to_replace=['F', 'T'], value=['0', '1']).astype(int)
test['bin_4'] = test['bin_4'].replace(to_replace=['Y', 'N'], value=['1', '0']).astype(int)
# train['bin_3'].astype(int)
# train['bin_4'].astype(int)

Checking the dataframe after transformation:

In [None]:
train.head(3)

Also, let's already drop the ID column which is insignificant for us. Also, let's seperate the target column.

In [None]:
#Drop ID and seperate target variable
target = train['target']
train_id = train['id']
test_id = test['id']
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

print(train.shape)
print(test.shape)

# Transformation of ordinal features (ord_0 through ord_5)

### Finding the number of unique values for each variable and taking a glimpse of the values

In [None]:
ord_cols = ['ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']

for i in ord_cols:
    print("The number of unique values in {} column is : {}".format(i, train[i].nunique()))
    print("The unique values in {} column is : \n {}".format(i, train[i].value_counts()[:7]))
    print('\n')

## Logical ordering for ordinal feautes seems to be as follows:

ord_0 : [1, 2, 3]

ord_1 : ['Novice', 'Contributor','Expert', 'Master', 'Grandmaster']

ord_2 : ['Freezing', 'Cold', 'Warm', 'Hot', 'Boiling Hot', 'Lava Hot']

ord_3 : ['a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'i', 'j', 'k', 'l', 'm', 'n', 'o']

ord_4 : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I','J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', ... 'Z']

Let's encode ord_1 to ord_4 since the numbers of their unique values are small

In [None]:
# credits for mapper code : https://www.kaggle.com/gogo827jz/catboost-baseline-with-feature-importance

mapper_ord_1 = {'Novice': 1, 'Contributor': 2, 'Expert': 3, 'Master': 4, 'Grandmaster': 5}

mapper_ord_2 = {'Freezing': 1, 'Cold': 2, 'Warm': 3, 'Hot': 4,'Boiling Hot': 5, 'Lava Hot': 6}

mapper_ord_3 = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 
                'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15}

mapper_ord_4 = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 
                'I': 9, 'J': 10, 'K': 11, 'L': 12, 'M': 13, 'N': 14, 'O': 15,
                'P': 16, 'Q': 17, 'R': 18, 'S': 19, 'T': 20, 'U': 21, 'V': 22, 
                'W': 23, 'X': 24, 'Y': 25, 'Z': 26}

for col, mapper in zip(['ord_1', 'ord_2', 'ord_3', 'ord_4'], [mapper_ord_1, mapper_ord_2, mapper_ord_3, mapper_ord_4]):
    train[col+'_oe'] = train[col].replace(mapper)
    test[col+'_oe'] = test[col].replace(mapper)
    train.drop(col, axis=1, inplace=True)
    test.drop(col, axis=1, inplace=True)

for ord_5, we have high cardinality. Lets apply  OrdinalEncoder with "categories=’auto’" to it.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories='auto')
encoder.fit(train.ord_5.values.reshape(-1, 1))
train.ord_5 = encoder.transform(train.ord_5.values.reshape(-1, 1))
test.ord_5 = encoder.transform(test.ord_5.values.reshape(-1, 1))

Let's take a look at ord_5 after transormation

In [None]:
train.ord_5[:5]

Looks Good!!

### Now let's Check the dtypes for the ordinal variables after transformations

In [None]:
train[['ord_1_oe','ord_2_oe','ord_3_oe','ord_4_oe','ord_5','ord_0']].info()

# Now let's Get on with nominal features (nom_0 through nom_9)

In [None]:
nom_cols = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']

for i in nom_cols:
    print("The number of unique values in {} column is : {}".format(i, train[i].nunique()) )
        

### For nominal features, we will perform two types of encodings (viz, LabelEncoding and One Hot Encoding). 

### Moreover, we will check the performances for individual encoding techniques by applying a couple of models (LightGBM and logistic).


In [None]:
%%time
from sklearn.preprocessing import OneHotEncoder
one=OneHotEncoder()
train_ohe1 = one.fit_transform(train)
test_ohe1 = one.fit_transform(test)
# ohe_obj_train = one.fit(train)
# ohe_obj_test = one.fit(test)

# train_ohe1 = ohe_obj_train.transform(train)
# test_ohe1 = ohe_obj_test.transform(test)
print(train_ohe1.shape)
print(train_ohe1.dtype)
print(test_ohe1.shape)
print(test_ohe1.dtype)

In [None]:
# %%time
# nom_col = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8','nom_9']

# traintest = pd.concat([train, test])
# traintest_ohe = pd.get_dummies(traintest, columns=nom_col, drop_first=True, sparse=True)
# train_ohe = traintest_ohe.iloc[:train.shape[0], :]
# test_ohe = traintest_ohe.iloc[train.shape[0]:, :]

# print(train_ohe.shape)
# print(test_ohe.shape)

### Reason of commenting above code:

* I tried to perform One hot Encoding using Pandas' getdummies() as well as sklearn OneHotEncoding.
* The wall run time for sklearn OneHotEncoding was remarkably less at at couple seconds compared to Pandas getDummies which took north of 4 minutes.

## Applying a logistic model on OneHotEncoded dataset and checking the performance

In [None]:
def logistic(X,y):
    X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.2)
    lr=LogisticRegression()
    lr.fit(X_train,y_train)
    y_pre=lr.predict(X_test)
    print('Accuracy : ',accuracy_score(y_test,y_pre))

In [None]:
logistic(train_ohe1,target)

## Time to replace OneHotEncoder with LabelEncoder and checking the performance

In [None]:
%%time
nom_col = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8','nom_9']
from sklearn import model_selection, preprocessing, metrics
le = preprocessing.LabelEncoder()
traintest = pd.concat([train, test])

for col in nom_col:
    traintest[col] = le.fit_transform(traintest[col])

train_le = traintest.iloc[:train.shape[0], :]
test_le = traintest.iloc[train.shape[0]:, :]

print(train_le.shape)
print(test_le.shape)

In [None]:
train_le.head()

In [None]:
logistic(train_le,target)

As far as accuracy of logistic model is concerned, OHE has slightly edged LE.

## Now let's fit a LGB Model on both LabelEncoded dataset as well as OneHotEncoded dataset and check their respective performance

In [None]:
# LGBM on LabelEncoded data
import lightgbm as lgb
num_round = 50000

param = {'num_leaves': 64,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 5,
         'learning_rate': 0.001,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 44,
         "metric": 'auc',
         "verbosity": -1}

X_train,X_test,y_train,y_test=train_test_split(train_le,target,random_state=42,test_size=0.2)

train = lgb.Dataset(X_train, label=y_train)
test = lgb.Dataset(X_test, label=y_test)

clf = lgb.train(param, train, num_round, valid_sets = [train, test], verbose_eval=50, 
                early_stopping_rounds = 500)


Expand the output of above cell to view the model's training and validation AUC score.

### Validation set's AUC using LabelEncoder in LGBM comes approximately to 0.778382

## Applying LGBM on OneHotEncoded data and checking it's performance

In [None]:
X_train_ohe,X_test_ohe,y_train_ohe,y_test_ohe=train_test_split(train_ohe1,target,random_state=42,test_size=0.2)

train = lgb.Dataset(X_train_ohe, label=y_train_ohe)
test = lgb.Dataset(X_test_ohe, label=y_test_ohe)

clf_ohe = lgb.train(param, train, num_round, valid_sets = [train, test], verbose_eval=50, 
                early_stopping_rounds = 500)


Expand the output of above cell to view the model's training and validation AUC score.

### Submitting LGBM + LabelEncoder results!!

In [None]:
y_preds = clf.predict(test_le)

In [None]:
sub['target'] = y_preds
sub.to_csv('lgb_model.csv', index=False)

**Will try some more permutation/combinations of encoding techniques and model architectures in further versions.**

**Kindly upvote if you find the kernel helpful :)  **