# **Handling Categorical Variables**

Handling Categorical/Qualitative variables is an important step in data preprocessing.Many Machine learning algorithms can not handle categorical variables by themself unless we convert them to numerical values.<br>
And performance of ML algorithms is based on how Categorical variables are encoded.
The results produced by the model varies from different encoding techniques used.

Categorical variables can be divided into two categories:<br>
1. Nominal (No particular order) 
2. Ordinal (some ordered).

<img src="Screenshots/Categorical_variables.png">

<div class="alert alert-block alert-warning">  
<b>You can use the below cheat-sheet as a guiding tool. </b> 
</div>

<img src="Screenshots/Categorical_Encoding.png">

There are many ways we can encode these categorical variables.


1. One Hot Encoding
2. Label Encoding
3. Ordinal Encoding
4. Frequency or Count Encoding
5. Binary Encoding
6. Base-N Encoding
7. Helmert Encoding 
8. Mean Encoding or Target Encoding
9. Weight of Evidence Encoding
10. Sum Encoder (Deviation Encoding or Effect Encoding)
11. Leave One Out Encoding 
12. CatBoost Encoding
13. James-Stein Encoding
14. M-estimator Encoding
15. Hashing Encoding
16. Backward Difference Encoding
17. Polynomial Encoding
18. MultiLabelBinarizer

In [None]:
# We are gonna use following libraries to perform encoding.
!pip install scikit-learn
!pip install category-encoders

In [1]:
import pandas as pd , numpy as np
import category_encoders as ce

In [80]:
data = {'Temperature':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color':['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


# 1. One Hot Encoding

In this technique, it creates a new column/feature for each category in the Categorical Variable and replaces with either 1 (presence of the feature) or 0 (absence of the feature). The number of column/feature depends on the number of categories in the Categorical Variable.This method slows down the learning process significantly if the number of the categories are very high.

In [3]:
# Using get_dummies method in pandas
df_ohe = df.copy()
one_hot_1 = pd.get_dummies(df_ohe,prefix = 'Temp' ,columns=['Temperature'],drop_first=False)
one_hot_1.insert(loc=2, column='Temperature', value=df.Temperature.values)
one_hot_1

Unnamed: 0,Color,Target,Temperature,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,Hot,0,1,0,0
1,Yellow,1,Cold,1,0,0,0
2,Blue,1,Very Hot,0,0,1,0
3,Blue,0,Warm,0,0,0,1
4,Red,1,Hot,0,1,0,0
5,Yellow,0,Warm,0,0,0,1
6,Red,1,Warm,0,0,0,1
7,Yellow,0,Hot,0,1,0,0
8,Yellow,1,Hot,0,1,0,0
9,Yellow,1,Cold,1,0,0,0


In [4]:
# Using OneHotEncoder in sklearn
from sklearn.preprocessing import OneHotEncoder
# ohe = OneHotEncoder(drop='first')
ohe = OneHotEncoder()
oh_array = ohe.fit_transform(df['Temperature'].values.reshape(-1, 1)).toarray()
oh_df = pd.DataFrame(oh_array,columns=['Temp_Cold','Temp_Hot','Temp_Very_Hot','Temp_Warm'])
pd.concat([df,oh_df],axis=1)

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_Very_Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0


In [5]:
# Using category_encoders OneHotEncoder
import category_encoders as ce
ohe = ce.OneHotEncoder(cols=['Temperature'])
ce_ohe = ohe.fit_transform(df.iloc[:,0], df.iloc[:,-1])
ce_ohe.columns = ['Temp_Hot','Temp_Cold','Temp_Very_Hot','Temp_Warm']
pd.concat([df,ce_ohe],axis=1)

Unnamed: 0,Temperature,Color,Target,Temp_Hot,Temp_Cold,Temp_Very_Hot,Temp_Warm
0,Hot,Red,1,1,0,0,0
1,Cold,Yellow,1,0,1,0,0
2,Very Hot,Blue,1,0,0,1,0
3,Warm,Blue,0,0,0,0,1
4,Hot,Red,1,1,0,0,0
5,Warm,Yellow,0,0,0,0,1
6,Warm,Red,1,0,0,0,1
7,Hot,Yellow,0,1,0,0,0
8,Hot,Yellow,1,1,0,0,0
9,Cold,Yellow,1,0,1,0,0


1. For Regression, we can use N-1 (drop first or last column of One Hot Coded new feature ), 
2. For classification, the recommendation is to use all N columns as most of the tree-based algorithm builds a tree based on all available variables. 

**Disadvantages:** 
1. Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
2. When the feature contains too many unique values, that many features are created which may result in overfitting.

# 2. Label Encoding

1. In this encoding, a unique value is assigned for different labels/categories.<br>
2. One major issue with sklearn.LabelEncoder is it assigns the values to the labels based on the Alphabetical order of the lables.<br>
Ex : Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 

In [6]:
# Using sklearn LabelEncoder()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_ohe['Temperature_encoded'] = le.fit_transform(df.Temperature)
df_ohe

Unnamed: 0,Temperature,Color,Target,Temperature_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


In [7]:
# Using Pandas factorize()
fact = df.copy()
fact['Temperature_factor'] = pd.factorize(df.Temperature)[0]
fact

Unnamed: 0,Temperature,Color,Target,Temperature_factor
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


**Disadvantages:** 
1. It mis-leads the information by assigning values based on Alphabetical order instead of actual label order.

# 3. Ordinal Encoding

In [8]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe_val = oe.fit_transform(df['Temperature'].values.reshape(-1, 1))
pd.concat([df,pd.DataFrame(oe_val,columns=['Temperature_Oe'])],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_Oe
0,Hot,Red,1,1.0
1,Cold,Yellow,1,0.0
2,Very Hot,Blue,1,2.0
3,Warm,Blue,0,3.0
4,Hot,Red,1,1.0
5,Warm,Yellow,0,3.0
6,Warm,Red,1,3.0
7,Hot,Yellow,0,1.0
8,Hot,Yellow,1,1.0
9,Cold,Yellow,1,0.0


In [9]:
# Using category_encoders OrdinalEncoder
import category_encoders as ce
ohe = ce.OrdinalEncoder(cols=['Temperature'])
df['Temp_ce_oe'] = ohe.fit_transform(df.iloc[:,0], df.iloc[:,-1])
df

Unnamed: 0,Temperature,Color,Target,Temp_ce_oe
0,Hot,Red,1,1
1,Cold,Yellow,1,2
2,Very Hot,Blue,1,3
3,Warm,Blue,0,4
4,Hot,Red,1,1
5,Warm,Yellow,0,4
6,Warm,Red,1,4
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,2


In [12]:
# Best way is mapping based on their actual label order
# Ex : Cold < Warm <Hot < Very Hot = 1 < 2 < 3 < 4
Temp_order = {'Cold' : 1 , 'Warm' : 2 , 'Hot' : 3 , 'Very Hot' : 4}
df['Temperature_Order'] = df.Temperature.map(Temp_order)
df

Unnamed: 0,Temperature,Color,Target,Temperature_Order
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


# 4. Frequency or Count Encoder

In frequency encoding, each of the categories in the feature is replaced with the frequencies of categories.<br>
Here frequency of the categories is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.


<img src="Screenshots/frequency_encoding.png">

Category refers to each of the unique values in a feature.
1. **Frequency(category)** = Number of values in that category
2. **Size(data)** = Size of the entire dataset.

In [14]:
# Using Pandas groupby()
cat_freq = df.groupby('Temperature').size() / len(df)
df['Temp_Freq_Enc'] = df.Temperature.map(cat_freq)
df

Unnamed: 0,Temperature,Color,Target,Temp_Freq_Enc
0,Hot,Red,1,0.4
1,Cold,Yellow,1,0.2
2,Very Hot,Blue,1,0.1
3,Warm,Blue,0,0.3
4,Hot,Red,1,0.4
5,Warm,Yellow,0,0.3
6,Warm,Red,1,0.3
7,Hot,Yellow,0,0.4
8,Hot,Yellow,1,0.4
9,Cold,Yellow,1,0.2


In [16]:
# Using category_encoders CountEncoder
import category_encoders as ce
ce = ce.CountEncoder(cols=['Temperature'])
df['Temp_Count_Enc'] = ce.fit_transform(df.iloc[:,0], df.iloc[:,-1])
df

Unnamed: 0,Temperature,Color,Target,Temp_Count_Enc
0,Hot,Red,1,4
1,Cold,Yellow,1,2
2,Very Hot,Blue,1,1
3,Warm,Blue,0,3
4,Hot,Red,1,4
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,4
8,Hot,Yellow,1,4
9,Cold,Yellow,1,2


**Disadvantage**:
1. If two categories have the same frequency then it is hard to distinguish between them.

# 5. Binary Encoding

1. It similar to onehot, but stores categories as binary bitstrings i.e., each binary bitstring creates one feature column.
2. Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories, One Hot Encoding will have 100 features while for Binary encoding, we will need just seven features).<br>

**Feature -> ordinal encoding -> binary code -> digits of the binary code to separate columns.**

<img src="Screenshots/Binary_encoding.png">

In [76]:
import category_encoders as ce
be = ce.BinaryEncoder(cols=['Temperature'])
be_df = be.fit_transform(df['Temperature'])
pd.concat([df,be_df],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,1
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Yellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


# 6. Base-N encoder

1. Base-N encoder encodes the categories into arrays of their base-N representation. 
2. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.

In [74]:
bne = ce.BaseNEncoder(cols=['Temperature'],base=2)
bne_df = bne.fit_transform(df['Temperature'],df.Target)
pd.concat([df,bne_df],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,1
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Yellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


# 7. Helmert Encoding

1. Helmert coding is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding.
2. It compares each level of a categorical variable to the mean of the subsequent levels.
3. The version in category_encoders is sometimes referred to as **Reverse Helmert Coding.**
4. It is useful in certain situations where levels of the categorical variable are ordered, say, from lowest to highest, or from smallest to largest.

In [20]:
import category_encoders as ce
he = ce.HelmertEncoder(cols=['Temperature'])
he_df = he.fit_transform(df['Temperature'])
pd.concat([df,he_df],axis=1)

Unnamed: 0,Temperature,Color,Target,intercept,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,1,-1.0,-1.0,-1.0
1,Cold,Yellow,1,1,1.0,-1.0,-1.0
2,Very Hot,Blue,1,1,0.0,2.0,-1.0
3,Warm,Blue,0,1,0.0,0.0,3.0
4,Hot,Red,1,1,-1.0,-1.0,-1.0
5,Warm,Yellow,0,1,0.0,0.0,3.0
6,Warm,Red,1,1,0.0,0.0,3.0
7,Hot,Yellow,0,1,-1.0,-1.0,-1.0
8,Hot,Yellow,1,1,-1.0,-1.0,-1.0
9,Cold,Yellow,1,1,1.0,-1.0,-1.0


# 8. Mean Encoding or Target Encoding

1. It has become the most popular encoding type because of Kaggle competitions.
2. It takes information about the target to encode categories, which makes it extremely powerful.
3. In Target Encoding, labels are correlated directly with the target.i.e., for each category in the feature label is decided with the mean value of the target variable on a training data.<br>
<img src="Screenshots/Target_Encoding.png"><br>
Here, mdl — min data (samples) in leaf,<br> 
a — smoothing parameter, representing the power of regularization.<br>
<img src="Screenshots/Mean_Encoding.png">

In [32]:
# Using category_encoders TargetEncoder
import category_encoders as ce
te = ce.TargetEncoder(cols=['Temperature'])
df['Temperature_ce_TarEnc'] = te.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_TarEnc
0,Hot,Red,1,0.747629
1,Cold,Yellow,1,0.919318
2,Very Hot,Blue,1,0.7
3,Warm,Blue,0,0.377041
4,Hot,Red,1,0.747629
5,Warm,Yellow,0,0.377041
6,Warm,Red,1,0.377041
7,Hot,Yellow,0,0.747629
8,Hot,Yellow,1,0.747629
9,Cold,Yellow,1,0.919318


In [37]:
# Using Pandas groupby()
tar_enc = df.groupby('Temperature')['Target'].mean()
# print(tar_enc)
df['Temperature_tar_enc'] = df['Temperature'].map(tar_enc)
df

Unnamed: 0,Temperature,Color,Target,Temperature_tar_enc
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Yellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


Advantage :
1. It does not affect the volume of the data and helps in faster learning. 

Disadvantage :
1.   Target leakage: it uses information about the target. Because of the target leakage, model overfits the training data which results in unreliable validation and lower test scores.<br>
**To reduce the effect of target leakage.**
1. Increase regularization
2. Add random noise to the representation of the category in train dataset (some sort of augmentation)
3. Use Double Validation (using other validation)



# 9. Weight of Evidence Encoding

1. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry.
2. It is a measure of the “strength” of a grouping for separating good and bad risk (default).
3. Weight of evidence (WOE) is a measure of how much the evidence supports or undermines a hypothesis.<br>
<img src="Screenshots/WoE.PNG">
Distr Goods -> Distribution of Good Credit Outcomes<br>
Distr bads -> Distribution of Bad Credit Outcomes<br>
However, above formulas might lead to target leakage and overfit. <br>
To avoid that, regularization parameter a is induced and WoE is calculated in the following way:<br>
<img src="Screenshots/WoE1.png">

In [40]:
woe = ce.WOEEncoder(cols=['Temperature'])
df['Temperature_ce_WOE'] = woe.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_WOE
0,Hot,Red,1,0.105361
1,Cold,Yellow,1,0.510826
2,Very Hot,Blue,1,0.0
3,Warm,Blue,0,-0.993252
4,Hot,Red,1,0.105361
5,Warm,Yellow,0,-0.993252
6,Warm,Red,1,-0.993252
7,Hot,Yellow,0,0.105361
8,Hot,Yellow,1,0.105361
9,Cold,Yellow,1,0.510826


# 10. Sum Encoder (Deviation Encoding or Effect Encoding)

1. Compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. 
2. Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.
3. However, the difference between them is the interpretation of LR coefficients: in OHE model the intercept represents the mean for the baseline condition and coefficients represents simple effects (the difference between one particular condition and the baseline), whereas in Sum Encoder model the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.

In [44]:
se = ce.SumEncoder(cols=['Temperature'])
se_df = se.fit_transform(df['Temperature'],df.Target)
pd.concat([df,se_df],axis=1)

Unnamed: 0,Temperature,Color,Target,intercept,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,1,1.0,0.0,0.0
1,Cold,Yellow,1,1,0.0,1.0,0.0
2,Very Hot,Blue,1,1,0.0,0.0,1.0
3,Warm,Blue,0,1,-1.0,-1.0,-1.0
4,Hot,Red,1,1,1.0,0.0,0.0
5,Warm,Yellow,0,1,-1.0,-1.0,-1.0
6,Warm,Red,1,1,-1.0,-1.0,-1.0
7,Hot,Yellow,0,1,1.0,0.0,0.0
8,Hot,Yellow,1,1,1.0,0.0,0.0
9,Cold,Yellow,1,1,0.0,1.0,0.0


# 11. Leave-one-out Encoder (LOO or LOOE)

1. It is another example of target-based encoders.
2. This encoder calculate mean target of category k for observation j if observation j is removed from the dataset:<br>
<img src="Screenshots/LOO.png"><br>
While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:
<img src="Screenshots/LOO1.png"><br>

In [52]:
loue = ce.LeaveOneOutEncoder(cols=['Temperature'])
df['Temperature_ce_CBE'] = loue.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_CBE
0,Hot,Red,1,0.666667
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,0.7
3,Warm,Blue,0,0.5
4,Hot,Red,1,0.666667
5,Warm,Yellow,0,0.5
6,Warm,Red,1,0.0
7,Hot,Yellow,0,1.0
8,Hot,Yellow,1,0.666667
9,Cold,Yellow,1,1.0


Disadvantage :
1. Just like with all other target-based encoders, the problems with LOO is target leakage.

# 12. CatBoost Encoder

1. Catboost is a recently created target-based categorical encoder. 
2. It is intended to overcome target leakage problems inherent in LOO.
3. To prevent overfitting, the process of target encoding for train dataset is repeated several times on shuffled versions of the dataset and results are averaged.

In [50]:
cbe = ce.CatBoostEncoder(cols=['Temperature'])
df['Temperature_ce_CBE'] = cbe.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_CBE
0,Hot,Red,1,0.7
1,Cold,Yellow,1,0.7
2,Very Hot,Blue,1,0.7
3,Warm,Blue,0,0.7
4,Hot,Red,1,0.85
5,Warm,Yellow,0,0.35
6,Warm,Red,1,0.233333
7,Hot,Yellow,0,0.9
8,Hot,Yellow,1,0.675
9,Cold,Yellow,1,0.85


# 13. James-Stein Encoding

1. James-Stein Encoder is a target-based encoder.
2. The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:
<img src="Screenshots/James_Stein_Encoder.png"><br>
One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:<br>
<img src="Screenshots/James_Stein_Encoder.png"><br>

In [54]:
jse = ce.JamesSteinEncoder(cols=['Temperature'])
df['Temperature_ce_JSE'] = jse.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_JSE
0,Hot,Red,1,0.741379
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.405229
4,Hot,Red,1,0.741379
5,Warm,Yellow,0,0.405229
6,Warm,Red,1,0.405229
7,Hot,Yellow,0,0.741379
8,Hot,Yellow,1,0.741379
9,Cold,Yellow,1,1.0


Disadvantage :
1. It is defined only for normal distribution (which is not the case in real time).
2. **To avoid that**, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.

# 14. M-estimator Encoding

1. M-Estimate Encoder is a simplified version of Target Encoder. 
2. It has only one hyperparameter — m, which represents the power of regularization. 
3. The higher value of m results into stronger shrinking. 
4. Recommended values for m is in the range of 1 to 100. <br>
<img src="Screenshots/M-Estimate_Encoder.png"><br>

In [56]:
mee = ce.MEstimateEncoder(cols=['Temperature'],m=1.0)
df['Temperature_ce_JSE'] = mee.fit_transform(df['Temperature'],df.Target)
df

Unnamed: 0,Temperature,Color,Target,Temperature_ce_JSE
0,Hot,Red,1,0.74
1,Cold,Yellow,1,0.9
2,Very Hot,Blue,1,0.85
3,Warm,Blue,0,0.425
4,Hot,Red,1,0.74
5,Warm,Yellow,0,0.425
6,Warm,Red,1,0.425
7,Hot,Yellow,0,0.74
8,Hot,Yellow,1,0.74
9,Cold,Yellow,1,0.9


# 15. Hashing Encoding

1. Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. 
2. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding. 

In [66]:
hash_df = ce.HashingEncoder(cols=['Temperature'],n_components=8)
hash_df = hash_df.fit_transform(df['Temperature'],df.Target)
pd.concat([df,hash_df],axis=1)

Unnamed: 0,Temperature,Color,Target,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,Hot,Red,1,1,0,0,0,0,0,0,0
1,Cold,Yellow,1,0,0,1,0,0,0,0,0
2,Very Hot,Blue,1,0,1,0,0,0,0,0,0
3,Warm,Blue,0,0,1,0,0,0,0,0,0
4,Hot,Red,1,1,0,0,0,0,0,0,0
5,Warm,Yellow,0,0,1,0,0,0,0,0,0
6,Warm,Red,1,0,1,0,0,0,0,0,0
7,Hot,Yellow,0,1,0,0,0,0,0,0,0
8,Hot,Yellow,1,1,0,0,0,0,0,0,0
9,Cold,Yellow,1,0,0,1,0,0,0,0,0


Advantage :
1. This method is advantageous when the cardinality of categorical is very high with parameter **n_components**.<br>

Disadvantage :
1. It is slow comparing to other encoder's.

# 16. Backward Difference Encoding

1. In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. 
2. This technique falls under the contrast coding system for categorical features. A feature of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.

In [70]:
bde = ce.BackwardDifferenceEncoder(cols=['Temperature'])
bde_df = bde.fit_transform(df['Temperature'],df.Target)
pd.concat([df,bde_df],axis=1)

Unnamed: 0,Temperature,Color,Target,intercept,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,1,-0.75,-0.5,-0.25
1,Cold,Yellow,1,1,0.25,-0.5,-0.25
2,Very Hot,Blue,1,1,0.25,0.5,-0.25
3,Warm,Blue,0,1,0.25,0.5,0.75
4,Hot,Red,1,1,-0.75,-0.5,-0.25
5,Warm,Yellow,0,1,0.25,0.5,0.75
6,Warm,Red,1,1,0.25,0.5,0.75
7,Hot,Yellow,0,1,-0.75,-0.5,-0.25
8,Hot,Yellow,1,1,-0.75,-0.5,-0.25
9,Cold,Yellow,1,1,0.25,-0.5,-0.25


# 17. Polynomial Encoding

In [81]:
pe = ce.PolynomialEncoder(cols=['Temperature'])
pe_df = pe.fit_transform(df['Temperature'],df.Target)
pd.concat([df,pe_df],axis=1)

Unnamed: 0,Temperature,Color,Target,intercept,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,1,-0.67082,0.5,-0.223607
1,Cold,Yellow,1,1,-0.223607,-0.5,0.67082
2,Very Hot,Blue,1,1,0.223607,-0.5,-0.67082
3,Warm,Blue,0,1,0.67082,0.5,0.223607
4,Hot,Red,1,1,-0.67082,0.5,-0.223607
5,Warm,Yellow,0,1,0.67082,0.5,0.223607
6,Warm,Red,1,1,0.67082,0.5,0.223607
7,Hot,Yellow,0,1,-0.67082,0.5,-0.223607
8,Hot,Yellow,1,1,-0.67082,0.5,-0.223607
9,Cold,Yellow,1,1,-0.223607,-0.5,0.67082


# 18. MultiLabelBinarizer

 MultiLabel Binarizer is used when any column has multiple labels.

In [12]:
data = {'Type':[['fruits','vegitables'],['animals','vegitables'],['animals','fruits'],['vehicals','fruits']]}
df = pd.DataFrame(data)
df

Unnamed: 0,Type
0,"[fruits, vegitables]"
1,"[animals, vegitables]"
2,"[animals, fruits]"
3,"[vehicals, fruits]"


In [13]:
# importing MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

# instantiating MultiLabelBinarizer
mlb = MultiLabelBinarizer()
types_encoded = pd.DataFrame(mlb.fit_transform(df['Type']),columns=mlb.classes_)
types_encoded.head()

Unnamed: 0,animals,fruits,vegitables,vehicals
0,0,1,1,0
1,1,0,1,0
2,1,1,0,0
3,0,1,0,1


<div class="alert alert-block alert-danger"> 
<b>It is essential to understand, for all machine learning models, all these encodings do not work well in all situations or for every dataset. <br>
Data Scientists still need to experiment and find out which works best for their specific case.  </b>
</div>
