# Learning Objectives

Towards the end of this lesson, you should be able to:
- investigate different categorical variable encoding approaches
    - One Hot Encoding
    - Label Encoding
    - Ordinal Encoding
    - Binary Encoding
    - Frequency Encoding
    - Mean Encoding
    - Weight of Evidence Encoding
    - Probability Ratio Encoding

https://pypi.org/project/category-encoders/

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'], 
        'Color': ['Red','Yellow','Blue','Blue','Red','Vellow','Red','Yellow','Yellow','Yellow'], 
        'Target': [1,1,1,0,1,0,1,0,1,1]} 

df = pd.DataFrame(data,columns = ['Temperature','Color','Target']) 

df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Vellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


# One-Hot Encoding

- For **tree-based learning algorithm**, it is good practice to encode it into N binary variables and don’t drop any of the variables.

- for **regression**, we use N-1 (drop first or last column of One Hot Coded new feature). The linear Regression has access to all of the features as it is being trained, and therefore examines the whole set of dummy variables altogether. This means that **N-1** binary variables give complete information about (represent completely) the original categorical variable to the linear Regression.



### Using Pandas get_dummies approach

In [3]:
df_oneHot = pd.get_dummies(df, prefix='Temp', columns=['Temperature'])
df_oneHot

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Vellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


### using sklearn approach

In [4]:
from sklearn. preprocessing import OneHotEncoder 

ohc = OneHotEncoder() 
ohe = ohc.fit_transform(df.Temperature.values.reshape(-1,1)).toarray() 
dfOneHot = pd.DataFrame(ohe, columns = ["Temp_" + str(ohc.categories_[0][1]) for i in range(len(ohc.categories_[0]))]) 
dfOneHot = dfOneHot.astype(int)
dfh = pd.concat([df, dfOneHot], axis=1)  

dfh

Unnamed: 0,Temperature,Color,Target,Temp_Hot,Temp_Hot.1,Temp_Hot.2,Temp_Hot.3
0,Hot,Red,1,0,1,0,0
1,Cold,Yellow,1,1,0,0,0
2,Very Hot,Blue,1,0,0,1,0
3,Warm,Blue,0,0,0,0,1
4,Hot,Red,1,0,1,0,0
5,Warm,Vellow,0,0,0,0,1
6,Warm,Red,1,0,0,0,1
7,Hot,Yellow,0,0,1,0,0
8,Hot,Yellow,1,0,1,0,0
9,Cold,Yellow,1,1,0,0,0


# Label Encoding

### Using sklearn LabelEncoder()

In [5]:
from sklearn.preprocessing import LabelEncoder

df_le = df.copy()
df_le['Temp_le'] = LabelEncoder().fit_transform(df.Temperature)
df_le

Unnamed: 0,Temperature,Color,Target,Temp_le
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Vellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


### using Pandas factorize method

In [6]:
df_fac = df.copy()
df_fac['Temp_fac'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df_fac

Unnamed: 0,Temperature,Color,Target,Temp_fac
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Vellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


# Ordinal Encoding

you can assign a number to the ordered categories.

If we consider in the temperature scale as the order, then the ordinal value should from cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.

In [7]:
df_oe = df.copy()

Temp_dict = { 'Cold' : 1, 'Warm' : 2, 'Hot' : 3, 'Very Hot' :4} 
df_oe['Temp_Ordinal']= df_oe.Temperature.map(Temp_dict)  
df_oe


Unnamed: 0,Temperature,Color,Target,Temp_Ordinal
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Vellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


# Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. In this example, we have four features; thus, the total number of the binary encoded features will be three features. Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories One Hot Encoding will have 100 features while for Binary encoding, we will need just seven features).

For Binary encoding, one has to follow the following steps:
- The categories are first converted to numeric order starting from 1 (order is created as categories appear in a dataset and do not mean any ordinal nature)
- Then those integers are converted into binary code, so for example 3 becomes 011, 4 becomes 100
- Then the digits of the binary number form separate columns.


<img src="binary_encoding.png" style="height: 280px;" align=left> 

In [8]:
df_bin = df.copy()

import category_encoders as ce 
encoder = ce.BinaryEncoder(df.Temperature) 
dfbin = encoder.fit_transform(df.Temperature) 
df_final = pd.concat([df, dfbin], axis=1) 
df_final

Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,1
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Vellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


# Frequency Encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :

- Select a categorical variable you would like to transform
- Group by the categorical variable and obtain counts of each category
- Join it back with the training dataset

In [9]:
df_fq = df.copy()

fe= df_fq.groupby('Temperature').size()/len(df_fq) 
df_fq['Temp_freq_encodel'] = df_fq['Temperature'].map(fe) 
df_fq

Unnamed: 0,Temperature,Color,Target,Temp_freq_encodel
0,Hot,Red,1,0.4
1,Cold,Yellow,1,0.2
2,Very Hot,Blue,1,0.1
3,Warm,Blue,0,0.3
4,Hot,Red,1,0.4
5,Warm,Vellow,0,0.3
6,Warm,Red,1,0.3
7,Hot,Yellow,0,0.4
8,Hot,Yellow,1,0.4
9,Cold,Yellow,1,0.2


# Mean Encoding

Mean encoding is similar to label encoding, except here labels are **correlated directly with the target**. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. 

Mean encoding approach is as below:
1. Select a categorical variable you would like to transform
2. Group by the categorical variable and obtain aggregated sum over the “Target” variable. (total number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain aggregated count over “Target” variable
4. Divide the (Sum of Target)step 2 / (Sum of Count)step 3 results and join it back with the train.

<img src="mean_encoding.png" style="height: 350px;" align=left> 



In [10]:
df_mn = df.copy()

mean_encode = df_mn.groupby('Temperature')['Target'].mean() 
print(mean_encode) 
df_mn['Temperature_mean'] = df_mn.Temperature.map(mean_encode) 
df_mn 

Temperature
Cold        1.000000
Hot         0.750000
Very Hot    1.000000
Warm        0.333333
Name: Target, dtype: float64


Unnamed: 0,Temperature,Color,Target,Temperature_mean
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Vellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


#### A variant

In [11]:
df_m = df.copy()
mean = df_m['Target'].mean()
agg = df.groupby('Temperature')['Target'].agg(['count', 'mean'])

counts = agg['count'] 
means = agg['mean'] 
weight = 100 

smooth = (counts  * means + weight * mean) / (counts + weight) 
print(smooth)

df_m['Temperature_smean_enc'] = df_m['Temperature'].map(smooth) 
df_m

Temperature
Cold        0.705882
Hot         0.701923
Very Hot    0.702970
Warm        0.689320
dtype: float64


Unnamed: 0,Temperature,Color,Target,Temperature_smean_enc
0,Hot,Red,1,0.701923
1,Cold,Yellow,1,0.705882
2,Very Hot,Blue,1,0.70297
3,Warm,Blue,0,0.68932
4,Hot,Red,1,0.701923
5,Warm,Vellow,0,0.68932
6,Warm,Red,1,0.68932
7,Hot,Yellow,0,0.701923
8,Hot,Yellow,1,0.701923
9,Cold,Yellow,1,0.705882


# Weight of Evidence Encoding

Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to separate good and bad. Weight of evidence (WOE) is a measure of how much the **evidence supports or undermines a hypothesis**.

$$
WoE=\left[\ln \left(\frac{\text { Distr } \text {Goods}}{\text {Distr Bads}}\right)\right] * 100
$$

WoE will be 0 if the P(Goods) / P(Bads) = 1, this indicates **random** for that group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression because the Logit transformation is simply the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.

**Advantage of WoE**:
1. It can transform an independent variable so that it establishes a monotonic relationship to the dependent variable. It does more than this — to secure monotonic relationship it would be enough to “recode” it to any ordered measure (for example 1,2,3,4…), but the WoE transformation orders the categories on a “logistic” scale which is natural for Logistic Regression. [**Note:** A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases.]
2. For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated), and the WoE can be used to express information for the whole category
3. The (univariate) effect of each category on the dependent variable can be compared across categories and variables because WoE is a standardized value (for example you can compare WoE of married people to WoE of manual workers)

**Disadavantages of WoE**:
1. Loss of information (variation) due to binning to a few categories
2. It is a “univariate” measure, so it does not take into account the correlation between independent variables
3. It is easy to manipulate (over-fit) the effect of variables according to how categories are created

In [12]:
# Calculate the probability of target = 1 that is Good = 1 for each category

df1 = df.copy()

woe = df1.groupby('Temperature')['Target'].mean() 
woe_df = pd.DataFrame(woe) 

# Rename the column name to 'Good' to keep it consistent with formula for easy understanding

woe_df= woe_df.rename(columns = {'Target':'Good'})
woe_df['Bad'] = 1-woe_df.Good 

# need to add a small value to avoid divide by zero in denominator 
woe_df['Bad'] = np.where(woe_df['Bad'] == 0, 0.000001, woe_df.Bad) 

# Compute WOE 
woe_df['WoE'] = np.log(woe_df.Good/woe_df.Bad) 

# Map back to df1

df1['WoE_Encode'] = df1.Temperature.map(woe_df.WoE)

df1

Unnamed: 0,Temperature,Color,Target,WoE_Encode
0,Hot,Red,1,1.098612
1,Cold,Yellow,1,13.815511
2,Very Hot,Blue,1,13.815511
3,Warm,Blue,0,-0.693147
4,Hot,Red,1,1.098612
5,Warm,Vellow,0,-0.693147
6,Warm,Red,1,-0.693147
7,Hot,Yellow,0,1.098612
8,Hot,Yellow,1,1.098612
9,Cold,Yellow,1,13.815511


# Probability Ratio Encoding
Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used. For each label, we calculate the mean of target=1, that is the probability of being **1** ( i.e. **P(1)** ), and also the probability of the target=0 ( i.e. P(0) ). And then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio. We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0.

In [13]:
# We calculate probablity of target = 1 i.e. Good = 1 for each category 

df2 = df.copy()

pr_df = df2.groupby('Temperature')['Target'].mean() 
pr_df = pd.DataFrame(pr_df) 

# Rename the column name to 'Good' to keep it consistent with formula for easy understanding 

pr_df = pr_df.rename(columns = {'Target':'Good'}) 

# Calculate Bad probabiliry which is 1- Good probability

pr_df['Bad'] = 1-pr_df.Good 

# We need to odd a small value to avoid divide by zero in denominator 

pr_df['Bad'] = np.where(pr_df['Bad'] == 0, 0.000001, pr_df['Bad']) 

# compute the Probability Ratio 

pr_df['PR'] = pr_df.Good/pr_df.Bad 

pr_df 

Unnamed: 0_level_0,Good,Bad,PR
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cold,1.0,1e-06,1000000.0
Hot,0.75,0.25,3.0
Very Hot,1.0,1e-06,1000000.0
Warm,0.333333,0.666667,0.5


# Implementation in actual production environment

### Training time

In [14]:
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'], 
        'Color': ['Red','Yellow','Blue','Blue','Red','Vellow','Red','Yellow','Yellow','Yellow'], 
        'Target': [1,1,1,0,1,0,1,0,1,1]} 

df = pd.DataFrame(data,columns = ['Temperature','Color','Target']) 

df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Vellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


In [15]:
df_mn = df.copy()

mean_encode = df_mn.groupby('Temperature')['Target'].mean() 
print(mean_encode) 
df_mn['Temperature_mean'] = df_mn.Temperature.map(mean_encode) 
df_mn 

Temperature
Cold        1.000000
Hot         0.750000
Very Hot    1.000000
Warm        0.333333
Name: Target, dtype: float64


Unnamed: 0,Temperature,Color,Target,Temperature_mean
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Vellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


### Testing time

In [16]:
# encoded values from mean encoding 
print(mean_encode) 

#test data without the target 

test_data = {'Temperature': ['Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold', 'Cold']}
dft = pd.DataFrame(test_data, columns = ['Temperature']) 
dft['Temperature_mean'] = dft.Temperature.map(mean_encode)  

dft

Temperature
Cold        1.000000
Hot         0.750000
Very Hot    1.000000
Warm        0.333333
Name: Target, dtype: float64


Unnamed: 0,Temperature,Temperature_mean
0,Cold,1.0
1,Very Hot,1.0
2,Warm,0.333333
3,Hot,0.75
4,Warm,0.333333
5,Warm,0.333333
6,Hot,0.75
7,Hot,0.75
8,Cold,1.0
9,Cold,1.0


# Final Remark

<img src="label_encoding_cheatSheet.png" style="height: 900px;" align=left>  