### Categorical Variable Encoding

- Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values. 
- Many algorithm’s performances vary based on how Categorical variables are encoded.

### Categorical Variable
- Categorical variables can be divided into two categories: Nominal (No particular order) and Ordinal (some ordered).  
- A `nominal` variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. 
- An `ordinal` variable has a clear ordering. For example, temperature as a variable with three orderly categories (low, medium and high). 


There are many ways we can encode these categorical variables -

### LabelEncoding

In this encoding, each category is assigned a value from 1 through N (here N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order

In [1]:
import numpy as np
import pandas as pd

In [2]:
raw_data = pd.read_csv(r'./data/Bags_sales.csv')
data = raw_data.copy()

In [3]:
data.shape

(39, 7)

In [4]:
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,No
1,PAK-2,Pakistan,23,White,Female,Medium,Yes
2,AUS-3,Australia,35,Blue,Female,Short,Yes


In [5]:
data.dtypes

CustId         object
Country        object
Age             int64
FavColor       object
Gender         object
Height_Type    object
Purchased      object
dtype: object

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
data['FavColor_LabelEncoded'] = LabelEncoder().fit_transform(data['FavColor'])

In [8]:
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased,FavColor_LabelEncoded
0,SRI-1,Srilanka,26,White,Female,Short,No,4
1,PAK-2,Pakistan,23,White,Female,Medium,Yes,4
2,AUS-3,Australia,35,Blue,Female,Short,Yes,1
3,WES-4,West Indies,32,Black,Female,Tall,No,0
4,SRI-5,Srilanka,35,Green,Female,Medium,No,2


In [9]:
data.drop(columns = ['FavColor_LabelEncoded'], axis=1, inplace=True)

In [10]:
data['Height_Type_LabelEncoded'] = LabelEncoder().fit_transform(data['Height_Type'])
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased,Height_Type_LabelEncoded
0,SRI-1,Srilanka,26,White,Female,Short,No,1
1,PAK-2,Pakistan,23,White,Female,Medium,Yes,0
2,AUS-3,Australia,35,Blue,Female,Short,Yes,1
3,WES-4,West Indies,32,Black,Female,Tall,No,2
4,SRI-5,Srilanka,35,Green,Female,Medium,No,0


In [11]:
data.drop(columns = ['Height_Type_LabelEncoded'], axis=1, inplace=True)

#### Ordinal Encoding

In [12]:
height_dict = {'Short': 1, 'Medium': 2, 'Tall': 3}

data['HeightType_Ordinal'] = data['Height_Type'].map(height_dict)

data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased,HeightType_Ordinal
0,SRI-1,Srilanka,26,White,Female,Short,No,1
1,PAK-2,Pakistan,23,White,Female,Medium,Yes,2
2,AUS-3,Australia,35,Blue,Female,Short,Yes,1
3,WES-4,West Indies,32,Black,Female,Tall,No,3
4,SRI-5,Srilanka,35,Green,Female,Medium,No,2


### One Hot Encoding

- One hot encoding, consists of replacing the categorical variable by different boolean variables, which take value 0 or 1, to indicate whether or not a certain category / label of the variable was present for that observation.
- This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.

In [13]:
data = raw_data.copy()

sorted(data['FavColor'].unique())

['Black', 'Blue', 'Green', 'Red', 'White']

In [14]:
pd.get_dummies(data, prefix=['FavColor'], columns=['FavColor']).head()

Unnamed: 0,CustId,Country,Age,Gender,Height_Type,Purchased,FavColor_Black,FavColor_Blue,FavColor_Green,FavColor_Red,FavColor_White
0,SRI-1,Srilanka,26,Female,Short,No,0,0,0,0,1
1,PAK-2,Pakistan,23,Female,Medium,Yes,0,0,0,0,1
2,AUS-3,Australia,35,Female,Short,Yes,0,1,0,0,0
3,WES-4,West Indies,32,Female,Tall,No,1,0,0,0,0
4,SRI-5,Srilanka,35,Female,Medium,No,0,0,1,0,0


**Scikit-learn has OneHotEncoder for this purpose**

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
ohc = OneHotEncoder()
ohe = ohc.fit_transform(data.FavColor.values.reshape(-1,1)).toarray()
ohe[0:5]

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])

In [17]:
ohc.categories_

[array(['Black', 'Blue', 'Green', 'Red', 'White'], dtype=object)]

In [18]:
favCol_ohe_df = pd.DataFrame(ohe, 
                 columns = ["FavColor_" + ohc.categories_[0][i] for i in range(len(ohc.categories_[0]))],
                            dtype=int)

favCol_ohe_df.head()

Unnamed: 0,FavColor_Black,FavColor_Blue,FavColor_Green,FavColor_Red,FavColor_White
0,0,0,0,0,1
1,0,0,0,0,1
2,0,1,0,0,0
3,1,0,0,0,0
4,0,0,1,0,0


In [19]:
data = pd.concat([data, favCol_ohe_df], axis=1)
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased,FavColor_Black,FavColor_Blue,FavColor_Green,FavColor_Red,FavColor_White
0,SRI-1,Srilanka,26,White,Female,Short,No,0,0,0,0,1
1,PAK-2,Pakistan,23,White,Female,Medium,Yes,0,0,0,0,1
2,AUS-3,Australia,35,Blue,Female,Short,Yes,0,1,0,0,0
3,WES-4,West Indies,32,Black,Female,Tall,No,1,0,0,0,0
4,SRI-5,Srilanka,35,Green,Female,Medium,No,0,0,1,0,0


**To encode categorical variable with k labels, we need k-1 dummy variables.**

In [20]:
data = raw_data.copy()

pd.get_dummies(data, prefix=['FavColor'], columns=['FavColor'], drop_first = True).head()

Unnamed: 0,CustId,Country,Age,Gender,Height_Type,Purchased,FavColor_Blue,FavColor_Green,FavColor_Red,FavColor_White
0,SRI-1,Srilanka,26,Female,Short,No,0,0,0,1
1,PAK-2,Pakistan,23,Female,Medium,Yes,0,0,0,1
2,AUS-3,Australia,35,Female,Short,Yes,1,0,0,0
3,WES-4,West Indies,32,Female,Tall,No,0,0,0,0
4,SRI-5,Srilanka,35,Female,Medium,No,0,1,0,0


**When should you use k and when k-1 dummy variables**

Usually, for Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but for classification, the recommendation is to use all N columns

### Frequency Encoding

- One approach is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. 
- Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

In [21]:
data = raw_data.copy()

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 0:-1], data.iloc[:,-1],
                                                   test_size=0.3, random_state=0)

In [24]:
favcolor_dict = X_train['FavColor'].value_counts().to_dict()
favcolor_dict

{'Red': 10, 'Green': 5, 'Black': 4, 'White': 4, 'Blue': 4}

In [25]:
X_train['FavColor_freq'] = X_train['FavColor'].map(favcolor_dict)

In [26]:
X_test['FavColor_freq'] = X_test['FavColor'].map(favcolor_dict)

In [27]:
X_train.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,FavColor_freq
2,AUS-3,Australia,35,Blue,Female,Short,4
38,ENG-39,England,21,Green,Female,Short,5
20,AUS-21,Australia,18,Red,Female,Medium,10
36,PAK-37,Pakistan,36,Green,Male,Medium,5
16,AUS-17,Australia,36,Green,Male,Short,5


In [28]:
X_test.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,FavColor_freq
4,SRI-5,Srilanka,35,Green,Female,Medium,5
28,SRI-29,Srilanka,22,Blue,Male,Medium,4
29,AUS-30,Australia,25,Green,Male,Tall,5
33,NEW-34,New Zealand,19,White,Male,Short,4
34,WES-35,West Indies,27,Black,Female,Tall,4


- If a category is present in the test set, that was not present in the train set, this method will generate missing data in the test set. 
- Then we can combine rare label replacement plus categorical encoding with counts like this: 
    - we may choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label 

### Mean Encoding

- `Mean encoding` is similar to label encoding, except here labels are correlated directly with the target
- Usually, Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must on most occasions. 
- Mean encoding approach is as below:
    - Step 1: For each category of the categorical variable, find sum over target variable
    - Step 2: For each category find count 
    - Step 3: For each category, divide the sum obtained in Step1 by count obtained in Step 2
    - Step 4: Replace category by value obtained in Step 3

In [29]:
data = raw_data.copy()

In [31]:
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,No
1,PAK-2,Pakistan,23,White,Female,Medium,Yes
2,AUS-3,Australia,35,Blue,Female,Short,Yes


In [33]:
data['Purchased'] = data['Purchased'].map({'Yes':1, 'No':0})

In [34]:
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,0
1,PAK-2,Pakistan,23,White,Female,Medium,1
2,AUS-3,Australia,35,Blue,Female,Short,1


In [44]:
color_mean_encode_dict = data.groupby('FavColor')['Purchased'].mean().round(2).to_dict()
color_mean_encode_dict

{'Black': 0.43, 'Blue': 0.71, 'Green': 0.44, 'Red': 0.5, 'White': 0.67}

In [46]:
data['FavColor'] = data['FavColor'].map(color_mean_encode_dict)

In [47]:
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,0.67,Female,Short,0
1,PAK-2,Pakistan,23,0.67,Female,Medium,1
2,AUS-3,Australia,35,0.71,Female,Short,1


#### Smooth mean

In [48]:
data = raw_data.copy()
data['Purchased'] = data['Purchased'].map({'Yes':1, 'No':0})
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,0
1,PAK-2,Pakistan,23,White,Female,Medium,1
2,AUS-3,Australia,35,Blue,Female,Short,1
3,WES-4,West Indies,32,Black,Female,Tall,0
4,SRI-5,Srilanka,35,Green,Female,Medium,0


In [52]:
# global mean
global_mean = data['Purchased'].mean()

In [54]:
agg = data.groupby('FavColor')['Purchased'].agg(['count', 'mean'])
agg

Unnamed: 0_level_0,count,mean
FavColor,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,7,0.428571
Blue,7,0.714286
Green,9,0.444444
Red,10,0.5
White,6,0.666667


In [57]:
counts = agg['count']
means = agg['mean']
weight = 100

smooth = (counts * means + weight * global_mean) / (counts + weight)
smooth

FavColor
Black    0.531272
Blue     0.549964
Green    0.530699
Red      0.534965
White    0.545718
dtype: float64

In [59]:
data['FavColor'] = data['FavColor'].map(smooth)
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,0.545718,Female,Short,0
1,PAK-2,Pakistan,23,0.545718,Female,Medium,1
2,AUS-3,Australia,35,0.549964,Female,Short,1
3,WES-4,West Indies,32,0.531272,Female,Tall,0
4,SRI-5,Srilanka,35,0.530699,Female,Medium,0


### Weight of Evidence Encoding

In [60]:
data = raw_data.copy()
data['Purchased'] = data['Purchased'].map({'Yes':1, 'No':0})
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,0
1,PAK-2,Pakistan,23,White,Female,Medium,1
2,AUS-3,Australia,35,Blue,Female,Short,1


- Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to separate good and bad. 
- This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry. 
- Weight of evidence (WOE) is a measure of how much the evidence supports or undermines a hypothesis

**WoE = [ln(Ditribution of Goods / Distribution of Bads)] * 100**

- WoE will be 0 if the P(Goods) / P(Bads) = 1. That is if the outcome is random for that group. 
- If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0
- If, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression

In [63]:
# calculate probability of target =1 for each category

woe = data.groupby('FavColor')['Purchased'].mean()
woe_df = pd.DataFrame(woe)
woe_df

Unnamed: 0_level_0,Purchased
FavColor,Unnamed: 1_level_1
Black,0.428571
Blue,0.714286
Green,0.444444
Red,0.5
White,0.666667


In [64]:
# rename column to Good to keep it consistent with formula for ease of understanding
woe_df.rename(columns = {'Purchased': 'Good'}, inplace=True)

In [66]:
# calculate Bad probability which is 1 - Good probability
woe_df['Bad'] = 1 - woe_df['Good']

In [67]:
woe_df

Unnamed: 0_level_0,Good,Bad
FavColor,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,0.428571,0.571429
Blue,0.714286,0.285714
Green,0.444444,0.555556
Red,0.5,0.5
White,0.666667,0.333333


In [68]:
# make sure Bad is not zero
woe_df['Bad'] = np.where(woe_df['Bad'] == 0, 0.000001, woe_df['Bad'])

In [69]:
woe_df['WOE'] = np.log(woe_df.Good / woe_df.Bad)

In [70]:
woe_df

Unnamed: 0_level_0,Good,Bad,WOE
FavColor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,0.428571,0.571429,-0.287682
Blue,0.714286,0.285714,0.916291
Green,0.444444,0.555556,-0.223144
Red,0.5,0.5,0.0
White,0.666667,0.333333,0.693147


In [71]:
data.loc[:, 'FavColor'] = data['FavColor'].map(woe_df['WOE'])

In [72]:
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,0.693147,Female,Short,0
1,PAK-2,Pakistan,23,0.693147,Female,Medium,1
2,AUS-3,Australia,35,0.916291,Female,Short,1
3,WES-4,West Indies,32,-0.287682,Female,Tall,0
4,SRI-5,Srilanka,35,-0.223144,Female,Medium,0


### Probability Ratio Encoding

Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used

In [73]:
data = raw_data.copy()
data['Purchased'] = data['Purchased'].map({'Yes':1, 'No':0})
data.head(3)

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,White,Female,Short,0
1,PAK-2,Pakistan,23,White,Female,Medium,1
2,AUS-3,Australia,35,Blue,Female,Short,1


In [77]:
pr_data = pd.DataFrame(data.groupby('FavColor')['Purchased'].\
                       mean()).\
                       rename(columns = {'Purchased': 'Good'})
pr_data['Bad'] = 1 - pr_data['Good']
pr_data['Bad'] = np.where(pr_data['Bad'] == 0, .000001, pr_data['Bad'])
pr_data['PR'] = pr_data['Good'] / pr_data['Bad']
pr_data

Unnamed: 0_level_0,Good,Bad,PR
FavColor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,0.428571,0.571429,0.75
Blue,0.714286,0.285714,2.5
Green,0.444444,0.555556,0.8
Red,0.5,0.5,1.0
White,0.666667,0.333333,2.0


In [78]:
data.FavColor = data.FavColor.map(pr_data.PR)
data.head()

Unnamed: 0,CustId,Country,Age,FavColor,Gender,Height_Type,Purchased
0,SRI-1,Srilanka,26,2.0,Female,Short,0
1,PAK-2,Pakistan,23,2.0,Female,Medium,1
2,AUS-3,Australia,35,2.5,Female,Short,1
3,WES-4,West Indies,32,0.75,Female,Tall,0
4,SRI-5,Srilanka,35,0.8,Female,Medium,0
