# Smarter Ways to Encode Categorical Data for Machine Learning

# Table of Contents
1. [One Hot Encoding](#section1)<br>
2. [Label Encoding](#section2)<br>
3. [Ordinal Encoding](#section3)<br>
4. [Binary Encoding](#section4)<br>
5. [Frequency Encoding](#section5)<br>
6. [Mean / Target Encoding](#section6)<br>
7. [Weight of Evidence Encoding](#section7)<br>
8. [Probability Ratio Encoding](#section8)<br>
9. [Frequency Encoding](#section9)<br>
10. [Frequency Encoding](#section10)<br>
11. [Frequency Encoding](#section11)<br>
12. [Frequency Encoding](#section12)<br>
13. [Frequency Encoding](#section13)<br>

Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value

For **nominal columns** try 
- OneHot, 
- Hashing,
- LeaveOneOut, and 
- Target encoding. 

Avoid OneHot for high cardinality columns and decision tree-based algorithms.

For **ordinal columns** try 
- Ordinal (Integer), 
- Binary, 
- OneHot, 
- LeaveOneOut, and 
- Target. 

Helmert, Sum, BackwardDifference and Polynomial are less likely to be helpful, but if you have time or theoretic reason you might want to try them.

For regression tasks, Target and LeaveOneOut probably won’t work well.

In [1]:
data = {    
        'Temperature' : ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color' : ['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target' : [1,1,1,0,1,0,1,0,1,1]
       }
df = pd.DataFrame(data, columns=['Temperature', 'Color', 'Target'])
df.head()

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1


<a id=section1></a> 
# 1. One Hot Encoding¶

we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.

In [4]:
# Pandas Get Dummies
print(df.Temperature.nunique())
pd.get_dummies(df, prefix='Temp', columns=['Temperature'])


4


Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


In [7]:
# Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional feature column
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Temperature.values.reshape(-1,1)).toarray()
df_oh = pd.DataFrame(ohe, columns= ['Temp_'+ ohc.categories_[0][i] for i in range(len(ohc.categories_[0]))])
dfh = pd.concat([df, df_oh], axis=1)
dfh

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0


One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as that is sufficient to encode the one that is not included. 

**For Regression**
Use N-1 (drop first or last column of One Hot Coded new feature )

**For classification** 
Use all N columns without as most of the tree-based algorithm builds a tree based on all available

One hot encoding with N-1 binary variables should be used in linear Regression, to ensure the correct number of degrees of freedom (N-1). The linear Regression has access to all of the features as it is being trained, and therefore examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about (represent completely) the original categorical variable to the linear Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the features at the same time during training. For example, support vector machines and neural networks as well and clustering algorithms.

In tree-based methods, we will never consider that additional label if we drop. Thus, if we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and doesn’t drop.

In [8]:
import category_encoders as ce
one_enc = ce.OneHotEncoder(cols=['Temperature'])
one_df = one_enc.fit_transform(df.Temperature)
one_df = pd.concat([df, one_df], axis=1)
one_df

Unnamed: 0,Temperature,Color,Target,Temperature_1,Temperature_2,Temperature_3,Temperature_4
0,Hot,Red,1,1,0,0,0
1,Cold,Yellow,1,0,1,0,0
2,Very Hot,Blue,1,0,0,1,0
3,Warm,Blue,0,0,0,0,1
4,Hot,Red,1,1,0,0,0
5,Warm,Yellow,0,0,0,0,1
6,Warm,Red,1,0,0,0,1
7,Hot,Yellow,0,1,0,0,0
8,Hot,Yellow,1,1,0,0,0
9,Cold,Yellow,1,0,1,0,0


# <a id=section2></a> 
# 2. Label Encoding¶

In this encoding, each category is assigned a value from 1 through N (here N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship. In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 )

In [12]:
# Scikit-learn code for the data-frame as follows:
from sklearn.preprocessing import LabelEncoder
le_df = df.copy()
le_df['Temp_LE_Temp'] = LabelEncoder().fit_transform(le_df.Temperature)
le_df.head()

Unnamed: 0,Temperature,Color,Target,Temp_LE_Temp
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1


In [11]:
# Pandas factorize also perform the same function.
le_df['Temp_factorize'] = pd.factorize(le_df.Temperature)[0].reshape(-1,1)
le_df.head()

Unnamed: 0,Temperature,Color,Target,Temp_LE_Temp,Temp_factorize
0,Hot,Red,1,1,0
1,Cold,Yellow,1,0,1
2,Very Hot,Blue,1,2,2
3,Warm,Blue,0,3,3
4,Hot,Red,1,1,0


<a id=section3></a> 
# 3. Ordinal Encoding¶

<a id=section4></a> 
# 4. Binary Encoding¶

<a id=section5></a> 
# 5. Frequency Encoding¶

<a id=section6></a> 
# 6. Mean / Target Encoding¶

<a id=section7></a> 
# 7. Weight of Evidence Encoding¶

<a id=section8></a> 
# 8. Probability Ratio Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶

# <a id=section1></a> 
# 1. One Hot Encoding¶