# **Categorical Encoding Methods**

**Import required libraries**`

In [1]:
import pandas as pd
import sklearn

**What is Categorical Data?**

* Categorical data is a type of data that is used to group information with similar characteristics, while numerical data is a type of data that expresses information in the form of numbers.

* Example of categorical data: gender

**Why do we need encoding?**
* Most machine learning algorithms cannot handle categorical variables unless we convert them to numerical values
* Many algorithm’s performances even vary based upon how the categorical variables are encoded

**Categorical variables can be divided into two categories:**
* Nominal: no particular order
* Ordinal: there is some order between values

**Create Dataframe**

In [2]:
df = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female','Male'],
                    'marital_status' : ['Married','Single','Married','Single','MArried', 'Single'],
                    'city' : ['Delhi','Mumbai','Kolkata','Delhi','Kolkata', 'Chennai'],
                    'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot','hot'],
                    'target' : [1, 0, 0, 1, 1, 1] })
df.head()

Unnamed: 0,gender,marital_status,city,temperature,target
0,Male,Married,Delhi,very cold,1
1,Female,Single,Mumbai,cold,0
2,Male,Married,Kolkata,warm,0
3,Female,Single,Delhi,hot,1
4,Female,MArried,Kolkata,very hot,1


## **1. One-Hot Encoding**
* In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.

In [3]:
import category_encoders as ce
ce_OHE = ce.OneHotEncoder(cols=['gender','city','marital_status'])
ce_OHE_df = ce_OHE.fit_transform(df)
ce_OHE_df.head()

Unnamed: 0,gender_1,gender_2,marital_status_1,marital_status_2,marital_status_3,city_1,city_2,city_3,city_4,temperature,target
0,1,0,1,0,0,1,0,0,0,very cold,1
1,0,1,0,1,0,0,1,0,0,cold,0
2,1,0,1,0,0,0,0,1,0,warm,0
3,0,1,0,1,0,1,0,0,0,hot,1
4,0,1,0,0,1,0,0,1,0,very hot,1


## **2. Binary Encoding**

* Binary encoding converts a category into binary digits. Each binary digit creates one feature column.

In [4]:
ce_be = ce.BinaryEncoder(cols=['city'])
df_binary = ce_be.fit_transform(df['city'])
df_binary

Unnamed: 0,city_0,city_1,city_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,0,1
4,0,1,1
5,1,0,0


## **3. Label Encoding**
* In label encoding, each category is assigned a value from 1 through N where N is the number of categories for the feature. There is no relation or order between these assignments

In [5]:
from sklearn.preprocessing import LabelEncoder
#Label encoder takes no arguments
le = LabelEncoder()
for col in df.columns:
  df_le = le.fit_transform(df[col]) 
  print(f"{col} : {df_le}")

gender : [1 0 1 0 0 1]
marital_status : [1 2 1 2 0 2]
city : [1 3 2 1 2 0]
temperature : [2 0 4 1 3 1]
target : [1 0 0 1 1 1]


## **4. Ordinal Encoding**
 
* Ordinal encoding’s encoded variables retain the ordinal (ordered) nature of the variable. It looks similar to label encoding, the only difference being that label coding doesn't consider whether a variable is ordinal or not; it will then assign a sequence of integers.

* Example: Ordinal encoding will assign values as Very Good(1) < Good(2) < Bad(3) < Worse(4)

* First, we need to assign the original order of the variable through a dictionary.

In [6]:
temp_dict = {'very cold': 1,'cold': 2,'warm': 3,'hot': 4,"very hot":5}
df["temp_ordinal"] = df.temperature.map(temp_dict)
df

Unnamed: 0,gender,marital_status,city,temperature,target,temp_ordinal
0,Male,Married,Delhi,very cold,1,1
1,Female,Single,Mumbai,cold,0,2
2,Male,Married,Kolkata,warm,0,3
3,Female,Single,Delhi,hot,1,4
4,Female,MArried,Kolkata,very hot,1,5
5,Male,Single,Chennai,hot,1,4


## **5. Frequency Encoding**
* The category is assigned as per the frequency of values in its total lot.

In [7]:
df["temperature_fe"] = df["temperature"].map(df.groupby("temperature").size()/len(df)).round(2)
df[['temperature', 'temperature_fe']]

Unnamed: 0,temperature,temperature_fe
0,very cold,0.17
1,cold,0.17
2,warm,0.17
3,hot,0.33
4,very hot,0.17
5,hot,0.33


## **6. Helmert Encoding**
* Helmert Encoding is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding.

* It compares each level of a categorical variable to the mean of the subsequent levels.

* This type of encoding can be useful in certain situations where levels of the categorical variable are ordered. (not this dataset)

In [8]:
from category_encoders.helmert import HelmertEncoder
HE_encoder = HelmertEncoder('city')
df_he = HE_encoder.fit_transform(df['city'])
df_he



Unnamed: 0,intercept,city_0,city_1,city_2
0,1,-1.0,-1.0,-1.0
1,1,1.0,-1.0,-1.0
2,1,0.0,2.0,-1.0
3,1,-1.0,-1.0,-1.0
4,1,0.0,2.0,-1.0
5,1,0.0,0.0,3.0


In [9]:
df

Unnamed: 0,gender,marital_status,city,temperature,target,temp_ordinal,temperature_fe
0,Male,Married,Delhi,very cold,1,1,0.17
1,Female,Single,Mumbai,cold,0,2,0.17
2,Male,Married,Kolkata,warm,0,3,0.17
3,Female,Single,Delhi,hot,1,4,0.33
4,Female,MArried,Kolkata,very hot,1,5,0.17
5,Male,Single,Chennai,hot,1,4,0.33


## **7. Mean Encoding**
* It is the ratio of occurrence of the positive class in the target variable.
* Example: Unique value “Kolkata. It has 2 occurrences of the target variable and 1 of those are the positive label — therefore, mean encoding would be 1/2 = 0.5 for value “Kolkata"

In [10]:
df.groupby(['city'])['target'].mean()

city
Chennai    1.0
Delhi      1.0
Kolkata    0.5
Mumbai     0.0
Name: target, dtype: float64

## **8. Weight of Evidence Encoding**
* The WoERatioCategoricalEncoder() replaces categories by the weight of evidence (WoE). The WoE was used primarily in the financial sector to create credit risk scorecards.

* The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed the encoder will find and encode all categorical variables (object type).

* The encoder first maps the categories to the weight of evidence for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

In [11]:
columns = [col for col in df.columns if col != 'id']
woe_encoder = ce.WOEEncoder(cols=columns)
woe_encoded_train = woe_encoder.fit_transform(df[columns], df['target']).add_suffix('_woe')
train_features = df.join(woe_encoded_train)
woe_encoded_cols = woe_encoded_train.columns

## **9. Probability Ratio Encoding**
* Probability Ratio Encoding is based on the predictive power of an independent variable in relation to the dependent variable with respect to the ratio of good and bad probability is used

In [12]:
from sunbird.categorical_encoding import probability_ratio_encoding
df_prc = df.copy()
probability_ratio_encoding(df_prc, 'city', 'target')
df_prc

Unnamed: 0,gender,marital_status,city,temperature,target,temp_ordinal,temperature_fe
0,Male,Married,inf,very cold,1,1,0.17
1,Female,Single,0.0,cold,0,2,0.17
2,Male,Married,1.0,warm,0,3,0.17
3,Female,Single,inf,hot,1,4,0.33
4,Female,MArried,1.0,very hot,1,5,0.17
5,Male,Single,inf,hot,1,4,0.33


## **11. Backward Difference Encoding**
* In backward difference encoding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of encoding may be useful for a nominal or an ordinal variable

In [13]:
import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=["temperature"])
encoder.fit_transform(df['temperature'], verbose=1)
encoder



BackwardDifferenceEncoder(cols=['temperature'],
                          mapping=[{'col': 'temperature',
                                    'mapping':     temperature_0  temperature_1  temperature_2  temperature_3
 1           -0.8           -0.6           -0.4           -0.2
 2            0.2           -0.6           -0.4           -0.2
 3            0.2            0.4           -0.4           -0.2
 4            0.2            0.4            0.6           -0.2
 5            0.2            0.4            0.6            0.8
-1            0.0            0.0            0.0            0.0
-2            0.0            0.0            0.0            0.0}])

## **Refrences:**
* https://www.kdnuggets.com/2021/05/deal-with-categorical-data-machine-learning.html
* Image link : https://www.kdnuggets.com/wp-content/uploads/garg_cat_variables_15.jpg