# Categorical Feature Transformation

In machine learning, we need to process different types of data. Some of them are continuous variables and others are categorical variables. In a way, we can compare the difference
between continuous and categorical data with regression and classification algorithms, at least for data inputs. As we are dealing with data, it is critically important to consider
and process the categorical data correctly to avoid any wrong impact on the performance of the machine learning models. We do not really have a choice here as we need anyway to
transform categorical data, often text, into numeric and consumable data. Most of the time, we can find three major classes of categorical data: binary, nominal, and ordinal.

It is preferable to apply continuous and categorical transformations after splitting the data between train and test. We can choose different encoders
for different features. The output, encoded data, can be merged with the rescaled continuous data to train the models. The same process is applied to the test data before applying the
trained models.

It exists different coding systems for categorical variables such as the classic encoders which are well known and widely used (ordinal, one hot, binary, frequency, hashing), the
contrast encoders that encode data by looking at different categories (or levels) of features such as Helmert or backward difference and Bayesian encoders which use the target as a
foundation for encoding. Target, leave one out, weight of evidence, James-Stein and m-estimator are Bayesian encoders. Even we already have a good list of encoders to explore, there
are many more! The important is to master a couple of them and then explore to go further.

In [1]:
# import necessary libraries
import os
import numpy as np
import pandas as pd

# Ordinal Encoding

The easiest way to encode ordinal data is to assign it an integer value (integer encoding). For example, if we have a variable “size”, we can assign 0 to “small”, 1 to “medium” and 2 to “large”. Integer encoding is easily reversible. Ordinal encoding can be applied if there is a known relationship between categories. We can use pandas and assign the original order of the variable through a dictionary and then map each row for the variable as per the dictionary.

In [2]:
# with pandas
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}

df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])
df

Unnamed: 0,Size,Color,Class
0,small,red,1
1,small,green,1
2,large,black,1
3,medium,white,0
4,large,blue,1
5,large,red,0
6,small,green,0
7,medium,black,1


In [3]:
# with scikit-learn
from sklearn.preprocessing import OrdinalEncoder

# Creating an instance of OrdinalEncoder
enc = OrdinalEncoder()

# Assigning numerical value and storing it
enc.fit(df[["Size","Color"]])
df[["Size","Color"]] = enc.transform(df[["Size","Color"]])
df

Unnamed: 0,Size,Color,Class
0,2.0,3.0,1
1,2.0,2.0,1
2,0.0,0.0,1
3,1.0,4.0,0
4,0.0,1.0,1
5,0.0,3.0,0
6,2.0,2.0,0
7,1.0,0.0,1


# One Hot Enconding

One Hot Encoding is very popular. With the One Hot Encoding methodology, we will map each category to a vector containing 1 (presence) and 0 (absence).  This is applied when no order relationship exists. It creates new binary columns where 1 indicates the presence of each possible value from the original data.
In this approach, for each category of a feature, we create a new column (sometimes called a dummy variable) with binary encoding (0 or 1) to denote whether a particular row belongs to this category. This method can be challenging if our categorical variable takes on many values and it is preferable to avoid it for variables taking more than 15 different values. The drawback of this method is the size of the variable in memory since it uses as many bits as there are states meaning that the necessary memory space increases linearly with the number of states. Creating many columns can slow down the learning significantly.

In [4]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}

df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])
df

Unnamed: 0,Size,Color,Class
0,small,red,1
1,small,green,1
2,large,black,1
3,medium,white,0
4,large,blue,1
5,large,red,0
6,small,green,0
7,medium,black,1


In [5]:
# with pandas
df = pd.get_dummies(df, prefix="One",columns=['Size', 'Color'])
df

Unnamed: 0,Class,One_large,One_medium,One_small,One_black,One_blue,One_green,One_red,One_white
0,1,0,0,1,0,0,0,1,0
1,1,0,0,1,0,0,1,0,0
2,1,1,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0,1
4,1,1,0,0,0,1,0,0,0
5,0,1,0,0,0,0,0,1,0
6,0,0,0,1,0,0,1,0,0
7,1,0,1,0,1,0,0,0,0


In [6]:
# with scikit-learn

from sklearn.preprocessing import OneHotEncoder

data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}

df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])
df

enc = OneHotEncoder(handle_unknown='ignore')
enc_df = pd.DataFrame(enc.fit_transform(df[['Size','Color']]).toarray())
df = df.join(enc_df)
df




Unnamed: 0,Size,Color,Class,0,1,2,3,4,5,6,7
0,small,red,1,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,small,green,1,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,large,black,1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,medium,white,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,large,blue,1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,large,red,0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
6,small,green,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7,medium,black,1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


# Label Encoding

With Label Encoding we replace a categorical value with a numeric value (from 0 to N with N the number of categories for the feature) to each category. If the feature contains 5 categories, we will use 0, 1, 2, 3, and 4. This approach can bring a major issue because even if there is no relation or order between categories, the algorithm might interpret some order or relationship.

In [7]:
from sklearn.preprocessing import LabelEncoder

data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}

df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])
df

# Creating an instance of Labelencoder
enc = LabelEncoder()

# Assigning numerical value and storing it
df[["Size","Color"]] = df[["Size","Color"]].apply(enc.fit_transform)

df

Unnamed: 0,Size,Color,Class
0,2,3,1
1,2,2,1
2,0,0,1
3,1,4,0
4,0,1,1
5,0,3,0
6,2,2,0
7,1,0,1


# Helmert Encoding

Helmert Encoding compares each level of a categorical variable to the mean of the subsequent levels.

In [8]:
data = {'Size': ['small', 'small', 'small', 'small', 'medium', 'medium', 'medium', 'large','large', 'x-large']}
df = pd.DataFrame(data, columns = ['Size'])
df

Unnamed: 0,Size
0,small
1,small
2,small
3,small
4,medium
5,medium
6,medium
7,large
8,large
9,x-large


In [9]:
import category_encoders as ce
enc = ce.HelmertEncoder()
df = enc.fit_transform(df['Size'])
df

Unnamed: 0,intercept,Size_0,Size_1,Size_2
0,1,-1.0,-1.0,-1.0
1,1,-1.0,-1.0,-1.0
2,1,-1.0,-1.0,-1.0
3,1,-1.0,-1.0,-1.0
4,1,1.0,-1.0,-1.0
5,1,1.0,-1.0,-1.0
6,1,1.0,-1.0,-1.0
7,1,0.0,2.0,-1.0
8,1,0.0,2.0,-1.0
9,1,0.0,0.0,3.0


# Binary Encoding

The Binary Encoding method consists in different operations: the categories are encoded as ordinal, then, the resulting integers are converted into a binary code and finally the digits from that binary code are split into separate columns. This process results in fewer dimensions than the one hot encoding. As Helmert Encoding, we can use the category_encoders library to code it. We need to invoke the BinaryEncoder function by specifying the columns we want to encode and then call the .fit_transform() method on it with the DataFrame as the argument.

In [10]:
import category_encoders as ce

data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])

enc = ce.BinaryEncoder(cols=['Color','Size'])
df_binary = enc.fit_transform(df)

df_binary

Unnamed: 0,Size_0,Size_1,Color_0,Color_1,Color_2,Class
0,0,1,0,0,1,1
1,0,1,0,1,0,1
2,1,0,0,1,1,1
3,1,1,1,0,0,0
4,1,0,1,0,1,1
5,1,0,0,0,1,0
6,0,1,0,1,0,0
7,1,1,0,1,1,1


# Frequency Encoding

The Frequency Encoding method encodes by frequency which means we will create a new feature with the number of categories from the data (counts of each category).

In [11]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black'],
        'Class': [1, 1, 1, 0, 1, 0, 0, 1]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Class'])

frequency = df.groupby('Color').size()/len(df)
df.loc[:,'Frequency'] = df['Color'].map(frequency)
df

Unnamed: 0,Size,Color,Class,Frequency
0,small,red,1,0.25
1,small,green,1,0.25
2,large,black,1,0.25
3,medium,white,0,0.125
4,large,blue,1,0.125
5,large,red,0,0.25
6,small,green,0,0.25
7,medium,black,1,0.25


# Mean Encoding

In this method, we will encode, for each unique value of the categorical feature, based on the ratio of occurrence of the positive class in the target variable.

In [12]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])
df

Unnamed: 0,Size,Color,Target
0,small,red,1
1,small,green,1
2,large,black,1
3,medium,white,0
4,large,blue,1
5,large,red,0
6,small,green,0
7,medium,black,1
8,small,red,1
9,medium,red,0


In [13]:
from category_encoders.target_encoder import TargetEncoder

#TE_encoder = TargetEncoder(drop_invariant=True)
#df = TE_encoder.fit_transform(df['Color'], df['Target'])
#df

mean_encoding = df.groupby('Color')['Target'].mean()
print(mean_encoding)
df.loc[:,'Mean_encoding'] = df['Color'].map(mean_encoding)
df



Color
black    1.0
blue     1.0
green    0.5
red      0.5
white    0.0
Name: Target, dtype: float64


Unnamed: 0,Size,Color,Target,Mean_encoding
0,small,red,1,0.5
1,small,green,1,0.5
2,large,black,1,1.0
3,medium,white,0,0.0
4,large,blue,1,1.0
5,large,red,0,0.5
6,small,green,0,0.5
7,medium,black,1,1.0
8,small,red,1,0.5
9,medium,red,0,0.5


# Sum Encoding

Sum encoding method, also called effect of deviation encoding, will compare the mean of the target (dependent variable) for a given level of a categorical column to the overall mean of the target. It’s like One Hot Encoding with the difference that we use 1, 0 and -1 values to encode the data. It can be used in Linear Regression types of models. It can be coded with the category_encoders library.

In [14]:
from category_encoders import SumEncoder

data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

sum_encoder =SumEncoder()
df_encoded = sum_encoder.fit_transform(df['Size'], df['Target'])

df_encoded

Unnamed: 0,intercept,Size_0,Size_1
0,1,1.0,0.0
1,1,1.0,0.0
2,1,0.0,1.0
3,1,-1.0,-1.0
4,1,0.0,1.0
5,1,0.0,1.0
6,1,1.0,0.0
7,1,-1.0,-1.0
8,1,1.0,0.0
9,1,-1.0,-1.0


# Weigth of Evidence

The Weight of Evidence (WoE) is coming from the credit scoring world and measures the “strength” of a grouping technique to separate the good customers and bad customers which refers to the customers who defaulted on a loan or not. In the context of machine learning, WoE is also used for the replacement of categorical values. With One Hot Encoding, if we assume that a column contains 5 unique labels, there will be 5 new columns. Here, we will replace the values by the WoE. This method is particularly well suited for subsequent modeling using Logistic Regression. WoE transformation orders the categories on a “logistic” scale which is natural for logistic regression.

In [15]:
from category_encoders import WOEEncoder

data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

#regularization is mostly to prevent division by zero. 
woe =WOEEncoder(random_state=42, regularization=0)
df_encoded = woe.fit_transform(df['Size'], df['Target'])
df_encoded


Unnamed: 0,Size
0,0.0
1,0.0
2,0.693147
3,-0.693147
4,0.693147
5,0.693147
6,0.0
7,-0.693147
8,0.0
9,-0.693147


# Probability Ratio Encoding

Probability Ratio Encoding is similar to WoE but we will only keep the ratio, not the logarithm of it. For each category, the mean of the target is calculated to equal 1 that is the probability p(1) of being 1 and the probability p(0) of not being 1 (it’s 0). The ratio of happening and not happening is simply p(1)/p(0). All the categorical values should be replaced with this ratio.

In [16]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

# Calculation of the probability of target being 1
probability_encoding_1 = df.groupby('Color')['Target'].mean()
# Calculation of the probability of target not being 1
probability_encoding_0 = 1 - probability_encoding_1
probability_encoding_0 = np.where(probability_encoding_0 == 0, 0.00001, probability_encoding_0)
# Probability ratio calculation
df_encoded = probability_encoding_1 / probability_encoding_0
# Map the probability ratio into the data
df.loc[:,'Proba_Ratio'] = df['Color'].map(df_encoded)
df

Unnamed: 0,Size,Color,Target,Proba_Ratio
0,small,red,1,1.0
1,small,green,0,0.0
2,large,black,1,100000.0
3,medium,white,0,0.0
4,large,blue,1,100000.0
5,large,red,0,1.0
6,small,green,0,0.0
7,medium,black,1,100000.0
8,small,red,1,1.0
9,medium,red,0,1.0


# Hashing Encoding

Hashing encoding is similar to One-Hot-encoding which converts the category into binary numbers using new variables. The difference is that we can fix the number of variables we want. Hashing encoding maps each category to an integer within a pre-defined range with the help of the hash function. We can use different hashing methods using the hash_method option. Any method from hashlib works (import hashlib) -- this is defined in inputs.py (hash_method). We also need to choose the number of components (n_components in inputs.py).
If we want 4 binary features, we can convert the output written in binary and select the last 4 bits.

In [17]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

import category_encoders as ce
# n_components contains the number of bits you want in your hash value.
encoder_purpose = ce.HashingEncoder(n_components=3, hash_method="sha256")
# Converting the feature "Size"
df_encoded = encoder_purpose.fit_transform(df['Size'])
df_encoded

Unnamed: 0,col_0,col_1,col_2
0,0,1,0
1,0,1,0
2,1,0,0
3,0,1,0
4,1,0,0
5,1,0,0
6,0,1,0
7,0,1,0
8,0,1,0
9,0,1,0


# Backward difference encoding

In backward difference encoding method, which is similar to Helmert encoding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. Backward difference encoding falls under the contrast encoders for categorical features. Backward difference encoding may be useful for both nominal and ordinal variable. In addition, contrary to the dummy encoding examples, we will see as outputs regressed continuous values.

In [18]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

import category_encoders as ce
encoder = ce.BackwardDifferenceEncoder(cols=['Size'])
df_encoded = encoder.fit_transform(df['Size'])
df_encoded

Unnamed: 0,intercept,Size_0,Size_1
0,1,-0.666667,-0.333333
1,1,-0.666667,-0.333333
2,1,0.333333,-0.333333
3,1,0.333333,0.666667
4,1,0.333333,-0.333333
5,1,0.333333,-0.333333
6,1,-0.666667,-0.333333
7,1,0.333333,0.666667
8,1,-0.666667,-0.333333
9,1,0.333333,0.666667


# Leave One Out Encoder

The target-based encoder Leave One Out encoding excludes the current row’s target when we calculate the mean target for a level to reduce the effect of outliers. In other words, it involves taking the mean target value of all data points in the category except the current row.

In [19]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

import category_encoders as ce
encoder = ce.LeaveOneOutEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])
df_encoded

Unnamed: 0,Color
0,0.333333
1,0.0
2,1.0
3,0.5
4,0.5
5,0.666667
6,0.0
7,1.0
8,0.333333
9,0.666667


# James-Stein Encoder

The target-based encoder James-Stein, only defined for normal distributions, is inspired by James-Stein estimator. For the feature value i, James-Stein estimator return a weighted average.

In [20]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

import category_encoders as ce
encoder = ce.JamesSteinEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])

df_encoded


Unnamed: 0,Color
0,0.5
1,0.0
2,1.0
3,0.0
4,1.0
5,0.5
6,0.0
7,1.0
8,0.5
9,0.5


# M-Estimator Econding 

M-estimator encoding, a more general Bayesian approach, has only one hyperparameter (m) which represents the power of regularization and generally good for high cardinality data.
The default value of m is 1. The recommended values are in the range of 1 to 100 and higher is m stronger shrinking.

In [21]:
data = {'Size': ['small', 'small', 'large', 'medium', 'large', 'large', 'small', 'medium', 'small', 'medium'],
        'Color': ['red', 'green', 'black', 'white', 'blue', 'red', 'green', 'black', 'red', 'red'],
        'Target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]}


df = pd.DataFrame(data, columns = ['Size', 'Color', 'Target'])

encoder = ce.MEstimateEncoder(cols=['Color'])
df_encoded = encoder.fit_transform(df['Color'], df['Target'])
df_encoded

Unnamed: 0,Color
0,0.5
1,0.166667
2,0.833333
3,0.25
4,0.75
5,0.5
6,0.166667
7,0.833333
8,0.5
9,0.5
