In [27]:
import pandas as pd

# Presentation sections
### 1. One-Hot Encoding
#### 1.1 One-Hot Encoding
#### 1.2 One-Hot Encoding for Top K Categories

### 2. Probability-Statistics Encoding
#### 2.1 Ordinal / Label / Integer Encoding
#### 2.2 Count/Frequency Encoding

### 3. Target Encoding
#### 3.1 Ordered Ordinal Encoding
#### 3.2 Mean Encoding

# 1. One-Hot Encoding
#### (Mã hóa one-hot)
Encode each categorical variable, each category or level of the categorical variable is represented as a binary column (0 or 1) in the matrix. The presence of a category is indicated by a 1 in the corresponding column, while the absence is represented by a 0.

Example:

For the category variable "color" with values 'red', 'blue' and 'green':
+ We can create 3 new variables "red", "blue" and "green". 
+ These variables will take the value 1 if the observation has the above color or 0 otherwise.

#### Advantages of one-hot encoding
+ Easy to do.
+ Make no assumptions about the categories or distribution of the category variable.
+ Keep all the information of the category variable.
+ Suitable for linear models.
#### Limit
+ Expand featured space.
+ Do not add additional information while encoding.
+ Many dummy variables may be identical, introducing redundant information.

### 1.1 One-Hot Encoding

### Encode as k-1 dummy variables
For example, suppose you have a variable "Color" with three categories: "Green", "Red", and "Blue". If you use encoding as $k-1$ dummy variables, you will have two dummy variables: "Color_Green" and "Color_Red". The variable "Color_Blue" is not included in the dataset because you can infer its value from the values of the other two dummy variables.
+ If the observation is green, it will be captured by the variable "green" (green = 1, red = 0).
+ If the observation is red, it will be captured by the variable "red" (green = 0, red = 1).
+ If the observation is blue, it will be recorded as a combination of "green" and "red" (green = 0, red = 0).

Most machine learning algorithms consider the entire data set while matching. Therefore, encoding the category variables into k − 1 binary variables is better because it avoids introducing redundant information.

In [28]:
# Ví dụ mã hóa one-hot encoding k-1

# Tạo một DataFrame ví dụ
data = {'Color': ['Blue', 'Green', 'Red', 'Green', 'Blue']}
df = pd.DataFrame(data)

# Sử dụng mã hóa one-hot encoding k-1
df_encoded = pd.get_dummies(df, drop_first=True).astype(int)

df = pd.concat((df, df_encoded), axis=1)
df

Unnamed: 0,Color,Color_Green,Color_Red
0,Blue,0,0
1,Green,1,0
2,Red,0,1
3,Green,1,0
4,Blue,0,0


### Encode as k dummy variables
Some cases it is better to encode variables as k dummy variables:
+ When building tree algorithms.
+ When performing selection characterized by recursive algorithms.
+ When wanting to determine the importance of each individual item.

Because tree algorithms do not evaluate the entire data set while being trained. Therefore, if we want a tree algorithm to consider all categories, we need to encode the category variables into k binary variables.

If we plan to perform feature selection by removing/adding or if we want to evaluate the importance of each single item of the category variable then the entire set of binary variables will also be needed. (k) let the machine learning model select the feature with the best prediction ability.

In [29]:
# Ví dụ mã hóa one-hot encoding k-1

# Tạo một DataFrame ví dụ
data = {'Color': ['Blue', 'Green', 'Red', 'Green', 'Blue']}
df = pd.DataFrame(data)

# Sử dụng mã hóa one-hot encoding k-1
df_encoded = pd.get_dummies(df).astype(int)

df = pd.concat((df, df_encoded), axis=1)
df

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red
0,Blue,1,0,0
1,Green,0,1,0
2,Red,0,0,1
3,Green,0,1,0
4,Blue,1,0,0


### 1.2 One-Hot Encoding for Top Categories
We know that high cardinality and rare labels can lead to certain categorical:
+ Appears only in the training set => causing overfitting
+ Appears only in the test set => then the models won't know how to score them.

**To avoid these problems, we can only create dummy variables for the most frequently occurring categories.**

#### Advantages of OHE top categories
+ Easy to deploy.
+ Doesn't require hours of variable exploration.
+ Does not expand much of the featured space.
+ Suitable for linear models.
#### Limit
+ Do not add any information that would make the variable more predictive.
+ Do not keep information of omitted labels.

In [30]:
import pandas as pd

# Create a sample DataFrame
data = {'Origin_Category': ['A', 'B', 'C', 'D', 'E', 'F', 'B', 'C', 'D']}
df = pd.DataFrame(data)

# Define a function to encode high cardinality categorical variable
def encode_high_cardinality(df, column, top_categories):
    top_categories_set = set(top_categories)
    df['Category'] = df[column].apply(lambda x: x if x in top_categories_set else 'Other')
    df_encoded = pd.get_dummies(df['Category'], prefix='Category')
    return df_encoded

# Identify the top categories based on frequency
top_categories = df['Origin_Category'].value_counts().nlargest(3).index.to_list()
print("Top 3 category:")
print(top_categories)

# Encode the high cardinality column
df_encoded = encode_high_cardinality(df, 'Origin_Category', top_categories).astype(int)

df = pd.concat((df, df_encoded), axis=1)
df

Top 3 category:
['B', 'C', 'D']


Unnamed: 0,Origin_Category,Category,Category_B,Category_C,Category_D,Category_Other
0,A,Other,0,0,0,1
1,B,B,1,0,0,0
2,C,C,0,1,0,0
3,D,D,0,0,1,0
4,E,Other,0,0,0,1
5,F,Other,0,0,0,1
6,B,B,1,0,0,0
7,C,C,0,1,0,0
8,D,D,0,0,1,0


# 2. Probability-Statistics Encoding

### 2.1 Ordinal / Label / Integer Encoding
Integer encoding involves replacing categories with digits from 1 to n (or 0 to n-1, depending on the implementation), where n is the number of distinct categories of the variable.

The numbers are assigned arbitrarily. This encoding method allows for rapid benchmarking of machine learning models.

#### Advantage:

+ Easy to deploy and does not increase feature space.
+ Can work well with tree-based algorithms.
#### Limit:
+ Do not add new information when encoding variables.
+ Do not process new category in the test set that are not in the training set.
+ Not suitable for linear models.

In [31]:
# Ví dụ mã hóa số nguyên

# Tạo một DataFrame ví dụ
data = {'Color': ['Blue', 'Green', 'Red', 'Green', 'Blue', 'White', 'Yellow']}
df = pd.DataFrame(data)

ordinal_mapping = {
    k: i
    for i, k in enumerate(df['Color'].unique(), 0)
}
print('Map category to integer:')
print(ordinal_mapping)

df['Color_encoding'] = df['Color'].map(ordinal_mapping)
df

Map category to integer:
{'Blue': 0, 'Green': 1, 'Red': 2, 'White': 3, 'Yellow': 4}


Unnamed: 0,Color,Color_encoding
0,Blue,0
1,Green,1
2,Red,2
3,Green,1
4,Blue,0
5,White,3
6,Yellow,4


### 2.2 Count/Frequency Encoding
In count coding, we replace categories with the number of observations displaying that category in the data set.

Similarly, we can replace category with the frequency - or percentage - of observations in the data set
#### Advantage
+ Simple.
+ Do not expand the featured space.
#### Limit
+ May be lead to over-fitting
+ If two different categories have the same number of occurrences in the data set, i.e. they have the same number of observations, they will be replaced by the same number: valuable information can be lost.

In [32]:
# Ví dụ mã hóa số nguyên

# Tạo một DataFrame ví dụ
data = {'Color': ['Blue', 'Green', 'Red', 'Green', 'Blue', 'White', 'Yellow']}
df = pd.DataFrame(data)

count_map = df['Color'].value_counts().to_dict()
frequency_map = (df['Color'].value_counts() / len(df)).to_dict()

df['Color_Count_Encoded'] = df['Color'].map(count_map)
df['Color_Frequency_Encoded'] = df['Color'].map(frequency_map)
df

Unnamed: 0,Color,Color_Count_Encoded,Color_Frequency_Encoded
0,Blue,2,0.285714
1,Green,2,0.285714
2,Red,1,0.142857
3,Green,2,0.285714
4,Blue,2,0.285714
5,White,1,0.142857
6,Yellow,1,0.142857


# 3. Target Encoding

### 3.1 Ordered Ordinal Encoding
Categories are replaced by integers from 1 to k, where is the number of distinct categories in the variable, but this numbering is informed by the mean of the target for each category.

#### Advantage
+ Straightforward to implement
+ Does not expand the feature space
+ Creates monotonic relationship between categories and target
#### Limit
+ May be lead to over-fitting

In [33]:
data = {'Color': ['Blue', 'Blue', 'Red', 'Green', 'Green'],
        'Target': [0, 1, 0, 1, 1]}
df = pd.DataFrame(data)

ordered_labels = df.groupby(['Color'])['Target'].mean().sort_values().to_dict()
ordinal_mapping = {k: i for i, k in enumerate(ordered_labels, 0)}

print("Ordinal map:")
print(ordinal_mapping)


df['Color_Ordinal_Encoded'] = df['Color'].map(ordinal_mapping)
df

Ordinal map:
{'Red': 0, 'Blue': 1, 'Green': 2}


Unnamed: 0,Color,Target,Color_Ordinal_Encoded
0,Blue,0,1
1,Blue,1,1
2,Red,0,0
3,Green,1,2
4,Green,1,2


### 3.2 Mean Encoding
Mean encoding implies replacing the category by the average target value for that category.
#### Advantage
+ Straightforward to implement
+ Does not expand the feature space
+ Creates monotonic relationship between categories and target
#### Limit
+ May be lead to over-fitting
+ If 2 categories show the same mean of target, they will be replaced by the same number => valuable information can be lost.

In [34]:
data = {'Color': ['Blue', 'Blue', 'Red', 'Green', 'Green'],
        'Target': [0, 1, 0, 1, 1]}
df = pd.DataFrame(data)

mean_map = df.groupby(['Color'])['Target'].mean().to_dict()
print("Mean map:")
print(mean_map)

df['Color_Mean_Encoded'] = df['Color'].map(mean_map)
df

Mean map:
{'Blue': 0.5, 'Green': 1.0, 'Red': 0.0}


Unnamed: 0,Color,Target,Color_Mean_Encoded
0,Blue,0,0.5
1,Blue,1,0.5
2,Red,0,0.0
3,Green,1,1.0
4,Green,1,1.0
