# Target Encoding

Target encoding is a method used in machine learning and data analysis for encoding categorical variables. It involves replacing categories or labels of a categorical variable with some meaningful numeric representation based on the target variable. The target variable refers to the variable that we want to predict or model.

The basic idea behind target encoding is to utilize the relationship between the categorical variable and the target variable to derive informative features. It is particularly useful when dealing with categorical variables with high cardinality, i.e., a large number of distinct categories.

Target encoding leverages the information contained in the target variable to generate meaningful representations for categorical variables. By doing so, it allows the model to capture potential relationships between the categorical variable and the target variable, potentially improving the predictive performance of the model.

It's worth noting that target encoding should be performed with caution, as it can lead to overfitting if not properly regularized or validated. Techniques such as cross-validation, smoothing, or incorporating other regularization methods can help mitigate this risk.

In [1]:
#!pip install category_encoders

In [2]:
import pandas as pd
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


We will first create a sample dataset with the `color` and `label` columns. The we will split the dataset into training and validation sets using the `train_test_split` function from scikit-learn.

In [3]:
# Create a sample dataset
data = {'color': ['red', 'blue', 'green', 'red', 'blue', 'blue'],
        'label': [1, 0, 1, 0, 1, 0]}
df = pd.DataFrame(data)


In [4]:
df.head()

Unnamed: 0,color,label
0,red,1
1,blue,0
2,green,1
3,red,0
4,blue,1


## With one hot encoding

In [5]:
# Perform one-hot encoding
encoded = pd.get_dummies(df['color'], prefix='color')

# Concatenate the original DataFrame with the encoded variables
df_encoded = pd.concat([df, encoded], axis=1)

In [6]:
df_encoded.drop('color', axis=1, inplace=True)
df_encoded

Unnamed: 0,label,color_blue,color_green,color_red
0,1,0,0,1
1,0,1,0,0
2,1,0,1,0
3,0,0,0,1
4,1,1,0,0
5,0,1,0,0


In [7]:
# check correlations with one hot encoding
df_encoded.corr()

Unnamed: 0,label,color_blue,color_green,color_red
label,1.0,-0.333333,0.447214,0.0
color_blue,-0.333333,1.0,-0.447214,-0.707107
color_green,0.447214,-0.447214,1.0,-0.316228
color_red,0.0,-0.707107,-0.316228,1.0


In [8]:
# Split the dataset into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

We will now initialize a `TargetEncoder` from the `category_encoders` library and specify the `color` column as the one to be encoded.

The default method used is the mean. However, you can choose different aggregation functions such as 'sum', 'count', 'median', 'min', 'max', etc. Additionally, you can also apply smoothing techniques to handle rare categories or prevent overfitting.

In [9]:
# Initialize target encoder
encoder = ce.TargetEncoder(cols=['color'])

# If you want to change the default agg function mean to 'sum'.
#encoder.fit(train_df['color'], train_df['label'], )


Let us fit the encoder on the training data using `encoder.fit()`. This calculates the average label value for each category in the `color` column.

For e.g. If the class blue occurs 2 times in the data and the corresponsing values in the label column are 0 and 1. The average value is (0+1)/2 = 0.5

In [10]:
train_df[train_df.color=='blue']

Unnamed: 0,color,label
5,blue,0
4,blue,1


In [11]:
# Fit the encoder on the training data
encoder.fit(train_df['color'], train_df['label'], )

TargetEncoder(cols=['color'])

In [12]:
# Transform the categorical variable in both training and validation sets
train_encoded = encoder.transform(train_df['color'])
val_encoded = encoder.transform(val_df['color'])

In [13]:
# Replace the original 'color' column with the encoded values
train_df['color'] = train_encoded
val_df['color'] = val_encoded


In [14]:
train_df

Unnamed: 0,color,label
5,0.5,0
2,0.565054,1
4,0.5,1
3,0.434946,0


In [15]:
# Train a model on the encoded training set
model = LogisticRegression()
model.fit(train_df[['color']], train_df['label'])

# Make predictions on the encoded validation set
val_predictions = model.predict(val_df[['color']])

# Evaluate the model's performance
accuracy = accuracy_score(val_df['label'], val_predictions)
print("Validation accuracy:", accuracy)


Validation accuracy: 0.5


In [16]:
# check the correlation coefficient after target encoding
train_df.corr()

Unnamed: 0,color,label
color,1.0,0.707107
label,0.707107,1.0


### Using the 'soothing' argument

Smoothing, also known as regularization, is a technique used in target encoding to mitigate the risk of overfitting and handle categories with limited observations. It introduces a balance between the category-specific mean (or other aggregate value) and the global mean of the target variable. Smoothing assigns a weighted average of these two values, with the weight determined by a smoothing parameter or hyperparameter.

In [17]:
# Initialize target encoder with smoothing
# smoothing_param = 0.5  # Adjust the smoothing parameter
# encoder = ce.TargetEncoder(cols=['color'], smoothing=smoothing_param)

In this example, we added the smoothing parameter when initializing the TargetEncoder. The value of smoothing_param is set to 0.5, but you can adjust it according to your requirements.

When smoothing is applied, the encoding calculation for each category combines the category-specific mean and the global mean. The weight given to the category-specific mean depends on the smoothing parameter. Higher values of smoothing parameter lead to a stronger influence of the global mean, while lower values give more weight to the category-specific mean.

By incorporating smoothing, we reduce the impact of categories with sparse observations, preventing them from having extreme or unreliable target encodings. Smoothing helps to generalize the encoding and prevent overfitting by striking a balance between category-specific information and overall target variable distribution.

## How to choose your aggregate function?

The choice of aggregate function for target encoding depends on the nature of your dataset and the problem you are trying to solve. There is no one-size-fits-all answer, and it often requires experimentation and domain knowledge to determine the most appropriate function.

For e.g.:

1. **Mean:** The mean is a commonly used aggregate function for target encoding. It provides a balanced representation of the target variable within each category. It is suitable when the distribution of the target variable is relatively symmetric and there are no extreme outliers.

2. **Sum:** Summing the target variable values within each category can be useful when the target variable represents a count or a cumulative measure. It may be suitable for problems where the cumulative effect or the total occurrence is important.

3. **Count:** Counting the occurrences of each category in the target variable can be valuable when you want to capture the frequency or prevalence of each category. This can be particularly useful for imbalanced datasets or when the frequency of a category is informative.

4. **Median:** The median can be a robust alternative to the mean, especially if the target variable is skewed or contains outliers. It provides a measure of the central tendency that is less influenced by extreme values.

5. **Min/Max:** Using the minimum or maximum value of the target variable within each category can be relevant when you want to capture the extreme values or boundaries of the target variable for each category.

These are just a few examples of aggregate functions you can consider. The choice also depends on the specific characteristics of your dataset and the problem at hand. You can also explore other aggregate functions such as `standard deviation`, `percentile`, or any custom function that captures meaningful information about the relationship between the categorical variable and the target variable.

To select the best aggregate function, you can try multiple options and evaluate the impact on your model's performance. You can use validation techniques, such as cross-validation, to compare the performance of different encoding methods and choose the one that improves your model's accuracy, precision, recall, or other relevant metrics based on your problem domain.