## 1. Label Encoding

Assigns each unique category an integer value.

- Use case: Ordinal categories or tree-based models.
- Note: Not suitable for nominal data without order as it may imply false relationships.

In [8]:
from sklearn.preprocessing import LabelEncoder

data = ['red', 'green', 'blue', 'green', 'red']
le = LabelEncoder()
encoded = le.fit_transform(data)
print(encoded)  # Output: [2 1 0 1 2]

[2 1 0 1 2]


## 2. One-Hot Encoding

Creates binary columns for each category.

- Use case: Nominal categories, models sensitive to numeric values (e.g., logistic regression).
- Note: Can increase dimensionality if many categories exist.

In [9]:
import pandas as pd

df = pd.DataFrame({'color': ['red', 'green', 'blue', 'green', 'red']})
one_hot = pd.get_dummies(df['color'])
print(one_hot)

    blue  green    red
0  False  False   True
1  False   True  False
2   True  False  False
3  False   True  False
4  False  False   True


## 3. Ordinal Encoding

Replaces categories with integers preserving their natural order.

- Use case: Ordered categories (e.g., low < medium < high).
- Note: Requires correct category order definition.

In [10]:
from sklearn.preprocessing import OrdinalEncoder

data = [['low'], ['medium'], ['high'], ['medium'], ['low']]
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded = encoder.fit_transform(data)
print(encoded)

[[0.]
 [1.]
 [2.]
 [1.]
 [0.]]


## 4. Target Encoding

Replaces categories with the mean of the target variable for each category.

- Use case: High-cardinality categories.
- Warning: Risk of target leakage — requires careful cross-validation.

In [11]:
import pandas as pd

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'C', 'B'],
    'target': [10, 20, 15, 10, 25]
})
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)
print(df)

  category  target  category_encoded
0        A      10              12.5
1        B      20              22.5
2        A      15              12.5
3        C      10              10.0
4        B      25              22.5


## 5. Frequency Encoding

Replaces categories with their frequency counts.

- Use case: When frequency information matters.
- Advantage: No increase in dimensionality.

In [12]:
freq = df['category'].value_counts()
df['category_freq'] = df['category'].map(freq)
print(df)

  category  target  category_encoded  category_freq
0        A      10              12.5              2
1        B      20              22.5              2
2        A      15              12.5              2
3        C      10              10.0              1
4        B      25              22.5              2


## When to Use Each Encoding

| Encoding Type   | Use Case                    | Pros                          | Cons                          |
|-----------------|-----------------------------|-------------------------------|-------------------------------|
| Label Encoding  | Ordinal data                | Simple                        | May mislead model on nominal data |
| One-Hot Encoding| Nominal data                | No order implied              | High dimensionality possible   |
| Ordinal Encoding| Ordered categories          | Maintains order               | Needs correct order definition |
| Target Encoding | High-cardinality categories | Reduces dimensionality        | Risk of leakage, needs care    |
| Frequency Encoding| When frequency matters     | Simple, no dimensionality increase | May lose category meaning   |


## Summary

Categorical encoding is a crucial part of feature engineering. The choice of method depends on the nature of the categorical variable (nominal or ordinal), the number of unique categories, and the model being used. Always consider the risk of leakage, especially with target encoding, and validate your approach carefully.