# Feature Engineering: Encoding Categorical Variables

Categorical variables need to be converted into numerical form before they can be used in most machine learning models. Below are common encoding techniques used in feature engineering.

---

## 1. One-Hot Encoding

Converts each category into a separate binary column (1 if present, 0 otherwise).

**Use case**: Logistic regression, SVM, or models sensitive to numerical values  
**Note**: Can create many columns if categories are numerous.

In [2]:
import pandas as pd

df = pd.DataFrame({'color': ['Red', 'Green', 'Blue', 'Green']})
encoded = pd.get_dummies(df, columns=['color'], drop_first=True)
print(encoded)

   color_Green  color_Red
0        False       True
1         True      False
2        False      False
3         True      False


## 2. Label Encoding

Assigns each category a unique integer.

**Use case**: Tree-based models like decision trees or random forests  
**Note**: Not suitable for linear models unless categories have meaningful order.

In [3]:
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'color': ['Red', 'Green', 'Blue', 'Green']})
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
print(df)

   color  color_encoded
0    Red              2
1  Green              1
2   Blue              0
3  Green              1


## 3. Ordinal Encoding

Used when categories have a defined order (e.g., low < medium < high).

**Use case**: When category order is meaningful  
**Note**: Should not be used if categories are unordered

In [4]:
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'size': ['small', 'medium', 'large', 'small']})
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
df['size_encoded'] = encoder.fit_transform(df[['size']])
print(df)

     size  size_encoded
0   small           0.0
1  medium           1.0
2   large           2.0
3   small           0.0


## 4. Binary Encoding

Converts categories into binary numbers and spreads the bits across columns.

**Use case**: When you have many categories and want to reduce dimensionality  
**Note**: Harder to interpret

In [None]:
pip install category_encoders

In [6]:
import pandas as pd
import category_encoders as ce

df = pd.DataFrame({'city': ['Tokyo', 'Osaka', 'Nagoya', 'Osaka']})
encoder = ce.BinaryEncoder()
df_encoded = encoder.fit_transform(df)
print(df_encoded)

   city_0  city_1
0       0       1
1       1       0
2       1       1
3       1       0


## 5. Target Encoding (Mean Encoding)

Encodes categories with the mean of the target variable.

**Use case**: High-cardinality categorical variables  
**Note**: Can lead to data leakage — use with cross-validation or smoothing

In [7]:
df = pd.DataFrame({
    'city': ['Tokyo', 'Osaka', 'Nagoya', 'Osaka', 'Tokyo'],
    'purchased': [1, 0, 1, 0, 1]
})

mean_map = df.groupby('city')['purchased'].mean()
df['city_encoded'] = df['city'].map(mean_map)
print(df)

     city  purchased  city_encoded
0   Tokyo          1           1.0
1   Osaka          0           0.0
2  Nagoya          1           1.0
3   Osaka          0           0.0
4   Tokyo          1           1.0


## Summary Table

| Encoding Method     | Description                             | Suitable For                     |
|---------------------|------------------------------------------|----------------------------------|
| One-Hot Encoding     | Creates binary columns per category      | Linear models, SVM               |
| Label Encoding       | Integer encoding (unordered categories)  | Tree-based models                |
| Ordinal Encoding     | Integer encoding (ordered categories)    | Ordered categorical features     |
| Binary Encoding      | Compact representation using binary      | Mid-cardinality categories       |
| Target Encoding      | Replaces category with target mean       | LGBM, XGBoost (with caution)     |


## Notes

- Use `One-Hot` if the model is sensitive to numerical values (e.g., logistic regression).
- Use `Label` or `Ordinal` for tree-based models that do not rely on feature scale.
- Avoid target leakage with `Target Encoding` by applying it within each fold during cross-validation.