# Categorical Encoding Techniques in Python
## Prof. Dnyanesh Khedekar , Big Data Analytics, GIM
This notebook demonstrates when and how to use different categorical encoding techniques.
Each section includes explanation and code examples.

## 1. Label Encoding
Label Encoding converts each category into a unique integer, based on **alphabetical order.**
Use when categorical variable is **ordinal** 

In [None]:
# Import required libraries
import pandas as pd
# Import required libraries
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Grade': ['A','C','B','F','A','D']})
# Initialize label encoder for categorical variables
le = LabelEncoder()
# Fit and transform data in one step
df['Grade_encoded'] = le.fit_transform(df['Grade'])
df

Other example for label : Level tags that sort naturally: ['Level1','Level2','Level3','Level4']

In [None]:
# Import required libraries
import pandas as pd
# Import required libraries
from sklearn.preprocessing import LabelEncoder

# Example dataset
df = pd.DataFrame({'Quality': ['Low', 'Medium', 'High', 'Medium', 'Low']})

# Initialize encoder
le = LabelEncoder()
# Fit and transform data in one step
df['Quality_encoded'] = le.fit_transform(df['Quality'])
df

## 2. One-Hot Encoding
Creates binary columns for each category. Use when variable is **nominal** (no order). ( Problem  : Curse of dimentionality )

In [None]:
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# One-hot encoding using pandas
one_hot_df = pd.get_dummies(df, columns=['Color'])
one_hot_df

## 3. Ordinal Encoding
Maps categories to integers based on predefined order.
Useful for ordered categorical data (e.g., 'Poor' < 'Average' < 'Good').

In [None]:
# Import required libraries
from sklearn.preprocessing import OrdinalEncoder

ordinal_map = [['Poor', 'Average', 'Good']]
df = pd.DataFrame({'Service': ['Good', 'Poor', 'Average', 'Good', 'Poor']})

enc = OrdinalEncoder(categories=ordinal_map)
# Fit and transform data in one step
df['Service_encoded'] = enc.fit_transform(df[['Service']])
df

## 4. Target Encoding
Replaces each category with the mean of the target variable.
Use carefully to prevent **data leakage**.

In [None]:
df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A'], 'Sales': [100, 200, 150, 300, 250, 400, 120]})

# Group data for aggregation
target_mean = df.groupby('Category')['Sales'].mean().to_dict()
df['Category_encoded'] = df['Category'].map(target_mean)
df

## 5. Frequency Encoding
Each category is replaced by its frequency in the dataset.
Helps when frequency conveys useful information.

In [None]:
df = pd.DataFrame({'Brand': ['A', 'B', 'A', 'C', 'B', 'A', 'B']})
freq_enc = df['Brand'].value_counts().to_dict()
df['Brand_encoded'] = df['Brand'].map(freq_enc)
df

## 6. Hash Encoding
Converts categories into a fixed number of columns using hash function.
Useful for **very high-cardinality** data.

In [None]:
# Import required libraries
import  category_encoders as ce
encoder = ce.HashingEncoder(cols=['City'], n_components=4)
# Fit and transform data in one step
hash_df = encoder.fit_transform(pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Bangalore']}))
hash_df

## Summary Table
| Encoding Type | Best Use Case | Example Library |
|----------------|----------------|----------------|
| Label Encoding | Ordinal data | sklearn.preprocessing.LabelEncoder |
| One-Hot Encoding | Nominal data, low cardinality | pandas.get_dummies |
| Ordinal Encoding | Ordered categorical | sklearn.preprocessing.OrdinalEncoder |
| Target Encoding | Supervised problems with caution | category_encoders.TargetEncoder |
| Frequency Encoding | When category frequency is relevant | pandas |
| Hash Encoding | Extremely high-cardinality variables | category_encoders.HashingEncoder |

