# Categorical Encoding Techniques in Python
## Prof. Dnyanesh Khedekar , Big Data Analytics, GIM
This notebook demonstrates when and how to use different categorical encoding techniques.
Each section includes explanation and code examples.

## 1. Label Encoding
Label Encoding converts each category into a unique integer, based on **alphabetical order.**
Use when categorical variable is **ordinal** (i.e., there is a natural order to the categories).

In [None]:
# Import pandas for data manipulation and LabelEncoder for encoding.
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create a sample DataFrame with a 'Grade' column.
df = pd.DataFrame({'Grade': ['A','C','B','F','A','D']})
# Initialize the LabelEncoder.
le = LabelEncoder()
# Fit the encoder to the 'Grade' column and transform it into numerical labels.
df['Grade_encoded'] = le.fit_transform(df['Grade'])
# Display the DataFrame with the new encoded column.
df

Other example for label : Level tags that sort naturally: ['Level1','Level2','Level3','Level4']

In [None]:
# Another example of Label Encoding with different quality levels.
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame with an ordinal 'Quality' feature.
df = pd.DataFrame({'Quality': ['Low', 'Medium', 'High', 'Medium', 'Low']})

# Initialize and apply the LabelEncoder.
le = LabelEncoder()
df['Quality_encoded'] = le.fit_transform(df['Quality'])
# Note that the encoding is based on alphabetical order: High (0), Low (1), Medium (2).
df

## 2. One-Hot Encoding
Creates binary columns for each category. Use when variable is **nominal** (no order). ( Problem  : Curse of dimentionality )

In [None]:
# Create a sample DataFrame with a nominal 'Color' feature.
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Use pandas' get_dummies() function to perform one-hot encoding.
# This creates a new column for each color, with 1s and 0s indicating the presence of that color.
one_hot_df = pd.get_dummies(df, columns=['Color'])
one_hot_df

## 3. Ordinal Encoding
Maps categories to integers based on predefined order.
Useful for ordered categorical data (e.g., 'Poor' < 'Average' < 'Good').

In [None]:
# Import OrdinalEncoder for custom order encoding.
from sklearn.preprocessing import OrdinalEncoder

# Define the desired order of the categories.
ordinal_map = [['Poor', 'Average', 'Good']]
# Create a sample DataFrame.
df = pd.DataFrame({'Service': ['Good', 'Poor', 'Average', 'Good', 'Poor']})

# Initialize the OrdinalEncoder with the custom order.
enc = OrdinalEncoder(categories=ordinal_map)
# Apply the encoder to the 'Service' column.
df['Service_encoded'] = enc.fit_transform(df[['Service']])
df

## 4. Target Encoding
Replaces each category with the mean of the target variable.
Use carefully to prevent **data leakage**.

In [None]:
# Create a sample DataFrame with a categorical feature and a numerical target ('Sales').
df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A'], 'Sales': [100, 200, 150, 300, 250, 400, 120]})

# Calculate the mean of 'Sales' for each category.
target_mean = df.groupby('Category')['Sales'].mean().to_dict()
# Map the calculated means back to the 'Category' column to create the encoded feature.
df['Category_encoded'] = df['Category'].map(target_mean)
df

## 5. Frequency Encoding
Each category is replaced by its frequency in the dataset.
Helps when frequency conveys useful information.

In [None]:
# Create a sample DataFrame with a 'Brand' column.
df = pd.DataFrame({'Brand': ['A', 'B', 'A', 'C', 'B', 'A', 'B']})
# Calculate the frequency of each brand.
freq_enc = df['Brand'].value_counts().to_dict()
# Map the frequencies to the 'Brand' column to create the encoded feature.
df['Brand_encoded'] = df['Brand'].map(freq_enc)
df

## 6. Hash Encoding
Converts categories into a fixed number of columns using hash function.
Useful for **very high-cardinality** data.

In [None]:
# Import the category_encoders library for more advanced encoding techniques.
import  category_encoders as ce
# Initialize the HashingEncoder with the desired number of output columns (n_components).
encoder = ce.HashingEncoder(cols=['City'], n_components=4)
# Apply the encoder to the 'City' column.
hash_df = encoder.fit_transform(pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Bangalore']}))
hash_df

## Summary Table
| Encoding Type | Best Use Case | Example Library |
|----------------|----------------|----------------|
| Label Encoding | Ordinal data | sklearn.preprocessing.LabelEncoder |
| One-Hot Encoding | Nominal data, low cardinality | pandas.get_dummies |
| Ordinal Encoding | Ordered categorical | sklearn.preprocessing.OrdinalEncoder |
| Target Encoding | Supervised problems with caution | category_encoders.TargetEncoder |
| Frequency Encoding | When category frequency is relevant | pandas |
| Hash Encoding | Extremely high-cardinality variables | category_encoders.HashingEncoder |

