# Data Encoding: Nominal and One-Hot Encoding (OHE)

## Introduction

Machine learning algorithms work with numbers, not text. **Categorical encoding** converts categorical (text) variables into numerical format that algorithms can process.

### Why Do We Need Encoding?

Most ML algorithms require numerical input:
- Linear Regression, Logistic Regression
- Neural Networks
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)

**Example Problem:**
```
Colors: ['Red', 'Blue', 'Green']
ML Algorithm: ‚ùå Cannot process strings
Solution: ‚úÖ Encode as numbers
```

### Types of Categorical Variables

**1. Nominal (No Order)**
- Categories have no inherent order
- Examples: Color (Red, Blue, Green), Country (USA, UK, India), Gender (Male, Female)
- **Solution:** One-Hot Encoding

**2. Ordinal (Has Order)**
- Categories have meaningful order
- Examples: Education (High School < Bachelor < Master < PhD), Rating (Bad < Good < Excellent)
- **Solution:** Label Encoding or Ordinal Encoding

### What is One-Hot Encoding?

**One-Hot Encoding (OHE)** creates binary columns for each category, with 1 indicating presence and 0 indicating absence.

**Example:**
```
Original: Color = ['Red', 'Blue', 'Green', 'Red']

After OHE:
  Color_Red  Color_Blue  Color_Green
  1          0           0
  0          1           0
  0          0           1
  1          0           0
```

### How One-Hot Encoding Works

1. **Identify unique categories** in the column
2. **Create new binary column** for each category
3. **Set 1** where that category appears, **0** elsewhere
4. **Drop original column** (optional)

### When to Use One-Hot Encoding

‚úÖ **Use OHE When:**
- Categories are **nominal** (no order)
- Number of categories is **small to moderate** (<10-15)
- Categories are **equally important**
- Using algorithms that can't handle ordinal relationships
- Want to avoid imposing false ordinal relationships

‚ùå **Avoid OHE When:**
- **High cardinality** (many unique categories)
- Categories have **ordinal relationship**
- **Memory constraints** (creates many columns)
- Using **tree-based algorithms** (can handle categories)

### Advantages of One-Hot Encoding

‚úÖ No ordinal relationship imposed
‚úÖ Works with all ML algorithms
‚úÖ Each category treated equally
‚úÖ Easy to interpret
‚úÖ Standard practice for nominal data

### Disadvantages

‚ùå **Curse of dimensionality** - creates many columns
‚ùå **Sparse matrices** - mostly zeros
‚ùå **Memory intensive** with high cardinality
‚ùå **Multicollinearity** - columns are correlated
‚ùå Can slow down training

### The Dummy Variable Trap

When using OHE, we can drop one column to avoid **multicollinearity** (perfect correlation between features).

**Example:** If Color_Red=0 and Color_Blue=0, then Color_Green must =1
**Solution:** Drop one column (e.g., Color_Green)

Let's implement One-Hot Encoding with practical examples!

## Step 1: Import Libraries and Create Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

# Create sample dataset with categorical variables
np.random.seed(42)

data = {
    'Customer_ID': range(1, 16),
    'Age': [25, 30, 35, 28, 42, 38, 45, 29, 33, 40, 27, 31, 36, 44, 26],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 
               'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'New York',
             'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'New York', 'Los Angeles',
             'Chicago', 'Houston', 'Phoenix'],
    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop',
                'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone', 'Tablet'],
    'Purchase_Amount': [1200, 800, 500, 1300, 750, 520, 1250, 820, 510, 1280, 780, 530, 1220, 790, 495]
}

df = pd.DataFrame(data)

print("=" * 100)
print("SAMPLE DATASET WITH CATEGORICAL VARIABLES")
print("=" * 100)
print(df)

print("\n" + "-" * 100)
print("DATA TYPES:")
print("-" * 100)
print(df.dtypes)

print("\n" + "-" * 100)
print("CATEGORICAL COLUMNS ANALYSIS:")
print("-" * 100)
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    unique_values = df[col].nunique()
    print(f"\n{col}:")
    print(f"  Unique Values: {unique_values}")
    print(f"  Categories: {df[col].unique().tolist()}")
    print(f"  Value Counts:\n{df[col].value_counts()}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Categorical Variables Distribution', fontsize=16, fontweight='bold')

for idx, col in enumerate(['Gender', 'City', 'Product']):
    counts = df[col].value_counts()
    axes[idx].bar(counts.index, counts.values, color='skyblue', alpha=0.7, edgecolor='black', linewidth=2)
    axes[idx].set_xlabel(col, fontweight='bold', fontsize=12)
    axes[idx].set_ylabel('Count', fontweight='bold', fontsize=12)
    axes[idx].set_title(f'{col} Distribution', fontweight='bold', fontsize=13)
    axes[idx].grid(axis='y', alpha=0.3)
    axes[idx].tick_params(axis='x', rotation=45)
    
    for i, (category, count) in enumerate(counts.items()):
        axes[idx].text(i, count + 0.1, str(count), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("‚úÖ Dataset loaded successfully!")
print("   Next: Apply One-Hot Encoding to convert categories to numbers")
print("=" * 100)

## Method 1: Pandas get_dummies()

The simplest way to perform One-Hot Encoding in Python is using pandas `get_dummies()`.

In [None]:
# Method 1: pandas get_dummies()
print("METHOD 1: PANDAS GET_DUMMIES()")
print("=" * 100)

# Apply get_dummies to specific columns
df_encoded = pd.get_dummies(df, columns=['Gender', 'City', 'Product'], drop_first=False)

print("\nOriginal DataFrame Shape:", df.shape)
print("Encoded DataFrame Shape:", df_encoded.shape)

print("\n" + "-" * 100)
print("ENCODED DATASET (first 10 rows):")
print("-" * 100)
print(df_encoded.head(10))

print("\n" + "-" * 100)
print("NEW COLUMNS CREATED:")
print("-" * 100)
new_cols = [col for col in df_encoded.columns if col not in df.columns]
print(new_cols)

# With drop_first=True (avoid dummy variable trap)
df_encoded_drop = pd.get_dummies(df, columns=['Gender', 'City', 'Product'], drop_first=True)

print("\n" + "=" * 100)
print("COMPARISON: drop_first=False vs drop_first=True")
print("=" * 100)
print(f"Without dropping first: {df_encoded.shape[1]} columns")
print(f"With dropping first:    {df_encoded_drop.shape[1]} columns")
print(f"Difference:             {df_encoded.shape[1] - df_encoded_drop.shape[1]} columns dropped")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Show encoding for Gender
sample_data = df_encoded[['Customer_ID', 'Gender_Male', 'Gender_Female']].head(8)
axes[0].axis('off')
table = axes[0].table(cellText=sample_data.values,
                      colLabels=sample_data.columns,
                      cellLoc='center',
                      loc='center',
                      colWidths=[0.3, 0.35, 0.35])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2.5)
axes[0].set_title('One-Hot Encoding: Gender Example', fontweight='bold', fontsize=14, pad=20)

# Column count comparison
methods = ['Original', 'OHE (keep all)', 'OHE (drop first)']
col_counts = [df.shape[1], df_encoded.shape[1], df_encoded_drop.shape[1]]
bars = axes[1].bar(methods, col_counts, color=['blue', 'orange', 'green'], 
                   alpha=0.7, edgecolor='black', linewidth=2)
axes[1].set_ylabel('Number of Columns', fontweight='bold', fontsize=12)
axes[1].set_title('Column Count Comparison', fontweight='bold', fontsize=14)
axes[1].grid(axis='y', alpha=0.3)

for bar in bars:
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height + 0.5,
                f'{int(height)}', ha='center', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üí° KEY POINTS:")
print("  ‚Ä¢ drop_first=False: Creates column for EACH category (default)")
print("  ‚Ä¢ drop_first=True: Drops one column per feature (avoids multicollinearity)")
print("  ‚Ä¢ For Gender: Male=0, Female=0 implies the dropped category")
print("=" * 100)

## Method 2: Scikit-learn OneHotEncoder

For machine learning pipelines, use sklearn's `OneHotEncoder` class.

In [None]:
# Method 2: Scikit-learn OneHotEncoder
print("METHOD 2: SCIKIT-LEARN ONEHOTENCODER")
print("=" * 100)

# Initialize encoder
encoder = OneHotEncoder(drop='first', sparse_output=False)

# Select categorical columns
cat_cols = ['Gender', 'City', 'Product']

# Fit and transform
encoded_array = encoder.fit_transform(df[cat_cols])

# Get feature names
feature_names = encoder.get_feature_names_out(cat_cols)

# Create DataFrame
df_sklearn_encoded = pd.DataFrame(encoded_array, columns=feature_names)

# Combine with original numerical columns
df_final = pd.concat([df[['Customer_ID', 'Age', 'Purchase_Amount']], df_sklearn_encoded], axis=1)

print("\nEncoded Array Shape:", encoded_array.shape)
print("\nFeature Names Created:")
print(list(feature_names))

print("\n" + "-" * 100)
print("FINAL DATASET (first 10 rows):")
print("-" * 100)
print(df_final.head(10))

print("\n" + "-" * 100)
print("ENCODER PROPERTIES:")
print("-" * 100)
print(f"Categories per feature: {[len(cat) for cat in encoder.categories_]}")
print(f"\nActual categories:")
for i, col in enumerate(cat_cols):
    print(f"  {col}: {list(encoder.categories_[i])}")

print("\n" + "=" * 100)
print("‚úÖ ADVANTAGES OF SKLEARN ONEHOTENCODER:")
print("  ‚Ä¢ Works in sklearn pipelines")
print("  ‚Ä¢ Handles unseen categories")
print("  ‚Ä¢ Consistent with train/test split")
print("  ‚Ä¢ Can inverse_transform")
print("=" * 100)

## Handling High Cardinality

When a categorical variable has many unique values (high cardinality), One-Hot Encoding can create too many columns.

In [None]:
# Demonstrating High Cardinality Problem
print("HIGH CARDINALITY CHALLENGE")
print("=" * 100)

# Create high cardinality example
np.random.seed(42)
high_card_data = {
    'Customer_ID': range(1, 101),
    'Country': np.random.choice(['USA', 'UK', 'Germany', 'France', 'Spain', 'Italy', 
                                 'Canada', 'Australia', 'Japan', 'China', 'India', 'Brazil',
                                 'Mexico', 'Russia', 'South Korea', 'Netherlands', 'Sweden',
                                 'Norway', 'Denmark', 'Finland'], 100),
    'Product_ID': [f'PROD_{i:04d}' for i in np.random.randint(1, 51, 100)],
    'Sales': np.random.randint(100, 1000, 100)
}

df_high_card = pd.DataFrame(high_card_data)

print("\nDataset with High Cardinality Features:")
print(f"  Rows: {len(df_high_card)}")
print(f"  Country unique values: {df_high_card['Country'].nunique()}")
print(f"  Product_ID unique values: {df_high_card['Product_ID'].nunique()}")

# Apply OHE to see the explosion
df_ohe_high = pd.get_dummies(df_high_card, columns=['Country', 'Product_ID'])

print("\n" + "-" * 100)
print("IMPACT OF ONE-HOT ENCODING:")
print("-" * 100)
print(f"Original columns: {df_high_card.shape[1]}")
print(f"After OHE: {df_ohe_high.shape[1]}")
print(f"Columns added: {df_ohe_high.shape[1] - df_high_card.shape[1]}")

# Strategies for high cardinality
print("\n" + "=" * 100)
print("STRATEGIES FOR HIGH CARDINALITY:")
print("=" * 100)

# Strategy 1: Keep only top N categories
top_n = 5
print(f"\n1. Keep Top {top_n} Categories + 'Other':")
top_countries = df_high_card['Country'].value_counts().head(top_n).index
df_reduced = df_high_card.copy()
df_reduced['Country_Grouped'] = df_reduced['Country'].apply(
    lambda x: x if x in top_countries else 'Other'
)
print(f"   Original unique values: {df_high_card['Country'].nunique()}")
print(f"   After grouping: {df_reduced['Country_Grouped'].nunique()}")

# Strategy 2: Frequency encoding
freq_encoding = df_high_card['Country'].value_counts() / len(df_high_card)
df_freq = df_high_card.copy()
df_freq['Country_Frequency'] = df_freq['Country'].map(freq_encoding)
print(f"\n2. Frequency Encoding:")
print(f"   Sample: {dict(list(freq_encoding.head(3).items()))}")

# Strategy 3: Target encoding (if we have a target variable)
target_mean = df_high_card.groupby('Country')['Sales'].mean()
df_target = df_high_card.copy()
df_target['Country_Target_Encoded'] = df_target['Country'].map(target_mean)
print(f"\n3. Target Encoding (Mean Sales by Country):")
print(f"   Sample: {dict(list(target_mean.head(3).items()))}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('High Cardinality Handling Strategies', fontsize=16, fontweight='bold')

# Original distribution
country_counts = df_high_card['Country'].value_counts().head(10)
axes[0, 0].barh(country_counts.index, country_counts.values, color='skyblue', alpha=0.7, edgecolor='black')
axes[0, 0].set_xlabel('Count', fontweight='bold')
axes[0, 0].set_title('Original: Top 10 Countries', fontweight='bold')
axes[0, 0].grid(axis='x', alpha=0.3)

# After grouping
grouped_counts = df_reduced['Country_Grouped'].value_counts()
axes[0, 1].bar(grouped_counts.index, grouped_counts.values, color='lightgreen', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Country Group', fontweight='bold')
axes[0, 1].set_ylabel('Count', fontweight='bold')
axes[0, 1].set_title(f'After Grouping (Top {top_n} + Other)', fontweight='bold')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(axis='y', alpha=0.3)

# Column explosion comparison
methods = ['Original', 'OHE\n(all categories)', 'OHE\n(top 5 + Other)']
columns = [df_high_card.shape[1], df_ohe_high.shape[1], 
           pd.get_dummies(df_reduced, columns=['Country_Grouped']).shape[1]]
colors_comp = ['blue', 'red', 'green']
bars = axes[1, 0].bar(methods, columns, color=colors_comp, alpha=0.7, edgecolor='black', linewidth=2)
axes[1, 0].set_ylabel('Number of Columns', fontweight='bold')
axes[1, 0].set_title('Column Count Comparison', fontweight='bold')
axes[1, 0].grid(axis='y', alpha=0.3)

for bar in bars:
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + 1,
                   f'{int(height)}', ha='center', fontweight='bold')

# Encoding methods comparison
axes[1, 1].axis('off')
summary_text = f"""
ENCODING METHOD COMPARISON

Original Features:
‚Ä¢ Country: {df_high_card['Country'].nunique()} unique values
‚Ä¢ Product_ID: {df_high_card['Product_ID'].nunique()} unique values

One-Hot Encoding Results:
‚Ä¢ Columns created: {df_ohe_high.shape[1] - df_high_card.shape[1]}
‚Ä¢ Total columns: {df_ohe_high.shape[1]}

Alternative Strategies:
1. Top-N + Other: {pd.get_dummies(df_reduced, columns=['Country_Grouped']).shape[1]} columns
2. Frequency Encoding: +1 column per feature
3. Target Encoding: +1 column per feature

Recommendation: Use frequency or target encoding
for high cardinality (>10-15 categories)
"""
axes[1, 1].text(0.1, 0.5, summary_text, fontsize=10, family='monospace',
               verticalalignment='center',
               bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üí° HIGH CARDINALITY BEST PRACTICES:")
print("  ‚Ä¢ < 10 categories: Use One-Hot Encoding")
print("  ‚Ä¢ 10-50 categories: Group rare categories or use target encoding")
print("  ‚Ä¢ > 50 categories: Avoid OHE, use frequency/target encoding")
print("  ‚Ä¢ Consider domain knowledge for grouping")
print("=" * 100)

## Summary: One-Hot Encoding Best Practices

### Quick Decision Guide

| Unique Categories | Recommendation |
|-------------------|----------------|
| 2-10 | ‚úÖ Use One-Hot Encoding |
| 10-20 | ‚ö†Ô∏è Use OHE with caution or group rare categories |
| 20-50 | ‚ùå Avoid OHE, use target/frequency encoding |
| 50+ | ‚ùå Never use OHE, use embeddings or other methods |

### pandas get_dummies() vs sklearn OneHotEncoder

| Feature | pandas get_dummies() | sklearn OneHotEncoder |
|---------|---------------------|----------------------|
| **Simplicity** | ‚úÖ Very simple | Slightly more complex |
| **Pipeline Integration** | ‚ùå No | ‚úÖ Yes |
| **Handle Unknown Categories** | ‚ùå No | ‚úÖ Yes (with handle_unknown='ignore') |
| **Inverse Transform** | ‚ùå No | ‚úÖ Yes |
| **Sparse Matrix** | ‚ùå No (dense only) | ‚úÖ Yes (memory efficient) |
| **Best For** | Quick analysis, EDA | Production ML pipelines |

### Best Practices

‚úÖ **DO:**
- Use OHE for nominal (non-ordinal) categorical variables
- Drop first category to avoid multicollinearity (drop_first=True)
- Check cardinality before encoding
- Use sklearn OneHotEncoder for production pipelines
- Encode after train-test split to avoid data leakage
- Document which categories were encoded
- Handle missing values before encoding

‚ùå **DON'T:**
- Use OHE for ordinal variables (use ordinal encoding instead)
- Use OHE with high cardinality (>15-20 categories)
- Encode before splitting data (causes data leakage)
- Forget to handle unseen categories in test data
- Use OHE when tree-based models can handle categories directly
- Create too many sparse columns (memory issues)

### Common Pitfalls

**1. Data Leakage**
```python
# ‚ùå WRONG - encoding before split
df_encoded = pd.get_dummies(df)
X_train, X_test = train_test_split(df_encoded)

# ‚úÖ CORRECT - split first, then encode
X_train, X_test = train_test_split(df)
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)
```

**2. Unseen Categories in Test Data**
```python
# ‚ùå WRONG - may have different columns
X_train_ohe = pd.get_dummies(X_train)
X_test_ohe = pd.get_dummies(X_test)  # Different categories!

# ‚úÖ CORRECT - use sklearn encoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(X_train)
X_train_ohe = encoder.transform(X_train)
X_test_ohe = encoder.transform(X_test)
```

**3. The Dummy Variable Trap**
```python
# ‚ö†Ô∏è Creates multicollinearity
df_ohe = pd.get_dummies(df, drop_first=False)

# ‚úÖ Avoids multicollinearity
df_ohe = pd.get_dummies(df, drop_first=True)
```

### When NOT to Use One-Hot Encoding

1. **Ordinal Data** - Use Label/Ordinal Encoding instead
2. **High Cardinality** - Use target/frequency encoding
3. **Tree-based Models** - Can handle categories natively
4. **Deep Learning** - Use embeddings for categories
5. **Memory Constraints** - Creates sparse matrices

### Code Template

```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# 1. Split data first
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 2. Initialize encoder
encoder = OneHotEncoder(
    drop='first',  # Avoid dummy variable trap
    sparse_output=False,  # Return dense array
    handle_unknown='ignore'  # Handle unseen categories
)

# 3. Fit on training data only
encoder.fit(X_train[categorical_cols])

# 4. Transform both sets
X_train_encoded = encoder.transform(X_train[categorical_cols])
X_test_encoded = encoder.transform(X_test[categorical_cols])

# 5. Get feature names
feature_names = encoder.get_feature_names_out(categorical_cols)
```

### Final Recommendations

- **For EDA/Quick Analysis**: Use `pd.get_dummies()`
- **For ML Pipelines**: Use `sklearn.preprocessing.OneHotEncoder`
- **For High Cardinality**: Consider target encoding or embeddings
- **For Ordinal Data**: Use ordinal encoding (next notebook!)

---

**You've mastered One-Hot Encoding!** Next, we'll learn about Label and Ordinal Encoding for ordered categorical variables.