# 021 - CatBoost: Ordered Boosting & Categorical Feature Mastery

## 📘 Introduction

**CatBoost** (Categorical Boosting) is Yandex's gradient boosting framework designed to excel at handling categorical features. Released in 2017, it introduces **ordered boosting** to reduce prediction shift and **ordered target statistics** for robust categorical encoding.

### Key Innovations

1. **Ordered Boosting** - Eliminates prediction shift from target leakage
2. **Ordered Target Statistics** - Smart categorical encoding without overfitting
3. **Oblivious Trees** - Symmetric trees with same split at each level (fast prediction)
4. **GPU Acceleration** - Optimized for both training and inference
5. **Robust to Overfitting** - Performs well with default parameters

### When to Use CatBoost

✅ **High-cardinality categoricals** (equipment_id: 1000s, user_id: millions)  
✅ **Small-medium datasets** (<1M samples) where accuracy matters most  
✅ **Need robust defaults** (minimal hyperparameter tuning)  
✅ **Interpretability required** (symmetric trees easier to visualize)  
✅ **Limited time for feature engineering** (handles categoricals automatically)  

### Comparison with XGBoost and LightGBM

| Aspect | XGBoost | LightGBM | CatBoost |
|--------|---------|----------|----------|
| **Categorical handling** | Manual encoding | Native (basic) | **Ordered target stats** |
| **Speed (large data)** | Medium | **Fastest** | Slower |
| **Accuracy (small data)** | High | High | **Highest** |
| **Overfitting resistance** | Good (with tuning) | Good (with tuning) | **Excellent (defaults)** |
| **Tree structure** | Level-wise | Leaf-wise | **Oblivious (symmetric)** |
| **Default performance** | Needs tuning | Needs tuning | **Strong defaults** |
| **GPU support** | Yes | Yes | **Yes (optimized)** |
| **Best for** | General purpose | Large datasets | Categoricals + small data |

### Learning Path Context

- **018_Gradient_Boosting** - Sequential boosting foundations
- **019_XGBoost** - Regularized gradient boosting
- **020_LightGBM** - Histogram-based speedup
- **021_CatBoost (this)** - Ordered boosting + categorical mastery
- **022_Voting_Stacking** (next) - Ensemble meta-learning


## 🔄 CatBoost Workflow

```mermaid
graph TD
    A[Input Data] --> B{Detect Categorical Features}
    B --> C[Ordered Target Statistics]
    C --> D[Random Permutations]
    D --> E[Calculate Target Stats]
    E --> F[Ordered Boosting]
    F --> G[Build Oblivious Tree]
    G --> H[Same Split at Each Level]
    H --> I{More Trees?}
    I -->|Yes| F
    I -->|No| J[Final Ensemble]
    J --> K[Predictions]
    
    style C fill:#e1f5ff
    style F fill:#fff4e1
    style G fill:#f0e1ff
    style J fill:#e1ffe1
```

**Key Stages:**
1. **Ordered Target Statistics** - Encode categoricals using historical targets
2. **Random Permutations** - Multiple orderings prevent overfitting
3. **Ordered Boosting** - Use different subsets for residual calculation
4. **Oblivious Trees** - Symmetric structure (2^depth leaves)
5. **Ensemble** - Combine trees with shrinkage


## 📐 Mathematical Foundation

### 1. Ordered Target Statistics (Categorical Encoding)

Traditional target encoding causes **target leakage** when the same data point's target influences its encoding.

**Problem with naive target encoding:**
$$\hat{x}_i^{cat} = \frac{\sum_{j: x_j^{cat} = x_i^{cat}} y_j}{\sum_{j: x_j^{cat} = x_i^{cat}} 1}$$

This includes $y_i$ in its own encoding → overfitting!

**CatBoost's Ordered Target Statistics:**
$$\hat{x}_i^{cat} = \frac{\sum_{j: j < i, x_j^{cat} = x_i^{cat}} y_j + a \cdot P}{\sum_{j: j < i, x_j^{cat} = x_i^{cat}} 1 + a}$$

Where:
- $j < i$: Only use **previous examples** in the random permutation
- $P$: Prior (global target mean)
- $a$: Weight of prior (regularization, default 1)

**Key insight:** Each sample's encoding depends only on samples that came **before** it in a random ordering.

---

### 2. Ordered Boosting

Traditional gradient boosting suffers from **prediction shift** - the model used to compute gradients sees the same data it was trained on.

**Standard GBM (prediction shift):**
$$g_i = \frac{\partial L(y_i, F_{t-1}(x_i))}{\partial F_{t-1}(x_i)}$$
Where $F_{t-1}$ was trained on data including $x_i$ → biased gradients!

**CatBoost's Ordered Boosting:**
$$g_i = \frac{\partial L(y_i, M_i(x_i))}{\partial M_i(x_i)}$$
Where $M_i$ is trained on $\{(x_j, y_j): j < i\}$ only

**Implementation:** Use multiple random permutations, for each sample $i$, compute gradient using a model trained only on samples before $i$ in that permutation.

---

### 3. Oblivious Decision Trees

CatBoost uses **oblivious (symmetric) trees** where all nodes at the same level use the **same splitting criterion**.

**Standard tree:** Each node has independent split → 2^depth - 1 splits  
**Oblivious tree:** Same split at each level → depth splits only

**Structure:**
```
Depth 0:        [Split: feature_1 < 5]
               /                      \\
Depth 1:  [Split: feature_2 < 10]  [Split: feature_2 < 10]
          /        \               /        \\
Leaves:  L1       L2             L3        L4
```

**Advantages:**
- **Fast prediction**: $O(\text{depth})$ vs $O(\text{depth} \times \text{features})$
- **Less overfitting**: Fewer parameters (depth splits vs 2^depth - 1 splits)
- **Interpretable**: Easier to visualize and understand

**Prediction formula:**
$$F(x) = \sum_{t=1}^T \alpha_t \cdot h_t(x)$$
Where each $h_t$ is an oblivious tree with $2^{\text{depth}}$ leaves.

---

### 4. Loss Function (same as GBM/XGBoost)

**Regression (MSE):**
$$L = \frac{1}{N} \sum_{i=1}^N (y_i - F(x_i))^2$$

**Classification (Logloss):**
$$L = -\frac{1}{N} \sum_{i=1}^N [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]$$

Where $p_i = \sigma(F(x_i))$ and $\sigma(z) = 1/(1 + e^{-z})$

---

### Key Hyperparameters

| Parameter | Description | Default | Typical Range |
|-----------|-------------|---------|---------------|
| `iterations` | Number of trees | 1000 | 100-5000 |
| `learning_rate` | Shrinkage | 0.03 | 0.01-0.3 |
| `depth` | Tree depth | 6 | 4-10 |
| `l2_leaf_reg` | L2 regularization | 3.0 | 1-10 |
| `border_count` | Splits per feature | 254 | 32-255 |
| `random_strength` | Randomness in splits | 1.0 | 0-10 |
| `bagging_temperature` | Bayesian bootstrap | 1.0 | 0-10 |

**CatBoost's strength:** Works well with defaults, minimal tuning needed!


## 🔧 Installation

### 📝 What's Happening in This Code?

**Purpose:** Install CatBoost library for gradient boosting with categorical features.

**Key Points:**
- **CatBoost**: Yandex's gradient boosting with ordered boosting
- **GPU support**: Automatically enabled if CUDA available
- **Compatible**: Works with sklearn API and native CatBoost API


In [None]:
# Install CatBoost
# !pip install catboost

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

# CatBoost
import catboost as cb
from catboost import CatBoostClassifier, CatBoostRegressor, Pool

# For comparison
import xgboost as xgb
import lightgbm as lgb

print(f"✅ Libraries loaded successfully")
print(f"   CatBoost version: {cb.__version__}")
print(f"   GPU available: {cb.get_gpu_device_count() > 0}")

## 🏷️ Categorical Encoding: Ordered Target Statistics vs One-Hot

### 📝 What's Happening in This Code?

**Purpose:** Demonstrate why CatBoost's ordered target statistics outperforms traditional encoding methods.

**Key Points:**
- **One-hot encoding**: 100 categories → 100 columns (dimension explosion)
- **Target encoding (naive)**: Uses target values → causes overfitting
- **Ordered target statistics**: Uses only **previous** samples → no leakage
- **Regularization**: Prior weight prevents overfitting on rare categories

**Why This Matters:** For post-silicon data with equipment_id (1000s), lot_id (10,000s), traditional encoding is infeasible. CatBoost handles this automatically.


In [None]:
# Generate dataset with high-cardinality categorical
np.random.seed(42)
n_samples = 5000

# High-cardinality categorical (100 equipment IDs)
equipment_ids = [f'EQ_{i:03d}' for i in range(100)]
equipment_id = np.random.choice(equipment_ids, n_samples)

# Equipment has systematic effect on yield
equipment_quality = {eq: np.random.normal(0, 10) for eq in equipment_ids}

# Numerical features
voltage = np.random.normal(1.8, 0.05, n_samples)
temperature = np.random.uniform(25, 85, n_samples)

# Target depends on equipment + numerical features
y = (
    100 +
    np.array([equipment_quality[eq] for eq in equipment_id]) +
    5 * (voltage - 1.8) / 0.05 +
    -0.2 * (temperature - 25) +
    np.random.normal(0, 3, n_samples)
)

# Create DataFrame
df = pd.DataFrame({
    'equipment_id': equipment_id,
    'voltage': voltage,
    'temperature': temperature,
    'yield_score': y
})

print("📊 High-Cardinality Categorical Dataset:")
print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"Unique equipment IDs: {df['equipment_id'].nunique()}")
print(f"\nEquipment ID distribution (top 10):")
print(df['equipment_id'].value_counts().head(10))

# Split data
X = df.drop('yield_score', axis=1)
y = df['yield_score'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\n✅ Train: {len(X_train)}, Test: {len(X_test)}")

### 📝 What's Happening in This Code?

**Purpose:** Compare CatBoost (ordered target stats) vs XGBoost (one-hot) vs LightGBM (native categorical).

**Key Points:**
- **CatBoost**: Specify `cat_features` → automatic ordered target statistics
- **XGBoost**: Requires manual one-hot encoding → 100 columns from 1 feature
- **LightGBM**: Native categorical but simpler encoding (not ordered)
- **Metrics**: Training time, accuracy, memory usage

**Why This Matters:** On high-cardinality categoricals, CatBoost typically achieves 2-5% better accuracy with less effort.


In [None]:
print("🚀 Comparing categorical encoding methods...\n")

# 1. CatBoost with ordered target statistics
print("1️⃣ CatBoost (Ordered Target Statistics)")
start_time = time.time()

cat_model = CatBoostRegressor(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    random_state=42,
    verbose=False
)

cat_model.fit(
    X_train, y_train,
    cat_features=['equipment_id'],  # Specify categorical features
    eval_set=(X_test, y_test),
    early_stopping_rounds=20,
    verbose=False
)

cat_time = time.time() - start_time
cat_pred = cat_model.predict(X_test)
cat_mse = mean_squared_error(y_test, cat_pred)
cat_r2 = r2_score(y_test, cat_pred)

print(f"   Time: {cat_time:.2f}s")
print(f"   Test MSE: {cat_mse:.2f}")
print(f"   Test R²:  {cat_r2:.4f}")
print(f"   Trees: {cat_model.tree_count_}")

# 2. XGBoost with one-hot encoding
print(f"\n2️⃣ XGBoost (One-Hot Encoding)")
X_train_ohe = pd.get_dummies(X_train, columns=['equipment_id'])
X_test_ohe = pd.get_dummies(X_test, columns=['equipment_id'])
X_test_ohe = X_test_ohe.reindex(columns=X_train_ohe.columns, fill_value=0)

start_time = time.time()
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(
    X_train_ohe, y_train,
    eval_set=[(X_test_ohe, y_test)],
    early_stopping_rounds=20,
    verbose=False
)
xgb_time = time.time() - start_time

xgb_pred = xgb_model.predict(X_test_ohe)
xgb_mse = mean_squared_error(y_test, xgb_pred)
xgb_r2 = r2_score(y_test, xgb_pred)

print(f"   Time: {xgb_time:.2f}s")
print(f"   Features after encoding: {X_train_ohe.shape[1]} (from 3 original)")
print(f"   Test MSE: {xgb_mse:.2f}")
print(f"   Test R²:  {xgb_r2:.4f}")

# 3. LightGBM with native categorical
print(f"\n3️⃣ LightGBM (Native Categorical)")
start_time = time.time()

lgb_model = lgb.LGBMRegressor(
    n_estimators=200,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)
lgb_model.fit(
    X_train, y_train,
    categorical_feature=['equipment_id'],
    eval_set=[(X_test, y_test)],
    callbacks=[lgb.early_stopping(stopping_rounds=20, verbose=False)]
)
lgb_time = time.time() - start_time

lgb_pred = lgb_model.predict(X_test)
lgb_mse = mean_squared_error(y_test, lgb_pred)
lgb_r2 = r2_score(y_test, lgb_pred)

print(f"   Time: {lgb_time:.2f}s")
print(f"   Test MSE: {lgb_mse:.2f}")
print(f"   Test R²:  {lgb_r2:.4f}")

# Comparison
print(f"\n📊 Summary:")
print(f"   {'Method':<25} {'Time (s)':<12} {'MSE':<12} {'R²':<10}")
print(f"   {'-'*60}")
print(f"   {'CatBoost (Ordered)':<25} {cat_time:<12.2f} {cat_mse:<12.2f} {cat_r2:<10.4f}")
print(f"   {'XGBoost (One-Hot)':<25} {xgb_time:<12.2f} {xgb_mse:<12.2f} {xgb_r2:<10.4f}")
print(f"   {'LightGBM (Native)':<25} {lgb_time:<12.2f} {lgb_mse:<12.2f} {lgb_r2:<10.4f}")

print(f"\n💡 Key Insights:")
print(f"   • CatBoost accuracy advantage: {(xgb_mse - cat_mse) / xgb_mse * 100:.1f}% lower MSE than XGBoost")
print(f"   • No feature engineering required: 3 features vs {X_train_ohe.shape[1]} after one-hot")
print(f"   • Ordered target statistics prevents overfitting on rare categories")
print(f"   • Best for: equipment_id (1000s), lot_id (10,000s), user_id (millions)")

## 📋 Batch 1 Summary: Foundations Complete

### ✅ What We've Covered

1. **Ordered Target Statistics** - Categorical encoding without target leakage
2. **Ordered Boosting** - Eliminates prediction shift
3. **Oblivious Trees** - Symmetric structure for fast prediction
4. **Categorical Encoding Comparison** - CatBoost vs XGBoost vs LightGBM

### 🎯 Key Insights

- **CatBoost excels at high-cardinality categoricals**: equipment_id (1000s), user_id (millions)
- **Ordered target statistics**: Only uses previous samples → no target leakage
- **No feature engineering**: Automatic encoding beats manual one-hot
- **Strong defaults**: Works well without extensive hyperparameter tuning

---

### 🚀 Coming in Batch 2

- **Oblivious tree visualization** - Understand symmetric tree structure
- **Post-silicon application** - 100K device classification with equipment/lot/wafer IDs
- **Feature importance** - Identify critical categorical vs numerical features
- **Hyperparameter tuning** - Depth, learning rate, regularization
- **8 Real-world projects** - Post-silicon (4) + General AI/ML (4)
- **Comparison guide** - When to use CatBoost vs XGBoost vs LightGBM


## 🔬 Post-Silicon Application: Equipment-Lot-Wafer Yield Prediction

### 📝 What's Happening in This Code?

**Purpose:** Build production-scale yield classifier with high-cardinality categoricals (equipment, lot, wafer IDs).

**Key Points:**
- **Scale**: 100K devices with realistic categorical distributions
- **Categoricals**: 200 equipment IDs + 500 lot IDs + 1000 wafer IDs = 1700 unique categories
- **Business scenario**: Predict yield before packaging to save $0.50-2.00 per device
- **CatBoost advantage**: Handles 1700 categories natively (XGBoost would create 1700 columns!)

**Business Value:**
- **Early screening**: Identify failing devices before expensive packaging
- **Equipment correlation**: Discover which equipment IDs predict failures
- **Lot tracking**: Detect problematic lots early in production
- **Cost savings**: $50K-200K per day from improved screening


In [None]:
# Generate realistic 100K device dataset
print("🏭 Generating 100K device production dataset...\n")

np.random.seed(42)
n_devices = 100000

# High-cardinality categorical features
equipment_ids = [f'EQ_{i:04d}' for i in range(200)]  # 200 test machines
lot_ids = [f'LOT_{i:05d}' for i in range(500)]       # 500 lots
wafer_ids = [f'WFR_{i:05d}' for i in range(1000)]    # 1000 wafers

equipment_id = np.random.choice(equipment_ids, n_devices)
lot_id = np.random.choice(lot_ids, n_devices)
wafer_id = np.random.choice(wafer_ids, n_devices)

# Equipment/lot/wafer quality effects (systematic variations)
equipment_effects = {eq: np.random.normal(0, 3) for eq in equipment_ids}
lot_effects = {lot: np.random.normal(0, 4) for lot in lot_ids}
wafer_effects = {wfr: np.random.normal(0, 2) for wfr in wafer_ids}

# Parametric test results (15 features)
voltage = np.random.normal(1.8, 0.05, n_devices)
current = np.random.normal(150, 20, n_devices)
frequency = np.random.normal(2000, 100, n_devices)
temperature = np.random.uniform(25, 85, n_devices)
power = voltage * current
leakage = np.random.exponential(10, n_devices)
delay = np.random.normal(500, 50, n_devices)
jitter = np.random.exponential(20, n_devices)
noise_margin = np.random.normal(0.3, 0.06, n_devices)
skew = np.random.normal(0, 18, n_devices)

# Additional parametric tests
test_11 = np.random.normal(100, 12, n_devices)
test_12 = np.random.normal(50, 8, n_devices)
test_13 = np.random.exponential(15, n_devices)
test_14 = np.random.normal(200, 25, n_devices)
test_15 = np.random.uniform(10, 90, n_devices)

# Spatial features
die_x = np.random.randint(0, 50, n_devices)
die_y = np.random.randint(0, 50, n_devices)

# Complex yield model
yield_score = (
    100 +
    0.6 * (frequency - 2000) / 100 +
    -0.4 * (temperature - 25) / 10 +
    -1.2 * leakage / 10 +
    -0.08 * delay / 50 +
    -0.3 * jitter / 20 +
    12 * noise_margin +
    np.array([equipment_effects[eq] for eq in equipment_id]) +
    np.array([lot_effects[lot] for lot in lot_id]) +
    np.array([wafer_effects[wfr] for wfr in wafer_id]) +
    np.random.normal(0, 6, n_devices)
)

# Binary yield (threshold at 95)
yield_binary = (yield_score > 95).astype(int)

# Create DataFrame
df_ps = pd.DataFrame({
    'Equipment_ID': equipment_id,
    'Lot_ID': lot_id,
    'Wafer_ID': wafer_id,
    'Voltage_V': voltage,
    'Current_mA': current,
    'Frequency_MHz': frequency,
    'Temperature_C': temperature,
    'Power_mW': power,
    'Leakage_uA': leakage,
    'Delay_ps': delay,
    'Jitter_ps': jitter,
    'Noise_Margin': noise_margin,
    'Skew_ps': skew,
    'Test_11': test_11,
    'Test_12': test_12,
    'Test_13': test_13,
    'Test_14': test_14,
    'Test_15': test_15,
    'Die_X': die_x,
    'Die_Y': die_y,
    'Yield': yield_binary
})

print(f"✅ Production Dataset Generated:")
print(f"   Devices: {n_devices:,}")
print(f"   Features: {df_ps.shape[1] - 1}")
print(f"   - Categorical: 3 (Equipment_ID: 200, Lot_ID: 500, Wafer_ID: 1000)")
print(f"   - Numerical: 15 parametric tests")
print(f"   - Spatial: 2 (Die_X, Die_Y)")

print(f"\nYield Statistics:")
print(f"   Pass rate: {yield_binary.mean():.1%}")
print(f"   Fail rate: {1 - yield_binary.mean():.1%}")

print(f"\nCategorical Feature Cardinality:")
print(f"   Equipment_ID: {df_ps['Equipment_ID'].nunique()} unique values")
print(f"   Lot_ID: {df_ps['Lot_ID'].nunique()} unique values")
print(f"   Wafer_ID: {df_ps['Wafer_ID'].nunique()} unique values")
print(f"   Total unique categories: {df_ps['Equipment_ID'].nunique() + df_ps['Lot_ID'].nunique() + df_ps['Wafer_ID'].nunique()}")

print(f"\nBusiness Context:")
print(f"   100K devices = ~5 wafer lots (full production day)")
print(f"   Packaging cost: $1.00/device average")
print(f"   Potential daily savings: ${int((1-yield_binary.mean()) * n_devices * 1.00):,}")

print(f"\n{df_ps.head(10)}")

### 📝 What's Happening in This Code?

**Purpose:** Train CatBoost classifier with native categorical feature handling and compare with XGBoost.

**Key Points:**
- **cat_features parameter**: Automatically applies ordered target statistics
- **Pool object**: CatBoost's internal data structure (like DMatrix/Dataset)
- **Early stopping**: Monitor AUC on validation set
- **XGBoost comparison**: Must one-hot encode → 1700 columns vs 20 original features

**Why This Matters:** One-hot encoding 1700 categories creates sparse, high-dimensional data. CatBoost handles this elegantly with ordered target statistics.


In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Prepare data
X = df_ps.drop('Yield', axis=1)
y = df_ps['Yield'].values
feature_names = X.columns.tolist()
cat_feature_names = ['Equipment_ID', 'Lot_ID', 'Wafer_ID']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("🚀 Training CatBoost on 100K devices...\n")

# 1. CatBoost with native categorical handling
print("1️⃣ CatBoost (Ordered Target Statistics)")
start_time = time.time()

# Create Pool objects (CatBoost's native data structure)
train_pool = Pool(X_train, y_train, cat_features=cat_feature_names)
test_pool = Pool(X_test, y_test, cat_features=cat_feature_names)

cat_clf = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    random_seed=42,
    verbose=False
)

cat_clf.fit(
    train_pool,
    eval_set=test_pool,
    early_stopping_rounds=30,
    verbose=False
)

cat_time = time.time() - start_time

# Predictions
y_pred_cat = cat_clf.predict(X_test)
y_pred_proba_cat = cat_clf.predict_proba(X_test)[:, 1]

# Metrics
cat_accuracy = accuracy_score(y_test, y_pred_cat)
cat_auc = roc_auc_score(y_test, y_pred_proba_cat)
cat_cm = confusion_matrix(y_test, y_pred_cat)

print(f"   Training time: {cat_time:.2f}s")
print(f"   Trees used: {cat_clf.tree_count_} (early stopped from 500)")
print(f"   Accuracy: {cat_accuracy:.4f}")
print(f"   AUC-ROC: {cat_auc:.4f}")

# 2. XGBoost with one-hot encoding (for comparison)
print(f"\n2️⃣ XGBoost (One-Hot Encoding)")
print(f"   Encoding 1700 categories...")

X_train_ohe = pd.get_dummies(X_train, columns=cat_feature_names)
X_test_ohe = pd.get_dummies(X_test, columns=cat_feature_names)
X_test_ohe = X_test_ohe.reindex(columns=X_train_ohe.columns, fill_value=0)

print(f"   Features after encoding: {X_train_ohe.shape[1]} (from 20 original)")

start_time = time.time()
xgb_clf = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    random_state=42,
    n_jobs=-1,
    use_label_encoder=False,
    eval_metric='logloss'
)
xgb_clf.fit(
    X_train_ohe, y_train,
    eval_set=[(X_test_ohe, y_test)],
    early_stopping_rounds=30,
    verbose=False
)
xgb_time = time.time() - start_time

y_pred_xgb = xgb_clf.predict(X_test_ohe)
y_pred_proba_xgb = xgb_clf.predict_proba(X_test_ohe)[:, 1]

xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
xgb_auc = roc_auc_score(y_test, y_pred_proba_xgb)

print(f"   Training time: {xgb_time:.2f}s")
print(f"   Accuracy: {xgb_accuracy:.4f}")
print(f"   AUC-ROC: {xgb_auc:.4f}")

# Comparison
print(f"\n📊 Comparison:")
print(f"   {'Metric':<25} {'CatBoost':<15} {'XGBoost':<15}")
print(f"   {'-'*55}")
print(f"   {'Training time (s)':<25} {cat_time:<15.2f} {xgb_time:<15.2f}")
print(f"   {'Feature count':<25} {X_train.shape[1]:<15} {X_train_ohe.shape[1]:<15}")
print(f"   {'Accuracy':<25} {cat_accuracy:<15.4f} {xgb_accuracy:<15.4f}")
print(f"   {'AUC-ROC':<25} {cat_auc:<15.4f} {xgb_auc:<15.4f}")

print(f"\n⚡ CatBoost Advantages:")
print(f"   • {xgb_time / cat_time:.1f}x faster training")
print(f"   • {X_train_ohe.shape[1] / X_train.shape[1]:.0f}x fewer features (20 vs {X_train_ohe.shape[1]})")
print(f"   • {((xgb_auc - cat_auc) / xgb_auc * -100):.2f}% better AUC")
print(f"   • No feature engineering required")
print(f"   • Handles 1700 unique categories natively")

# Detailed metrics for CatBoost
print(f"\n📋 CatBoost Confusion Matrix:")
print(f"   True Neg:  {cat_cm[0,0]:7,}  |  False Pos: {cat_cm[0,1]:6,}")
print(f"   False Neg: {cat_cm[1,0]:7,}  |  True Pos:  {cat_cm[1,1]:6,}")

print(f"\n📋 Classification Report:")
print(classification_report(y_test, y_pred_cat, target_names=['Fail', 'Pass']))

# Business impact
false_negatives = cat_cm[1, 0]
false_positives = cat_cm[0, 1]
true_negatives = cat_cm[0, 0]

print(f"💰 Business Impact (100K device analysis):")
print(f"   Correctly caught failures: {true_negatives:,} devices")
print(f"   Cost avoided: ${true_negatives * 1.0:,.0f} (packaging saved)")
print(f"   Missed failures: {false_negatives:,} devices")
print(f"   Cost of misses: ${false_negatives * 5.0:,.0f} (packaged then failed)")
print(f"   False alarms: {false_positives:,} devices")
print(f"   Opportunity cost: ${false_positives * 3.0:,.0f} (good devices scrapped)")
print(f"\n   Net benefit: ${(true_negatives * 1.0 - false_positives * 3.0 - false_negatives * 5.0):,.0f}")

In [None]:
# Feature importance analysis
feature_importance = cat_clf.get_feature_importance()
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("📊 Top 15 Most Important Features:\n")
print(feature_importance_df.head(15).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 8))
top_15 = feature_importance_df.head(15)
plt.barh(range(len(top_15)), top_15['Importance'])
plt.yticks(range(len(top_15)), top_15['Feature'])
plt.xlabel('Importance', fontsize=12)
plt.title('Top 15 Feature Importances - CatBoost Yield Predictor', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# Analyze categorical vs numerical
cat_importance = feature_importance_df[feature_importance_df['Feature'].isin(cat_feature_names)]['Importance'].sum()
num_importance = feature_importance_df[~feature_importance_df['Feature'].isin(cat_feature_names)]['Importance'].sum()
total_importance = feature_importance_df['Importance'].sum()

print(f"\n💡 Categorical vs Numerical Importance:")
print(f"   Categorical features: {cat_importance / total_importance * 100:.1f}% of total importance")
print(f"   Numerical features: {num_importance / total_importance * 100:.1f}% of total importance")
print(f"\n   Top categorical: {feature_importance_df[feature_importance_df['Feature'].isin(cat_feature_names)].iloc[0]['Feature']}")
print(f"   Top numerical: {feature_importance_df[~feature_importance_df['Feature'].isin(cat_feature_names)].iloc[0]['Feature']}")

print(f"\n🎯 Insights for Test Optimization:")
print(f"   • Equipment/Lot/Wafer effects account for {cat_importance / total_importance * 100:.1f}% of prediction")
print(f"   • Ordered target statistics captures systematic quality variations")
print(f"   • Top 5 features: {', '.join(top_15['Feature'].head(5).tolist())}")
print(f"   • Consider equipment calibration for high-importance equipment IDs")

## 🎯 Real-World CatBoost Projects

Below are 8 comprehensive project ideas demonstrating CatBoost's categorical feature mastery:

---

### 🔬 Post-Silicon Validation Projects (4)

### **1. Equipment-Specific Yield Predictor**
**Objective:** Build per-equipment yield models to identify problematic test machines

**Features:**
- Equipment_ID (500+ testers), operator_id, shift, maintenance_date
- Parametric tests: voltage, current, frequency (15+ tests)
- Temporal: days_since_calibration, cumulative_test_count
- CatBoost ordered target statistics for equipment quality encoding

**Success Metrics:**
- Per-equipment AUC >0.90 (identify bad machines)
- Detect equipment drift within 1000 devices
- False positive rate <5% (avoid unnecessary downtime)
- Recommend calibration 2-3 days before quality degrades

**Business Value:** Prevent bad equipment from processing 50K+ devices → $50K-200K savings per incident

---

### **2. Lot-Level Quality Prediction**
**Objective:** Predict lot yield from first 1000 devices to enable early interventions

**Features:**
- Lot_ID (1000s), fab_site, process_node, wafer_supplier
- Early parametric data: first 1000 devices from 25-wafer lot
- Environmental: temperature, humidity, cleanroom class
- Process: etch_time, implant_dose, anneal_temperature

**Success Metrics:**
- Predict final lot yield within ±3% after 1000 devices
- 80%+ accuracy on lot pass/fail (>95% yield)
- Early warning: 2-3 days before lot completion
- Actionable: Identify correctable process issues

**Business Value:** Stop bad lots early → save $500K-2M per lot (avoid processing remaining wafers)

---

### **3. Multi-Site Categorical Correlation Engine**
**Objective:** Unified model across 5 global fabs with site-specific categorical features

**Features:**
- Site_ID (5 fabs), equipment_id per site (100+ each), lot_id (10,000s)
- Site-specific: local_vendor_id, operator_language, shift_pattern
- Shared parametrics: voltage, frequency, power (standardized tests)
- CatBoost handles 500+ unique equipment IDs across sites

**Success Metrics:**
- Unified model AUC >0.92 (all sites combined)
- Per-site AUC variance <2%
- Identify 15+ cross-site systematic patterns
- Transfer learning: New site reaches 90% accuracy with 10K samples

**Business Value:** Standardize quality predictions → $10-30M annual savings from shared learnings

---

### **4. Supplier Quality Scoring with Categorical Tracing**
**Objective:** Track wafer/material supplier quality using categorical IDs

**Features:**
- Wafer_supplier_id, material_lot_id, chemical_batch_id (high cardinality)
- Supply chain: ship_date, storage_duration, handling_count
- Quality metrics: defect_density, thickness_uniformity
- Parametric results: Post-processing test yields

**Success Metrics:**
- Supplier ranking accuracy >85% (compare with manual audits)
- Detect bad material batch within 5000 devices
- Predict supplier yield impact within ±2%
- Automated alerts for sub-spec supplier quality

**Business Value:** Negotiate with suppliers using data → 5-10% cost reduction on materials ($5-15M annually)

---

### 🌐 General AI/ML Projects (4)

### **5. E-Commerce Recommendation with User/Product IDs**
**Objective:** Predict purchase probability using high-cardinality user and product IDs

**Features:**
- user_id (millions), product_id (100,000s), category_id (1000s)
- Behavioral: clicks, cart_adds, time_on_page
- Contextual: device_type, location, time_of_day
- CatBoost ordered target statistics prevents user_id overfitting

**Success Metrics:**
- AUC >0.80 (industry benchmark)
- Precision >40% (top 10% predictions)
- Inference latency <20ms
- Daily retraining on 10M events

**Business Value:** 15-25% conversion rate improvement → $5-15M additional revenue per quarter

---

### **6. Credit Risk Scoring with Merchant/Location IDs**
**Objective:** Predict default risk using categorical merchant and location data

**Features:**
- merchant_id (millions), zip_code (40,000+), occupation_code (1000s)
- Financial: income, debt_ratio, payment_history
- Behavioral: transaction_velocity, merchant_category
- Temporal: days_since_last_payment, account_age

**Success Metrics:**
- AUC >0.75 (regulatory minimum)
- Precision >60% (minimize false approvals)
- Fair lending: No bias on protected categories
- Explainable: Feature importance for audit compliance

**Business Value:** Reduce default rate 20% → $50-200M annual savings for large lender

---

### **7. Healthcare Patient Risk with ICD Diagnosis Codes**
**Objective:** Predict readmission using high-cardinality ICD-10 codes

**Features:**
- primary_diagnosis_code (70,000 ICD-10 codes), hospital_id, insurance_type
- Clinical: age, length_of_stay, num_procedures, comorbidity_score
- Medication: drug_count, high_risk_meds (categoricals)
- CatBoost handles 70K ICD codes without dimension explosion

**Success Metrics:**
- AUC >0.75 (published benchmark)
- Precision >70% for high-risk patients
- Recall >80% (catch most readmissions)
- Interpretable: Top diagnosis codes for each prediction

**Business Value:** Reduce readmissions 15-20% → $3-8M annual savings per hospital

---

### **8. Marketing Attribution with Campaign/Channel IDs**
**Objective:** Multi-touch attribution using categorical campaign and channel data

**Features:**
- campaign_id (10,000s), channel_id (100s), creative_id (1000s)
- Touchpoint sequence: first_touch, last_touch, touch_count
- User: demographic_segment, purchase_history_category
- Temporal: days_since_first_touch, hour_of_conversion

**Success Metrics:**
- Attribution accuracy >80% (vs ground truth experiments)
- ROAS prediction within ±15%
- Campaign ranking correlation >0.90 with incrementality tests
- Handles 10K+ campaigns with CatBoost ordered encoding

**Business Value:** Optimize $10-100M ad budget → 20-30% efficiency gain ($2-30M savings)

---


## ✅ Key Takeaways: CatBoost Mastery

### 🎯 Use CatBoost When:

1. **High-cardinality categorical features**
   - Equipment_ID (1000s), user_id (millions), ICD codes (70,000+)
   - Ordered target statistics prevents overfitting
   - No one-hot encoding required

2. **Small-medium datasets (<1M samples)**
   - Better accuracy than XGBoost/LightGBM on small data
   - Strong default parameters
   - Less hyperparameter tuning required

3. **Need robust defaults**
   - Works well out-of-the-box
   - Less prone to overfitting than other GBM libraries
   - Oblivious trees reduce parameter count

4. **Interpretability required**
   - Symmetric trees easier to visualize
   - Feature importance clearly separates categorical vs numerical
   - Ordered encoding is explainable (historical target average)

5. **Limited feature engineering time**
   - Handles categoricals automatically
   - No need for target encoding or hashing
   - Fast experimentation

---

### ⚠️ Use XGBoost/LightGBM Instead When:

1. **Large datasets (>1M samples)**
   - LightGBM 5-20x faster on massive data
   - XGBoost more mature ecosystem
   - CatBoost slower on large datasets

2. **Few or no categorical features**
   - CatBoost's main advantage unused
   - XGBoost/LightGBM may be faster
   - No benefit from ordered target statistics

3. **Need maximum speed**
   - LightGBM histogram-based is fastest
   - XGBoost level-wise faster than CatBoost oblivious
   - Real-time constraints (<100ms)

---

### 📊 CatBoost vs XGBoost vs LightGBM

| Aspect | XGBoost | LightGBM | CatBoost |
|--------|---------|----------|----------|
| **Categorical handling** | Manual encoding | Native (basic) | **Ordered target stats** |
| **Speed (large data)** | Medium | **Fastest** | Slower |
| **Accuracy (small data)** | High | High | **Highest** |
| **Default performance** | Needs tuning | Needs tuning | **Strong defaults** |
| **Overfitting resistance** | Good | Good | **Excellent** |
| **Tree structure** | Level-wise | Leaf-wise | **Oblivious** |
| **Best for** | General purpose | Large datasets | **Categoricals + small data** |
| **Feature engineering** | Required | Some required | **Minimal** |

---

### 🔧 Best Practices

1. **Always specify categorical features**
   ```python
   model.fit(X_train, y_train, cat_features=['col1', 'col2'])
   ```

2. **Use Pool objects for efficiency**
   ```python
   train_pool = Pool(X_train, y_train, cat_features=['col1'])
   model.fit(train_pool)
   ```

3. **Start with defaults**
   - `iterations=1000`, `learning_rate=0.03`, `depth=6`
   - Tune only if accuracy insufficient
   - Most important: `depth`, `l2_leaf_reg`, `learning_rate`

4. **Enable early stopping**
   ```python
   model.fit(train, eval_set=test, early_stopping_rounds=30)
   ```

5. **Use GPU for medium-large datasets**
   ```python
   CatBoostClassifier(task_type='GPU', devices='0')
   ```

6. **Leverage built-in cross-validation**
   ```python
   cv_results = model.cv(params, pool, fold_count=5)
   ```

---

### 🚀 Next Steps

1. **022_Voting_Stacking_Ensembles.ipynb** - Meta-learning and model combination
2. **023_Hyperparameter_Optimization.ipynb** - Optuna, Hyperopt, Bayesian optimization
3. **024_Model_Interpretation.ipynb** - SHAP, LIME, feature interactions
4. **025_Imbalanced_Learning.ipynb** - Class weights, SMOTE, custom loss functions

---

### 🎓 What You've Mastered

✅ **Ordered target statistics** - Categorical encoding without target leakage  
✅ **Ordered boosting** - Eliminates prediction shift  
✅ **Oblivious trees** - Symmetric structure for fast prediction  
✅ **High-cardinality categoricals** - Handle 1000s-millions of unique values  
✅ **Production deployment** - Equipment/Lot/Wafer quality prediction  
✅ **Feature importance** - Categorical vs numerical contribution analysis  
✅ **Robust defaults** - Minimal tuning for strong performance  
✅ **Business applications** - $50K-30M impact across 8 domains  

You now understand when CatBoost's ordered boosting and categorical mastery outperforms other gradient boosting frameworks! 🎉


## 📚 References and Further Reading

### Original Papers

1. **CatBoost: unbiased boosting with categorical features** (2018)  
   Prokhorenkova et al., NeurIPS 2018  
   https://arxiv.org/abs/1706.09516

2. **Ordered Boosting**  
   Section 4 of CatBoost paper - eliminates prediction shift

3. **Ordered Target Statistics**  
   Section 3 of CatBoost paper - categorical feature encoding

4. **Oblivious Decision Trees**  
   Section 2.2 - symmetric tree structure

### Official Documentation

5. **CatBoost Official Docs**  
   https://catboost.ai/docs/

6. **Categorical Features Guide**  
   https://catboost.ai/docs/concepts/categorical-features.html

7. **Training Parameters**  
   https://catboost.ai/docs/concepts/parameter-tuning.html

8. **GPU Training**  
   https://catboost.ai/docs/features/training-on-gpu.html

### Comparison Studies

9. **Benchmarking CatBoost vs XGBoost vs LightGBM** (2019)  
   Performance comparison on categorical-heavy datasets

10. **Kaggle Winning Solutions with CatBoost**  
    https://www.kaggle.com - Search "CatBoost" for competition examples

### Related Notebooks in This Series

- **018_Gradient_Boosting.ipynb** - Sequential boosting foundations
- **019_XGBoost.ipynb** - Regularized gradient boosting
- **020_LightGBM.ipynb** - Histogram-based speedup
- **022_Voting_Stacking_Ensembles.ipynb** (next) - Meta-learning
- **016_Decision_Trees.ipynb** - Tree-based models fundamentals
- **017_Random_Forest.ipynb** - Parallel ensemble methods

---

**Notebook Complete!** ✅  
**Next:** 022_Voting_Stacking_Ensembles.ipynb - Combining multiple models for superior performance
