# 022 - Voting & Stacking Ensembles: Meta-Learning for Superior Performance

## 📘 Introduction

**Ensemble methods** combine multiple models to achieve better performance than any single model. While Random Forest and Gradient Boosting are ensemble methods themselves, **meta-ensembles** take this further by combining *different types* of models.

### Types of Meta-Ensembles

1. **Voting** - Average/majority vote across models
2. **Stacking** - Train meta-model on base model predictions
3. **Blending** - Simpler stacking with hold-out validation set

### Why Meta-Ensembles?

**The wisdom of crowds:** Combining diverse models reduces errors because:
- **Reduces variance**: Different models make different errors
- **Reduces bias**: Weak models complement each other
- **Captures different patterns**: Linear + tree + neural approaches
- **Kaggle dominance**: Top solutions are almost always ensembles

### Key Concepts

**Diversity is critical:**
- Combining 5 identical models → no benefit
- Combining 5 diverse models → significant improvement
- Use different algorithms: Linear, Tree, Boosting, Neural
- Use different feature sets, hyperparameters, or data subsets

**When to Use Ensembles:**

✅ **Competitions** (Kaggle, etc.) - Every 0.01% accuracy matters  
✅ **High-stakes predictions** (medical, financial) - Need maximum reliability  
✅ **Production systems** - Robust to model degradation  
✅ **Diverse data** - Different models capture different patterns  
✅ **Have compute budget** - Can train multiple models  

❌ **Avoid when:**
- Need fast inference (<10ms)
- Limited training resources
- Interpretability critical (ensemble = black box)
- Single model already achieves target accuracy

### Comparison: Voting vs Stacking vs Blending

| Aspect | Voting | Stacking | Blending |
|--------|--------|----------|----------|
| **Complexity** | Simple | Complex | Medium |
| **Training** | Parallel | Sequential (2-level) | Sequential |
| **Overfitting risk** | Low | **Higher** | Medium |
| **Performance** | Good | **Best** | Good |
| **Interpretability** | Easy | Hard | Medium |
| **Use case** | Quick ensemble | **Competition winning** | Production balanced |

### Learning Path Context

- **016_Decision_Trees** - Single tree foundations
- **017_Random_Forest** - Bagging ensemble (parallel)
- **018-021_Gradient_Boosting** - Sequential ensembles
- **022_Voting_Stacking (this)** - Meta-ensembles (combining different models)
- **023_Hyperparameter_Optimization** (next) - Systematic tuning


## 🔄 Meta-Ensemble Workflows

### Voting Ensemble
```mermaid
graph TD
    A[Training Data] --> B1[Model 1: LogisticReg]
    A --> B2[Model 2: Random Forest]
    A --> B3[Model 3: XGBoost]
    
    B1 --> C[Voting]
    B2 --> C
    B3 --> C
    
    C --> D{Voting Type}
    D -->|Hard| E[Majority Vote]
    D -->|Soft| F[Average Probabilities]
    E --> G[Final Prediction]
    F --> G
    
    style C fill:#e1f5ff
    style G fill:#e1ffe1
```

### Stacking Ensemble
```mermaid
graph TD
    A[Training Data] --> B{K-Fold Split}
    B --> C1[Fold 1]
    B --> C2[Fold 2]
    B --> C3[Fold K]
    
    C1 --> D1[Base Models on Fold 1]
    C2 --> D2[Base Models on Fold 2]
    C3 --> D3[Base Models on Fold K]
    
    D1 --> E[Out-of-Fold Predictions]
    D2 --> E
    D3 --> E
    
    E --> F[Meta-Features]
    F --> G[Meta-Model: LogisticReg]
    G --> H[Final Prediction]
    
    style E fill:#fff4e1
    style F fill:#f0e1ff
    style H fill:#e1ffe1
```

**Key Differences:**
- **Voting**: Simple average/majority, no meta-model
- **Stacking**: Meta-model learns optimal weights from out-of-fold predictions


## 📐 Mathematical Foundation

### 1. Voting Ensembles

**Hard Voting (Classification):**
$$\hat{y} = \text{mode}(h_1(x), h_2(x), ..., h_M(x))$$

Where $h_m(x)$ is the prediction from model $m$, and mode is the most frequent class.

**Example:** 3 models predict [0, 1, 1] → Hard vote = 1 (majority)

**Soft Voting (Classification with probabilities):**
$$\hat{y} = \arg\max_c \frac{1}{M} \sum_{m=1}^M P_m(y = c | x)$$

Where $P_m(y = c | x)$ is model $m$'s predicted probability for class $c$.

**Example:**
- Model 1: P(class=1) = 0.6
- Model 2: P(class=1) = 0.55
- Model 3: P(class=1) = 0.9
- Average: (0.6 + 0.55 + 0.9) / 3 = 0.683 → Predict class 1

**Weighted Voting:**
$$\hat{y} = \arg\max_c \sum_{m=1}^M w_m \cdot P_m(y = c | x)$$

Where $w_m$ is the weight for model $m$ (e.g., based on validation accuracy), and $\sum w_m = 1$.

**Voting for Regression:**
$$\hat{y} = \frac{1}{M} \sum_{m=1}^M h_m(x)$$

Simple average of predictions.

---

### 2. Stacking (Stacked Generalization)

**Two-level architecture:**

**Level 0 (Base models):**
$$h_m(x) = f_m(x; \theta_m), \quad m = 1, ..., M$$

Train $M$ diverse base models on training data.

**Level 1 (Meta-model):**
$$\hat{y} = g([h_1(x), h_2(x), ..., h_M(x)]; \phi)$$

Where:
- $g$ is the meta-model (often LogisticRegression, Ridge, or LightGBM)
- $[h_1(x), ..., h_M(x)]$ are the meta-features (base model predictions)
- $\phi$ are meta-model parameters

**Critical: Out-of-Fold Predictions**

To prevent overfitting, base models must predict on data they haven't seen:

1. Split training data into K folds
2. For each fold $k$:
   - Train base model on folds $\neq k$
   - Predict on fold $k$ (out-of-fold predictions)
3. Concatenate all out-of-fold predictions → meta-features
4. Train meta-model on meta-features

**Mathematical formulation:**
$$\text{Meta-features}_{\text{train}} = [h_1^{\text{OOF}}, h_2^{\text{OOF}}, ..., h_M^{\text{OOF}}]$$
$$\text{Meta-features}_{\text{test}} = [h_1(x_{\text{test}}), h_2(x_{\text{test}}), ..., h_M(x_{\text{test}})]$$

Where $h_m^{\text{OOF}}$ are out-of-fold predictions from model $m$.

---

### 3. Blending

Simpler alternative to stacking:

1. Split training data: 80% train, 20% hold-out
2. Train base models on 80% train
3. Predict on 20% hold-out → meta-features
4. Train meta-model on hold-out meta-features

**Advantage:** Simpler, faster (no K-fold cross-validation)  
**Disadvantage:** Uses less data for base models (80% vs 100% in stacking)

---

### 4. Why Ensembles Work: Bias-Variance Decomposition

**Expected error of a model:**
$$E[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2$$

**For ensemble of M uncorrelated models:**
$$\text{Variance}_{\text{ensemble}} = \frac{1}{M} \cdot \text{Variance}_{\text{individual}}$$

Variance reduces by factor of $M$!

**For correlated models:**
$$\text{Variance}_{\text{ensemble}} = \rho \cdot \text{Variance}_{\text{individual}} + \frac{1-\rho}{M} \cdot \text{Variance}_{\text{individual}}$$

Where $\rho$ is correlation between models.

**Key insight:** Diversity (low $\rho$) is critical for ensemble performance!

---

### Key Parameters

**Voting:**
- `voting='hard'` or `'soft'`
- `weights=[w1, w2, ...]` for weighted voting

**Stacking:**
- `cv=5` (number of folds for out-of-fold predictions)
- `final_estimator` (meta-model, default LogisticRegression)
- `passthrough=False` (include original features in meta-model)


## 🔧 Setup and Imports

### 📝 What's Happening in This Code?

**Purpose:** Import libraries for ensemble methods and diverse base models.

**Key Points:**
- **VotingClassifier/Regressor**: Simple averaging ensemble
- **StackingClassifier/Regressor**: Two-level meta-learning
- **Diverse base models**: Linear, Tree, Boosting for maximum diversity


In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, mean_squared_error, r2_score
)

# Ensemble methods
from sklearn.ensemble import (
    VotingClassifier, VotingRegressor,
    StackingClassifier, StackingRegressor,
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor
)

# Base models for diversity
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB

# Modern boosting
import xgboost as xgb
import lightgbm as lgb

print("✅ Libraries loaded successfully")
print("   Ready to build meta-ensembles!")

## 🗳️ Voting Ensemble: Simple Yet Effective

### 📝 What's Happening in This Code?

**Purpose:** Combine diverse models using hard and soft voting for classification.

**Key Points:**
- **Base models**: LogisticRegression (linear), RandomForest (tree), XGBoost (boosting)
- **Hard voting**: Majority class wins (e.g., 2/3 predict class 1 → output class 1)
- **Soft voting**: Average probabilities (better when models are calibrated)
- **Diversity**: Linear + tree + boosting approaches capture different patterns

**Why This Matters:** Voting ensembles are simple to implement, parallelize easily, and often outperform single models by 2-5%.


In [None]:
# Generate realistic classification dataset
from sklearn.datasets import make_classification

print("📊 Generating classification dataset...\n")

X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    n_clusters_per_class=2,
    class_sep=0.8,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"✅ Dataset generated:")
print(f"   Training samples: {len(X_train)}")
print(f"   Test samples: {len(X_test)}")
print(f"   Features: {X.shape[1]}")
print(f"   Classes: {len(np.unique(y))}")
print(f"   Class distribution: {np.bincount(y_train) / len(y_train) * 100}%")

In [None]:
print("🔨 Training individual base models...\n")

# Model 1: Logistic Regression (linear)
print("1️⃣ Logistic Regression (Linear)")
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_pred)
lr_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:, 1])
print(f"   Accuracy: {lr_acc:.4f}, AUC: {lr_auc:.4f}")

# Model 2: Random Forest (tree ensemble)
print(f"\n2️⃣ Random Forest (Tree Ensemble)")
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
print(f"   Accuracy: {rf_acc:.4f}, AUC: {rf_auc:.4f}")

# Model 3: XGBoost (gradient boosting)
print(f"\n3️⃣ XGBoost (Gradient Boosting)")
xgb_clf = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)
xgb_clf.fit(X_train, y_train)
xgb_pred = xgb_clf.predict(X_test)
xgb_acc = accuracy_score(y_test, xgb_pred)
xgb_auc = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:, 1])
print(f"   Accuracy: {xgb_acc:.4f}, AUC: {xgb_auc:.4f}")

print(f"\n📊 Individual Model Performance Summary:")
print(f"   {'Model':<25} {'Accuracy':<12} {'AUC':<10}")
print(f"   {'-'*47}")
print(f"   {'Logistic Regression':<25} {lr_acc:<12.4f} {lr_auc:<10.4f}")
print(f"   {'Random Forest':<25} {rf_acc:<12.4f} {rf_auc:<10.4f}")
print(f"   {'XGBoost':<25} {xgb_acc:<12.4f} {xgb_auc:<10.4f}")
print(f"\n   Best single model: {max([('LR', lr_acc), ('RF', rf_acc), ('XGB', xgb_acc)], key=lambda x: x[1])[0]} ({max(lr_acc, rf_acc, xgb_acc):.4f})")

In [None]:
print("🗳️ Building Voting Ensembles...\n")

# Hard Voting
print("1️⃣ Hard Voting (Majority Class)")
voting_hard = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('rf', rf),
        ('xgb', xgb_clf)
    ],
    voting='hard'
)
voting_hard.fit(X_train, y_train)
voting_hard_pred = voting_hard.predict(X_test)
voting_hard_acc = accuracy_score(y_test, voting_hard_pred)
print(f"   Accuracy: {voting_hard_acc:.4f}")

# Soft Voting
print(f"\n2️⃣ Soft Voting (Average Probabilities)")
voting_soft = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('rf', rf),
        ('xgb', xgb_clf)
    ],
    voting='soft'
)
voting_soft.fit(X_train, y_train)
voting_soft_pred = voting_soft.predict(X_test)
voting_soft_acc = accuracy_score(y_test, voting_soft_pred)
voting_soft_auc = roc_auc_score(y_test, voting_soft.predict_proba(X_test)[:, 1])
print(f"   Accuracy: {voting_soft_acc:.4f}")
print(f"   AUC: {voting_soft_auc:.4f}")

# Weighted Soft Voting (weight by individual performance)
print(f"\n3️⃣ Weighted Soft Voting (Performance-Based Weights)")
weights = [lr_auc, rf_auc, xgb_auc]  # Weight by AUC
voting_weighted = VotingClassifier(
    estimators=[
        ('lr', lr),
        ('rf', rf),
        ('xgb', xgb_clf)
    ],
    voting='soft',
    weights=weights
)
voting_weighted.fit(X_train, y_train)
voting_weighted_pred = voting_weighted.predict(X_test)
voting_weighted_acc = accuracy_score(y_test, voting_weighted_pred)
voting_weighted_auc = roc_auc_score(y_test, voting_weighted.predict_proba(X_test)[:, 1])
print(f"   Weights: {[f'{w:.3f}' for w in weights]}")
print(f"   Accuracy: {voting_weighted_acc:.4f}")
print(f"   AUC: {voting_weighted_auc:.4f}")

# Comparison
print(f"\n📊 Voting Ensemble Comparison:")
print(f"   {'Ensemble Type':<30} {'Accuracy':<12} {'AUC':<10}")
print(f"   {'-'*52}")
print(f"   {'Best Single Model':<30} {max(lr_acc, rf_acc, xgb_acc):<12.4f} {max(lr_auc, rf_auc, xgb_auc):<10.4f}")
print(f"   {'Hard Voting':<30} {voting_hard_acc:<12.4f} {'N/A':<10}")
print(f"   {'Soft Voting':<30} {voting_soft_acc:<12.4f} {voting_soft_auc:<10.4f}")
print(f"   {'Weighted Soft Voting':<30} {voting_weighted_acc:<12.4f} {voting_weighted_auc:<10.4f}")

improvement = (voting_soft_acc - max(lr_acc, rf_acc, xgb_acc)) / max(lr_acc, rf_acc, xgb_acc) * 100
print(f"\n💡 Ensemble Improvement: {improvement:.2f}% over best single model")
print(f"   Soft voting typically outperforms hard voting by using probability information")

## 📋 Batch 1 Summary: Voting Ensembles Complete

### ✅ What We've Covered

1. **Meta-ensemble concepts** - Combining different model types
2. **Voting ensembles** - Hard, soft, and weighted voting
3. **Diverse base models** - Linear + Tree + Boosting for maximum diversity
4. **Performance comparison** - Ensemble typically 2-5% better than single models

### 🎯 Key Insights

- **Diversity is critical**: Use different algorithm families (linear, tree, boosting)
- **Soft voting > hard voting**: Probability averaging uses more information
- **Weighted voting**: Give stronger models more influence
- **Simple to implement**: No complex training procedure

---

### 🚀 Coming in Batch 2

- **Stacking ensembles** - Meta-model learns optimal combination
- **Out-of-fold predictions** - Prevent overfitting in stacking
- **Post-silicon application** - Multi-model yield prediction
- **Blending** - Simpler alternative to stacking
- **8 Real-world projects** - Competition winning and production systems
- **Best practices** - When to use voting vs stacking


## 📚 Stacking: Meta-Learning with Out-of-Fold Predictions

### 📝 What's Happening in This Code?

**Purpose:** Build two-level stacking ensemble where meta-model learns optimal combination of base models.

**Key Points:**
- **Level 0 (Base models)**: Diverse models trained on original features
- **Out-of-fold predictions**: Each sample predicted by models that didn't see it during training
- **Level 1 (Meta-model)**: Learns to combine base model predictions optimally
- **Critical**: Must use out-of-fold predictions to prevent overfitting

**Why This Matters:** Stacking typically outperforms voting by 1-3% because the meta-model learns optimal weights rather than simple averaging. Dominant in Kaggle competitions.


In [None]:
print("📚 Building Stacking Ensemble...\n")

# Base models (diverse algorithm families)
base_models = [
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)),
    ('xgb', xgb.XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1,
                              random_state=42, use_label_encoder=False, eval_metric='logloss')),
    ('lgb', lgb.LGBMClassifier(n_estimators=100, num_leaves=31, learning_rate=0.1,
                               random_state=42, verbose=-1))
]

# Meta-model (simple linear model to learn optimal weights)
meta_model = LogisticRegression(max_iter=1000, random_state=42)

# Stacking Classifier with 5-fold out-of-fold predictions
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,  # 5-fold cross-validation for out-of-fold predictions
    stack_method='predict_proba',  # Use probabilities instead of hard predictions
    n_jobs=-1
)

print("🔨 Training stacking ensemble (this takes longer due to cross-validation)...")
start_time = time.time()
stacking_clf.fit(X_train, y_train)
stacking_time = time.time() - start_time

# Predictions
stacking_pred = stacking_clf.predict(X_test)
stacking_proba = stacking_clf.predict_proba(X_test)[:, 1]

# Metrics
stacking_acc = accuracy_score(y_test, stacking_pred)
stacking_auc = roc_auc_score(y_test, stacking_proba)

print(f"\n✅ Stacking complete ({stacking_time:.2f}s)")
print(f"   Base models: 4 (LR, RF, XGB, LGB)")
print(f"   Meta-model: LogisticRegression")
print(f"   Cross-validation: 5-fold")
print(f"   Accuracy: {stacking_acc:.4f}")
print(f"   AUC: {stacking_auc:.4f}")

# Compare with voting
print(f"\n📊 Stacking vs Voting vs Single Models:")
print(f"   {'Method':<30} {'Accuracy':<12} {'AUC':<10}")
print(f"   {'-'*52}")
print(f"   {'Best Single Model':<30} {max(lr_acc, rf_acc, xgb_acc):<12.4f} {max(lr_auc, rf_auc, xgb_auc):<10.4f}")
print(f"   {'Soft Voting':<30} {voting_soft_acc:<12.4f} {voting_soft_auc:<10.4f}")
print(f"   {'Weighted Soft Voting':<30} {voting_weighted_acc:<12.4f} {voting_weighted_auc:<10.4f}")
print(f"   {'Stacking':<30} {stacking_acc:<12.4f} {stacking_auc:<10.4f}")

print(f"\n💡 Key Insights:")
stacking_improvement = (stacking_acc - max(lr_acc, rf_acc, xgb_acc)) / max(lr_acc, rf_acc, xgb_acc) * 100
print(f"   • Stacking improvement over best single: {stacking_improvement:.2f}%")
print(f"   • Meta-model learns optimal weights from out-of-fold predictions")
print(f"   • Typically 1-3% better than voting (worth the extra complexity)")
print(f"   • Training time: {stacking_time / max([lr, rf, xgb_clf], key=lambda m: 1):.1f}x longer (due to cross-validation)")

In [None]:
# Analyze meta-model learned weights
print("🔍 Meta-Model Analysis\n")

# Get meta-model coefficients (weights for each base model's predictions)
meta_coef = stacking_clf.final_estimator_.coef_[0]
base_model_names = [name for name, _ in base_models]

# For predict_proba, we get 2 predictions per model (prob class 0, prob class 1)
# Extract weights for class 1 probabilities
num_base_models = len(base_models)
class_1_coefs = meta_coef[num_base_models:2*num_base_models]  # Second set of coefficients

print("📊 Meta-Model Learned Weights (for class 1 probabilities):\n")
for name, coef in zip(base_model_names, class_1_coefs):
    print(f"   {name:<10} → {coef:>8.4f}")

# Visualize
plt.figure(figsize=(10, 6))
plt.bar(base_model_names, class_1_coefs)
plt.xlabel('Base Model', fontsize=12)
plt.ylabel('Meta-Model Weight', fontsize=12)
plt.title('Stacking Meta-Model Learned Weights', fontsize=14, fontweight='bold')
plt.axhline(y=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

print(f"\n💡 Interpretation:")
print(f"   • Positive weights: Model's predictions positively influence final prediction")
print(f"   • Larger magnitude: More influence in ensemble")
print(f"   • Meta-model automatically learns optimal combination")
print(f"   • Different from equal-weight voting (all weights = 1/{len(base_models)})")

## 🔬 Post-Silicon Application: Multi-Model Yield Prediction

### 📝 What's Happening in This Code?

**Purpose:** Build production-grade stacking ensemble for semiconductor yield prediction.

**Key Points:**
- **Scale**: 50K devices with 25 parametric tests + categorical features
- **Base models**: 5 diverse models (Linear, Tree, RF, XGB, LGB)
- **Meta-model**: LightGBM (faster than LogisticRegression for large data)
- **Business value**: Combine strengths of multiple models for robust predictions

**Why This Matters:** In production, model performance can degrade over time. Ensembles are more robust to drift because failure of one model doesn't break the system.


In [None]:
# Generate realistic post-silicon dataset
print("🏭 Generating 50K device post-silicon dataset...\n")

np.random.seed(42)
n_devices = 50000

# Categorical features
equipment_ids = [f'EQ_{i:03d}' for i in range(50)]
lot_ids = [f'LOT_{i:04d}' for i in range(100)]

equipment_id = np.random.choice(equipment_ids, n_devices)
lot_id = np.random.choice(lot_ids, n_devices)

# Equipment/lot effects
equipment_effects = {eq: np.random.normal(0, 2) for eq in equipment_ids}
lot_effects = {lot: np.random.normal(0, 3) for lot in lot_ids}

# Parametric tests (25 features)
voltage = np.random.normal(1.8, 0.04, n_devices)
current = np.random.normal(150, 18, n_devices)
frequency = np.random.normal(2000, 90, n_devices)
temperature = np.random.uniform(25, 85, n_devices)
power = voltage * current
leakage = np.random.exponential(9, n_devices)
delay = np.random.normal(500, 45, n_devices)
jitter = np.random.exponential(18, n_devices)
noise_margin = np.random.normal(0.3, 0.055, n_devices)
skew = np.random.normal(0, 16, n_devices)

# Additional tests
additional_tests = {f'test_{i}': np.random.normal(100, 10, n_devices) for i in range(11, 26)}

# Spatial
die_x = np.random.randint(0, 50, n_devices)
die_y = np.random.randint(0, 50, n_devices)

# Complex yield model
yield_score = (
    100 +
    0.5 * (frequency - 2000) / 90 +
    -0.35 * (temperature - 25) / 10 +
    -1.1 * leakage / 9 +
    -0.06 * delay / 45 +
    -0.25 * jitter / 18 +
    11 * noise_margin +
    np.array([equipment_effects[eq] for eq in equipment_id]) +
    np.array([lot_effects[lot] for lot in lot_id]) +
    np.random.normal(0, 5.5, n_devices)
)

yield_binary = (yield_score > 95).astype(int)

# Create DataFrame
df_ps = pd.DataFrame({
    'Equipment_ID': equipment_id,
    'Lot_ID': lot_id,
    'Voltage_V': voltage,
    'Current_mA': current,
    'Frequency_MHz': frequency,
    'Temperature_C': temperature,
    'Power_mW': power,
    'Leakage_uA': leakage,
    'Delay_ps': delay,
    'Jitter_ps': jitter,
    'Noise_Margin': noise_margin,
    'Skew_ps': skew,
    'Die_X': die_x,
    'Die_Y': die_y,
    'Yield': yield_binary
})

for name, values in additional_tests.items():
    df_ps[name] = values

# Encode categoricals for sklearn models
from sklearn.preprocessing import LabelEncoder
le_eq = LabelEncoder()
le_lot = LabelEncoder()
df_ps['Equipment_ID_encoded'] = le_eq.fit_transform(df_ps['Equipment_ID'])
df_ps['Lot_ID_encoded'] = le_lot.fit_transform(df_ps['Lot_ID'])

print(f"✅ Dataset Generated:")
print(f"   Devices: {n_devices:,}")
print(f"   Features: {df_ps.shape[1] - 1} (2 categorical + 25 parametric + 2 spatial)")
print(f"   Yield rate: {yield_binary.mean():.1%}")
print(f"\n{df_ps.head()}")

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Prepare data (use encoded categoricals)
feature_cols = [col for col in df_ps.columns if col not in ['Yield', 'Equipment_ID', 'Lot_ID']]
X_ps = df_ps[feature_cols].values
y_ps = df_ps['Yield'].values

X_train_ps, X_test_ps, y_train_ps, y_test_ps = train_test_split(
    X_ps, y_ps, test_size=0.2, random_state=42, stratify=y_ps
)

print("🚀 Building Production Stacking Ensemble...\n")

# Diverse base models
base_models_ps = [
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('dt', DecisionTreeClassifier(max_depth=10, random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=100, max_depth=12, random_state=42, n_jobs=-1)),
    ('xgb', xgb.XGBClassifier(n_estimators=150, max_depth=6, learning_rate=0.05,
                              random_state=42, use_label_encoder=False, eval_metric='logloss', n_jobs=-1)),
    ('lgb', lgb.LGBMClassifier(n_estimators=150, num_leaves=31, learning_rate=0.05,
                               random_state=42, verbose=-1, n_jobs=-1))
]

# Meta-model: LightGBM (faster than LogReg for large data)
meta_model_ps = lgb.LGBMClassifier(n_estimators=50, num_leaves=15, learning_rate=0.1,
                                   random_state=42, verbose=-1)

print("1️⃣ Training base models individually (for comparison)...")
base_results = {}
for name, model in base_models_ps:
    model.fit(X_train_ps, y_train_ps)
    pred = model.predict(X_test_ps)
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(X_test_ps)[:, 1]
        auc = roc_auc_score(y_test_ps, proba)
    else:
        auc = None
    acc = accuracy_score(y_test_ps, pred)
    base_results[name] = {'accuracy': acc, 'auc': auc}
    print(f"   {name:<6} Accuracy: {acc:.4f}, AUC: {auc:.4f if auc else 'N/A'}")

# Stacking ensemble
print(f"\n2️⃣ Training stacking ensemble (5-fold CV)...")
start_time = time.time()

stacking_ps = StackingClassifier(
    estimators=base_models_ps,
    final_estimator=meta_model_ps,
    cv=5,
    stack_method='predict_proba',
    n_jobs=-1
)

stacking_ps.fit(X_train_ps, y_train_ps)
stacking_ps_time = time.time() - start_time

# Predictions
stacking_ps_pred = stacking_ps.predict(X_test_ps)
stacking_ps_proba = stacking_ps.predict_proba(X_test_ps)[:, 1]

# Metrics
stacking_ps_acc = accuracy_score(y_test_ps, stacking_ps_pred)
stacking_ps_auc = roc_auc_score(y_test_ps, stacking_ps_proba)
stacking_ps_cm = confusion_matrix(y_test_ps, stacking_ps_pred)

print(f"\n✅ Stacking ensemble complete ({stacking_ps_time:.2f}s)")
print(f"   Accuracy: {stacking_ps_acc:.4f}")
print(f"   AUC: {stacking_ps_auc:.4f}")

# Comparison
print(f"\n📊 Model Comparison on 50K Device Dataset:")
print(f"   {'Model':<20} {'Accuracy':<12} {'AUC':<10}")
print(f"   {'-'*42}")
for name, results in base_results.items():
    auc_str = f"{results['auc']:.4f}" if results['auc'] else 'N/A'
    print(f"   {name.upper():<20} {results['accuracy']:<12.4f} {auc_str:<10}")
print(f"   {'-'*42}")
print(f"   {'STACKING ENSEMBLE':<20} {stacking_ps_acc:<12.4f} {stacking_ps_auc:<10.4f}")

best_base_acc = max(r['accuracy'] for r in base_results.values())
improvement = (stacking_ps_acc - best_base_acc) / best_base_acc * 100
print(f"\n💡 Stacking improvement: {improvement:.2f}% over best single model")

# Confusion matrix
print(f"\n📋 Confusion Matrix:")
print(f"   True Neg:  {stacking_ps_cm[0,0]:7,}  |  False Pos: {stacking_ps_cm[0,1]:6,}")
print(f"   False Neg: {stacking_ps_cm[1,0]:7,}  |  True Pos:  {stacking_ps_cm[1,1]:6,}")

print(f"\n📋 Classification Report:")
print(classification_report(y_test_ps, stacking_ps_pred, target_names=['Fail', 'Pass']))

# Business impact
false_negatives = stacking_ps_cm[1, 0]
false_positives = stacking_ps_cm[0, 1]
true_negatives = stacking_ps_cm[0, 0]

print(f"💰 Business Impact (50K device analysis):")
print(f"   Correctly caught failures: {true_negatives:,}")
print(f"   Cost avoided: ${true_negatives * 1.0:,.0f}")
print(f"   Missed failures: {false_negatives:,}")
print(f"   Cost of misses: ${false_negatives * 5.0:,.0f}")
print(f"   Net benefit: ${(true_negatives * 1.0 - false_negatives * 5.0 - false_positives * 3.0):,.0f}")

print(f"\n🎯 Production Advantages:")
print(f"   • Robust to model degradation (single model failure doesn't break system)")
print(f"   • Combines strengths: Linear captures trends, trees capture non-linearities")
print(f"   • Meta-model adapts weights as data distribution changes")
print(f"   • Can add/remove base models without retraining from scratch")

## 🎯 Real-World Ensemble Projects

### 🔬 Post-Silicon Validation Projects (4)

### **1. Multi-Model Test Flow Optimizer**
**Objective:** Ensemble predicts test time from multiple model perspectives

**Features:**
- Base models: LinearRegression (trends), RandomForest (non-linear), XGBoost (interactions)
- Stacking meta-model learns optimal combination
- 15 parametric tests, equipment_id, lot_id categorical features
- Predict total test time for adaptive scheduling

**Success Metrics:**
- Ensemble MAE <5ms (better than single models by 15-20%)
- Robust to equipment changes (ensemble doesn't break)
- Test time reduction: 20-30% via dynamic scheduling
- Production uptime: 99.9% (voting fallback if one model fails)

**Business Value:** $100K-300K annual savings per production line

---

### **2. Robust Failure Mode Classification**
**Objective:** Multi-class ensemble for identifying 10+ failure modes

**Features:**
- Base models: SVM (boundaries), RandomForest (non-linear), XGBoost (boosting)
- 20 parametric tests + spatial features (die_x, die_y)
- Failure modes: leakage, delay, power, frequency, voltage, mixed (10 classes)
- Voting ensemble (hard voting for interpretability)

**Success Metrics:**
- Multi-class accuracy >75% (vs 65% single model)
- Confusion between similar modes <10%
- Root cause identification time: 2 hours → 15 minutes
- Ensemble robustness: Works even if one model degrades

**Business Value:** Reduce debug time 60% → $500K-2M annual savings

---

### **3. Production Drift-Resistant Yield Predictor**
**Objective:** Stacking ensemble adapts to process changes over time

**Features:**
- Base models: 6 diverse models (Linear, Trees, Boosting, SVM)
- Monthly retraining: Meta-model learns new weights as process drifts
- Categorical: equipment_id (100+), lot_id (1000s), supplier_id
- Temporal features: days_since_calibration, cumulative_devices

**Success Metrics:**
- AUC degradation <2% over 6 months (vs 8% single model)
- Automatic weight adjustment detects model degradation
- Alert when base model contribution drops >20%
- Production uptime: 99.95% (ensemble never fails completely)

**Business Value:** Consistent quality → $2-5M avoided rework costs annually

---

### **4. Kaggle-Style Competition: Semiconductor Yield Prediction**
**Objective:** Win internal data science competition using ensemble techniques

**Features:**
- Stacking: 10+ diverse base models (Linear, Tree, Boosting, Neural)
- Feature engineering: Polynomial features, interactions, aggregations
- Hyperparameter tuning: Optuna for each base model
- Blending: Multiple stacking ensembles averaged

**Success Metrics:**
- Top 3 finish (AUC >0.95)
- Beat baseline by >5%
- Reproducible: Cross-validation AUC within 0.5% of leaderboard
- Deploy winner to production

**Business Value:** Prestige + production model worth $1-3M annually

---

### 🌐 General AI/ML Projects (4)

### **5. Healthcare Multi-Model Readmission Predictor**
**Objective:** Ensemble combines clinical, demographic, and behavioral models

**Features:**
- Base models: LogisticReg (demographics), RandomForest (clinical), XGBoost (behavioral)
- Each model specializes in different feature subsets
- Stacking meta-model learns patient-specific weights
- 100+ features: diagnosis_codes, procedures, medications, social determinants

**Success Metrics:**
- AUC >0.78 (vs 0.72 single model)
- Precision >75% (minimize false positives)
- Recall >85% (catch most readmissions)
- Interpretable: Can explain which sub-model contributed most

**Business Value:** Reduce readmissions 18-25% → $8-15M per hospital

---

### **6. Financial Fraud Detection Ensemble**
**Objective:** Real-time fraud scoring with fallback redundancy

**Features:**
- Base models: IsolationForest (anomaly), XGBoost (patterns), LSTM (sequence)
- Voting ensemble for <10ms latency (parallel inference)
- Features: transaction_amount, velocity, merchant_risk, location_anomaly
- Fallback: If LSTM fails, other models still work

**Success Metrics:**
- Precision >85% (minimize false declines)
- Recall >92% (catch most fraud)
- Latency <10ms (real-time authorization)
- Uptime: 99.99% (ensemble never fully fails)

**Business Value:** Block $20-50M fraud annually, reduce false declines 30%

---

### **7. E-Commerce Click-Through Rate Meta-Ensemble**
**Objective:** Stacking for ad CTR prediction across multiple platforms

**Features:**
- Base models: FM (interactions), GBDT (non-linear), DeepFM (neural)
- Each model trained on different feature representations
- Meta-model: Lightweight LR for <5ms inference
- 1M+ categorical features (user_id, ad_id, context)

**Success Metrics:**
- AUC >0.78 (vs 0.74 single model)
- CTR prediction error <5%
- Revenue lift: 12-18% from better targeting
- A/B test: Ensemble beats single models by 15%

**Business Value:** $10-30M additional revenue per quarter

---

### **8. Kaggle Competition Framework (Tabular Data)**
**Objective:** Reusable stacking pipeline for top 1% finishes

**Features:**
- Automated ensemble: 15+ base models with hyperparameter tuning
- Multi-level stacking: Level 1 (10 models) → Level 2 (5 models) → Level 3 (meta)
- Blending: Average 3 stacking ensembles for robustness
- Feature engineering: Automated polynomial, interactions, aggregations

**Success Metrics:**
- Top 1% finish on 5+ Kaggle competitions
- Reproducible: Cross-validation within 1% of leaderboard
- Fast iteration: End-to-end pipeline in <6 hours
- Open-source: Share framework for community

**Business Value:** Career advancement, consulting opportunities, reputation

---


## ✅ Key Takeaways: Meta-Ensemble Mastery

### 🎯 When to Use Each Ensemble Type

**Voting Ensembles:**
- ✅ Quick ensemble (parallel training)
- ✅ Interpretable (simple average/majority)
- ✅ Production systems with <100ms latency requirements
- ✅ Failover redundancy (voting continues if one model fails)
- ❌ Not optimal (equal or manual weights)

**Stacking Ensembles:**
- ✅ Competition winning (Kaggle, data science contests)
- ✅ Maximum accuracy (1-3% better than voting)
- ✅ Adaptive to data changes (meta-model learns optimal weights)
- ✅ Feature-aware meta-learning (passthrough=True option)
- ❌ Slower training (cross-validation required)
- ❌ More complex (two-level architecture)

**Blending:**
- ✅ Simpler than stacking (no cross-validation)
- ✅ Faster training (single hold-out set)
- ✅ Good balance for production
- ❌ Uses less data (80% train vs 100% in stacking)
- ❌ Slightly worse than stacking

---

### 🔑 Critical Success Factors

1. **Diversity is everything**
   - Use different algorithm families: Linear, Tree, Boosting, Neural
   - Different feature sets (e.g., one model on raw, another on engineered)
   - Different hyperparameters or data subsets
   - Rule of thumb: Base model correlation <0.8

2. **Out-of-fold predictions for stacking**
   - **Critical**: Never train meta-model on in-sample predictions
   - Use cv=5 or cv=10 for robust out-of-fold predictions
   - Prevents catastrophic overfitting

3. **Meta-model simplicity**
   - Prefer simple meta-models: LogisticRegression, Ridge, Lasso
   - Complex meta-models (XGBoost) can overfit
   - Exception: Large datasets (>100K) can use LightGBM meta-model

4. **Computational trade-offs**
   - Voting: 1x training time (parallel)
   - Stacking: 5-10x training time (due to cross-validation)
   - Inference: Both ~Mx prediction time (M = number of base models)
   - Use voting if latency <10ms required

---

### 📊 Performance Expectations

| Dataset Size | Single Model | Voting | Stacking |
|--------------|--------------|--------|----------|
| Small (<10K) | Baseline | +1-2% | +2-4% |
| Medium (10K-100K) | Baseline | +2-3% | +3-5% |
| Large (>100K) | Baseline | +1-2% | +2-3% |

*Improvements depend on base model diversity and dataset complexity*

---

### 🔧 Best Practices

1. **Start with voting**
   - Quick baseline ensemble
   - Identifies if ensemble approach is promising
   - Only move to stacking if voting shows benefit

2. **Validate base model diversity**
   ```python
   # Check correlation between base model predictions
   predictions_df = pd.DataFrame({
       'model1': model1.predict_proba(X)[:, 1],
       'model2': model2.predict_proba(X)[:, 1]
   })
   print(predictions_df.corr())
   # Aim for correlation <0.8
   ```

3. **Monitor base model contributions**
   - Track meta-model coefficients over time
   - Remove base models with near-zero weights
   - Alert when previously strong model degrades

4. **Use cross-validation consistently**
   - Same cv folds for all base models in stacking
   - Stratified folds for classification
   - Reproducible seeds

5. **Production deployment patterns**
   - Voting: Parallel microservices (failover redundancy)
   - Stacking: Sequential pipeline (base models → meta-model)
   - Cache base model predictions to speed up inference

---

### 🚀 Next Steps

1. **023_Hyperparameter_Optimization.ipynb** - Systematic tuning with Optuna
2. **024_Model_Interpretation.ipynb** - SHAP, LIME, feature interactions
3. **025_Imbalanced_Learning.ipynb** - Class weights, SMOTE, custom loss
4. **026_K_Means_Clustering.ipynb** - Unsupervised learning begins

---

### 🎓 What You've Mastered

✅ **Voting ensembles** - Hard, soft, weighted voting  
✅ **Stacking ensembles** - Two-level meta-learning  
✅ **Out-of-fold predictions** - Prevent overfitting in stacking  
✅ **Blending** - Simpler alternative to stacking  
✅ **Diversity principles** - Why different models matter  
✅ **Production deployment** - Robust, failover-ready systems  
✅ **Meta-model selection** - Simple models for meta-learning  
✅ **Business applications** - $100K-30M impact across domains  

You now understand how to combine multiple models for superior performance and production robustness! 🎉


## 📚 References and Further Reading

### Original Papers

1. **Stacked Generalization** (1992)  
   Wolpert, Neural Networks 1992  
   Original stacking paper introducing meta-learning concept

2. **Ensemble Methods in Machine Learning** (2000)  
   Dietterich, Multiple Classifier Systems 2000  
   Comprehensive survey of ensemble techniques

3. **Netflix Prize Winning Solution** (2009)  
   Koren, Bellkor's Pragmatic Chaos  
   100+ model ensemble, demonstrates stacking at scale

### Official Documentation

4. **Sklearn Ensemble Module**  
   https://scikit-learn.org/stable/modules/ensemble.html

5. **VotingClassifier/Regressor**  
   https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

6. **StackingClassifier/Regressor**  
   https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html

### Kaggle Resources

7. **Kaggle Ensembling Guide**  
   https://mlwave.com/kaggle-ensembling-guide/

8. **Stacking Made Easy**  
   https://github.com/vecxoz/vecstack - Python library for stacking

### Related Notebooks

- **017_Random_Forest.ipynb** - Bagging ensemble (parallel trees)
- **018_Gradient_Boosting.ipynb** - Sequential ensemble foundations
- **019_XGBoost.ipynb** - Regularized boosting
- **020_LightGBM.ipynb** - Histogram-based boosting
- **021_CatBoost.ipynb** - Ordered boosting with categoricals
- **023_Hyperparameter_Optimization.ipynb** (next) - Tune base models systematically

---

**Notebook Complete!** ✅  
**Next:** 023_Hyperparameter_Optimization.ipynb - Systematic tuning with Optuna/Hyperopt
