# Ensemble Methods Lab

This hands-on lab demonstrates key ensemble learning concepts using real-world data. We'll explore:

1. **Using Multiple Models Together** - Combining different algorithms
2. **Random Forests and GBTs** - Tree-based ensemble methods
3. **Understanding Bootstrap Aggregation** - The bagging technique
4. **Combining Heterogeneous Models** - Stacking and blending
5. **Evaluating Ensembles of Methods** - Comprehensive performance analysis

## Dataset

We'll use the **Wine Quality Dataset** from UCI Machine Learning Repository. This dataset contains physicochemical properties of Portuguese "Vinho Verde" wine samples, along with sensory quality ratings.

## Setup and Dependencies

The next code cell installs optional libraries used later in this notebook: **XGBoost** and **LightGBM**. These are advanced implementations of gradient boosting algorithms that are widely used in industry and competitive machine learning.

**Why these libraries?**
- **scikit-learn**: Great for learning and understanding core concepts, excellent documentation
- **XGBoost**: Highly optimized for speed and performance, includes advanced regularization
- **LightGBM**: Developed by Microsoft, extremely fast on large datasets, uses a unique leaf-wise tree growth

If these packages are already installed in your environment, the installation command will simply skip them. Don't worry if the installation takes a minute - these are substantial libraries!

In [None]:
!uv pip install xgboost lightgbm

In [None]:
# Import necessary libraries
# The imports below provide the datasets, modeling algorithms,
# preprocessing utilities, and evaluation metrics we will use.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve

# Individual models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Ensemble methods
from sklearn.ensemble import (
    RandomForestClassifier, 
    GradientBoostingClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
    VotingClassifier,
    StackingClassifier
)
# Optional high-performance gradient boosting libraries
# These packages may not be available in every environment (e.g., classroom machines).
# We import them defensively so the notebook continues to run if they're missing.
try:
    import xgboost as xgb
except Exception:
    xgb = None
    print('xgboost not available; XGBoost comparisons will be skipped.')

try:
    import lightgbm as lgb
except Exception:
    lgb = None
    print('lightgbm not available; LightGBM comparisons will be skipped.')

import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)



## Understanding Key Concepts

Before proceeding, let's establish some foundational knowledge that will help you understand the ensemble methods we'll explore:

### Important Concepts to Remember:

**1. Bagging vs Boosting:**
- **Bagging** (Bootstrap Aggregating): Trains multiple models independently in parallel, then averages their predictions. Think of it as "wisdom of the crowd" where each expert gives an independent opinion.
- **Boosting**: Trains models sequentially, where each new model focuses on fixing the mistakes of previous models. It's like learning from your errors iteratively.

**2. Feature Standardization:**
- **Why it matters**: Some algorithms (like SVM and KNN) are sensitive to the scale of features. If one feature ranges from 0-1 and another from 0-1000, the algorithm might give undue importance to the larger-scale feature.
- **Which models need it**: Distance-based models (KNN, SVM) and gradient-based models (Logistic Regression, Neural Networks)
- **Which models don't**: Tree-based models (Decision Trees, Random Forests, Gradient Boosting) naturally handle different scales

**3. Cross-Validation:**
- **Purpose**: Provides a more reliable estimate of model performance than a single train/test split
- **How it works**: Divides training data into K parts (folds), trains K times using different folds for validation each time
- **Why we use it**: Reduces the risk of getting lucky (or unlucky) with a particular train/test split

**4. Out-of-Bag (OOB) Evaluation:**
- **Unique to bagging methods**: When creating bootstrap samples, ~37% of data is left out of each sample
- **Free validation**: These left-out samples can be used to estimate model performance without needing a separate validation set

### Quick Self-Check:
Before running the next cells, think about:
- Which algorithm might perform best on the Wine dataset and why?
- Do you expect linear models or tree-based models to work better for this classification task?

## Load and Explore the Dataset

We'll load the Wine dataset from scikit-learn, which is a well-known classification dataset perfect for demonstrating ensemble methods.

### Understanding the Dataset

The next cell loads the Wine dataset and displays essential information. As you run it, pay attention to:

**1. Class Balance:**
- Are the three wine classes equally represented?
- Imbalanced classes can bias models toward the majority class
- This dataset is reasonably balanced, which makes accuracy a reliable metric

**2. Feature Information:**
- Number of features (13 chemical properties like alcohol content, acidity, etc.)
- All features are continuous numerical values (no categorical variables)
- This is ideal for ensemble methods which work well with numerical data

**3. Dataset Size:**
- With 178 samples, this is a small dataset
- Ensemble methods shine even on small datasets by reducing overfitting
- Cross-validation becomes especially important with limited data

**What to look for in the output:**
- Total samples and how they split across classes
- Feature names (these are chemical measurements from wine analysis)
- First few rows to see the scale and range of values

**Try this:** After running the cell, use `df.describe()` and `df.info()` in a new cell to explore the data distributions further!

In [None]:
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=wine.feature_names)
df['target'] = y

print("Dataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"\nClass distribution:")
print(pd.Series(y).value_counts().sort_index())

print("\nFeature names:")
print(wine.feature_names)

print("\nFirst few rows:")
df.head()

In [None]:
# Visualize class distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
pd.Series(y).value_counts().sort_index().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
df.iloc[:, :4].boxplot()
plt.title('Feature Distributions (First 4 Features)')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

### Visualizing the Data

This cell creates two important visualizations:

**Left Plot - Class Distribution:**
- Shows how many samples we have for each wine class
- Helps identify if we have class imbalance (we don't in this case!)
- Balanced classes mean we can trust accuracy as our primary metric

**Right Plot - Feature Distributions (First 4 Features):**
- Box plots show the spread and outliers for each feature
- Notice the different scales across features (this is why we'll standardize later)
- Outliers appear as individual points beyond the whiskers
- The wide range of scales reinforces why distance-based algorithms need standardization

## Data Preparation

Split the data into training and test sets, and standardize the features for better model performance.

### Data Splitting and Scaling Explained

This cell performs two critical preprocessing steps:

**1. Train/Test Split (70%/30%):**
- **Training set (70%)**: Used to train our models and tune parameters
- **Test set (30%)**: Held aside to evaluate final performance (simulates new, unseen data)
- **Stratified splitting**: Ensures each split maintains the same class proportions as the original dataset
  - Example: If the original data is 40% class 0, 30% class 1, 30% class 2, both train and test sets will have the same proportions
  - This is crucial for small datasets to avoid accidentally creating imbalanced splits

**2. Feature Standardization (StandardScaler):**
- **What it does**: Transforms each feature to have mean=0 and standard deviation=1
- **Why it matters**: 
  - Prevents features with larger scales from dominating the model
  - Essential for: SVM, KNN, Logistic Regression, and Neural Networks
  - Not needed for: Decision Trees, Random Forests, and Gradient Boosting (they use splits, not distances)
- **Important**: We fit the scaler ONLY on training data, then apply it to test data
  - This prevents "data leakage" - we don't want the test set influencing our preprocessing

**Which models in this lab need scaling?**
- ✅ Need scaling: SVM, KNN, Logistic Regression
- ❌ Don't need scaling: Decision Trees, Random Forests, Gradient Boosting

**Mini-experiment idea**: Try training a KNN model with and without scaling to see the difference in performance!

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Standardize features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTraining set class distribution:")
print(pd.Series(y_train).value_counts().sort_index())

# 1. Using Multiple Models Together

Before diving into ensemble methods, let's first train several individual models to establish baselines. We'll compare their performance and then see how combining them improves results.

### Establishing Baseline Performance

**Why start with individual models?**
Ensemble methods combine multiple models, so we need to understand how well individual models perform first. This gives us:
- A **baseline** to compare against - how much improvement do ensembles provide?
- Insight into **which models work well** for this dataset
- Understanding of **model diversity** - do different models make different mistakes?

**What this cell measures:**

1. **Test Accuracy**: How well the model performs on unseen data (our test set)
   - This is what you'd report as the final model performance

2. **Cross-Validation (CV) Score**: More reliable than a single test score
   - **CV Mean**: Average performance across 5 different train/validation splits
   - **CV Std (Standard Deviation)**: How much performance varies between folds
     - Low std = stable, consistent model
     - High std = model is sensitive to which data it sees

**The 5 Models We're Testing:**
- **Decision Tree**: Fast, interpretable, but prone to overfitting
- **Logistic Regression**: Linear model, works well when classes are linearly separable
- **SVM (Support Vector Machine)**: Finds optimal decision boundary, powerful but slower
- **K-Nearest Neighbors**: Classifies based on similarity to nearby training examples
- **Naive Bayes**: Fast probabilistic model, assumes feature independence

**After running, ask yourself:**
- Which model performs best? Why might that be?
- Which models show high variance (large CV std)? This suggests instability.
- Do you see any models overfitting (train accuracy much higher than test)?

In [None]:
# Train multiple individual models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

# Store results
individual_results = {}

print("Training individual models...\n")
print("="*70)

for name, model in models.items():
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    individual_results[name] = {
        'model': model,
        'test_accuracy': accuracy,
        'cv_mean': cv_mean,
        'cv_std': cv_std
    }
    
    print(f"{name}:")
    print(f"  Test Accuracy: {accuracy:.4f}")
    print(f"  CV Score: {cv_mean:.4f} (+/- {cv_std:.4f})")
    print("-"*70)

print("="*70)

In [None]:
# Visualize individual model performance
results_df = pd.DataFrame({
    'Model': list(individual_results.keys()),
    'Test Accuracy': [r['test_accuracy'] for r in individual_results.values()],
    'CV Mean': [r['cv_mean'] for r in individual_results.values()],
    'CV Std': [r['cv_std'] for r in individual_results.values()]
})

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.barh(results_df['Model'], results_df['Test Accuracy'])
plt.xlabel('Test Accuracy')
plt.title('Individual Model Performance')
plt.xlim(0.8, 1.0)

plt.subplot(1, 2, 2)
plt.errorbar(results_df['CV Mean'], results_df['Model'], 
             xerr=results_df['CV Std'], fmt='o', markersize=8)
plt.xlabel('Cross-Validation Score')
plt.title('CV Performance with Standard Deviation')
plt.xlim(0.8, 1.0)

plt.tight_layout()
plt.show()

print("\nPerformance Summary:")
print(results_df.to_string(index=False))

### Visualizing Model Performance

This visualization helps you compare models at a glance:

**Left Plot - Test Accuracy:**
- Simple bar chart showing final test performance
- Easy to identify the best and worst performers
- But this only shows one metric from one train/test split...

**Right Plot - Cross-Validation with Error Bars:**
- **Center point**: Mean performance across 5 folds
- **Error bars**: Show the standard deviation (spread of scores)
- **What good error bars tell you**:
  - Short bars = consistent, reliable model
  - Long bars = unstable model, performance varies with data
  
**Key insight**: A model with slightly lower mean but smaller error bars might be more trustworthy than one with higher mean but large variance!

### Simple Voting Ensemble

Now let's combine these models using a **Voting Classifier**. This ensemble method combines predictions from multiple models using either:
- **Hard voting**: Each model votes for a class, and the majority wins
- **Soft voting**: Predictions are weighted by class probabilities (usually performs better)

### How Voting Ensembles Work

Now we're combining our individual models! Think of this like a panel of experts voting on the correct answer.

**Two Voting Strategies:**

**1. Hard Voting (Majority Vote):**
- Each model predicts a class (0, 1, or 2)
- The class with the most votes wins
- Example: If 3 models predict class 1, and 2 models predict class 2, the final prediction is class 1
- Simple and interpretable

**2. Soft Voting (Probability Weighted):**
- Each model outputs probabilities for each class (e.g., [0.2, 0.7, 0.1])
- Probabilities are averaged across all models
- The class with the highest average probability wins
- Generally performs better because it considers confidence levels
- **Requirement**: All models must support `predict_proba()` method

**Why Voting Ensembles Work:**
- **Diversity of mistakes**: Different algorithms make different errors
  - Linear models struggle with non-linear boundaries
  - KNN can be fooled by noisy data
  - Decision trees might overfit to certain patterns
- When combined, the errors of one model can be corrected by others
- The "wisdom of crowds" principle in action!

**What you should see**: Voting ensemble performance typically matches or exceeds the best individual model, with more stable predictions.

In [None]:
# Create voting classifiers
estimators = [(name, model) for name, model in models.items()]

# Hard voting
voting_hard = VotingClassifier(estimators=estimators, voting='hard')
voting_hard.fit(X_train_scaled, y_train)
y_pred_hard = voting_hard.predict(X_test_scaled)
acc_hard = accuracy_score(y_test, y_pred_hard)

# Soft voting
voting_soft = VotingClassifier(estimators=estimators, voting='soft')
voting_soft.fit(X_train_scaled, y_train)
y_pred_soft = voting_soft.predict(X_test_scaled)
acc_soft = accuracy_score(y_test, y_pred_soft)

print("\nVoting Ensemble Results:")
print("="*70)
print(f"Hard Voting Accuracy: {acc_hard:.4f}")
print(f"Soft Voting Accuracy: {acc_soft:.4f}")
print("\nComparison with best individual model:")
best_individual = max(individual_results.items(), key=lambda x: x[1]['test_accuracy'])
print(f"Best Individual Model: {best_individual[0]}")
print(f"Best Individual Accuracy: {best_individual[1]['test_accuracy']:.4f}")
print(f"\nImprovement (Soft Voting): {(acc_soft - best_individual[1]['test_accuracy']):.4f}")
print("="*70)

# 2. Random Forests and Gradient Boosted Trees

## Random Forests

Random Forests use **Bootstrap Aggregation (Bagging)** combined with random feature selection. Each tree is trained on a different bootstrap sample, and at each split, only a random subset of features is considered.

### Understanding Random Forests

**What is a Random Forest?**
A Random Forest is an ensemble of Decision Trees, but with two key sources of randomness:

1. **Bootstrap Sampling**: Each tree is trained on a different random sample of the data (with replacement)
2. **Random Feature Selection**: At each split, only a random subset of features is considered
   - For classification: typically √(n_features) are considered
   - This prevents trees from being too similar to each other

**Why Randomness Helps:**
- Without randomness, all trees would be identical (they'd see the same data and same features)
- Randomness creates **diverse** trees that make different mistakes
- When we average predictions, the errors cancel out, but the correct predictions reinforce each other

**The Experiment:**
This cell trains Random Forests with different numbers of trees (10, 50, 100, 200, 500) to answer:
- Does more trees = better performance?
- Is there a point of diminishing returns?
- Can Random Forests overfit with too many trees?

**What to watch for:**
- **Training accuracy**: Will be high (maybe too high) because decision trees can memorize data
- **Test accuracy**: Should improve initially, then plateau
- **The gap between train and test**: Should remain stable (Random Forests resist overfitting!)

**Computational trade-off**: More trees = better performance but slower training and prediction. In practice, 100-500 trees is a good range.

In [None]:
# Train Random Forest with different numbers of trees
n_trees_list = [10, 50, 100, 200, 500]
rf_results = []

print("Training Random Forests with different number of trees...\n")

for n_trees in n_trees_list:
    rf = RandomForestClassifier(n_estimators=n_trees, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    
    train_acc = rf.score(X_train, y_train)
    test_acc = rf.score(X_test, y_test)
    
    rf_results.append({
        'n_trees': n_trees,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc
    })
    
    print(f"n_estimators={n_trees:3d} | Train: {train_acc:.4f} | Test: {test_acc:.4f}")

rf_results_df = pd.DataFrame(rf_results)

In [None]:
# Visualize Random Forest performance vs number of trees
plt.figure(figsize=(10, 5))

plt.plot(rf_results_df['n_trees'], rf_results_df['train_accuracy'], 
         marker='o', label='Training Accuracy', linewidth=2)
plt.plot(rf_results_df['n_trees'], rf_results_df['test_accuracy'], 
         marker='s', label='Test Accuracy', linewidth=2)
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest Performance vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0.85, 1.05)
plt.show()

print("\nKey Observations:")
print("- Performance improves with more trees initially")
print("- Returns diminish after a certain point")
print("- Random Forests are resistant to overfitting due to averaging")

### Interpreting the Random Forest Learning Curve

**What this plot shows:**
- **X-axis**: Number of trees in the forest
- **Blue line**: Training accuracy (how well the forest fits the training data)
- **Orange line**: Test accuracy (how well it generalizes to new data)

**Key observations to make:**

1. **Initial improvement**: Performance jumps dramatically from 10 to 50 trees
   - More trees = more diverse opinions = better ensemble decisions

2. **Diminishing returns**: The curve flattens after ~100-200 trees
   - Additional trees help less and less
   - The ensemble has already captured most patterns

3. **No overfitting!**: Notice that test accuracy doesn't decrease as we add trees
   - Unlike single decision trees, Random Forests are resistant to overfitting
   - More trees increase computational cost but don't hurt generalization

4. **Train vs Test gap**: Training accuracy is near perfect, test is lower
   - This gap is expected and acceptable
   - The gap doesn't grow as we add trees (good sign!)

**Practical takeaway**: For most problems, 100-500 trees offers the best accuracy-speed trade-off.

In [None]:
# Analyze feature importance from Random Forest
rf_final = RandomForestClassifier(n_estimators=200, random_state=42)
rf_final.fit(X_train, y_train)

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': wine.feature_names,
    'importance': rf_final.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head().to_string(index=False))

### Feature Importance: Understanding What Matters

One of the most valuable aspects of Random Forests is **feature importance** - which features contribute most to predictions.

**How it's calculated:**
- For each tree, we track how much each feature reduces impurity (error) when used for splitting
- Average this across all trees in the forest
- Normalize to sum to 1.0 (or 100%)

**Why this matters:**
- **Interpretability**: Understand which chemical properties most distinguish wine types
- **Feature selection**: Could we build a simpler model using only the top features?
- **Domain insights**: Does the model's ranking align with wine expert knowledge?

**What to look for in the results:**
- A few dominant features vs. evenly distributed importance
- Features with near-zero importance (candidates for removal)
- Whether top features make intuitive sense for wine classification

**Important note**: Feature importance shows correlation with predictions, not causation! High importance means the feature is useful for prediction, not that it causes the wine type.

## Gradient Boosted Trees (GBTs)

Unlike Random Forests which build trees independently, Gradient Boosting builds trees **sequentially**. Each tree corrects the errors of the previous trees.

We'll compare:
- **Scikit-learn GradientBoosting**
- **XGBoost** (eXtreme Gradient Boosting)
- **LightGBM** (Light Gradient Boosting Machine)

### Gradient Boosting: Learning from Mistakes

**Fundamental Difference from Random Forests:**
- **Random Forest (Bagging)**: Build all trees independently in parallel, then average
- **Gradient Boosting**: Build trees sequentially, where each new tree corrects the errors of previous trees

**How Gradient Boosting Works:**
1. Start with a simple model (often just predicting the mean)
2. Calculate the errors (residuals) this model makes
3. Train a new tree to predict these errors
4. Add this new tree's predictions to the ensemble (scaled by learning rate)
5. Repeat steps 2-4 for n_estimators iterations

**Key Parameters:**
- **n_estimators**: Number of boosting iterations (trees to build)
  - More trees = more opportunities to reduce error
  - But can overfit if too many!
  
- **learning_rate**: How much each tree contributes (typically 0.01 to 0.3)
  - Lower = more conservative, needs more trees but often better results
  - Higher = faster learning but may overshoot optimal solution
  
- **max_depth**: Complexity of each individual tree (typically 3-6)
  - Shallow trees prevent overfitting to each iteration's errors

**Three Implementations We'll Test:**
1. **Scikit-learn GradientBoosting**: Standard implementation, reliable and well-documented
2. **XGBoost**: Industry standard, highly optimized, includes regularization
3. **LightGBM**: Microsoft's implementation, very fast, uses leaf-wise growth
4. **AdaBoost**: Earlier boosting algorithm, simpler but still effective

**What to expect**: Boosting often achieves the highest accuracy but requires more careful tuning than Random Forests.

In [None]:
# Train different gradient boosting implementations
print("Training Gradient Boosting Models...\n")
print("="*70)

# Scikit-learn Gradient Boosting
gb_sklearn = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                        max_depth=3, random_state=42)
gb_sklearn.fit(X_train, y_train)
gb_sklearn_acc = gb_sklearn.score(X_test, y_test)
print(f"Scikit-learn GradientBoosting: {gb_sklearn_acc:.4f}")

# XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, 
                              max_depth=3, random_state=42, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)
xgb_acc = xgb_model.score(X_test, y_test)
print(f"XGBoost:                       {xgb_acc:.4f}")

# LightGBM
lgb_model = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, 
                               max_depth=3, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
lgb_acc = lgb_model.score(X_test, y_test)
print(f"LightGBM:                      {lgb_acc:.4f}")

# AdaBoost (another boosting variant)
ada_model = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
ada_model.fit(X_train, y_train)
ada_acc = ada_model.score(X_test, y_test)
print(f"AdaBoost:                      {ada_acc:.4f}")

print("="*70)

In [None]:
# Compare Random Forest vs Gradient Boosting
comparison_data = pd.DataFrame({
    'Model': ['Random Forest', 'Sklearn GB', 'XGBoost', 'LightGBM', 'AdaBoost'],
    'Accuracy': [rf_final.score(X_test, y_test), gb_sklearn_acc, xgb_acc, lgb_acc, ada_acc],
    'Type': ['Bagging', 'Boosting', 'Boosting', 'Boosting', 'Boosting']
})

plt.figure(figsize=(10, 5))
colors = ['blue' if t == 'Bagging' else 'orange' for t in comparison_data['Type']]
plt.barh(comparison_data['Model'], comparison_data['Accuracy'], color=colors)
plt.xlabel('Test Accuracy')
plt.title('Random Forest (Bagging) vs Gradient Boosting Methods')
plt.xlim(0.85, 1.0)
plt.axvline(x=0.95, color='red', linestyle='--', alpha=0.5, label='95% threshold')
plt.legend()
plt.tight_layout()
plt.show()

print("\nModel Comparison:")
print(comparison_data.to_string(index=False))

### Bagging vs Boosting: Visual Comparison

This chart directly compares the two major ensemble paradigms:

**Blue (Bagging - Random Forest):**
- Builds trees independently
- Reduces variance by averaging
- Parallel training (faster on multi-core systems)
- Very resistant to overfitting
- Good general-purpose choice

**Orange (Boosting - GB/XGBoost/LightGBM/AdaBoost):**
- Builds trees sequentially
- Reduces bias by focusing on errors
- Sequential training (slower to train)
- Can overfit if not careful
- Often achieves highest accuracy with proper tuning

**Questions to consider:**
- Which approach performs better on this dataset?
- Are the differences significant or marginal?
- Would the ranking change with different hyperparameters?
- Which would you choose for a production system and why?

**The 95% threshold line**: This reference helps visualize which models meet a high performance standard.

# 3. Understanding Bootstrap Aggregation (Bagging)

Let's dive deeper into how **Bootstrap Aggregation** works:

1. Create multiple bootstrap samples (random sampling with replacement)
2. Train a model on each bootstrap sample
3. Aggregate predictions (majority vote for classification, average for regression)

We'll demonstrate this manually and compare it to scikit-learn's BaggingClassifier.

### The Magic of Bootstrap Sampling

Let's understand the foundation of bagging by examining **bootstrap sampling** in detail.

**What is Bootstrap Sampling?**
- Random sampling **with replacement** from the training data
- Each bootstrap sample has the same size as the original dataset
- Some samples appear multiple times, others don't appear at all

**The Mathematics:**
- Probability a specific sample is NOT selected in one draw: (n-1)/n
- Probability it's NOT selected in n draws: ((n-1)/n)^n
- As n gets large, this approaches 1/e ≈ 0.368 (36.8%)
- Therefore, ~63.2% of samples are included (with possible repeats)

**The Out-of-Bag (OOB) Samples:**
- The ~36.8% of samples NOT selected are called "out-of-bag"
- These are like a free validation set for each tree!
- Each tree can be evaluated on its OOB samples
- Average OOB predictions across all trees gives an unbiased performance estimate

**Why This Matters:**
- Creates diverse training sets → diverse models → better ensemble
- OOB samples provide validation without reducing training data
- This is unique to bagging; boosting doesn't have this benefit

**Watch the percentages** in the output closely - you'll see they consistently hover around 63% unique samples and 37% OOB!

In [None]:
# Demonstrate bootstrap sampling
n_samples = len(X_train)
n_bootstrap = 3

print("Demonstrating Bootstrap Sampling:\n")
print(f"Original training set size: {n_samples}")
print(f"\nCreating {n_bootstrap} bootstrap samples...\n")

for i in range(n_bootstrap):
    # Create bootstrap sample (sampling with replacement)
    indices = np.random.choice(n_samples, size=n_samples, replace=True)
    unique_indices = len(np.unique(indices))
    
    # Out-of-bag samples (samples not selected)
    oob_indices = set(range(n_samples)) - set(indices)
    
    print(f"Bootstrap Sample {i+1}:")
    print(f"  Total samples: {len(indices)}")
    print(f"  Unique samples: {unique_indices} ({unique_indices/n_samples*100:.1f}%)")
    print(f"  Out-of-bag samples: {len(oob_indices)} ({len(oob_indices)/n_samples*100:.1f}%)")
    print()

print("Key Insight: Each bootstrap sample uses ~63.2% unique samples")
print("The remaining ~36.8% are out-of-bag (OOB) samples used for validation")

In [None]:
# Compare base model vs Bagging ensemble
print("\nComparing Single Model vs Bagging Ensemble:\n")
print("="*70)

# Single Decision Tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
single_tree_acc = single_tree.score(X_test, y_test)
print(f"Single Decision Tree: {single_tree_acc:.4f}")

# Bagging with different numbers of estimators
n_estimators_list = [10, 50, 100, 200]
bagging_results = []

for n_est in n_estimators_list:
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=n_est,
        random_state=42,
        n_jobs=-1
    )
    bagging.fit(X_train, y_train)
    bagging_acc = bagging.score(X_test, y_test)
    bagging_results.append({'n_estimators': n_est, 'accuracy': bagging_acc})
    print(f"Bagging ({n_est:3d} trees): {bagging_acc:.4f}")

print("="*70)

improvement = bagging_results[-1]['accuracy'] - single_tree_acc
print(f"\nImprovement from Bagging: {improvement:.4f} ({improvement/single_tree_acc*100:.1f}%)")

### Bagging in Action: From One Tree to Many

This experiment demonstrates the core value proposition of bagging.

**The Setup:**
- **Baseline**: A single Decision Tree (known for high variance and overfitting)
- **Bagging Ensembles**: Multiple decision trees trained on bootstrap samples (10, 50, 100, 200 trees)

**What We're Testing:**
- Does combining multiple unstable models create a stable ensemble?
- How many trees are needed for good performance?
- What's the improvement over a single tree?

**Decision Trees: The Perfect Base Learner for Bagging**
- **High variance**: Different training data → very different trees
- **Low bias**: Can fit complex patterns (even overfit)
- **Bagging fixes the variance problem** by averaging predictions!

**Expected Results:**
- Single tree: Decent performance but unstable
- Bagging (10 trees): Noticeable improvement
- Bagging (50-100 trees): Substantial improvement, plateau
- Bagging (200 trees): Marginal additional gains

**The key insight**: Many weak, diverse models can combine to create a strong, stable ensemble!

In [None]:
# Visualize the effect of ensemble size
bagging_df = pd.DataFrame(bagging_results)

plt.figure(figsize=(10, 5))
plt.plot(bagging_df['n_estimators'], bagging_df['accuracy'], 
         marker='o', linewidth=2, markersize=8, label='Bagging')
plt.axhline(y=single_tree_acc, color='red', linestyle='--', 
            label='Single Tree', linewidth=2)
plt.xlabel('Number of Estimators')
plt.ylabel('Test Accuracy')
plt.title('Bagging Performance vs Ensemble Size')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Visualizing Bagging's Impact

**The Story This Plot Tells:**

**Red dashed line (baseline)**: Performance of a single decision tree
- This is our starting point
- Prone to overfitting and high variance

**Blue line (Bagging curve)**: Accuracy as we add more trees
- Sharp initial rise: Adding the first few trees has huge impact
- Gradual plateau: Returns diminish after ~50-100 trees
- Stays well above baseline: Clear, consistent improvement

**Key Observations:**
1. Even 10 trees provide substantial improvement
2. The curve never dips back down (more trees never hurt performance)
3. The gap between bagging and single tree shows the power of ensembling
4. There's a sweet spot around 100 trees (good performance without excessive computation)

**Why the plateau?**
- After enough trees, you've already captured the diversity in bootstrap samples
- Additional trees are learning similar patterns
- More trees still help a little (why Random Forests use 100-500), but diminishing returns set in

**Practical lesson**: Even simple bagging with modest ensemble sizes (50-100) can dramatically improve model performance!

### Out-of-Bag (OOB) Evaluation

One of the benefits of bagging is **OOB evaluation** - we can estimate model performance without a separate validation set using the samples that weren't selected in each bootstrap.

In [None]:
# Demonstrate OOB scoring
bagging_oob = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
bagging_oob.fit(X_train, y_train)

print("Out-of-Bag Evaluation:")
print("="*70)
print(f"OOB Score (internal validation): {bagging_oob.oob_score_:.4f}")
print(f"Test Score:                      {bagging_oob.score(X_test, y_test):.4f}")
print("\nThe OOB score provides an unbiased estimate without needing a separate")
print("validation set, saving data for training!")
print("="*70)

### Out-of-Bag Scoring: Free Validation!

This cell demonstrates one of bagging's most elegant features: **built-in validation without using any extra data**.

**How OOB Scoring Works:**
1. For each tree in the ensemble, ~37% of training samples were not used (out-of-bag)
2. Use each tree to predict on its OOB samples
3. For each training sample, average predictions from all trees where it was OOB
4. Compare OOB predictions to true labels → OOB score

**Why This is Powerful:**
- **No data loss**: Don't need to set aside a separate validation set
- **Unbiased estimate**: OOB samples are truly unseen by each tree
- **Free cross-validation**: Similar to cross-validation but happens automatically
- **Particularly valuable for small datasets** where every training sample counts

**Comparing OOB vs Test Score:**
- OOB score: Estimated from internal validation during training
- Test score: True held-out performance
- They should be close! If OOB >> Test, you might have a problem with your test set
- If OOB ≈ Test, it confirms your model will generalize well

**When to use OOB scoring:**
- Quick model evaluation during development
- Hyperparameter tuning without extra validation split
- Confidence in generalization before final testing

**Important**: OOB is only available with bagging methods (BaggingClassifier, RandomForest), not with boosting!

# 4. Combining Heterogeneous Models (Stacking)

**Stacking** is an advanced ensemble technique that combines different types of models:

1. Train multiple diverse base models (level 0)
2. Use their predictions as features for a meta-model (level 1)
3. The meta-model learns how to best combine the base model predictions

This is different from voting, which uses a fixed combination rule.

### Stacking: The Ultimate Ensemble

**What Makes Stacking Different?**
- **Voting**: Fixed combination rule (majority vote or average probabilities)
- **Stacking**: Learns the optimal way to combine model predictions using a meta-model

**How Stacking Works:**

**Level 0 (Base Models):**
1. Train diverse base models on the training data
2. Each uses a different algorithm with different strengths
   - Random Forest: Handles non-linearity and interactions
   - Gradient Boosting: Reduces bias, high accuracy
   - SVM: Finds optimal decision boundaries
   - KNN: Captures local patterns

**Level 1 (Meta-Model):**
1. Use base model predictions as features
2. Train a meta-model (here, Logistic Regression) to learn optimal combination
3. The meta-model discovers which base models to trust for which types of predictions

**The Cross-Validation Trick (cv=5):**
- To avoid overfitting, we don't use simple training predictions as meta-features
- Instead, we use 5-fold cross-validation:
  - Split training data into 5 parts
  - For each part, train base models on other 4 parts, predict on this part
  - This ensures meta-features are from "unseen" predictions
  - Prevents the meta-model from just memorizing training data

**Why Choose Stacking?**
- ✅ Often achieves best performance by combining strengths of diverse models
- ✅ Meta-model learns complex combination rules (not just averaging)
- ✅ Can weight models differently for different scenarios
- ❌ More complex to implement and tune
- ❌ Longer training time (train base models + meta-model)
- ❌ Less interpretable than simpler ensembles

**When to use Stacking:**
- When you need maximum accuracy and have compute resources
- When you have diverse base models that capture different aspects of data
- In competitions or high-stakes applications

**Expected Result:** Stacking should match or exceed any single base model and often beats simple voting!

In [None]:
# Define base models (diverse set of algorithms)
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
]

# Define meta-model (final estimator)
meta_model = LogisticRegression(max_iter=1000, random_state=42)

# Create stacking classifier
stacking = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5  # Use cross-validation to generate meta-features
)

print("Training Stacking Ensemble...\n")
stacking.fit(X_train_scaled, y_train)
stacking_acc = stacking.score(X_test_scaled, y_test)

print("Stacking Results:")
print("="*70)
print("\nBase Models Performance:")
for name, model in base_models:
    model.fit(X_train_scaled, y_train)
    acc = model.score(X_test_scaled, y_test)
    print(f"  {name:3s}: {acc:.4f}")

print(f"\nStacking Ensemble: {stacking_acc:.4f}")
print("="*70)

In [None]:
# Compare different ensemble strategies
ensemble_comparison = pd.DataFrame({
    'Method': ['Voting (Hard)', 'Voting (Soft)', 'Bagging', 'Random Forest', 
               'Gradient Boosting', 'Stacking'],
    'Accuracy': [
        acc_hard,
        acc_soft,
        bagging_oob.score(X_test, y_test),
        rf_final.score(X_test, y_test),
        xgb_acc,
        stacking_acc
    ],
    'Strategy': ['Voting', 'Voting', 'Bagging', 'Bagging', 'Boosting', 'Stacking']
})

plt.figure(figsize=(12, 6))
colors = {'Voting': 'skyblue', 'Bagging': 'lightgreen', 
          'Boosting': 'orange', 'Stacking': 'purple'}
bar_colors = [colors[s] for s in ensemble_comparison['Strategy']]

plt.barh(ensemble_comparison['Method'], ensemble_comparison['Accuracy'], color=bar_colors)
plt.xlabel('Test Accuracy')
plt.title('Comparison of Different Ensemble Strategies')
plt.xlim(0.85, 1.0)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=strategy) 
                   for strategy, color in colors.items()]
plt.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

print("\nEnsemble Method Comparison:")
print(ensemble_comparison.sort_values('Accuracy', ascending=False).to_string(index=False))

### Comparing All Ensemble Strategies

This comprehensive comparison brings together everything we've learned!

**The Four Ensemble Strategies (Color-Coded):**

**Sky Blue - Voting Ensembles:**
- Combines heterogeneous models (different algorithms)
- Simple, interpretable, easy to implement
- Soft voting usually beats hard voting

**Light Green - Bagging Ensembles:**
- Combines homogeneous models (same algorithm, different data)
- Random Forest adds feature randomness to bagging
- Great for reducing variance of high-variance models

**Orange - Boosting Ensembles:**
- Sequential training, each model corrects previous errors
- Often achieves highest accuracy
- Requires more careful tuning

**Purple - Stacking:**
- Meta-learning approach
- Learns optimal combination weights
- Most sophisticated but most complex

**What to Look For:**
- Which strategy achieved highest accuracy?
- Is the difference significant or marginal?
- Is the complexity/performance trade-off worth it?
- Which would you deploy in production considering interpretability, speed, and accuracy?

**Key Insight**: On this dataset, you'll likely see that all ensemble methods substantially outperform individual models (from Section 1), validating the ensemble approach!

# 5. Evaluating Ensembles of Methods

Let's perform a comprehensive evaluation of our best ensemble models using multiple metrics:
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
- Cross-validation scores

### Detailed Performance Analysis

Now let's go beyond simple accuracy and examine our best models using multiple evaluation metrics.

**Why Accuracy Alone Isn't Enough:**
- **Class imbalance**: If 90% of samples are class A, predicting "always A" gives 90% accuracy but is useless
- **Different costs**: Misclassifying wine type might have different consequences for each class
- **Per-class performance**: A model might excel at some classes but fail at others

**The Classification Report Includes:**

**Precision**: Of all samples predicted as class X, what fraction actually were class X?
- Precision = True Positives / (True Positives + False Positives)
- High precision = Few false alarms
- Important when false positives are costly

**Recall (Sensitivity)**: Of all actual class X samples, what fraction did we correctly identify?
- Recall = True Positives / (True Positives + False Negatives)
- High recall = Few missed cases
- Important when false negatives are costly

**F1-Score**: Harmonic mean of precision and recall
- F1 = 2 × (Precision × Recall) / (Precision + Recall)
- Balances precision and recall
- Useful single metric when you care about both

**Support**: Number of samples in each class
- Helps you weight the importance of each class's metrics
- Small support → less reliable metrics for that class

**What to Look For:**
- Are precision and recall balanced, or does the model favor one?
- Which classes are easier/harder to predict?
- Do all models struggle with the same classes?
- How do ensemble methods compare on per-class metrics?

In [None]:
# Select best models for detailed evaluation
best_models = {
    'Random Forest': rf_final,
    'XGBoost': xgb_model,
    'Stacking': stacking,
    'Voting (Soft)': voting_soft
}

# Generate predictions for each model
print("Detailed Classification Reports:\n")
print("="*70)

for name, model in best_models.items():
    print(f"\n{name}:")
    print("-"*70)
    
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    print(classification_report(y_test, y_pred, target_names=wine.target_names))

In [None]:
# Visualize confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, (name, model) in enumerate(best_models.items()):
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=wine.target_names,
                yticklabels=wine.target_names)
    axes[idx].set_title(f'{name} Confusion Matrix')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

### Reading Confusion Matrices

Confusion matrices show the complete picture of model predictions vs. reality.

**How to Read a Confusion Matrix:**
- **Rows**: True labels (actual wine class)
- **Columns**: Predicted labels (what the model said)
- **Diagonal (top-left to bottom-right)**: Correct predictions
- **Off-diagonal**: Mistakes

**Example Interpretation:**
If cell [row=1, col=2] = 3, it means:
- 3 samples that were actually class 1
- Were incorrectly predicted as class 2

**What Makes a Good Confusion Matrix:**
- **Dark diagonal**: Most predictions on the diagonal (correct)
- **Light off-diagonal**: Few mistakes (off-diagonal cells)
- **Symmetric mistakes**: If the model confuses class A with B, does it also confuse B with A?

**Patterns to Look For:**
- Which classes are most confused with each other?
- Are errors symmetric or directional?
- Does the ensemble reduce specific types of errors?
- Compare matrices across models - do they make different mistakes?

**Wine-Specific Insight**: If certain wine types are chemically similar, you'd expect more confusion between them. The confusion matrix reveals these relationships!

In [None]:
# Cross-validation comparison
print("\nCross-Validation Performance Comparison:\n")
print("="*70)

cv_results = []

for name, model in best_models.items():
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        X_cv = X_train_scaled
    else:
        X_cv = X_train
    
    scores = cross_val_score(model, X_cv, y_train, cv=5, scoring='accuracy')
    cv_results.append({
        'Model': name,
        'Mean CV Score': scores.mean(),
        'Std CV Score': scores.std(),
        'Min Score': scores.min(),
        'Max Score': scores.max()
    })
    
    print(f"{name}:")
    print(f"  Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")
    print(f"  Range: [{scores.min():.4f}, {scores.max():.4f}]")
    print()

cv_results_df = pd.DataFrame(cv_results)
print("="*70)

### Cross-Validation: Measuring Reliability

A single test set only shows one snapshot of performance. Cross-validation gives us a more complete picture.

**What This Cell Does:**
- Performs 5-fold cross-validation on each of our best models
- Reports mean score, standard deviation, minimum, and maximum across folds

**Understanding the Metrics:**

**Mean CV Score**: Average performance across all 5 folds
- This is your best estimate of expected performance
- More reliable than a single test score

**Standard Deviation (Std)**: Variability across folds
- **Low std**: Model is stable, consistent predictions regardless of data split
- **High std**: Model is sensitive to training data, less reliable
- Example: Mean=0.95, Std=0.01 is better than Mean=0.96, Std=0.05

**Min and Max Scores**: Range of performance
- Shows best-case and worst-case scenarios
- Large range (high max-min) suggests instability
- Helps identify if model got lucky or is genuinely good

**Why This Matters:**
- **Model selection**: Choose models with good mean AND low variance
- **Trust**: A model with lower mean but higher stability might be preferable for production
- **Understanding limitations**: Know the range of expected performance

**What Ensembles Should Show:**
- Generally lower variance than individual models
- This is one of the key benefits of ensembling - more stable predictions!

In [None]:
# Visualize CV performance with error bars
plt.figure(figsize=(10, 6))
plt.errorbar(cv_results_df['Mean CV Score'], cv_results_df['Model'],
             xerr=cv_results_df['Std CV Score'], fmt='o', markersize=10,
             capsize=5, capthick=2, linewidth=2)
plt.xlabel('Cross-Validation Score')
plt.title('Cross-Validation Performance Comparison (with Standard Deviation)')
plt.xlim(0.90, 1.0)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Visualizing Model Stability

This error bar plot reveals model consistency at a glance.

**How to Interpret:**
- **Center point (circle)**: Mean cross-validation score
- **Horizontal error bars**: Standard deviation (uncertainty)
- **Left edge of bar**: Roughly the worst expected performance
- **Right edge of bar**: Roughly the best expected performance

**Comparing Models:**

**Ideal model**: Far right (high mean) with short bars (low variance)
- High performance AND consistent

**Risky model**: Far right but with long bars
- Sometimes great, sometimes mediocre
- Unreliable for production use

**Stable but limited model**: Middle position with short bars
- Predictable but not exceptional
- Might be acceptable if consistency matters more than peak performance

**What to Look For:**
- Do any models have bars that don't overlap? That's a significant performance difference!
- Which model has the shortest bars (most stable)?
- Is there a model that's both high-performing AND stable?

**The Ensemble Advantage**: Typically, ensemble methods should show tighter error bars than individual models - they're more robust to data variations!

## Final Performance Summary

Let's create a comprehensive summary of all ensemble methods we've explored.

### The Grand Finale: Complete Performance Summary

Let's bring it all together and see the complete picture of what we've learned!

**What the Summary Shows:**
- **Every method** we've explored, from baseline to advanced ensembles
- Sorted by test accuracy (best at top)
- Categorized by ensemble strategy

**The Complete Journey:**
1. **Baseline**: Single decision tree - our starting point
2. **Voting**: Simple combination of diverse models
3. **Bagging**: Bootstrap aggregation with decision trees
4. **Boosting**: Sequential error correction
5. **Stacking**: Meta-learning optimal combinations

**Questions to Reflect On:**

**Performance:**
- How much improvement did we gain from ensembling?
- Which strategy worked best for this dataset?
- Are the top performers significantly better, or clustered close together?

**Complexity vs. Accuracy:**
- Is the best model worth its complexity?
- Could a simpler model (Random Forest) be "good enough"?
- What's the performance/interpretability trade-off?

**Practical Considerations:**
- Training time: Stacking > Boosting > Bagging > Voting > Single model
- Prediction speed: Single model > Voting ≈ Bagging > Boosting > Stacking
- Interpretability: Single model > Random Forest > other ensembles
- Robustness: Ensembles > Single model

**Real-World Decision Making:**
If you were deploying this in production, which model would you choose and why? Consider:
- Required accuracy level
- Computational budget
- Need for interpretability
- Latency requirements
- Maintenance complexity

In [None]:
# Create final summary table
final_summary = pd.DataFrame({
    'Ensemble Method': [
        'Single Decision Tree (Baseline)',
        'Voting - Hard',
        'Voting - Soft',
        'Bagging (100 trees)',
        'Random Forest (200 trees)',
        'Gradient Boosting (sklearn)',
        'XGBoost',
        'LightGBM',
        'AdaBoost',
        'Stacking'
    ],
    'Test Accuracy': [
        single_tree_acc,
        acc_hard,
        acc_soft,
        bagging_oob.score(X_test, y_test),
        rf_final.score(X_test, y_test),
        gb_sklearn_acc,
        xgb_acc,
        lgb_acc,
        ada_acc,
        stacking_acc
    ],
    'Category': [
        'Baseline',
        'Voting',
        'Voting',
        'Bagging',
        'Bagging',
        'Boosting',
        'Boosting',
        'Boosting',
        'Boosting',
        'Stacking'
    ]
})

final_summary = final_summary.sort_values('Test Accuracy', ascending=False)

print("\n" + "="*80)
print("FINAL PERFORMANCE SUMMARY")
print("="*80)
print(final_summary.to_string(index=False))
print("="*80)

# Calculate improvement over baseline
best_model = final_summary.iloc[0]
improvement = (best_model['Test Accuracy'] - single_tree_acc) / single_tree_acc * 100
print(f"\nBest Model: {best_model['Ensemble Method']}")
print(f"Improvement over baseline: {improvement:.2f}%")

### Creating the Final Summary Table

This cell compiles all results into one comprehensive comparison table and calculates the total improvement achieved through ensemble methods.

In [None]:
# Final visualization
plt.figure(figsize=(14, 8))

category_colors = {
    'Baseline': 'gray',
    'Voting': 'skyblue',
    'Bagging': 'lightgreen',
    'Boosting': 'orange',
    'Stacking': 'purple'
}

colors = [category_colors[cat] for cat in final_summary['Category']]

plt.barh(final_summary['Ensemble Method'], final_summary['Test Accuracy'], color=colors)
plt.xlabel('Test Accuracy', fontsize=12)
plt.title('Comprehensive Ensemble Methods Performance Comparison', fontsize=14, fontweight='bold')
plt.xlim(0.85, 1.0)
plt.axvline(x=single_tree_acc, color='red', linestyle='--', linewidth=2, label='Baseline', alpha=0.7)

# Add value labels on bars
for idx, row in final_summary.iterrows():
    plt.text(row['Test Accuracy'], row['Ensemble Method'], 
             f" {row['Test Accuracy']:.4f}", 
             va='center', fontsize=9)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=category) 
                   for category, color in category_colors.items()]
plt.legend(handles=legend_elements, loc='lower right', fontsize=10)

plt.tight_layout()
plt.show()

### Final Visual Comparison

This comprehensive bar chart is your visual summary of everything we've learned!

**Color Coding (The Ensemble Taxonomy):**
- **Gray**: Baseline (where we started)
- **Sky Blue**: Voting ensembles (simple combinations)
- **Light Green**: Bagging ensembles (bootstrap aggregation)
- **Orange**: Boosting ensembles (sequential error correction)
- **Purple**: Stacking (meta-learning)

**Reading the Chart:**
- **Bar length**: Test accuracy (longer = better)
- **Red dashed line**: Baseline performance (single decision tree)
- **Numeric labels**: Exact accuracy values
- **Vertical ordering**: Best performer at top

**The Story It Tells:**
1. **Baseline gap**: Notice how far all ensembles are from the baseline
2. **Within-category comparison**: How do different boosting methods compare?
3. **Across-category comparison**: Does one ensemble strategy dominate?
4. **Practical significance**: Are small differences meaningful, or just noise?

**Key Insights to Extract:**
- **Ensemble benefit**: All ensemble methods beat the baseline
- **Method comparison**: Which category performs best overall?
- **Diminishing returns**: Top models cluster together - is the complexity worth marginal gains?
- **Consistency**: Methods in the same category perform similarly (validating the concepts!)

**Your Takeaway:**
Based on this dataset, which ensemble approach would you recommend for a new project, and why? Consider not just accuracy, but the full context of implementation and deployment.

# Summary and Key Takeaways

Congratulations! You've completed a comprehensive journey through ensemble learning. Let's consolidate what you've learned.

## Ensemble Methods Overview

### 1. **Using Multiple Models Together (Voting)**
**What you learned:**
- Combining diverse models often performs better than any individual model
- Different algorithms make different mistakes - when combined, errors cancel out
- **Hard voting**: Simple majority vote (each model gets one vote)
- **Soft voting**: Weighted by confidence (uses probabilities, usually better)

**When to use:** Quick improvement over single models, you already have diverse trained models

**Key insight:** The "wisdom of crowds" applies to machine learning!

---

### 2. **Random Forests and Gradient Boosted Trees**

#### Random Forests (Bagging Approach)
**What you learned:**
- Build many decision trees independently in parallel
- Each tree sees a random subset of data (bootstrap) and features
- Average predictions to reduce variance
- Resistant to overfitting (can add more trees safely)
- Provides feature importance for interpretability

**Strengths:**
- Good default choice for many problems
- Handles non-linear relationships naturally
- Works well out-of-the-box with minimal tuning
- Resistant to overfitting
- Fast to train (parallelizable)

**When to use:** Your first choice for most classification/regression problems

#### Gradient Boosted Trees (Boosting Approach)
**What you learned:**
- Build trees sequentially, not in parallel
- Each new tree corrects errors of previous trees
- Reduces bias by iteratively improving predictions
- More prone to overfitting than Random Forests (need to tune carefully)
- Three excellent implementations: sklearn, XGBoost, LightGBM

**Strengths:**
- Often achieves highest accuracy with proper tuning
- Powerful for complex patterns
- XGBoost/LightGBM optimized for speed and performance

**Weaknesses:**
- Requires more careful hyperparameter tuning
- Can overfit if not monitored
- Sequential training (slower than parallel methods)

**When to use:** When you need maximum accuracy and have time to tune parameters

**Key difference:** Bagging reduces variance, Boosting reduces bias

---

### 3. **Bootstrap Aggregation (Bagging) - Deep Dive**

**What you learned:**
- Bootstrap sampling: Random sampling **with replacement**
- Each sample uses ~63.2% unique data (mathematical result)
- Remaining ~36.8% are "out-of-bag" (OOB) samples
- OOB samples provide free validation without reducing training data

**How it works:**
1. Create multiple bootstrap samples from training data
2. Train a model on each bootstrap sample
3. Aggregate predictions (vote for classification, average for regression)

**Why it works:**
- Creates diversity through different training sets
- Reduces variance by averaging predictions
- Each model sees different data, learns different patterns
- Errors are random and cancel out, correct predictions reinforce

**Perfect for:** High-variance models (Decision Trees, Neural Networks, KNN)

**The OOB advantage:** Get validation score without using separate validation set - especially valuable for small datasets!

---

### 4. **Combining Heterogeneous Models (Stacking)**

**What you learned:**
- Most sophisticated ensemble approach
- **Level 0 (Base models):** Train diverse algorithms (RF, GB, SVM, KNN, etc.)
- **Level 1 (Meta-model):** Learns how to optimally combine base model predictions
- Uses cross-validation to avoid overfitting (critical!)

**How it's different from voting:**
- Voting: Fixed combination rule (average or majority)
- Stacking: Learns the optimal combination through a meta-model

**Advantages:**
- Can capture complementary strengths of different algorithms
- Meta-model learns which models to trust in which situations
- Often achieves best performance

**Disadvantages:**
- More complex to implement and tune
- Longer training time (train base models + meta-model)
- Less interpretable
- Can overfit if not using cross-validation properly

**When to use:** Maximum accuracy needed, have computational resources, competitions

---

### 5. **Evaluating Ensembles Properly**

**What you learned:**
- **Multiple metrics matter:** Don't rely on accuracy alone
  - **Precision:** Of predicted positives, how many were correct? (minimize false alarms)
  - **Recall:** Of actual positives, how many did we find? (minimize missed cases)
  - **F1-Score:** Balance between precision and recall
  
- **Confusion matrix:** Shows exactly which classes are confused with each other

- **Cross-validation:** More reliable than single train/test split
  - Mean score: Expected performance
  - Standard deviation: Stability/consistency
  - Ensembles should show lower variance!

- **Consider the full picture:**
  - Accuracy (overall correctness)
  - Per-class performance (some classes harder than others?)
  - Stability across folds (reliable predictions?)
  - Computational cost (training time, prediction speed)
  - Interpretability (can you explain predictions?)

---

## Best Practices You Should Follow

### 1. **Start Simple, Then Ensemble**
- Begin with a single well-tuned model (baseline)
- Try Random Forest as your first ensemble (great default)
- If you need more accuracy, try Gradient Boosting
- Only use stacking if you have time and need maximum performance

### 2. **Ensure Model Diversity**
For voting and stacking to work, base models should:
- Use different algorithms (linear, tree-based, distance-based)
- Make different types of errors
- Have decent individual performance (bad models → bad ensemble)

### 3. **Validate Properly**
- Always use cross-validation for robust estimates
- Use OOB scoring for bagging methods (free validation!)
- Keep a separate test set for final evaluation
- Watch for overfitting (train vs validation performance)

### 4. **Tune Hyperparameters**
- Random Forest: `n_estimators`, `max_depth`, `min_samples_split`
- Gradient Boosting: `n_estimators`, `learning_rate`, `max_depth`
- Stacking: Choice of base models and meta-model

### 5. **Consider Practical Constraints**
- **Training time:** Single < Voting < Bagging < Boosting < Stacking
- **Prediction speed:** Single ≈ Voting < Bagging < Boosting < Stacking
- **Memory:** Ensembles use more memory (store multiple models)
- **Interpretability:** Single > Random Forest (feature importance) > other ensembles

### 6. **Monitor Complexity vs. Performance**
- More complex ensembles aren't always better
- A 2% accuracy gain might not justify 10× training time
- Consider the full system context, not just accuracy

---

## Decision Guide: When to Use Each Method

### Random Forest
**Use when:**
- ✅ You need a strong baseline quickly
- ✅ You have tabular data with many features
- ✅ You want built-in feature importance
- ✅ You don't have time for extensive tuning
- ✅ You want robust, stable predictions

**Best for:** Default choice for most classification/regression problems

---

### Gradient Boosting (XGBoost/LightGBM)
**Use when:**
- ✅ You need maximum accuracy
- ✅ You have time to tune hyperparameters
- ✅ You have structured/tabular data
- ✅ You can monitor for overfitting

**Best for:** Kaggle competitions, high-stakes applications where accuracy is paramount

---

### Bagging (General)
**Use when:**
- ✅ You have a high-variance base model (Decision Trees, Neural Networks)
- ✅ You want to reduce overfitting
- ✅ You have limited training data (OOB scoring helps)
- ✅ You can train models in parallel

**Best for:** Stabilizing unstable models

---

### Voting
**Use when:**
- ✅ You already have several trained models
- ✅ You want quick ensemble without retraining
- ✅ Models are diverse (different algorithms)
- ✅ You want interpretable combination

**Best for:** Combining pre-existing models, quick wins

---

### Stacking
**Use when:**
- ✅ You need absolute maximum performance
- ✅ You have diverse base models available
- ✅ You have computational resources for training
- ✅ Accuracy justifies complexity

**Best for:** Competitions, critical applications, squeezing out last % of accuracy

---

## Common Pitfalls to Avoid

❌ **Using identical base models in voting/stacking**
- Diversity is key! Use different algorithms

❌ **Not using cross-validation in stacking**
- Meta-model will overfit to training predictions

❌ **Adding too many trees without checking**
- More trees = more compute, diminishing returns after a point

❌ **Forgetting to scale features for distance-based models**
- SVM, KNN, Logistic Regression need standardized features
- Trees don't need scaling

❌ **Relying only on accuracy**
- Check precision, recall, confusion matrix
- Some errors might be more costly than others

❌ **Not considering deployment constraints**
- Model accuracy is useless if it's too slow/large for production

---

## What You Should Remember

🎯 **Core Principle:** Ensemble methods work by combining diverse models to reduce errors

🎯 **Bagging:** Reduces variance by averaging independent models (Random Forest)

🎯 **Boosting:** Reduces bias by sequentially correcting errors (Gradient Boosting)

🎯 **Voting:** Simple combination of diverse algorithms

🎯 **Stacking:** Meta-learning optimal combinations

🎯 **Practical wisdom:** Random Forest is your go-to, Gradient Boosting when you need more, Stacking when you need the absolute best

---

## Next Steps and Further Learning

**Practice exercises to deepen understanding:**
1. Apply these methods to a different dataset (imbalanced classes, regression, etc.)
2. Implement hyperparameter tuning with GridSearchCV
3. Build a stacking ensemble manually (without StackingClassifier)
4. Compare training times and prediction speeds
5. Try ensemble methods on a large dataset (>100K samples)

**Advanced topics to explore:**
- Ensemble diversity metrics
- Weighted voting strategies
- Multi-level stacking
- Ensemble pruning (removing weak models)
- Online learning with ensembles
- Deep learning ensembles

**Real-world applications:**
- Medical diagnosis (where accuracy and reliability matter)
- Fraud detection (imbalanced classes)
- Recommendation systems (combining collaborative and content-based)
- Financial forecasting (ensemble predictions are more stable)

---

## Final Reflection Questions

Take a moment to think about:

1. Which ensemble method surprised you most in its performance?
2. How would you explain the difference between bagging and boosting to a colleague?
3. In a real project with tight deadlines, which method would you choose first and why?
4. What trade-offs would you consider when moving a model to production?
5. How could you determine if the complexity of stacking is justified for your use case?

---

**Congratulations!** You now have a solid foundation in ensemble learning. These techniques are among the most powerful tools in modern machine learning. Practice applying them to different problems, and you'll develop intuition for when and how to use each approach effectively.

Remember: **The best model is not always the most complex one, but the one that best balances your accuracy requirements, computational constraints, and interpretability needs for your specific problem.**