# Ensemble Methods Lab

This hands-on lab demonstrates key ensemble learning concepts using real-world data. We'll explore:

1. **Using Multiple Models Together** - Combining different algorithms
2. **Random Forests and GBTs** - Tree-based ensemble methods
3. **Understanding Bootstrap Aggregation** - The bagging technique
4. **Combining Heterogeneous Models** - Stacking and blending
5. **Evaluating Ensembles of Methods** - Comprehensive performance analysis

## Dataset

We'll use the **Wine Quality Dataset** from UCI Machine Learning Repository. This dataset contains physicochemical properties of Portuguese "Vinho Verde" wine samples, along with sensory quality ratings.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve

# Individual models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Ensemble methods
from sklearn.ensemble import (
    RandomForestClassifier, 
    GradientBoostingClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
    VotingClassifier,
    StackingClassifier
)
import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Load and Explore the Dataset

We'll load the Wine dataset from scikit-learn, which is a well-known classification dataset perfect for demonstrating ensemble methods.

In [None]:
# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=wine.feature_names)
df['target'] = y

print("Dataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"\nClass distribution:")
print(pd.Series(y).value_counts().sort_index())

print("\nFeature names:")
print(wine.feature_names)

print("\nFirst few rows:")
df.head()

In [None]:
# Visualize class distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
pd.Series(y).value_counts().sort_index().plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
df.iloc[:, :4].boxplot()
plt.title('Feature Distributions (First 4 Features)')
plt.xticks(rotation=45, ha='right')

plt.tight_layout()
plt.show()

## Data Preparation

Split the data into training and test sets, and standardize the features for better model performance.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Standardize features (important for some algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTraining set class distribution:")
print(pd.Series(y_train).value_counts().sort_index())

# 1. Using Multiple Models Together

Before diving into ensemble methods, let's first train several individual models to establish baselines. We'll compare their performance and then see how combining them improves results.

In [None]:
# Train multiple individual models
models = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

# Store results
individual_results = {}

print("Training individual models...\n")
print("="*70)

for name, model in models.items():
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    individual_results[name] = {
        'model': model,
        'test_accuracy': accuracy,
        'cv_mean': cv_mean,
        'cv_std': cv_std
    }
    
    print(f"{name}:")
    print(f"  Test Accuracy: {accuracy:.4f}")
    print(f"  CV Score: {cv_mean:.4f} (+/- {cv_std:.4f})")
    print("-"*70)

print("="*70)

In [None]:
# Visualize individual model performance
results_df = pd.DataFrame({
    'Model': list(individual_results.keys()),
    'Test Accuracy': [r['test_accuracy'] for r in individual_results.values()],
    'CV Mean': [r['cv_mean'] for r in individual_results.values()],
    'CV Std': [r['cv_std'] for r in individual_results.values()]
})

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.barh(results_df['Model'], results_df['Test Accuracy'])
plt.xlabel('Test Accuracy')
plt.title('Individual Model Performance')
plt.xlim(0.8, 1.0)

plt.subplot(1, 2, 2)
plt.errorbar(results_df['CV Mean'], results_df['Model'], 
             xerr=results_df['CV Std'], fmt='o', markersize=8)
plt.xlabel('Cross-Validation Score')
plt.title('CV Performance with Standard Deviation')
plt.xlim(0.8, 1.0)

plt.tight_layout()
plt.show()

print("\nPerformance Summary:")
print(results_df.to_string(index=False))

### Simple Voting Ensemble

Now let's combine these models using a **Voting Classifier**. This ensemble method combines predictions from multiple models using either:
- **Hard voting**: Each model votes for a class, and the majority wins
- **Soft voting**: Predictions are weighted by class probabilities (usually performs better)

In [None]:
# Create voting classifiers
estimators = [(name, model) for name, model in models.items()]

# Hard voting
voting_hard = VotingClassifier(estimators=estimators, voting='hard')
voting_hard.fit(X_train_scaled, y_train)
y_pred_hard = voting_hard.predict(X_test_scaled)
acc_hard = accuracy_score(y_test, y_pred_hard)

# Soft voting
voting_soft = VotingClassifier(estimators=estimators, voting='soft')
voting_soft.fit(X_train_scaled, y_train)
y_pred_soft = voting_soft.predict(X_test_scaled)
acc_soft = accuracy_score(y_test, y_pred_soft)

print("\nVoting Ensemble Results:")
print("="*70)
print(f"Hard Voting Accuracy: {acc_hard:.4f}")
print(f"Soft Voting Accuracy: {acc_soft:.4f}")
print("\nComparison with best individual model:")
best_individual = max(individual_results.items(), key=lambda x: x[1]['test_accuracy'])
print(f"Best Individual Model: {best_individual[0]}")
print(f"Best Individual Accuracy: {best_individual[1]['test_accuracy']:.4f}")
print(f"\nImprovement (Soft Voting): {(acc_soft - best_individual[1]['test_accuracy']):.4f}")
print("="*70)

# 2. Random Forests and Gradient Boosted Trees

## Random Forests

Random Forests use **Bootstrap Aggregation (Bagging)** combined with random feature selection. Each tree is trained on a different bootstrap sample, and at each split, only a random subset of features is considered.

In [None]:
# Train Random Forest with different numbers of trees
n_trees_list = [10, 50, 100, 200, 500]
rf_results = []

print("Training Random Forests with different number of trees...\n")

for n_trees in n_trees_list:
    rf = RandomForestClassifier(n_estimators=n_trees, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    
    train_acc = rf.score(X_train, y_train)
    test_acc = rf.score(X_test, y_test)
    
    rf_results.append({
        'n_trees': n_trees,
        'train_accuracy': train_acc,
        'test_accuracy': test_acc
    })
    
    print(f"n_estimators={n_trees:3d} | Train: {train_acc:.4f} | Test: {test_acc:.4f}")

rf_results_df = pd.DataFrame(rf_results)

In [None]:
# Visualize Random Forest performance vs number of trees
plt.figure(figsize=(10, 5))

plt.plot(rf_results_df['n_trees'], rf_results_df['train_accuracy'], 
         marker='o', label='Training Accuracy', linewidth=2)
plt.plot(rf_results_df['n_trees'], rf_results_df['test_accuracy'], 
         marker='s', label='Test Accuracy', linewidth=2)
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest Performance vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0.85, 1.05)
plt.show()

print("\nKey Observations:")
print("- Performance improves with more trees initially")
print("- Returns diminish after a certain point")
print("- Random Forests are resistant to overfitting due to averaging")

In [None]:
# Analyze feature importance from Random Forest
rf_final = RandomForestClassifier(n_estimators=200, random_state=42)
rf_final.fit(X_train, y_train)

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': wine.feature_names,
    'importance': rf_final.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head().to_string(index=False))

## Gradient Boosted Trees (GBTs)

Unlike Random Forests which build trees independently, Gradient Boosting builds trees **sequentially**. Each tree corrects the errors of the previous trees.

We'll compare:
- **Scikit-learn GradientBoosting**
- **XGBoost** (eXtreme Gradient Boosting)
- **LightGBM** (Light Gradient Boosting Machine)

In [None]:
# Train different gradient boosting implementations
print("Training Gradient Boosting Models...\n")
print("="*70)

# Scikit-learn Gradient Boosting
gb_sklearn = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                        max_depth=3, random_state=42)
gb_sklearn.fit(X_train, y_train)
gb_sklearn_acc = gb_sklearn.score(X_test, y_test)
print(f"Scikit-learn GradientBoosting: {gb_sklearn_acc:.4f}")

# XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, 
                              max_depth=3, random_state=42, eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)
xgb_acc = xgb_model.score(X_test, y_test)
print(f"XGBoost:                       {xgb_acc:.4f}")

# LightGBM
lgb_model = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, 
                               max_depth=3, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
lgb_acc = lgb_model.score(X_test, y_test)
print(f"LightGBM:                      {lgb_acc:.4f}")

# AdaBoost (another boosting variant)
ada_model = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
ada_model.fit(X_train, y_train)
ada_acc = ada_model.score(X_test, y_test)
print(f"AdaBoost:                      {ada_acc:.4f}")

print("="*70)

In [None]:
# Compare Random Forest vs Gradient Boosting
comparison_data = pd.DataFrame({
    'Model': ['Random Forest', 'Sklearn GB', 'XGBoost', 'LightGBM', 'AdaBoost'],
    'Accuracy': [rf_final.score(X_test, y_test), gb_sklearn_acc, xgb_acc, lgb_acc, ada_acc],
    'Type': ['Bagging', 'Boosting', 'Boosting', 'Boosting', 'Boosting']
})

plt.figure(figsize=(10, 5))
colors = ['blue' if t == 'Bagging' else 'orange' for t in comparison_data['Type']]
plt.barh(comparison_data['Model'], comparison_data['Accuracy'], color=colors)
plt.xlabel('Test Accuracy')
plt.title('Random Forest (Bagging) vs Gradient Boosting Methods')
plt.xlim(0.85, 1.0)
plt.axvline(x=0.95, color='red', linestyle='--', alpha=0.5, label='95% threshold')
plt.legend()
plt.tight_layout()
plt.show()

print("\nModel Comparison:")
print(comparison_data.to_string(index=False))

# 3. Understanding Bootstrap Aggregation (Bagging)

Let's dive deeper into how **Bootstrap Aggregation** works:

1. Create multiple bootstrap samples (random sampling with replacement)
2. Train a model on each bootstrap sample
3. Aggregate predictions (majority vote for classification, average for regression)

We'll demonstrate this manually and compare it to scikit-learn's BaggingClassifier.

In [None]:
# Demonstrate bootstrap sampling
n_samples = len(X_train)
n_bootstrap = 3

print("Demonstrating Bootstrap Sampling:\n")
print(f"Original training set size: {n_samples}")
print(f"\nCreating {n_bootstrap} bootstrap samples...\n")

for i in range(n_bootstrap):
    # Create bootstrap sample (sampling with replacement)
    indices = np.random.choice(n_samples, size=n_samples, replace=True)
    unique_indices = len(np.unique(indices))
    
    # Out-of-bag samples (samples not selected)
    oob_indices = set(range(n_samples)) - set(indices)
    
    print(f"Bootstrap Sample {i+1}:")
    print(f"  Total samples: {len(indices)}")
    print(f"  Unique samples: {unique_indices} ({unique_indices/n_samples*100:.1f}%)")
    print(f"  Out-of-bag samples: {len(oob_indices)} ({len(oob_indices)/n_samples*100:.1f}%)")
    print()

print("Key Insight: Each bootstrap sample uses ~63.2% unique samples")
print("The remaining ~36.8% are out-of-bag (OOB) samples used for validation")

In [None]:
# Compare base model vs Bagging ensemble
print("\nComparing Single Model vs Bagging Ensemble:\n")
print("="*70)

# Single Decision Tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
single_tree_acc = single_tree.score(X_test, y_test)
print(f"Single Decision Tree: {single_tree_acc:.4f}")

# Bagging with different numbers of estimators
n_estimators_list = [10, 50, 100, 200]
bagging_results = []

for n_est in n_estimators_list:
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(),
        n_estimators=n_est,
        random_state=42,
        n_jobs=-1
    )
    bagging.fit(X_train, y_train)
    bagging_acc = bagging.score(X_test, y_test)
    bagging_results.append({'n_estimators': n_est, 'accuracy': bagging_acc})
    print(f"Bagging ({n_est:3d} trees): {bagging_acc:.4f}")

print("="*70)

improvement = bagging_results[-1]['accuracy'] - single_tree_acc
print(f"\nImprovement from Bagging: {improvement:.4f} ({improvement/single_tree_acc*100:.1f}%)")

In [None]:
# Visualize the effect of ensemble size
bagging_df = pd.DataFrame(bagging_results)

plt.figure(figsize=(10, 5))
plt.plot(bagging_df['n_estimators'], bagging_df['accuracy'], 
         marker='o', linewidth=2, markersize=8, label='Bagging')
plt.axhline(y=single_tree_acc, color='red', linestyle='--', 
            label='Single Tree', linewidth=2)
plt.xlabel('Number of Estimators')
plt.ylabel('Test Accuracy')
plt.title('Bagging Performance vs Ensemble Size')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Out-of-Bag (OOB) Evaluation

One of the benefits of bagging is **OOB evaluation** - we can estimate model performance without a separate validation set using the samples that weren't selected in each bootstrap.

In [None]:
# Demonstrate OOB scoring
bagging_oob = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
bagging_oob.fit(X_train, y_train)

print("Out-of-Bag Evaluation:")
print("="*70)
print(f"OOB Score (internal validation): {bagging_oob.oob_score_:.4f}")
print(f"Test Score:                      {bagging_oob.score(X_test, y_test):.4f}")
print("\nThe OOB score provides an unbiased estimate without needing a separate")
print("validation set, saving data for training!")
print("="*70)

# 4. Combining Heterogeneous Models (Stacking)

**Stacking** is an advanced ensemble technique that combines different types of models:

1. Train multiple diverse base models (level 0)
2. Use their predictions as features for a meta-model (level 1)
3. The meta-model learns how to best combine the base model predictions

This is different from voting, which uses a fixed combination rule.

In [None]:
# Define base models (diverse set of algorithms)
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
]

# Define meta-model (final estimator)
meta_model = LogisticRegression(max_iter=1000, random_state=42)

# Create stacking classifier
stacking = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5  # Use cross-validation to generate meta-features
)

print("Training Stacking Ensemble...\n")
stacking.fit(X_train_scaled, y_train)
stacking_acc = stacking.score(X_test_scaled, y_test)

print("Stacking Results:")
print("="*70)
print("\nBase Models Performance:")
for name, model in base_models:
    model.fit(X_train_scaled, y_train)
    acc = model.score(X_test_scaled, y_test)
    print(f"  {name:3s}: {acc:.4f}")

print(f"\nStacking Ensemble: {stacking_acc:.4f}")
print("="*70)

In [None]:
# Compare different ensemble strategies
ensemble_comparison = pd.DataFrame({
    'Method': ['Voting (Hard)', 'Voting (Soft)', 'Bagging', 'Random Forest', 
               'Gradient Boosting', 'Stacking'],
    'Accuracy': [
        acc_hard,
        acc_soft,
        bagging_oob.score(X_test, y_test),
        rf_final.score(X_test, y_test),
        xgb_acc,
        stacking_acc
    ],
    'Strategy': ['Voting', 'Voting', 'Bagging', 'Bagging', 'Boosting', 'Stacking']
})

plt.figure(figsize=(12, 6))
colors = {'Voting': 'skyblue', 'Bagging': 'lightgreen', 
          'Boosting': 'orange', 'Stacking': 'purple'}
bar_colors = [colors[s] for s in ensemble_comparison['Strategy']]

plt.barh(ensemble_comparison['Method'], ensemble_comparison['Accuracy'], color=bar_colors)
plt.xlabel('Test Accuracy')
plt.title('Comparison of Different Ensemble Strategies')
plt.xlim(0.85, 1.0)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=strategy) 
                   for strategy, color in colors.items()]
plt.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

print("\nEnsemble Method Comparison:")
print(ensemble_comparison.sort_values('Accuracy', ascending=False).to_string(index=False))

# 5. Evaluating Ensembles of Methods

Let's perform a comprehensive evaluation of our best ensemble models using multiple metrics:
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
- Cross-validation scores

In [None]:
# Select best models for detailed evaluation
best_models = {
    'Random Forest': rf_final,
    'XGBoost': xgb_model,
    'Stacking': stacking,
    'Voting (Soft)': voting_soft
}

# Generate predictions for each model
print("Detailed Classification Reports:\n")
print("="*70)

for name, model in best_models.items():
    print(f"\n{name}:")
    print("-"*70)
    
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    print(classification_report(y_test, y_pred, target_names=wine.target_names))

In [None]:
# Visualize confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, (name, model) in enumerate(best_models.items()):
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    cm = confusion_matrix(y_test, y_pred)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=wine.target_names,
                yticklabels=wine.target_names)
    axes[idx].set_title(f'{name} Confusion Matrix')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

In [None]:
# Cross-validation comparison
print("\nCross-Validation Performance Comparison:\n")
print("="*70)

cv_results = []

for name, model in best_models.items():
    # Use scaled or unscaled data based on model type
    if name in ['Stacking', 'Voting (Soft)']:
        X_cv = X_train_scaled
    else:
        X_cv = X_train
    
    scores = cross_val_score(model, X_cv, y_train, cv=5, scoring='accuracy')
    cv_results.append({
        'Model': name,
        'Mean CV Score': scores.mean(),
        'Std CV Score': scores.std(),
        'Min Score': scores.min(),
        'Max Score': scores.max()
    })
    
    print(f"{name}:")
    print(f"  Mean: {scores.mean():.4f} (+/- {scores.std():.4f})")
    print(f"  Range: [{scores.min():.4f}, {scores.max():.4f}]")
    print()

cv_results_df = pd.DataFrame(cv_results)
print("="*70)

In [None]:
# Visualize CV performance with error bars
plt.figure(figsize=(10, 6))
plt.errorbar(cv_results_df['Mean CV Score'], cv_results_df['Model'],
             xerr=cv_results_df['Std CV Score'], fmt='o', markersize=10,
             capsize=5, capthick=2, linewidth=2)
plt.xlabel('Cross-Validation Score')
plt.title('Cross-Validation Performance Comparison (with Standard Deviation)')
plt.xlim(0.90, 1.0)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Final Performance Summary

Let's create a comprehensive summary of all ensemble methods we've explored.

In [None]:
# Create final summary table
final_summary = pd.DataFrame({
    'Ensemble Method': [
        'Single Decision Tree (Baseline)',
        'Voting - Hard',
        'Voting - Soft',
        'Bagging (100 trees)',
        'Random Forest (200 trees)',
        'Gradient Boosting (sklearn)',
        'XGBoost',
        'LightGBM',
        'AdaBoost',
        'Stacking'
    ],
    'Test Accuracy': [
        single_tree_acc,
        acc_hard,
        acc_soft,
        bagging_oob.score(X_test, y_test),
        rf_final.score(X_test, y_test),
        gb_sklearn_acc,
        xgb_acc,
        lgb_acc,
        ada_acc,
        stacking_acc
    ],
    'Category': [
        'Baseline',
        'Voting',
        'Voting',
        'Bagging',
        'Bagging',
        'Boosting',
        'Boosting',
        'Boosting',
        'Boosting',
        'Stacking'
    ]
})

final_summary = final_summary.sort_values('Test Accuracy', ascending=False)

print("\n" + "="*80)
print("FINAL PERFORMANCE SUMMARY")
print("="*80)
print(final_summary.to_string(index=False))
print("="*80)

# Calculate improvement over baseline
best_model = final_summary.iloc[0]
improvement = (best_model['Test Accuracy'] - single_tree_acc) / single_tree_acc * 100
print(f"\nBest Model: {best_model['Ensemble Method']}")
print(f"Improvement over baseline: {improvement:.2f}%")

In [None]:
# Final visualization
plt.figure(figsize=(14, 8))

category_colors = {
    'Baseline': 'gray',
    'Voting': 'skyblue',
    'Bagging': 'lightgreen',
    'Boosting': 'orange',
    'Stacking': 'purple'
}

colors = [category_colors[cat] for cat in final_summary['Category']]

plt.barh(final_summary['Ensemble Method'], final_summary['Test Accuracy'], color=colors)
plt.xlabel('Test Accuracy', fontsize=12)
plt.title('Comprehensive Ensemble Methods Performance Comparison', fontsize=14, fontweight='bold')
plt.xlim(0.85, 1.0)
plt.axvline(x=single_tree_acc, color='red', linestyle='--', linewidth=2, label='Baseline', alpha=0.7)

# Add value labels on bars
for idx, row in final_summary.iterrows():
    plt.text(row['Test Accuracy'], row['Ensemble Method'], 
             f" {row['Test Accuracy']:.4f}", 
             va='center', fontsize=9)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=category) 
                   for category, color in category_colors.items()]
plt.legend(handles=legend_elements, loc='lower right', fontsize=10)

plt.tight_layout()
plt.show()

# Summary and Key Takeaways

## Ensemble Methods Overview

### 1. **Using Multiple Models Together**
- Combining diverse models often performs better than individual models
- Voting ensembles are simple but effective
- Soft voting (using probabilities) typically outperforms hard voting

### 2. **Random Forests and GBTs**
- **Random Forests** (Bagging):
  - Build trees independently in parallel
  - Reduce variance through averaging
  - Resistant to overfitting
  - Good default choice for many problems
  
- **Gradient Boosted Trees** (Boosting):
  - Build trees sequentially
  - Each tree corrects previous errors
  - Often achieve highest accuracy
  - More prone to overfitting than Random Forests
  - XGBoost and LightGBM offer optimized implementations

### 3. **Bootstrap Aggregation (Bagging)**
- Creates diverse training sets through random sampling with replacement
- Each bootstrap sample uses ~63.2% unique samples
- Out-of-bag (OOB) samples provide free validation
- Reduces variance and improves stability

### 4. **Combining Heterogeneous Models**
- **Stacking** uses a meta-model to learn optimal combinations
- More sophisticated than simple voting
- Can capture complementary strengths of different algorithms
- Requires more computational resources

### 5. **Evaluating Ensembles**
- Always use cross-validation for robust performance estimates
- Consider multiple metrics (accuracy, precision, recall, F1)
- Examine confusion matrices for detailed error analysis
- Balance performance with computational cost

## Best Practices

1. **Start Simple**: Begin with Random Forest as a strong baseline
2. **Try Boosting**: Gradient boosting often gives the best performance
3. **Diversify**: Use different types of models in voting/stacking
4. **Validate Properly**: Use cross-validation to avoid overfitting
5. **Monitor Complexity**: More complex ensembles aren't always better

## When to Use Each Method

- **Random Forest**: Good default choice, handles non-linear relationships well
- **Gradient Boosting**: When you need maximum accuracy and have time to tune
- **Bagging**: When you have a high-variance model to stabilize
- **Voting**: Quick ensemble of pre-trained models
- **Stacking**: When you have diverse models and want optimal combination