# Ensemble Methods and Advanced Models
## Module 8, Lab 5: Combining Models for Better Performance

Individual models have limitations, but ensemble methods combine multiple models to achieve better performance, reduce overfitting, and increase robustness. This lab explores the most powerful ensemble techniques used in machine learning.

### Learning Objectives
By the end of this lab, you will be able to:
- Understand the principles behind ensemble methods
- Build Random Forest models (bagging)
- Implement Gradient Boosting models (boosting)
- Create voting and stacking ensembles
- Tune hyperparameters for optimal performance
- Compare ensemble methods with individual models

### Why Ensemble Methods Matter
Ensemble methods often win machine learning competitions and are widely used in industry because they:
- Reduce overfitting through model averaging
- Improve generalization to new data
- Provide more robust predictions
- Can capture different patterns in the data

## Setup and Data Loading

In [None]:
# Install required packages
!pip install --upgrade pip
!pip install pandas numpy matplotlib seaborn scikit-learn xgboost

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier, 
    VotingClassifier, BaggingClassifier, AdaBoostClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

print("Libraries imported successfully!")

### Creating a Complex Dataset
We'll create a more complex dataset that benefits from ensemble methods.

In [None]:
# Create a complex customer churn dataset
np.random.seed(42)
n_customers = 2000

# Generate customer features with complex interactions
customer_data = {
    'age': np.random.normal(40, 15, n_customers),
    'income': np.random.lognormal(10.5, 0.6, n_customers),
    'account_length': np.random.exponential(3, n_customers),
    'monthly_charges': np.random.normal(65, 20, n_customers),
    'total_charges': np.random.normal(1500, 800, n_customers),
    'support_calls': np.random.poisson(2, n_customers),
    'contract_length': np.random.choice([1, 12, 24], n_customers, p=[0.4, 0.35, 0.25]),
    'payment_method': np.random.choice(['Credit Card', 'Bank Transfer', 'Electronic Check', 'Mailed Check'], 
                                      n_customers, p=[0.35, 0.25, 0.25, 0.15]),
    'internet_service': np.random.choice(['DSL', 'Fiber', 'No'], n_customers, p=[0.4, 0.45, 0.15]),
    'online_security': np.random.choice([0, 1], n_customers, p=[0.6, 0.4]),
    'tech_support': np.random.choice([0, 1], n_customers, p=[0.7, 0.3]),
    'streaming_tv': np.random.choice([0, 1], n_customers, p=[0.55, 0.45]),
    'streaming_movies': np.random.choice([0, 1], n_customers, p=[0.55, 0.45]),
    'paperless_billing': np.random.choice([0, 1], n_customers, p=[0.4, 0.6]),
    'senior_citizen': np.random.choice([0, 1], n_customers, p=[0.84, 0.16])
}

# Create DataFrame
df = pd.DataFrame(customer_data)

# Apply realistic constraints
df['age'] = np.clip(df['age'], 18, 80)
df['income'] = np.clip(df['income'], 20000, 200000)
df['account_length'] = np.clip(df['account_length'], 0, 10)
df['monthly_charges'] = np.clip(df['monthly_charges'], 20, 120)
df['total_charges'] = np.maximum(df['total_charges'], df['monthly_charges'] * df['account_length'])

# Create complex feature interactions
df['charges_to_income_ratio'] = df['monthly_charges'] / (df['income'] / 12)
df['avg_monthly_charges'] = df['total_charges'] / (df['account_length'] + 1)
df['service_count'] = (df['online_security'] + df['tech_support'] + 
                      df['streaming_tv'] + df['streaming_movies'])

print(f"Dataset created with {len(df)} customers")
print(f"Dataset shape: {df.shape}")
df.head()

### Creating the Target Variable (Churn)
We'll create a complex churn pattern that benefits from ensemble methods.

In [None]:
# Create complex churn probability with non-linear relationships
churn_probability = (
    0.05 +  # Base probability
    # Linear effects
    (df['support_calls'] / 10) * 0.3 +
    (df['charges_to_income_ratio'] > 0.15) * 0.2 +
    (df['contract_length'] == 1) * 0.25 +
    (df['payment_method'] == 'Electronic Check') * 0.15 +
    (df['paperless_billing'] == 1) * 0.1 +
    (df['senior_citizen'] == 1) * 0.1 +
    
    # Non-linear effects (quadratic)
    ((df['age'] - 40) ** 2 / 1000) * 0.1 +
    
    # Interaction effects
    (df['internet_service'] == 'Fiber') * (df['tech_support'] == 0) * 0.2 +
    (df['account_length'] < 1) * (df['monthly_charges'] > 80) * 0.3 +
    (df['service_count'] == 0) * 0.15 +
    
    # Random noise
    np.random.normal(0, 0.05, n_customers)
)

# Ensure probabilities are between 0 and 1
churn_probability = np.clip(churn_probability, 0, 0.8)

# Generate binary churn outcome
df['churn'] = np.random.binomial(1, churn_probability)

print(f"Churn rate: {df['churn'].mean():.2%}")
print(f"Customers who churned: {df['churn'].sum()}")
print(f"Customers who stayed: {len(df) - df['churn'].sum()}")

# Display churn by key features
print("\nChurn rates by key features:")
print(f"Contract length: {df.groupby('contract_length')['churn'].mean().round(3)}")
print(f"Payment method: {df.groupby('payment_method')['churn'].mean().round(3)}")
print(f"Internet service: {df.groupby('internet_service')['churn'].mean().round(3)}")

## Step 1: Data Preparation
Let's prepare our data for ensemble modeling.

In [None]:
# Define feature columns
numerical_features = ['age', 'income', 'account_length', 'monthly_charges', 'total_charges', 
                     'support_calls', 'charges_to_income_ratio', 'avg_monthly_charges', 'service_count']
categorical_features = ['contract_length', 'payment_method', 'internet_service']
binary_features = ['online_security', 'tech_support', 'streaming_tv', 'streaming_movies', 
                  'paperless_billing', 'senior_citizen']

# Combine all features
all_features = numerical_features + categorical_features + binary_features
X = df[all_features]
y = df['churn']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature types:")
print(f"  Numerical: {len(numerical_features)}")
print(f"  Categorical: {len(categorical_features)}")
print(f"  Binary: {len(binary_features)}")
print(f"\nClass distribution: {y.value_counts(normalize=True).round(3)}")

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features),
        ('bin', 'passthrough', binary_features)
    ]
)

# Fit and transform the data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"\nProcessed training set shape: {X_train_processed.shape}")
print(f"Processed test set shape: {X_test_processed.shape}")

## Step 2: Individual Base Models
Let's first establish baseline performance with individual models.

In [None]:
# Define base models
base_models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'SVM': SVC(random_state=42, probability=True)
}

# Train and evaluate base models
base_results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Training and evaluating base models...")
for name, model in base_models.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
    
    # Train on full training set
    model.fit(X_train_processed, y_train)
    
    # Predictions
    y_pred = model.predict(X_test_processed)
    y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
    
    # Store results
    base_results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'test_accuracy': accuracy_score(y_test, y_pred),
        'test_precision': precision_score(y_test, y_pred),
        'test_recall': recall_score(y_test, y_pred),
        'test_f1': f1_score(y_test, y_pred),
        'test_auc': roc_auc_score(y_test, y_pred_proba),
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"  CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Test AUC: {base_results[name]['test_auc']:.4f}")

# Create results DataFrame
results_df = pd.DataFrame({
    name: {
        'CV_AUC': results['cv_mean'],
        'Test_Accuracy': results['test_accuracy'],
        'Test_Precision': results['test_precision'],
        'Test_Recall': results['test_recall'],
        'Test_F1': results['test_f1'],
        'Test_AUC': results['test_auc']
    }
    for name, results in base_results.items()
}).T

print("\nBase Models Performance:")
print(results_df.round(4))

## Step 3: Bagging Methods
Bagging (Bootstrap Aggregating) trains multiple models on different subsets of the data.

### 3.1 Random Forest

In [None]:
# Train Random Forest with different configurations
rf_configs = {
    'RF_Basic': RandomForestClassifier(n_estimators=100, random_state=42),
    'RF_Tuned': RandomForestClassifier(
        n_estimators=200,
        max_depth=15,
        min_samples_split=10,
        min_samples_leaf=5,
        max_features='sqrt',
        random_state=42
    )
}

rf_results = {}

print("Training Random Forest models...")
for name, model in rf_configs.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
    
    # Train on full training set
    model.fit(X_train_processed, y_train)
    
    # Predictions
    y_pred = model.predict(X_test_processed)
    y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
    
    # Store results
    rf_results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'test_auc': roc_auc_score(y_test, y_pred_proba),
        'test_f1': f1_score(y_test, y_pred),
        'model': model
    }
    
    print(f"  CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Test AUC: {rf_results[name]['test_auc']:.4f}")
    print(f"  Test F1: {rf_results[name]['test_f1']:.4f}")

# Get the best Random Forest model
best_rf_name = max(rf_results.keys(), key=lambda x: rf_results[x]['test_auc'])
best_rf_model = rf_results[best_rf_name]['model']
print(f"\nBest Random Forest: {best_rf_name}")

In [None]:
# Analyze Random Forest feature importance
# Get feature names after preprocessing
num_feature_names = numerical_features
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
bin_feature_names = binary_features
all_feature_names = num_feature_names + list(cat_feature_names) + bin_feature_names

# Feature importance
feature_importance = pd.DataFrame({
    'feature': all_feature_names,
    'importance': best_rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 Most Important Features (Random Forest):")
print(feature_importance.head(10).round(4))

# Visualize feature importance
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 15 Feature Importances in Random Forest')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 3.2 Bagging with Different Base Models

In [None]:
# Try bagging with different base estimators
bagging_models = {
    'Bagging_DT': BaggingClassifier(
        base_estimator=DecisionTreeClassifier(max_depth=10),
        n_estimators=100,
        random_state=42
    ),
    'Bagging_LR': BaggingClassifier(
        base_estimator=LogisticRegression(max_iter=1000),
        n_estimators=50,
        random_state=42
    )
}

bagging_results = {}

print("Training Bagging models...")
for name, model in bagging_models.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
    
    # Train and evaluate
    model.fit(X_train_processed, y_train)
    y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
    
    bagging_results[name] = {
        'cv_mean': cv_scores.mean(),
        'test_auc': roc_auc_score(y_test, y_pred_proba)
    }
    
    print(f"  CV AUC: {cv_scores.mean():.4f}")
    print(f"  Test AUC: {bagging_results[name]['test_auc']:.4f}")

## Step 4: Boosting Methods
Boosting trains models sequentially, with each model learning from the mistakes of the previous ones.

### 4.1 Gradient Boosting

In [None]:
# Train Gradient Boosting models
gb_configs = {
    'GB_Basic': GradientBoostingClassifier(random_state=42),
    'GB_Tuned': GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=6,
        min_samples_split=20,
        min_samples_leaf=10,
        subsample=0.8,
        random_state=42
    )
}

gb_results = {}

print("Training Gradient Boosting models...")
for name, model in gb_configs.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
    
    # Train on full training set
    model.fit(X_train_processed, y_train)
    
    # Predictions
    y_pred = model.predict(X_test_processed)
    y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
    
    # Store results
    gb_results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'test_auc': roc_auc_score(y_test, y_pred_proba),
        'test_f1': f1_score(y_test, y_pred),
        'model': model
    }
    
    print(f"  CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Test AUC: {gb_results[name]['test_auc']:.4f}")
    print(f"  Test F1: {gb_results[name]['test_f1']:.4f}")

# Get the best Gradient Boosting model
best_gb_name = max(gb_results.keys(), key=lambda x: gb_results[x]['test_auc'])
best_gb_model = gb_results[best_gb_name]['model']
print(f"\nBest Gradient Boosting: {best_gb_name}")

### 4.2 XGBoost

In [None]:
# Train XGBoost model
print("Training XGBoost model...")

xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss'
)

# Cross-validation
cv_scores = cross_val_score(xgb_model, X_train_processed, y_train, cv=cv, scoring='roc_auc')

# Train on full training set
xgb_model.fit(X_train_processed, y_train)

# Predictions
y_pred_xgb = xgb_model.predict(X_test_processed)
y_pred_proba_xgb = xgb_model.predict_proba(X_test_processed)[:, 1]

xgb_results = {
    'cv_mean': cv_scores.mean(),
    'cv_std': cv_scores.std(),
    'test_auc': roc_auc_score(y_test, y_pred_proba_xgb),
    'test_f1': f1_score(y_test, y_pred_xgb)
}

print(f"XGBoost CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"XGBoost Test AUC: {xgb_results['test_auc']:.4f}")
print(f"XGBoost Test F1: {xgb_results['test_f1']:.4f}")

### 4.3 AdaBoost

In [None]:
# Train AdaBoost model
print("Training AdaBoost model...")

ada_model = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)

# Cross-validation
cv_scores = cross_val_score(ada_model, X_train_processed, y_train, cv=cv, scoring='roc_auc')

# Train and evaluate
ada_model.fit(X_train_processed, y_train)
y_pred_ada = ada_model.predict(X_test_processed)
y_pred_proba_ada = ada_model.predict_proba(X_test_processed)[:, 1]

ada_results = {
    'cv_mean': cv_scores.mean(),
    'test_auc': roc_auc_score(y_test, y_pred_proba_ada),
    'test_f1': f1_score(y_test, y_pred_ada)
}

print(f"AdaBoost CV AUC: {cv_scores.mean():.4f}")
print(f"AdaBoost Test AUC: {ada_results['test_auc']:.4f}")
print(f"AdaBoost Test F1: {ada_results['test_f1']:.4f}")

## Step 5: Voting Ensembles
Voting ensembles combine predictions from multiple different algorithms.

In [None]:
# Create voting ensembles
print("Creating Voting Ensembles...")

# Select best models from each category
voting_estimators = [
    ('lr', LogisticRegression(random_state=42, max_iter=1000)),
    ('rf', best_rf_model),
    ('gb', best_gb_model),
    ('xgb', xgb_model)
]

# Hard voting (majority vote)
hard_voting = VotingClassifier(
    estimators=voting_estimators,
    voting='hard'
)

# Soft voting (average probabilities)
soft_voting = VotingClassifier(
    estimators=voting_estimators,
    voting='soft'
)

voting_models = {
    'Hard_Voting': hard_voting,
    'Soft_Voting': soft_voting
}

voting_results = {}

for name, model in voting_models.items():
    print(f"\nTraining {name}...")
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_processed, y_train, cv=cv, scoring='roc_auc')
    
    # Train and evaluate
    model.fit(X_train_processed, y_train)
    y_pred = model.predict(X_test_processed)
    
    if name == 'Soft_Voting':
        y_pred_proba = model.predict_proba(X_test_processed)[:, 1]
    else:
        # For hard voting, use the average of individual model probabilities
        individual_probas = []
        for est_name, estimator in voting_estimators:
            individual_probas.append(estimator.predict_proba(X_test_processed)[:, 1])
        y_pred_proba = np.mean(individual_probas, axis=0)
    
    voting_results[name] = {
        'cv_mean': cv_scores.mean(),
        'test_auc': roc_auc_score(y_test, y_pred_proba),
        'test_f1': f1_score(y_test, y_pred)
    }
    
    print(f"  CV AUC: {cv_scores.mean():.4f}")
    print(f"  Test AUC: {voting_results[name]['test_auc']:.4f}")
    print(f"  Test F1: {voting_results[name]['test_f1']:.4f}")

## Step 6: Model Comparison and Analysis
Let's compare all our ensemble methods with the base models.

In [None]:
# Compile all results
all_results = {}

# Base models
for name, results in base_results.items():
    all_results[name] = {
        'Type': 'Base Model',
        'CV_AUC': results['cv_mean'],
        'Test_AUC': results['test_auc'],
        'Test_F1': results['test_f1']
    }

# Random Forest
for name, results in rf_results.items():
    all_results[name] = {
        'Type': 'Bagging',
        'CV_AUC': results['cv_mean'],
        'Test_AUC': results['test_auc'],
        'Test_F1': results['test_f1']
    }

# Gradient Boosting
for name, results in gb_results.items():
    all_results[name] = {
        'Type': 'Boosting',
        'CV_AUC': results['cv_mean'],
        'Test_AUC': results['test_auc'],
        'Test_F1': results['test_f1']
    }

# XGBoost
all_results['XGBoost'] = {
    'Type': 'Boosting',
    'CV_AUC': xgb_results['cv_mean'],
    'Test_AUC': xgb_results['test_auc'],
    'Test_F1': xgb_results['test_f1']
}

# AdaBoost
all_results['AdaBoost'] = {
    'Type': 'Boosting',
    'CV_AUC': ada_results['cv_mean'],
    'Test_AUC': ada_results['test_auc'],
    'Test_F1': ada_results['test_f1']
}

# Voting
for name, results in voting_results.items():
    all_results[name] = {
        'Type': 'Voting',
        'CV_AUC': results['cv_mean'],
        'Test_AUC': results['test_auc'],
        'Test_F1': results['test_f1']
    }

# Create comprehensive results DataFrame
final_results_df = pd.DataFrame(all_results).T
final_results_df = final_results_df.sort_values('Test_AUC', ascending=False)

print("=== COMPREHENSIVE MODEL COMPARISON ===")
print(final_results_df.round(4))

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# AUC comparison
colors = {'Base Model': 'lightblue', 'Bagging': 'lightgreen', 
          'Boosting': 'lightcoral', 'Voting': 'lightyellow'}
model_colors = [colors[model_type] for model_type in final_results_df['Type']]

axes[0].barh(range(len(final_results_df)), final_results_df['Test_AUC'], color=model_colors)
axes[0].set_yticks(range(len(final_results_df)))
axes[0].set_yticklabels(final_results_df.index)
axes[0].set_xlabel('Test AUC Score')
axes[0].set_title('Model Performance Comparison (AUC)')
axes[0].grid(True, alpha=0.3)

# Add value labels
for i, v in enumerate(final_results_df['Test_AUC']):
    axes[0].text(v + 0.005, i, f'{v:.3f}', va='center')

# F1 comparison
axes[1].barh(range(len(final_results_df)), final_results_df['Test_F1'], color=model_colors)
axes[1].set_yticks(range(len(final_results_df)))
axes[1].set_yticklabels(final_results_df.index)
axes[1].set_xlabel('Test F1 Score')
axes[1].set_title('Model Performance Comparison (F1)')
axes[1].grid(True, alpha=0.3)

# Add value labels
for i, v in enumerate(final_results_df['Test_F1']):
    axes[1].text(v + 0.005, i, f'{v:.3f}', va='center')

# Create legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=model_type) 
                  for model_type, color in colors.items()]
fig.legend(handles=legend_elements, loc='upper center', bbox_to_anchor=(0.5, 0.02), ncol=4)

plt.tight_layout()
plt.show()

In [None]:
# Performance improvement analysis
best_base_model = final_results_df[final_results_df['Type'] == 'Base Model']['Test_AUC'].max()
best_ensemble_model = final_results_df[final_results_df['Type'] != 'Base Model']['Test_AUC'].max()
best_overall_model = final_results_df.iloc[0]

improvement = best_ensemble_model - best_base_model
improvement_pct = (improvement / best_base_model) * 100

print("=== ENSEMBLE METHODS ANALYSIS ===")
print(f"\nüìä Performance Summary:")
print(f"   ‚Ä¢ Best base model AUC: {best_base_model:.4f}")
print(f"   ‚Ä¢ Best ensemble model AUC: {best_ensemble_model:.4f}")
print(f"   ‚Ä¢ Improvement: {improvement:.4f} ({improvement_pct:.2f}%)")
print(f"   ‚Ä¢ Best overall model: {best_overall_model.name} (AUC: {best_overall_model['Test_AUC']:.4f})")

print(f"\nüéØ Key Insights:")
print(f"   ‚Ä¢ Ensemble methods {'improved' if improvement > 0 else 'did not improve'} upon base models")

# Analyze by ensemble type
ensemble_performance = final_results_df[final_results_df['Type'] != 'Base Model'].groupby('Type')['Test_AUC'].agg(['mean', 'max', 'count'])
print(f"\nüìà Ensemble Type Performance:")
for ensemble_type in ensemble_performance.index:
    stats = ensemble_performance.loc[ensemble_type]
    print(f"   ‚Ä¢ {ensemble_type}: Avg AUC = {stats['mean']:.4f}, Best AUC = {stats['max']:.4f}, Models = {stats['count']}")

# Best model recommendations
print(f"\nüí° Recommendations:")
top_3_models = final_results_df.head(3)
print(f"   ‚Ä¢ Top 3 models for deployment:")
for i, (model_name, model_data) in enumerate(top_3_models.iterrows(), 1):
    print(f"     {i}. {model_name} ({model_data['Type']}) - AUC: {model_data['Test_AUC']:.4f}")

print(f"   ‚Ä¢ Consider ensemble diversity and computational cost")
print(f"   ‚Ä¢ Soft voting often performs better than hard voting")
print(f"   ‚Ä¢ Boosting methods (XGBoost, GB) often excel on tabular data")

## Step 7: Hyperparameter Tuning
Let's optimize the best performing model.

In [None]:
# Hyperparameter tuning for the best model (assuming it's XGBoost or Random Forest)
best_model_name = final_results_df.index[0]
print(f"Performing hyperparameter tuning for: {best_model_name}")

if 'XGBoost' in best_model_name or 'GB' in best_model_name:
    # Tune XGBoost/Gradient Boosting
    param_grid = {
        'n_estimators': [100, 200],
        'learning_rate': [0.05, 0.1, 0.2],
        'max_depth': [4, 6, 8],
        'subsample': [0.8, 0.9]
    }
    
    if 'XGBoost' in best_model_name:
        base_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
    else:
        base_model = GradientBoostingClassifier(random_state=42)
        
elif 'RF' in best_model_name:
    # Tune Random Forest
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 15, 20],
        'min_samples_split': [5, 10, 20],
        'min_samples_leaf': [2, 5, 10]
    }
    base_model = RandomForestClassifier(random_state=42)
else:
    # Default to Random Forest if unclear
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [10, 15],
        'min_samples_split': [5, 10]
    }
    base_model = RandomForestClassifier(random_state=42)

# Perform grid search
print("Performing Grid Search (this may take a few minutes...)")
grid_search = GridSearchCV(
    base_model,
    param_grid,
    cv=3,  # Reduced for speed
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_processed, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Evaluate tuned model
tuned_model = grid_search.best_estimator_
y_pred_tuned = tuned_model.predict(X_test_processed)
y_pred_proba_tuned = tuned_model.predict_proba(X_test_processed)[:, 1]

tuned_auc = roc_auc_score(y_test, y_pred_proba_tuned)
tuned_f1 = f1_score(y_test, y_pred_tuned)

print(f"\nTuned model performance:")
print(f"  Test AUC: {tuned_auc:.4f}")
print(f"  Test F1: {tuned_f1:.4f}")

# Compare with original best model
original_auc = final_results_df.iloc[0]['Test_AUC']
improvement = tuned_auc - original_auc
print(f"\nImprovement from tuning: {improvement:.4f} ({improvement/original_auc*100:.2f}%)")

## Challenge: Your Turn to Practice!
Now it's your turn to experiment with ensemble methods.

### Challenge 1: Create a Custom Ensemble
Create a weighted voting ensemble where you assign different weights to different models based on their individual performance.

In [None]:
# Your code here for Challenge 1
# Hint: You can manually combine predictions using weights based on individual model AUC scores


### Challenge 2: Feature Importance Ensemble
Compare feature importances across Random Forest, Gradient Boosting, and XGBoost. Which features are consistently important?

In [None]:
# Your code here for Challenge 2
# Hint: Extract feature_importances_ from each model and create a comparison DataFrame


### Challenge 3: Stacking Ensemble
Create a simple stacking ensemble where you use the predictions of multiple models as features for a final meta-model.

In [None]:
# Your code here for Challenge 3
# Hint: Use cross-validation to generate out-of-fold predictions, then train a meta-model


## Summary

Congratulations! You've mastered ensemble methods and advanced modeling techniques. Here's what you've learned:

### ‚úÖ Key Skills Mastered:
1. **Bagging Methods**: Random Forest and general bagging with different base estimators
2. **Boosting Methods**: Gradient Boosting, XGBoost, and AdaBoost
3. **Voting Ensembles**: Hard voting (majority) and soft voting (probability averaging)
4. **Model Comparison**: Systematic evaluation of multiple ensemble approaches
5. **Hyperparameter Tuning**: Grid search for optimal model parameters
6. **Feature Importance**: Understanding which features drive ensemble predictions

### üîç Key Concepts Learned:
- **Bias-Variance Tradeoff**: Ensembles reduce variance (bagging) or bias (boosting)
- **Model Diversity**: Different algorithms capture different patterns in data
- **Overfitting Reduction**: Averaging multiple models reduces overfitting
- **Computational Cost**: Ensembles trade computational resources for better performance
- **Interpretability**: Ensemble models are less interpretable than individual models

### üöÄ Next Steps:
In the next lab, we'll explore unsupervised learning techniques:
- K-Means and hierarchical clustering
- Principal Component Analysis (PCA)
- Customer segmentation applications
- Dimensionality reduction techniques

### üìä Performance Insights:
- Ensemble methods typically provide 2-5% improvement over single models
- XGBoost and Random Forest are often top performers on tabular data
- Soft voting usually outperforms hard voting
- Hyperparameter tuning can provide additional 1-3% improvement
- Model diversity is key to ensemble success

### üìö Additional Resources:
- [Ensemble Methods in Scikit-learn](https://scikit-learn.org/stable/modules/ensemble.html)
- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [Random Forest Explained](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)
- [Gradient Boosting Guide](https://towardsdatascience.com/gradient-boosting-classification-explained-through-python-60cc980eeb3d)