# ML Practice Questions Part 7: Ensemble Methods and Boosting

This notebook covers ensemble learning methods, with emphasis on bagging, boosting, and stacking techniques. Each question includes theoretical foundations, algorithmic implementations, and practical considerations for building robust ensemble models.

**Topics Covered:**
- Random Forest and bagging methods
- AdaBoost and gradient boosting algorithms
- XGBoost and advanced boosting techniques
- Stacking and blending ensemble strategies
- Ensemble diversity and variance reduction

**Format:** Each question includes theory, implementation, and empirical analysis sections.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_regression, load_breast_cancer, load_boston
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
np.random.seed(42)

## Question 1: Random Forest and Bagging Implementation

**Question:** Implement Random Forest from scratch and analyze how bootstrap sampling and feature randomness contribute to ensemble performance. Compare with bagging and single decision trees.

### Theory

**Bagging (Bootstrap Aggregating):**
1. Create $B$ bootstrap samples from training data
2. Train model on each bootstrap sample
3. Aggregate predictions: majority vote (classification) or average (regression)

**Random Forest Extensions:**
- **Feature randomness**: At each split, consider only $\sqrt{p}$ features (classification) or $p/3$ (regression)
- **Extra randomness**: Random thresholds for splits (Extra Trees)

**Variance Reduction:**
For identical models with variance $\sigma^2$ and correlation $\rho$:
$$\text{Var}(\text{ensemble}) = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2$$

**Out-of-Bag (OOB) Error:**
- Each bootstrap sample excludes ~37% of original data
- Use excluded samples for unbiased error estimation
- $\text{OOB Error} = \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i^{\text{OOB}})$

**Feature Importance:**
$$\text{Importance}_j = \frac{1}{B} \sum_{b=1}^B \sum_{t \in T_b} p_t \Delta I_t \cdot I(\text{split on feature } j)$$

In [None]:
from sklearn.tree import DecisionTreeClassifier

class RandomForestCustom:
    """Random Forest implementation from scratch."""
    
    def __init__(self, n_estimators=100, max_features='sqrt', max_depth=None, 
                 min_samples_split=2, min_samples_leaf=1, bootstrap=True, 
                 oob_score=False, random_state=None, task='classification'):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.bootstrap = bootstrap
        self.oob_score = oob_score
        self.random_state = random_state
        self.task = task
        
        self.estimators_ = []
        self.feature_importances_ = None
        self.oob_score_ = None
        self.oob_predictions_ = None
        
    def _get_max_features(self, n_features):
        """Calculate number of features to consider at each split."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return int(self.max_features * n_features)
        else:
            return n_features
    
    def _bootstrap_sample(self, X, y):
        """Create bootstrap sample."""
        n_samples = X.shape[0]
        if self.bootstrap:
            indices = np.random.choice(n_samples, n_samples, replace=True)
        else:
            indices = np.arange(n_samples)
        
        return X[indices], y[indices], indices
    
    def _calculate_oob_predictions(self, X, y, bootstrap_indices):
        """Calculate out-of-bag predictions."""
        if not self.oob_score:
            return
        
        n_samples = X.shape[0]
        oob_predictions = np.full((n_samples, len(self.estimators_)), np.nan)
        
        for i, (estimator, indices) in enumerate(zip(self.estimators_, bootstrap_indices)):
            # Find out-of-bag samples
            oob_indices = np.setdiff1d(np.arange(n_samples), indices)
            
            if len(oob_indices) > 0:
                if self.task == 'classification':
                    oob_pred = estimator.predict(X[oob_indices])
                else:
                    oob_pred = estimator.predict(X[oob_indices])
                
                oob_predictions[oob_indices, i] = oob_pred
        
        # Aggregate OOB predictions
        final_oob_predictions = []
        for i in range(n_samples):
            valid_predictions = oob_predictions[i, ~np.isnan(oob_predictions[i])]
            if len(valid_predictions) > 0:
                if self.task == 'classification':
                    # Majority vote
                    final_oob_predictions.append(Counter(valid_predictions).most_common(1)[0][0])
                else:
                    # Average
                    final_oob_predictions.append(np.mean(valid_predictions))
            else:
                final_oob_predictions.append(np.nan)
        
        self.oob_predictions_ = np.array(final_oob_predictions)
        
        # Calculate OOB score
        valid_mask = ~np.isnan(self.oob_predictions_)
        if np.sum(valid_mask) > 0:
            if self.task == 'classification':
                self.oob_score_ = accuracy_score(
                    y[valid_mask], 
                    self.oob_predictions_[valid_mask]
                )
            else:
                self.oob_score_ = -mean_squared_error(
                    y[valid_mask], 
                    self.oob_predictions_[valid_mask]
                )
    
    def fit(self, X, y):
        """Fit the Random Forest."""
        X = np.array(X)
        y = np.array(y)
        
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        max_features = self._get_max_features(n_features)
        
        self.estimators_ = []
        bootstrap_indices = []
        
        for i in range(self.n_estimators):
            # Create bootstrap sample
            X_bootstrap, y_bootstrap, indices = self._bootstrap_sample(X, y)
            bootstrap_indices.append(indices)
            
            # Create and fit decision tree
            if self.task == 'classification':
                estimator = DecisionTreeClassifier(
                    max_features=max_features,
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=None  # Let each tree be different
                )
            else:
                estimator = DecisionTreeRegressor(
                    max_features=max_features,
                    max_depth=self.max_depth,
                    min_samples_split=self.min_samples_split,
                    min_samples_leaf=self.min_samples_leaf,
                    random_state=None
                )
            
            estimator.fit(X_bootstrap, y_bootstrap)
            self.estimators_.append(estimator)
        
        # Calculate feature importances
        self.feature_importances_ = np.mean(
            [estimator.feature_importances_ for estimator in self.estimators_], axis=0
        )
        
        # Calculate OOB score
        self._calculate_oob_predictions(X, y, bootstrap_indices)
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        X = np.array(X)
        
        # Get predictions from all estimators
        predictions = np.array([estimator.predict(X) for estimator in self.estimators_])
        
        if self.task == 'classification':
            # Majority vote
            final_predictions = []
            for i in range(X.shape[0]):
                votes = predictions[:, i]
                final_predictions.append(Counter(votes).most_common(1)[0][0])
            return np.array(final_predictions)
        else:
            # Average
            return np.mean(predictions, axis=0)
    
    def predict_proba(self, X):
        """Predict class probabilities (classification only)."""
        if self.task != 'classification':
            raise ValueError("predict_proba only available for classification")
        
        X = np.array(X)
        
        # Get probability predictions from all estimators
        all_probabilities = [estimator.predict_proba(X) for estimator in self.estimators_]
        
        # Average probabilities
        return np.mean(all_probabilities, axis=0)

# Generate datasets
X_cls, y_cls = make_classification(n_samples=1000, n_features=20, n_informative=10, 
                                  n_redundant=5, n_clusters_per_class=1, random_state=42)
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls, y_cls, test_size=0.3, random_state=42)

# Compare different ensemble approaches
models = {
    'Single Tree': DecisionTreeClassifier(random_state=42),
    'Bagging': BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, random_state=42),
    'Random Forest (sklearn)': RandomForestClassifier(n_estimators=100, random_state=42),
    'Random Forest (custom)': RandomForestCustom(n_estimators=100, oob_score=True, random_state=42),
    'Extra Trees': RandomForestCustom(n_estimators=100, max_features='sqrt', bootstrap=False, random_state=42)
}

results = {}
fitted_models = {}

for name, model in models.items():
    model.fit(X_train_cls, y_train_cls)
    y_pred = model.predict(X_test_cls)
    
    accuracy = accuracy_score(y_test_cls, y_pred)
    
    # Get OOB score if available
    oob_score = getattr(model, 'oob_score_', 'N/A')
    
    results[name] = {
        'test_accuracy': accuracy,
        'oob_score': oob_score
    }
    fitted_models[name] = model

results_df = pd.DataFrame(results).T
print("Ensemble Methods Comparison:")
print(results_df.round(4))

In [None]:
# Analyze ensemble diversity and variance reduction
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Performance comparison
model_names = list(results.keys())
test_accuracies = [results[name]['test_accuracy'] for name in model_names]
oob_scores = [results[name]['oob_score'] if results[name]['oob_score'] != 'N/A' else 0 for name in model_names]

x_pos = np.arange(len(model_names))
width = 0.35

bars1 = axes[0, 0].bar(x_pos - width/2, test_accuracies, width, label='Test Accuracy', alpha=0.8)
bars2 = axes[0, 0].bar(x_pos + width/2, oob_scores, width, label='OOB Score', alpha=0.8)

axes[0, 0].set_xlabel('Models')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Model Performance Comparison')
axes[0, 0].set_xticks(x_pos)
axes[0, 0].set_xticklabels(model_names, rotation=45, ha='right')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Ensemble size vs performance
ensemble_sizes = range(1, 101, 10)
rf_performance = []
bagging_performance = []

for size in ensemble_sizes:
    # Random Forest
    rf = RandomForestCustom(n_estimators=size, random_state=42)
    rf.fit(X_train_cls, y_train_cls)
    rf_acc = accuracy_score(y_test_cls, rf.predict(X_test_cls))
    rf_performance.append(rf_acc)
    
    # Bagging
    bagging = BaggingClassifier(DecisionTreeClassifier(), n_estimators=size, random_state=42)
    bagging.fit(X_train_cls, y_train_cls)
    bagging_acc = accuracy_score(y_test_cls, bagging.predict(X_test_cls))
    bagging_performance.append(bagging_acc)

axes[0, 1].plot(ensemble_sizes, rf_performance, 'o-', label='Random Forest', linewidth=2, markersize=6)
axes[0, 1].plot(ensemble_sizes, bagging_performance, 's-', label='Bagging', linewidth=2, markersize=6)
axes[0, 1].set_xlabel('Number of Estimators')
axes[0, 1].set_ylabel('Test Accuracy')
axes[0, 1].set_title('Ensemble Size vs Performance')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Feature importance comparison
rf_model = fitted_models['Random Forest (custom)']
single_tree = fitted_models['Single Tree']

feature_names = [f'Feature_{i}' for i in range(X_cls.shape[1])]
x_features = np.arange(len(feature_names))
width = 0.35

axes[0, 2].bar(x_features - width/2, rf_model.feature_importances_, width, 
              label='Random Forest', alpha=0.8)
axes[0, 2].bar(x_features + width/2, single_tree.feature_importances_, width, 
              label='Single Tree', alpha=0.8)
axes[0, 2].set_xlabel('Features')
axes[0, 2].set_ylabel('Importance')
axes[0, 2].set_title('Feature Importance Comparison')
axes[0, 2].set_xticks(x_features)
axes[0, 2].set_xticklabels([f'F{i}' for i in range(len(feature_names))], rotation=45)
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# Variance analysis using bootstrap
n_bootstrap = 50
single_tree_vars = []
rf_vars = []

for _ in range(n_bootstrap):
    # Bootstrap sample
    bootstrap_idx = np.random.choice(len(X_train_cls), len(X_train_cls), replace=True)
    X_bootstrap = X_train_cls[bootstrap_idx]
    y_bootstrap = y_train_cls[bootstrap_idx]
    
    # Single tree
    tree = DecisionTreeClassifier(random_state=None)
    tree.fit(X_bootstrap, y_bootstrap)
    tree_pred = tree.predict(X_test_cls)
    single_tree_vars.append(accuracy_score(y_test_cls, tree_pred))
    
    # Random Forest
    rf = RandomForestCustom(n_estimators=10, random_state=None)
    rf.fit(X_bootstrap, y_bootstrap)
    rf_pred = rf.predict(X_test_cls)
    rf_vars.append(accuracy_score(y_test_cls, rf_pred))

# Plot variance comparison
models_var = ['Single Tree', 'Random Forest']
variances = [np.var(single_tree_vars), np.var(rf_vars)]
means = [np.mean(single_tree_vars), np.mean(rf_vars)]

axes[1, 0].bar(models_var, variances, alpha=0.7, color=['red', 'green'])
axes[1, 0].set_ylabel('Variance of Accuracy')
axes[1, 0].set_title('Model Variance Comparison')
for i, (model, var) in enumerate(zip(models_var, variances)):
    axes[1, 0].text(i, var + 0.0001, f'{var:.4f}', ha='center', va='bottom')
axes[1, 0].grid(True, alpha=0.3)

# Individual tree performance distribution
rf_custom = fitted_models['Random Forest (custom)']
individual_accuracies = []

for estimator in rf_custom.estimators_:
    pred = estimator.predict(X_test_cls)
    acc = accuracy_score(y_test_cls, pred)
    individual_accuracies.append(acc)

axes[1, 1].hist(individual_accuracies, bins=20, alpha=0.7, edgecolor='black')
axes[1, 1].axvline(np.mean(individual_accuracies), color='red', linestyle='--', 
                  linewidth=2, label=f'Mean: {np.mean(individual_accuracies):.3f}')
axes[1, 1].axvline(results['Random Forest (custom)']['test_accuracy'], color='green', 
                  linestyle='--', linewidth=2, label=f'Ensemble: {results["Random Forest (custom)"]["test_accuracy"]:.3f}')
axes[1, 1].set_xlabel('Individual Tree Accuracy')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Individual Tree Performance')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# Feature randomness effect
max_features_options = [1, 2, 5, 10, 15, 20]  # Number of features to consider
feature_randomness_performance = []

for max_feat in max_features_options:
    rf_feat = RandomForestClassifier(n_estimators=50, max_features=max_feat, random_state=42)
    rf_feat.fit(X_train_cls, y_train_cls)
    acc = accuracy_score(y_test_cls, rf_feat.predict(X_test_cls))
    feature_randomness_performance.append(acc)

axes[1, 2].plot(max_features_options, feature_randomness_performance, 'o-', linewidth=2, markersize=8)
axes[1, 2].axvline(np.sqrt(X_cls.shape[1]), color='red', linestyle='--', 
                  alpha=0.7, label=f'√p = {int(np.sqrt(X_cls.shape[1]))}')
axes[1, 2].set_xlabel('Max Features per Split')
axes[1, 2].set_ylabel('Test Accuracy')
axes[1, 2].set_title('Effect of Feature Randomness')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print analysis
print(f"\nVariance Reduction Analysis:")
print(f"Single Tree Variance: {np.var(single_tree_vars):.6f}")
print(f"Random Forest Variance: {np.var(rf_vars):.6f}")
print(f"Variance Reduction: {(1 - np.var(rf_vars)/np.var(single_tree_vars))*100:.1f}%")

print(f"\nOOB Score Analysis:")
if rf_custom.oob_score_ is not None:
    print(f"Custom RF OOB Score: {rf_custom.oob_score_:.4f}")
    print(f"Custom RF Test Score: {results['Random Forest (custom)']['test_accuracy']:.4f}")
    print(f"OOB vs Test difference: {abs(rf_custom.oob_score_ - results['Random Forest (custom)']['test_accuracy']):.4f}")

print(f"\nIndividual Tree Performance:")
print(f"Mean individual accuracy: {np.mean(individual_accuracies):.4f}")
print(f"Ensemble accuracy: {results['Random Forest (custom)']['test_accuracy']:.4f}")
print(f"Ensemble improvement: {(results['Random Forest (custom)']['test_accuracy'] - np.mean(individual_accuracies))*100:.2f} percentage points")

## Question 2: AdaBoost and Gradient Boosting Implementation

**Question:** Implement AdaBoost and Gradient Boosting from scratch. Analyze how sequential learning and error correction contribute to ensemble performance compared to parallel ensemble methods.

### Theory

**AdaBoost Algorithm:**
1. Initialize weights: $w_i^{(1)} = \frac{1}{n}$ for all samples
2. For $t = 1, 2, \ldots, T$:
   - Train weak learner $h_t$ on weighted data
   - Calculate error: $\epsilon_t = \sum_{i=1}^n w_i^{(t)} I(h_t(x_i) \neq y_i)$
   - Calculate coefficient: $\alpha_t = \frac{1}{2} \ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$
   - Update weights: $w_i^{(t+1)} = w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))$
   - Normalize weights
3. Final prediction: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$

**Gradient Boosting:**
1. Initialize: $F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma)$
2. For $t = 1, 2, \ldots, T$:
   - Compute negative gradients: $r_{it} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{t-1}}$
   - Fit base learner $h_t$ to residuals $r_{it}$
   - Find optimal step size: $\gamma_t = \arg\min_\gamma \sum_{i=1}^n L(y_i, F_{t-1}(x_i) + \gamma h_t(x_i))$
   - Update: $F_t(x) = F_{t-1}(x) + \gamma_t h_t(x)$

**Key Differences:**
- **AdaBoost**: Reweights samples, focuses on misclassified examples
- **Gradient Boosting**: Fits to pseudo-residuals, general loss functions
- **Sequential vs Parallel**: Boosting is sequential (each model depends on previous), bagging is parallel

In [None]:
class AdaBoostCustom:
    """AdaBoost implementation from scratch."""
    
    def __init__(self, n_estimators=50, learning_rate=1.0, random_state=None):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.random_state = random_state
        
        self.estimators_ = []
        self.estimator_weights_ = []
        self.estimator_errors_ = []
        
    def fit(self, X, y):
        """Fit AdaBoost classifier."""
        X = np.array(X)
        y = np.array(y)
        
        # Convert labels to -1, 1 for AdaBoost
        y_ada = np.where(y == 0, -1, 1)
        
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples = X.shape[0]
        
        # Initialize weights
        sample_weights = np.ones(n_samples) / n_samples
        
        self.estimators_ = []
        self.estimator_weights_ = []
        self.estimator_errors_ = []
        
        for t in range(self.n_estimators):
            # Train weak learner with weighted samples
            estimator = DecisionTreeClassifier(max_depth=1, random_state=None)  # Decision stump
            estimator.fit(X, y, sample_weight=sample_weights)
            
            # Make predictions
            y_pred = estimator.predict(X)
            y_pred_ada = np.where(y_pred == 0, -1, 1)
            
            # Calculate weighted error
            incorrect = y_pred_ada != y_ada
            error = np.sum(sample_weights * incorrect) / np.sum(sample_weights)
            
            # Avoid division by zero and ensure error < 0.5
            error = np.clip(error, 1e-10, 0.5 - 1e-10)
            
            # Calculate alpha (estimator weight)
            alpha = self.learning_rate * 0.5 * np.log((1 - error) / error)
            
            # Update sample weights
            sample_weights *= np.exp(-alpha * y_ada * y_pred_ada)
            sample_weights /= np.sum(sample_weights)  # Normalize
            
            # Store estimator and its weight
            self.estimators_.append(estimator)
            self.estimator_weights_.append(alpha)
            self.estimator_errors_.append(error)
            
            # Early stopping if perfect classification
            if error == 0:
                break
        
        return self
    
    def predict(self, X):
        """Make predictions using AdaBoost."""
        X = np.array(X)
        
        # Weighted combination of weak learners
        decision_scores = np.zeros(X.shape[0])
        
        for estimator, alpha in zip(self.estimators_, self.estimator_weights_):
            pred = estimator.predict(X)
            pred_ada = np.where(pred == 0, -1, 1)
            decision_scores += alpha * pred_ada
        
        # Convert back to 0, 1 labels
        final_predictions = np.where(decision_scores >= 0, 1, 0)
        return final_predictions
    
    def decision_function(self, X):
        """Get decision scores."""
        X = np.array(X)
        
        decision_scores = np.zeros(X.shape[0])
        
        for estimator, alpha in zip(self.estimators_, self.estimator_weights_):
            pred = estimator.predict(X)
            pred_ada = np.where(pred == 0, -1, 1)
            decision_scores += alpha * pred_ada
        
        return decision_scores

class GradientBoostingCustom:
    """Gradient Boosting implementation from scratch."""
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, random_state=None):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.random_state = random_state
        
        self.estimators_ = []
        self.initial_prediction_ = None
        self.train_scores_ = []
        
    def _sigmoid(self, x):
        """Stable sigmoid function."""
        x = np.clip(x, -500, 500)
        return 1 / (1 + np.exp(-x))
    
    def _log_loss_gradient(self, y_true, y_pred_prob):
        """Compute gradient of log loss."""
        return y_pred_prob - y_true
    
    def fit(self, X, y):
        """Fit Gradient Boosting classifier."""
        X = np.array(X)
        y = np.array(y)
        
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        # Initialize with log-odds
        positive_rate = np.mean(y)
        positive_rate = np.clip(positive_rate, 1e-10, 1 - 1e-10)  # Avoid log(0)
        self.initial_prediction_ = np.log(positive_rate / (1 - positive_rate))
        
        # Initialize predictions
        f_x = np.full(len(y), self.initial_prediction_)
        
        self.estimators_ = []
        self.train_scores_ = []
        
        for t in range(self.n_estimators):
            # Convert to probabilities
            y_pred_prob = self._sigmoid(f_x)
            
            # Compute negative gradients (residuals)
            residuals = -self._log_loss_gradient(y, y_pred_prob)
            
            # Fit regression tree to residuals
            estimator = DecisionTreeRegressor(
                max_depth=self.max_depth,
                random_state=None
            )
            estimator.fit(X, residuals)
            
            # Update predictions
            update = estimator.predict(X)
            f_x += self.learning_rate * update
            
            # Store estimator
            self.estimators_.append(estimator)
            
            # Calculate training score
            y_pred_prob_updated = self._sigmoid(f_x)
            train_accuracy = accuracy_score(y, (y_pred_prob_updated >= 0.5).astype(int))
            self.train_scores_.append(train_accuracy)
        
        return self
    
    def predict(self, X):
        """Make predictions using Gradient Boosting."""
        decision_scores = self.decision_function(X)
        probabilities = self._sigmoid(decision_scores)
        return (probabilities >= 0.5).astype(int)
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        decision_scores = self.decision_function(X)
        probabilities = self._sigmoid(decision_scores)
        return np.column_stack([1 - probabilities, probabilities])
    
    def decision_function(self, X):
        """Get decision scores."""
        X = np.array(X)
        
        # Start with initial prediction
        decision_scores = np.full(X.shape[0], self.initial_prediction_)
        
        # Add contributions from all estimators
        for estimator in self.estimators_:
            decision_scores += self.learning_rate * estimator.predict(X)
        
        return decision_scores

# Generate datasets for boosting comparison
X_boost, y_boost = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                                      n_redundant=2, n_clusters_per_class=1, 
                                      class_sep=0.8, random_state=42)
X_train_boost, X_test_boost, y_train_boost, y_test_boost = train_test_split(
    X_boost, y_boost, test_size=0.3, random_state=42)

# Compare boosting methods
boosting_models = {
    'Single Tree': DecisionTreeClassifier(max_depth=1, random_state=42),
    'AdaBoost (sklearn)': AdaBoostClassifier(n_estimators=50, random_state=42),
    'AdaBoost (custom)': AdaBoostCustom(n_estimators=50, random_state=42),
    'Gradient Boosting (sklearn)': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting (custom)': GradientBoostingCustom(n_estimators=100, random_state=42)
}

boosting_results = {}
fitted_boosting_models = {}

for name, model in boosting_models.items():
    model.fit(X_train_boost, y_train_boost)
    y_pred = model.predict(X_test_boost)
    
    accuracy = accuracy_score(y_test_boost, y_pred)
    
    boosting_results[name] = {
        'test_accuracy': accuracy
    }
    fitted_boosting_models[name] = model

boosting_results_df = pd.DataFrame(boosting_results).T
print("Boosting Methods Comparison:")
print(boosting_results_df.round(4))

In [None]:
# Analyze boosting behavior and sequential learning
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Training curves for boosting methods
ada_custom = fitted_boosting_models['AdaBoost (custom)']
gb_custom = fitted_boosting_models['Gradient Boosting (custom)']

# AdaBoost error evolution
axes[0, 0].plot(range(1, len(ada_custom.estimator_errors_) + 1), 
               ada_custom.estimator_errors_, 'o-', linewidth=2, markersize=6)
axes[0, 0].set_xlabel('Boosting Round')
axes[0, 0].set_ylabel('Weighted Error')
axes[0, 0].set_title('AdaBoost: Weighted Error Evolution')
axes[0, 0].grid(True, alpha=0.3)

# AdaBoost alpha (estimator weights) evolution
axes[0, 1].bar(range(1, len(ada_custom.estimator_weights_) + 1), 
              ada_custom.estimator_weights_, alpha=0.7)
axes[0, 1].set_xlabel('Boosting Round')
axes[0, 1].set_ylabel('Estimator Weight (α)')
axes[0, 1].set_title('AdaBoost: Estimator Weights')
axes[0, 1].grid(True, alpha=0.3)

# Gradient Boosting training accuracy evolution
axes[0, 2].plot(range(1, len(gb_custom.train_scores_) + 1), 
               gb_custom.train_scores_, 'g-', linewidth=2)
axes[0, 2].set_xlabel('Boosting Round')
axes[0, 2].set_ylabel('Training Accuracy')
axes[0, 2].set_title('Gradient Boosting: Training Progress')
axes[0, 2].grid(True, alpha=0.3)

# Compare ensemble size effect for different methods
ensemble_sizes = range(5, 101, 10)
ada_performance = []
gb_performance = []
rf_performance_boost = []

for size in ensemble_sizes:
    # AdaBoost
    ada = AdaBoostCustom(n_estimators=size, random_state=42)
    ada.fit(X_train_boost, y_train_boost)
    ada_acc = accuracy_score(y_test_boost, ada.predict(X_test_boost))
    ada_performance.append(ada_acc)
    
    # Gradient Boosting
    gb = GradientBoostingCustom(n_estimators=size, random_state=42)
    gb.fit(X_train_boost, y_train_boost)
    gb_acc = accuracy_score(y_test_boost, gb.predict(X_test_boost))
    gb_performance.append(gb_acc)
    
    # Random Forest (for comparison)
    rf = RandomForestClassifier(n_estimators=size, random_state=42)
    rf.fit(X_train_boost, y_train_boost)
    rf_acc = accuracy_score(y_test_boost, rf.predict(X_test_boost))
    rf_performance_boost.append(rf_acc)

axes[1, 0].plot(ensemble_sizes, ada_performance, 'o-', label='AdaBoost', linewidth=2, markersize=6)
axes[1, 0].plot(ensemble_sizes, gb_performance, 's-', label='Gradient Boosting', linewidth=2, markersize=6)
axes[1, 0].plot(ensemble_sizes, rf_performance_boost, '^-', label='Random Forest', linewidth=2, markersize=6)
axes[1, 0].set_xlabel('Number of Estimators')
axes[1, 0].set_ylabel('Test Accuracy')
axes[1, 0].set_title('Ensemble Methods Comparison')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Learning rate effect on Gradient Boosting
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]
lr_performance = []

for lr in learning_rates:
    gb_lr = GradientBoostingCustom(n_estimators=100, learning_rate=lr, random_state=42)
    gb_lr.fit(X_train_boost, y_train_boost)
    acc = accuracy_score(y_test_boost, gb_lr.predict(X_test_boost))
    lr_performance.append(acc)

axes[1, 1].semilogx(learning_rates, lr_performance, 'o-', linewidth=2, markersize=8)
axes[1, 1].set_xlabel('Learning Rate')
axes[1, 1].set_ylabel('Test Accuracy')
axes[1, 1].set_title('Gradient Boosting: Learning Rate Effect')
axes[1, 1].grid(True, alpha=0.3)

# Decision boundary comparison (2D projection)
# Use first 2 features for visualization
X_2d_boost = X_train_boost[:, :2]
ada_2d = AdaBoostCustom(n_estimators=20, random_state=42)
gb_2d = GradientBoostingCustom(n_estimators=20, random_state=42)

ada_2d.fit(X_2d_boost, y_train_boost)
gb_2d.fit(X_2d_boost, y_train_boost)

# Create mesh
h = 0.02
x_min, x_max = X_2d_boost[:, 0].min() - 1, X_2d_boost[:, 0].max() + 1
y_min, y_max = X_2d_boost[:, 1].min() - 1, X_2d_boost[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z_ada = ada_2d.decision_function(mesh_points)
Z_ada = Z_ada.reshape(xx.shape)

axes[1, 2].contourf(xx, yy, Z_ada, alpha=0.8, cmap='RdYlBu')
scatter = axes[1, 2].scatter(X_2d_boost[:, 0], X_2d_boost[:, 1], c=y_train_boost, 
                            cmap='RdYlBu', edgecolors='black')
axes[1, 2].set_xlabel('Feature 1')
axes[1, 2].set_ylabel('Feature 2')
axes[1, 2].set_title('AdaBoost Decision Boundary')

plt.tight_layout()
plt.show()

# Analyze sample weight evolution in AdaBoost
print("\nAdaBoost Analysis:")
print(f"Number of weak learners used: {len(ada_custom.estimators_)}")
print(f"Final weighted error: {ada_custom.estimator_errors_[-1]:.4f}")
print(f"Maximum estimator weight: {max(ada_custom.estimator_weights_):.4f}")
print(f"Total estimator weight: {sum(ada_custom.estimator_weights_):.4f}")

print("\nGradient Boosting Analysis:")
print(f"Initial prediction (log-odds): {gb_custom.initial_prediction_:.4f}")
print(f"Final training accuracy: {gb_custom.train_scores_[-1]:.4f}")
print(f"Accuracy improvement: {gb_custom.train_scores_[-1] - gb_custom.train_scores_[0]:.4f}")

# Compare sequential vs parallel learning
print("\nSequential vs Parallel Learning:")
print(f"AdaBoost (sequential): {boosting_results['AdaBoost (custom)']['test_accuracy']:.4f}")
print(f"Gradient Boosting (sequential): {boosting_results['Gradient Boosting (custom)']['test_accuracy']:.4f}")

# Train Random Forest on same data for comparison
rf_comparison = RandomForestClassifier(n_estimators=50, random_state=42)
rf_comparison.fit(X_train_boost, y_train_boost)
rf_acc = accuracy_score(y_test_boost, rf_comparison.predict(X_test_boost))
print(f"Random Forest (parallel): {rf_acc:.4f}")

## Question 3: Stacking and Advanced Ensemble Strategies

**Question:** Implement stacking (stacked generalization) and compare with voting ensembles. Analyze how meta-learning improves ensemble performance and discuss practical considerations for ensemble diversity.

### Theory

**Stacking (Stacked Generalization):**
1. **Level-0 Models**: Train base models on training data
2. **Meta-features**: Use cross-validation to generate predictions from base models
3. **Meta-learner**: Train on meta-features to learn optimal combination

**Mathematical Framework:**
- Base models: $h_1(x), h_2(x), \ldots, h_K(x)$
- Meta-features: $\mathbf{z} = [h_1(x), h_2(x), \ldots, h_K(x)]^T$
- Meta-learner: $g(\mathbf{z}) = g(h_1(x), h_2(x), \ldots, h_K(x))$

**Voting Ensembles:**
- **Hard Voting**: $\hat{y} = \text{mode}(h_1(x), h_2(x), \ldots, h_K(x))$
- **Soft Voting**: $\hat{y} = \arg\max_c \sum_{k=1}^K P_k(y=c|x)$

**Ensemble Diversity Measures:**
- **Disagreement**: $\text{Dis}_{i,j} = \frac{N^{01} + N^{10}}{N^{00} + N^{01} + N^{10} + N^{11}}$
- **Q-statistic**: $Q_{i,j} = \frac{N^{11}N^{00} - N^{01}N^{10}}{N^{11}N^{00} + N^{01}N^{10}}$
- **Correlation coefficient**: $\rho_{i,j} = \frac{N^{11}N^{00} - N^{01}N^{10}}{\sqrt{(N^{11}+N^{10})(N^{01}+N^{00})(N^{11}+N^{01})(N^{10}+N^{00})}}$

In [None]:
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

class StackingEnsemble:
    """Stacking ensemble implementation."""
    
    def __init__(self, base_models, meta_model, cv=5, use_probabilities=False):
        self.base_models = base_models
        self.meta_model = meta_model
        self.cv = cv
        self.use_probabilities = use_probabilities
        
        self.fitted_base_models_ = []
        self.fitted_meta_model_ = None
        
    def _generate_meta_features(self, X, y, use_fitted_models=False):
        """Generate meta-features using cross-validation or fitted models."""
        if use_fitted_models:
            # Use already fitted models (for prediction)
            meta_features = []
            for model in self.fitted_base_models_:
                if self.use_probabilities and hasattr(model, 'predict_proba'):
                    pred = model.predict_proba(X)
                    if pred.shape[1] == 2:  # Binary classification
                        pred = pred[:, 1:]  # Use only positive class probability
                else:
                    pred = model.predict(X).reshape(-1, 1)
                meta_features.append(pred)
        else:
            # Generate meta-features using cross-validation
            meta_features = []
            cv_splitter = StratifiedKFold(n_splits=self.cv, shuffle=True, random_state=42)
            
            for model in self.base_models:
                if self.use_probabilities and hasattr(model, 'predict_proba'):
                    # Use probability predictions
                    cv_pred = cross_val_predict(model, X, y, cv=cv_splitter, method='predict_proba')
                    if cv_pred.shape[1] == 2:  # Binary classification
                        cv_pred = cv_pred[:, 1:]  # Use only positive class probability
                else:
                    # Use class predictions
                    cv_pred = cross_val_predict(model, X, y, cv=cv_splitter).reshape(-1, 1)
                
                meta_features.append(cv_pred)
        
        return np.column_stack(meta_features)
    
    def fit(self, X, y):
        """Fit the stacking ensemble."""
        X = np.array(X)
        y = np.array(y)
        
        # Generate meta-features using cross-validation
        meta_X = self._generate_meta_features(X, y)
        
        # Fit meta-learner
        self.fitted_meta_model_ = self.meta_model
        self.fitted_meta_model_.fit(meta_X, y)
        
        # Fit base models on full training data
        self.fitted_base_models_ = []
        for model in self.base_models:
            fitted_model = model
            fitted_model.fit(X, y)
            self.fitted_base_models_.append(fitted_model)
        
        return self
    
    def predict(self, X):
        """Make predictions using stacking."""
        # Generate meta-features using fitted base models
        meta_X = self._generate_meta_features(X, None, use_fitted_models=True)
        
        # Use meta-learner to make final prediction
        return self.fitted_meta_model_.predict(meta_X)
    
    def predict_proba(self, X):
        """Predict probabilities using stacking."""
        if not hasattr(self.fitted_meta_model_, 'predict_proba'):
            raise ValueError("Meta-model doesn't support probability prediction")
        
        meta_X = self._generate_meta_features(X, None, use_fitted_models=True)
        return self.fitted_meta_model_.predict_proba(meta_X)

def calculate_ensemble_diversity(predictions_matrix, y_true):
    """Calculate diversity measures for ensemble."""
    n_models = predictions_matrix.shape[1]
    
    diversity_measures = {
        'disagreement': [],
        'q_statistic': [],
        'correlation': []
    }
    
    for i in range(n_models):
        for j in range(i + 1, n_models):
            pred_i = predictions_matrix[:, i]
            pred_j = predictions_matrix[:, j]
            
            # Calculate confusion matrix between two classifiers
            n_11 = np.sum((pred_i == y_true) & (pred_j == y_true))
            n_10 = np.sum((pred_i == y_true) & (pred_j != y_true))
            n_01 = np.sum((pred_i != y_true) & (pred_j == y_true))
            n_00 = np.sum((pred_i != y_true) & (pred_j != y_true))
            
            total = n_11 + n_10 + n_01 + n_00
            
            # Disagreement measure
            disagreement = (n_01 + n_10) / total
            diversity_measures['disagreement'].append(disagreement)
            
            # Q-statistic
            if (n_11 * n_00 + n_01 * n_10) != 0:
                q_stat = (n_11 * n_00 - n_01 * n_10) / (n_11 * n_00 + n_01 * n_10)
            else:
                q_stat = 0
            diversity_measures['q_statistic'].append(q_stat)
            
            # Correlation coefficient
            denom = np.sqrt((n_11 + n_10) * (n_01 + n_00) * (n_11 + n_01) * (n_10 + n_00))
            if denom != 0:
                correlation = (n_11 * n_00 - n_01 * n_10) / denom
            else:
                correlation = 0
            diversity_measures['correlation'].append(correlation)
    
    # Return average diversity measures
    return {
        'avg_disagreement': np.mean(diversity_measures['disagreement']),
        'avg_q_statistic': np.mean(diversity_measures['q_statistic']),
        'avg_correlation': np.mean(diversity_measures['correlation'])
    }

# Generate dataset for stacking comparison
X_stack, y_stack = make_classification(n_samples=1500, n_features=15, n_informative=10, 
                                      n_redundant=3, n_clusters_per_class=2, 
                                      class_sep=0.9, random_state=42)
X_train_stack, X_test_stack, y_train_stack, y_test_stack = train_test_split(
    X_stack, y_stack, test_size=0.3, random_state=42)

# Define diverse base models
base_models = [
    DecisionTreeClassifier(max_depth=10, random_state=42),
    RandomForestClassifier(n_estimators=50, random_state=42),
    LogisticRegression(random_state=42, max_iter=1000),
    GaussianNB(),
    KNeighborsClassifier(n_neighbors=5)
]

# Compare different ensemble strategies
ensemble_strategies = {
    'Voting (Hard)': VotingClassifier(
        estimators=[(f'model_{i}', model) for i, model in enumerate(base_models)],
        voting='hard'
    ),
    'Voting (Soft)': VotingClassifier(
        estimators=[(f'model_{i}', model) for i, model in enumerate(base_models)],
        voting='soft'
    ),
    'Stacking (LR)': StackingEnsemble(
        base_models=base_models.copy(),
        meta_model=LogisticRegression(random_state=42, max_iter=1000),
        use_probabilities=True
    ),
    'Stacking (RF)': StackingEnsemble(
        base_models=base_models.copy(),
        meta_model=RandomForestClassifier(n_estimators=10, random_state=42),
        use_probabilities=True
    ),
    'Stacking (sklearn)': StackingClassifier(
        estimators=[(f'model_{i}', model) for i, model in enumerate(base_models)],
        final_estimator=LogisticRegression(random_state=42, max_iter=1000),
        cv=5
    )
}

stacking_results = {}
fitted_ensemble_models = {}

for name, ensemble in ensemble_strategies.items():
    ensemble.fit(X_train_stack, y_train_stack)
    y_pred = ensemble.predict(X_test_stack)
    
    accuracy = accuracy_score(y_test_stack, y_pred)
    
    stacking_results[name] = {
        'test_accuracy': accuracy
    }
    fitted_ensemble_models[name] = ensemble

# Add individual base model results
for i, model in enumerate(base_models):
    model_copy = model
    model_copy.fit(X_train_stack, y_train_stack)
    y_pred = model_copy.predict(X_test_stack)
    accuracy = accuracy_score(y_test_stack, y_pred)
    
    model_name = type(model).__name__
    stacking_results[f'Base: {model_name}'] = {
        'test_accuracy': accuracy
    }

stacking_results_df = pd.DataFrame(stacking_results).T
print("Ensemble Strategies Comparison:")
print(stacking_results_df.round(4))

In [None]:
# Analyze ensemble diversity and meta-learning
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Performance comparison of ensemble strategies
ensemble_names = list(stacking_results.keys())
ensemble_accuracies = [stacking_results[name]['test_accuracy'] for name in ensemble_names]

# Separate base models and ensemble methods
base_models_results = {k: v for k, v in stacking_results.items() if k.startswith('Base:')}
ensemble_methods_results = {k: v for k, v in stacking_results.items() if not k.startswith('Base:')}

base_names = list(base_models_results.keys())
base_accuracies = [base_models_results[name]['test_accuracy'] for name in base_names]

ensemble_method_names = list(ensemble_methods_results.keys())
ensemble_method_accuracies = [ensemble_methods_results[name]['test_accuracy'] for name in ensemble_method_names]

# Plot base models vs ensemble methods
x_base = np.arange(len(base_names))
x_ensemble = np.arange(len(ensemble_method_names))

axes[0, 0].bar(x_base, base_accuracies, alpha=0.7, label='Base Models', color='lightblue')
axes[0, 0].bar(x_ensemble + len(base_names) + 1, ensemble_method_accuracies, 
              alpha=0.7, label='Ensemble Methods', color='orange')

all_names = base_names + [''] + ensemble_method_names
all_positions = list(x_base) + [len(base_names)] + list(x_ensemble + len(base_names) + 1)

axes[0, 0].set_xticks(all_positions)
axes[0, 0].set_xticklabels([name.replace('Base: ', '').replace('Classifier', '').replace('Regression', 'LR') 
                           for name in all_names], rotation=45, ha='right')
axes[0, 0].set_ylabel('Test Accuracy')
axes[0, 0].set_title('Base Models vs Ensemble Methods')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Calculate and visualize ensemble diversity
# Get predictions from all base models
base_predictions = []
fitted_base_models = []

for model in base_models:
    model_copy = model
    model_copy.fit(X_train_stack, y_train_stack)
    pred = model_copy.predict(X_test_stack)
    base_predictions.append(pred)
    fitted_base_models.append(model_copy)

predictions_matrix = np.column_stack(base_predictions)
diversity_metrics = calculate_ensemble_diversity(predictions_matrix, y_test_stack)

# Diversity metrics visualization
metrics = ['Disagreement', 'Q-statistic', 'Correlation']
values = [diversity_metrics['avg_disagreement'], 
          diversity_metrics['avg_q_statistic'], 
          diversity_metrics['avg_correlation']]

colors = ['green' if v > 0.1 else 'orange' if v > 0.05 else 'red' for v in values]
bars = axes[0, 1].bar(metrics, values, color=colors, alpha=0.7)
axes[0, 1].set_ylabel('Diversity Score')
axes[0, 1].set_title('Ensemble Diversity Measures')
axes[0, 1].axhline(y=0, color='black', linestyle='-', alpha=0.3)
for bar, val in zip(bars, values):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
                   f'{val:.3f}', ha='center', va='bottom')
axes[0, 1].grid(True, alpha=0.3)

# Pairwise correlation matrix of base model predictions
model_names_short = [type(model).__name__.replace('Classifier', '').replace('Naive', 'NB') 
                    for model in base_models]
correlation_matrix = np.corrcoef(predictions_matrix.T)

im = axes[0, 2].imshow(correlation_matrix, cmap='coolwarm', vmin=-1, vmax=1)
axes[0, 2].set_xticks(range(len(model_names_short)))
axes[0, 2].set_yticks(range(len(model_names_short)))
axes[0, 2].set_xticklabels(model_names_short, rotation=45)
axes[0, 2].set_yticklabels(model_names_short)
axes[0, 2].set_title('Base Model Prediction Correlations')

# Add correlation values to heatmap
for i in range(len(model_names_short)):
    for j in range(len(model_names_short)):
        text = axes[0, 2].text(j, i, f'{correlation_matrix[i, j]:.2f}',
                              ha="center", va="center", color="black", fontsize=8)

plt.colorbar(im, ax=axes[0, 2])

# Meta-learner analysis for stacking
stacking_lr = fitted_ensemble_models['Stacking (LR)']

# Get meta-features for analysis
meta_X_train = stacking_lr._generate_meta_features(X_train_stack, y_train_stack)
meta_X_test = stacking_lr._generate_meta_features(X_test_stack, None, use_fitted_models=True)

# Meta-learner feature importance (if available)
if hasattr(stacking_lr.fitted_meta_model_, 'coef_'):
    meta_importances = np.abs(stacking_lr.fitted_meta_model_.coef_[0])
    
    axes[1, 0].bar(model_names_short, meta_importances, alpha=0.7, color='purple')
    axes[1, 0].set_ylabel('Meta-learner Coefficient (abs)')
    axes[1, 0].set_title('Meta-learner Feature Importance')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].grid(True, alpha=0.3)

# Ensemble performance vs diversity trade-off
# Test different subsets of base models
from itertools import combinations

subset_performances = []
subset_diversities = []
subset_sizes = []

for size in range(2, len(base_models) + 1):
    for subset_indices in combinations(range(len(base_models)), size):
        # Create subset ensemble
        subset_models = [base_models[i] for i in subset_indices]
        subset_voting = VotingClassifier(
            estimators=[(f'model_{i}', model) for i, model in enumerate(subset_models)],
            voting='soft'
        )
        
        # Fit and evaluate
        subset_voting.fit(X_train_stack, y_train_stack)
        subset_acc = accuracy_score(y_test_stack, subset_voting.predict(X_test_stack))
        
        # Calculate diversity for subset
        subset_predictions = np.column_stack([base_predictions[i] for i in subset_indices])
        subset_diversity = calculate_ensemble_diversity(subset_predictions, y_test_stack)
        
        subset_performances.append(subset_acc)
        subset_diversities.append(subset_diversity['avg_disagreement'])
        subset_sizes.append(size)

# Plot diversity vs performance
scatter = axes[1, 1].scatter(subset_diversities, subset_performances, 
                            c=subset_sizes, cmap='viridis', alpha=0.7, s=60)
axes[1, 1].set_xlabel('Average Disagreement (Diversity)')
axes[1, 1].set_ylabel('Ensemble Accuracy')
axes[1, 1].set_title('Diversity vs Performance Trade-off')
plt.colorbar(scatter, ax=axes[1, 1], label='Ensemble Size')
axes[1, 1].grid(True, alpha=0.3)

# Learning curves for ensemble methods
train_sizes = np.linspace(0.1, 1.0, 10)
voting_scores = []
stacking_scores = []
best_base_scores = []

for train_size in train_sizes:
    n_samples = int(train_size * len(X_train_stack))
    X_subset = X_train_stack[:n_samples]
    y_subset = y_train_stack[:n_samples]
    
    # Voting ensemble
    voting = VotingClassifier(
        estimators=[(f'model_{i}', model) for i, model in enumerate(base_models)],
        voting='soft'
    )
    voting.fit(X_subset, y_subset)
    voting_acc = accuracy_score(y_test_stack, voting.predict(X_test_stack))
    voting_scores.append(voting_acc)
    
    # Stacking ensemble
    stacking = StackingEnsemble(
        base_models=base_models.copy(),
        meta_model=LogisticRegression(random_state=42, max_iter=1000),
        use_probabilities=True
    )
    stacking.fit(X_subset, y_subset)
    stacking_acc = accuracy_score(y_test_stack, stacking.predict(X_test_stack))
    stacking_scores.append(stacking_acc)
    
    # Best base model
    best_acc = 0
    for model in base_models:
        model_copy = model
        model_copy.fit(X_subset, y_subset)
        acc = accuracy_score(y_test_stack, model_copy.predict(X_test_stack))
        best_acc = max(best_acc, acc)
    best_base_scores.append(best_acc)

train_sample_counts = train_sizes * len(X_train_stack)
axes[1, 2].plot(train_sample_counts, voting_scores, 'o-', label='Voting Ensemble', linewidth=2)
axes[1, 2].plot(train_sample_counts, stacking_scores, 's-', label='Stacking Ensemble', linewidth=2)
axes[1, 2].plot(train_sample_counts, best_base_scores, '^-', label='Best Base Model', linewidth=2)
axes[1, 2].set_xlabel('Training Set Size')
axes[1, 2].set_ylabel('Test Accuracy')
axes[1, 2].set_title('Learning Curves: Ensemble Strategies')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print analysis results
print(f"\nEnsemble Diversity Analysis:")
print(f"Average disagreement: {diversity_metrics['avg_disagreement']:.4f}")
print(f"Average Q-statistic: {diversity_metrics['avg_q_statistic']:.4f}")
print(f"Average correlation: {diversity_metrics['avg_correlation']:.4f}")

print(f"\nBest Performing Methods:")
sorted_results = sorted(stacking_results.items(), key=lambda x: x[1]['test_accuracy'], reverse=True)
for i, (name, result) in enumerate(sorted_results[:5]):
    print(f"{i+1}. {name}: {result['test_accuracy']:.4f}")

print(f"\nEnsemble Improvement over Best Base Model:")
best_base_acc = max([result['test_accuracy'] for name, result in stacking_results.items() if name.startswith('Base:')])
best_ensemble_acc = max([result['test_accuracy'] for name, result in stacking_results.items() if not name.startswith('Base:')])
improvement = (best_ensemble_acc - best_base_acc) * 100
print(f"Best base model: {best_base_acc:.4f}")
print(f"Best ensemble: {best_ensemble_acc:.4f}")
print(f"Improvement: {improvement:.2f} percentage points")

## Summary and Key Takeaways

### Ensemble Methods Fundamentals:

1. **Random Forest and Bagging**:
   - **Bootstrap sampling** reduces variance by averaging over different data subsets
   - **Feature randomness** reduces correlation between trees, enhancing diversity
   - **OOB error** provides unbiased performance estimate without separate validation set
   - Typically achieves 60-80% variance reduction compared to single trees

2. **Boosting Methods**:
   - **AdaBoost**: Sequential learning with sample reweighting; focuses on hard examples
   - **Gradient Boosting**: Fits to pseudo-residuals; more general framework for different loss functions
   - **Sequential vs Parallel**: Boosting builds models sequentially (each depends on previous), bagging builds in parallel
   - Often achieves higher accuracy than bagging but more prone to overfitting

3. **Stacking and Meta-Learning**:
   - **Stacking** learns optimal combination of base models using meta-learner
   - **Cross-validation** prevents overfitting when generating meta-features
   - **Diversity** is crucial: uncorrelated errors lead to better ensemble performance
   - Meta-learner can capture complex interactions between base model predictions

### Practical Guidelines:

**Choosing Ensemble Methods:**
- **Random Forest**: Default choice for tabular data; robust and interpretable
- **Gradient Boosting**: When maximum accuracy is needed; requires careful tuning
- **Stacking**: When base models are very different; adds complexity but can improve performance
- **Voting**: Simple and effective when base models have similar performance

**Ensemble Diversity:**
- Use different algorithms (tree-based, linear, probabilistic, distance-based)
- Different hyperparameters for same algorithm
- Different feature subsets or data transformations
- Target disagreement rate of 10-30% between base models

**Performance Considerations:**
- **Training time**: Bagging (parallel) < Boosting (sequential) < Stacking (CV + meta-learning)
- **Prediction time**: All methods require predictions from all base models
- **Memory usage**: Stores all base models; can be significant for large ensembles
- **Interpretability**: Decreases with ensemble complexity

### Key Insights:
- Ensemble improvement diminishes with highly correlated base models
- Optimal ensemble size typically 50-200 models for Random Forest
- Learning rate in boosting creates bias-variance tradeoff
- Stacking works best with diverse, well-performing base models
- OOB error closely approximates test error for bagging methods