<img src="./images/banner.png" width="800">

# Bagging

In our previous lecture on ensemble methods, we learned that combining multiple models often performs better than individual models. This is similar to how a group of experts might make better decisions than a single expert. Today, we'll dive deep into one of the most fundamental ensemble techniques: Bagging.


Bagging, short for **B**ootstrap **Agg**regat**ing**, is an ensemble learning technique introduced by Leo Breiman in 1996.


💡 **Key Insight:** Bagging creates multiple versions of a model trained on different subsets of the data, then combines their predictions to create a more robust model.


Let's understand this with a simple analogy:

Imagine you're trying to estimate the average height of people in a city:
- Instead of measuring everyone (impossible!), you take multiple random samples
- Each sample gives you an estimate
- By averaging these estimates, you get a more reliable result
- The more samples you take, the more stable your estimate becomes

This is exactly how bagging works in machine learning!


<img src="./images/Bagging-classifier.png" width="800">

Bagging helps reduce a critical problem in machine learning: **variance**. 

Let's break this down:
1. Individual models (especially complex ones like deep trees) are often sensitive to their training data
2. Small changes in the training data can lead to large changes in the model
3. Bagging reduces this sensitivity by:
   - Training each model on a different subset of data
   - Averaging their predictions to get a more stable result


❗️ **Important Note:** Bagging is particularly effective for high-variance algorithms (like decision trees) but might not help much with high-bias algorithms.


Let's consider a simple mathematical example to understand why averaging helps:

Assume we have a true value $y$ and $n$ different predictions $\hat{y}_i$ where each prediction has some random error $\epsilon_i$:

$\hat{y}_i = y + \epsilon_i$


If we average $n$ predictions:

$\bar{\hat{y}} = \frac{1}{n}\sum_{i=1}^n \hat{y}_i = y + \frac{1}{n}\sum_{i=1}^n \epsilon_i$


The variance of the average prediction is:

$Var(\bar{\hat{y}}) = \frac{Var(\epsilon)}{n}$


<img src="./images/bias-variance.png" width="600">

<img src="./images/bagging-variance.jpg" width="600">

This shows that averaging reduces variance by a factor of $n$.


In the next section, we'll dive deeper into how these different datasets (D1, D2, D3) are created through a process called bootstrap sampling. This is the "Bootstrap" part of "Bootstrap Aggregating" that gives bagging its name.


💡 **Preview:** Bootstrap sampling is a statistical technique that allows us to create multiple training sets from a single dataset, which is crucial for making bagging work in practice.


Would you like me to proceed with the content for the next section on Bootstrap Sampling?

**Table of contents**<a id='toc0_'></a>    
- [Bootstrap Sampling](#toc1_)    
  - [The Bootstrap Process](#toc1_1_)    
  - [Mathematical Properties](#toc1_2_)    
  - [Out-of-Bag (OOB) Samples](#toc1_3_)    
- [Bagging in Practice](#toc2_)    
  - [Core Components of Bagging](#toc2_1_)    
  - [Parallel Processing Advantage](#toc2_2_)    
  - [Practical Considerations](#toc2_3_)    
  - [Common Use Cases and Limitations](#toc2_4_)    
  - [Performance Monitoring](#toc2_5_)    
- [Practical Implementation](#toc3_)    
  - [Setting Up the Environment](#toc3_1_)    
  - [A Complete Bagging Example](#toc3_2_)    
  - [Model Configuration and Training](#toc3_3_)    
  - [Performance Analysis](#toc3_4_)    
  - [Best Practices and Common Pitfalls](#toc3_5_)    
  - [Putting It All Together](#toc3_6_)    
- [Summary](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Bootstrap Sampling](#toc0_)

Building on our previous section where we introduced bagging, let's dive into how we actually create those different training sets. This is where bootstrap sampling comes in.


Bootstrap sampling is a statistical method that involves:
- Sampling from a dataset **with replacement**
- Creating a sample of the same size as the original dataset
- Allowing data points to be selected multiple times


💡 **Key Insight:** Think of bootstrap sampling like drawing balls from a bag where you put each ball back after drawing it. This means some balls might be drawn multiple times while others might not be drawn at all.


### <a id='toc1_1_'></a>[The Bootstrap Process](#toc0_)


Let's understand this with a simple example:

Original dataset with 5 points: [A, B, C, D, E]


Possible bootstrap samples:
```python
Sample 1: [A, B, B, D, E]  # C is not selected, B appears twice
Sample 2: [B, C, C, D, E]  # A is not selected, C appears twice
Sample 3: [A, A, C, D, E]  # B is not selected, A appears twice
```


❗️**Important Note:** Each bootstrap sample:
- Has the same size as the original dataset
- Contains some duplicates
- Misses some original data points


### <a id='toc1_2_'></a>[Mathematical Properties](#toc0_)


The probability of a data point being selected in one draw is $\frac{1}{n}$ where n is the dataset size.


The probability of a data point NOT being selected after n draws is:
$(1 - \frac{1}{n})^n$


As n → ∞, this approaches $\frac{1}{e} ≈ 0.368$


This means:
- Approximately 63.2% of unique original data points appear in each bootstrap sample
- About 36.8% of data points are left out (these form the Out-of-Bag sample)


### <a id='toc1_3_'></a>[Out-of-Bag (OOB) Samples](#toc0_)


The data points not selected in a bootstrap sample form what we call the Out-of-Bag (OOB) sample. These samples are incredibly useful because:
1. They provide an unbiased estimate of model performance
2. They can be used for validation without needing a separate validation set
3. They help in detecting overfitting


<img src="./images/oob-samples.avif" width="600">

Here's a simple Python implementation of bootstrap sampling:


In [2]:
import numpy as np

def bootstrap_sample(data, size=None):
    if size is None:
        size = len(data)
    indices = np.random.randint(0, len(data), size=size)
    return data[indices]

# Example usage
data = np.array([1, 2, 3, 4, 5])
bootstrap = bootstrap_sample(data)

In [3]:
bootstrap


array([3, 1, 3, 1, 3])

Now we can see how bootstrap sampling connects to bagging:
1. Create multiple bootstrap samples from training data
2. Train a separate model on each bootstrap sample
3. Combine predictions from all models


This leads us to our next section, where we'll explore how to implement bagging in practice and see how these concepts come together in real-world applications.


In the next section, we'll learn how to implement bagging efficiently and explore its practical advantages and limitations.


## <a id='toc2_'></a>[Bagging in Practice](#toc0_)

Now that we understand bootstrap sampling and the principles of bagging, let's explore how to implement bagging effectively and what makes it work in real-world scenarios.


### <a id='toc2_1_'></a>[Core Components of Bagging](#toc0_)


A bagging implementation consists of three main components:

1. **Base Estimator Selection**
   - Typically high-variance models (e.g., deep decision trees)
   - Must be sensitive to training data changes
   - Common choices:
     - Decision trees (most common)
     - Neural networks
     - K-nearest neighbors


2. **Ensemble Generation**

```python
# Pseudocode for bagging implementation
class BaggingEnsemble:
    def __init__(self, base_estimator, n_estimators=10):
        self.estimators = []
        self.n_estimators = n_estimators

    def fit(self, X, y):
        for i in range(self.n_estimators):
            # Create bootstrap sample
            X_boot, y_boot = bootstrap_sample(X, y)
            # Train model on bootstrap sample
            model = clone(base_estimator)
            model.fit(X_boot, y_boot)
            self.estimators.append(model)
```

3. **Prediction Aggregation**
   - For regression: Average predictions
   - For classification: Majority voting


💡 **Key Insight:** The magic of bagging comes from the diversity of the base models combined with the wisdom of the crowd effect.


### <a id='toc2_2_'></a>[Parallel Processing Advantage](#toc0_)


One of the biggest practical advantages of bagging is that it's naturally parallel:


```python
from joblib import Parallel, delayed

def parallel_fit(X, y, base_estimator):
    # Create bootstrap sample and fit model
    X_boot, y_boot = bootstrap_sample(X, y)
    model = clone(base_estimator)
    model.fit(X_boot, y_boot)
    return model

# Parallel implementation
estimators = Parallel(n_jobs=-1)(
    delayed(parallel_fit)(X, y, base_estimator)
    for _ in range(n_estimators)
)
```


❗️**Important Note:** Parallel processing can significantly reduce training time, especially with many estimators.


### <a id='toc2_3_'></a>[Practical Considerations](#toc0_)


1. **Number of Estimators**
   - More estimators generally improve performance
   - Diminishing returns after certain point
   - Trade-off between performance and computational cost
   
   ```python
   # Learning curve for different numbers of estimators
   errors = []
   for n in [1, 5, 10, 50, 100]:
       bag = BaggingRegressor(n_estimators=n)
       error = cross_val_score(bag, X, y).mean()
       errors.append(error)
   ```

2. **Sample Size**
   - Default: same size as original dataset
   - Can be adjusted for specific needs:
     - Smaller: faster training, more diversity
     - Larger: more stable individual models

3. **Base Estimator Parameters**
   - Usually use default parameters
   - Complex base models → better diversity
   - Simple base models → faster training


### <a id='toc2_4_'></a>[Common Use Cases and Limitations](#toc0_)


Understanding where and when to use bagging is crucial for its effective application. Like any machine learning technique, bagging has specific scenarios where it shines and others where alternative approaches might be more appropriate.


Let's first look at the scenarios where bagging proves particularly valuable:

When we encounter any of these situations, bagging often emerges as a powerful solution:
- High-variance models that need stabilization (like deep decision trees)
- Datasets with significant noise where individual models might overfit
- Scenarios where computational resources allow for parallel processing
- Applications where prediction accuracy takes precedence over model interpretability


To illustrate this, consider a credit card fraud detection system. The dataset is typically noisy and imbalanced, and the cost of false negatives is high. In this case, bagging can help create a robust model that's less sensitive to individual transaction patterns and more reliable in detecting fraudulent activities.


However, bagging isn't a silver bullet. Let's examine its limitations to understand when we might need to consider alternative approaches:

1. Model Interpretability
   - The ensemble nature of bagging makes it harder to explain individual predictions
   - While we can still extract feature importance, the reasoning behind specific predictions becomes more opaque
   - This can be problematic in domains like healthcare or finance where decision transparency is crucial

2. Memory Requirements
   - Each base model in the ensemble needs to be stored in memory
   - For large datasets or complex base models, this can lead to significant memory overhead
   - This becomes especially challenging in resource-constrained environments

3. Prediction Time
   - Every prediction requires aggregating results from all base models
   - This can lead to slower inference times compared to single models
   - In real-time applications, this overhead might be problematic


Understanding these trade-offs helps us make informed decisions about when to use bagging. For instance, in a real-time recommendation system where quick predictions are crucial, we might need to carefully balance the number of base models against the required prediction speed.


### <a id='toc2_5_'></a>[Performance Monitoring](#toc0_)


Monitor these metrics to ensure effective bagging:

1. **Individual Model Performance**
   ```python
   # Check performance of individual models
   for model in bagging.estimators_:
       score = model.score(X_val, y_val)
       print(f"Model score: {score:.3f}")
   ```

2. **Ensemble Diversity**
   - Track prediction correlations
   - Monitor unique predictions

3. **OOB Score**
   ```python
   bagging = BaggingRegressor(oob_score=True)
   bagging.fit(X, y)
   print(f"OOB Score: {bagging.oob_score_}")
   ```


Now that we understand how to implement bagging effectively, we'll move on to practical implementation using scikit-learn, where we'll see these concepts in action with real code and datasets.


In the next section, we'll work with scikit-learn's implementation of bagging and explore a complete example from data preparation to model evaluation.

## <a id='toc3_'></a>[Practical Implementation](#toc0_)

Throughout our previous sections, we've built a strong theoretical foundation of bagging and its components. Now, let's put this knowledge into practice by implementing a bagging ensemble using scikit-learn. We'll walk through a complete example that demonstrates the concepts we've learned.


### <a id='toc3_1_'></a>[Setting Up the Environment](#toc0_)


First, let's import the necessary libraries and prepare our workspace. We'll use a real-world dataset to make our example more concrete.


In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingRegressor, BaggingClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
import matplotlib.pyplot as plt

### <a id='toc3_2_'></a>[A Complete Bagging Example](#toc0_)


Let's work through a practical example using the California Housing dataset, which is a perfect case for demonstrating bagging due to its complexity and real-world nature.


In [2]:
from sklearn.datasets import fetch_california_housing
# Load and prepare data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

💡 **Key Insight:** The choice of base estimator and number of estimators significantly impacts the model's performance. Let's explore this through experimentation.


### <a id='toc3_3_'></a>[Model Configuration and Training](#toc0_)


Let's implement bagging with different configurations to understand their impact:


In [20]:
# Basic bagging model
base_tree = DecisionTreeRegressor(max_depth=10)
bagging_model = BaggingRegressor(
    estimator=base_tree,
    n_estimators=100,
    max_samples=0.8,
    max_features=0.8,
    bootstrap=True,
    oob_score=True,
    random_state=42,
    n_jobs=-1  # Parallel processing
)

# Fit the model
bagging_model.fit(X_train, y_train)

### <a id='toc3_4_'></a>[Performance Analysis](#toc0_)


One of the key advantages of bagging is the ability to analyze performance using out-of-bag estimates. Let's explore different evaluation metrics:

In [22]:
bagging_model.oob_score_


0.8020028189326132

Or we can use the training and testing set to evaluate the performance of the bagging model.

In [23]:
# Create a function to evaluate and visualize performance
def evaluate_bagging_performance(model, X_train, X_test, y_train, y_test):
    # Training and test predictions
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    # Calculate errors
    train_mse = mean_squared_error(y_train, train_pred)
    test_mse = mean_squared_error(y_test, test_pred)

    print(f"Training MSE: {train_mse:.4f}")
    print(f"Test MSE: {test_mse:.4f}")

    return train_mse, test_mse

evaluate_bagging_performance(bagging_model, X_train, X_test, y_train, y_test)

Training MSE: 0.1691
Test MSE: 0.2625


(0.16914477707404912, 0.262538416417678)

Bagging involves several important hyperparameters that need careful tuning. Let's explore how to optimize them:


In [28]:
# Function to perform hyperparameter search
def tune_bagging_params(X, y, param_grid):
    from sklearn.model_selection import GridSearchCV

    bagging = BaggingRegressor()
    grid_search = GridSearchCV(
        bagging, param_grid,
        cv=5, scoring='neg_mean_squared_error'
    )
    grid_search.fit(X, y)

    return grid_search.best_params_

# Example parameter grid
param_grid = {
    'n_estimators': [3, 5, 10],
    # 'max_samples': [0.7, 0.8, 0.9],
    # 'max_features': [0.7, 0.8, 0.9]
}

tune_bagging_params(X_train, y_train, param_grid)


{'n_estimators': 10}

### <a id='toc3_5_'></a>[Best Practices and Common Pitfalls](#toc0_)


When implementing bagging in practice, keep these important considerations in mind:

1. **Data Preprocessing**
   - Always scale your features before bagging
   - Handle missing values appropriately
   - Consider feature engineering to improve base model performance

2. **Model Selection**
   - Start with simple base models and gradually increase complexity
   - Monitor training time vs. performance improvements
   - Use cross-validation to ensure robust performance estimates

3. **Resource Management**
   - Be mindful of memory usage with large ensembles
   - Utilize parallel processing when possible
   - Consider model compression techniques for deployment


❗️**Important Note:** Always validate your model's performance on a separate test set to ensure generalization.


### <a id='toc3_6_'></a>[Putting It All Together](#toc0_)


Let's implement a complete workflow combining all these elements:


In [29]:
# Complete implementation example
def build_optimal_bagging_model(X, y):
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Find optimal parameters
    best_params = tune_bagging_params(X_train, y_train, param_grid)

    # Create and train model with best parameters
    optimal_model = BaggingRegressor(**best_params)
    optimal_model.fit(X_train, y_train)

    # Evaluate performance
    evaluate_bagging_performance(
        optimal_model, X_train, X_test, y_train, y_test
    )

    return optimal_model


This implementation brings together all the concepts we've discussed and provides a solid foundation for using bagging in practice. Remember that the key to successful implementation lies in understanding your specific use case and adjusting these components accordingly.


As we conclude our exploration of bagging, we've seen how to implement it effectively in practice. In our next lecture, we'll build upon these concepts as we dive into Random Forests, which extend the bagging principle with additional randomization techniques.

## <a id='toc4_'></a>[Summary](#toc0_)

Throughout this lecture on bagging and bootstrap sampling, we've explored fundamental concepts that form the backbone of modern ensemble learning. Let's connect the key ideas we've covered and understand how they fit together in the bigger picture of machine learning.


We started with the core concept of bagging, understanding how it leverages the power of multiple models to create more robust predictions. The key insight here is that combining multiple "noisy" models can produce a more stable and accurate result, much like how averaging multiple expert opinions often leads to better decisions.


Bootstrap sampling, the mechanism that makes bagging possible, showed us how we can create diverse training sets from a single dataset. Remember our analogy of drawing balls from a bag with replacement - this simple yet powerful technique allows us to:
- Generate multiple training sets
- Ensure model diversity
- Create out-of-bag samples for validation


When we moved to practical implementation, we learned that bagging isn't just theoretical - it's a powerful tool with real-world applications. We saw how:
- Parallel processing makes bagging computationally efficient
- The number of estimators affects model performance
- Out-of-bag scoring provides built-in validation


💡 **Key Insight:** The true power of bagging lies in its ability to reduce variance while maintaining the predictive strength of complex models.


This understanding of bagging and bootstrap sampling sets the foundation for our next lecture on Random Forests, which builds upon these concepts by adding feature randomization to create even more powerful ensemble models.


❗️**Important Note:** As you move forward, remember that bagging is just one tool in your machine learning toolkit. Understanding when to use it - and when not to - is crucial for successful application.


To solidify your understanding, consider:
1. How does bootstrap sampling contribute to model diversity?
2. Why does averaging predictions reduce variance?
3. In what scenarios would you choose bagging over a single complex model?
4. How can you use out-of-bag samples to validate your model?


These concepts will continue to be relevant as we explore more advanced ensemble methods in upcoming lectures.