<img src="./images/banner.png" width="800">

# Bagging

In our previous lecture on ensemble methods, we learned that ensemble learning combines multiple models to create a more robust and accurate predictor.


Ensemble methods generally fall into two categories:
- **Parallel ensembles** (like bagging) where base learners are built independently
- **Sequential ensembles** (like boosting) where base learners are built sequentially


Bagging, short for **B**ootstrap **Agg**regat**ing**, is a parallel ensemble method introduced by Leo Breiman in 1996. It's designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.


The core idea of bagging is remarkably intuitive:
1. Create multiple versions of the training dataset through bootstrap sampling
2. Train a model on each sampled dataset
3. Aggregate predictions from all models


Bagging is particularly effective because it addresses a fundamental challenge in machine learning: the variance-bias tradeoff.


❗️ **Important Note:** Bagging primarily helps reduce variance while keeping bias relatively unchanged.


Let's understand this mathematically. For any predictor $f(x)$, the expected prediction error can be decomposed as:

$Error = Bias^2 + Variance + \text{Irreducible Error}$


Bagging reduces variance through:
- Averaging multiple predictions
- Using bootstrap samples to create diverse training sets


For $B$ bagged predictors, if we assume they're identically distributed with variance $\sigma^2$ and pairwise correlation $\rho$, the variance of the bagged predictor is:

$\sigma_{bag}^2 = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$


Bagging is most effective when:
1. The base model has high variance and low bias
2. The dataset is of moderate to large size
3. The base models are sensitive to changes in the training data


Common base learners for bagging include:
- Decision trees (most common)
- Neural networks
- Linear regression (in some cases)


While bagging is powerful, it's important to understand its limitations:
- Computational overhead due to multiple models
- No improvement in bias
- May not be as effective with stable algorithms (like k-nearest neighbors)


In [2]:
# Simple example of bagging concept in scikit-learn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Create a bagging classifier
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,    # number of base models
    max_samples=0.8,     # size of bootstrap sample
    bootstrap=True,      # enable bootstrap sampling
    random_state=42
)

💡 **Tip:** When implementing bagging, start with these guidelines:
- Use 50-100 base estimators initially
- Bootstrap sample size around 63.2% of original dataset (default)
- Monitor out-of-bag error to assess performance


In the next section, we'll dive deeper into bootstrap sampling, which forms the foundation of the bagging methodology.

## Bootstrap Sampling

Bootstrap sampling is a statistical method that involves randomly sampling a dataset with replacement. Think of it like drawing balls from a bag where you:
1. Draw a ball
2. Record its value
3. Put it back in the bag
4. Repeat until you have as many samples as you want

💡 **Tip:** The term "bootstrap" comes from the phrase "pulling oneself up by one's bootstraps," suggesting the method's ability to use limited data to generate insights about a larger population.

### Mathematical Foundation

Let's say we have a dataset $D$ with $n$ observations: $D = \{x_1, x_2, ..., x_n\}$

When we create a bootstrap sample $D^*$, each observation has:
- Probability $\frac{1}{n}$ of being selected each time
- Can be selected multiple times
- Might not be selected at all

The probability of an observation not being selected in a single draw is $(1-\frac{1}{n})$

For a bootstrap sample of size $n$, the probability of an observation never being selected is:

$P(\text{not selected}) = (1-\frac{1}{n})^n \approx e^{-1} \approx 0.368$

This means:
- Approximately 63.2% of original observations appear in each bootstrap sample
- About 36.8% of observations are left out (these form the Out-of-Bag sample)

### Properties of Bootstrap Samples

1. **Size and Composition**
   - Typically same size as original dataset
   - Contains duplicates of some observations
   - Misses some original observations
   - Maintains data distribution properties

2. **Statistical Properties**
   ```python
   import numpy as np

   # Example of bootstrap sampling
   def create_bootstrap_sample(data, size=None):
       if size is None:
           size = len(data)
       indices = np.random.randint(0, len(data), size=size)
       return data[indices]
   ```

3. **Variance Estimation**
   Bootstrap samples help estimate the variance of a statistic $\theta$ using:

   $Var(\theta) \approx \frac{1}{B-1}\sum_{i=1}^B(\theta_i - \bar{\theta})^2$

   where $B$ is the number of bootstrap samples and $\theta_i$ is the statistic computed on the $i$-th bootstrap sample.

### Out-of-Bag (OOB) Samples

OOB samples are a crucial concept in bagging, providing:
1. **Built-in Validation Set**
   - No need for separate cross-validation
   - Unbiased performance estimation

2. **Error Estimation**
   ```python
   # Example of OOB score in scikit-learn
   from sklearn.ensemble import BaggingClassifier

   bagging = BaggingClassifier(
       oob_score=True,  # Enable OOB scoring
       n_estimators=100
   )
   ```

3. **Feature Importance**
   - Can be used to estimate variable importance
   - Helps in feature selection

### Practical Considerations

❗️ **Important Note:** When implementing bootstrap sampling, consider:

1. **Sample Size Selection**
   - Standard: Same as original dataset
   - Smaller: Reduces computation, might increase diversity
   - Larger: More stable but potentially less diverse

2. **Number of Bootstrap Samples**
   - More samples → more stable estimates
   - Diminishing returns after certain point
   - Trade-off with computational resources

```python
# Example showing effect of different bootstrap sample sizes
def demonstrate_bootstrap_properties(data, n_bootstraps=1000):
    original_mean = np.mean(data)
    bootstrap_means = []

    for _ in range(n_bootstraps):
        boot_sample = create_bootstrap_sample(data)
        bootstrap_means.append(np.mean(boot_sample))

    return {
        'original_mean': original_mean,
        'bootstrap_mean': np.mean(bootstrap_means),
        'bootstrap_std': np.std(bootstrap_means)
    }
```

### Common Applications

1. **Statistical Inference**
   - Confidence interval estimation
   - Hypothesis testing
   - Parameter uncertainty estimation

2. **Machine Learning**
   - Model validation
   - Ensemble creation
   - Uncertainty quantification

3. **Model Evaluation**
   - Performance metric estimation
   - Model stability assessment
   - Feature importance calculation

### Bootstrap Sampling Variations

1. **Balanced Bootstrap**
   - Ensures each observation appears exactly same number of times
   - Useful for imbalanced datasets

2. **Weighted Bootstrap**
   - Observations have different selection probabilities
   - Useful for dealing with sample bias

3. **Block Bootstrap**
   - For time series or dependent data
   - Samples blocks of consecutive observations

💡 **Tip:** Choose the appropriate bootstrap variation based on your data structure and problem requirements.

In the next section, we'll explore how these bootstrap samples are used in practice to create bagged models and improve prediction accuracy.