# Bagging
<img src="../images/bagging.webp" width="500" />

### Introduction
Bagging (Bootstrap Aggregation) is the process of training multiple instances of the same learning algorithm on different resampled subsets of the training data and then combining their predictions. Hence, it is a form of ensemble learning which improves the performance of models, especially decision trees (which are especially prone to overfitting). Random forests are an example of a popular ensemble model which employs bagging.

At a high level (seen in the figure above), the steps of bagging are:
1. **Bootstrap Sampling**
    - Randomly drawing multiple resampled subsets (bootstrap samples) from the original training data, allowing for replacement.
2. **Training Base Learners**
    - Training individual base learners (most commonly decision trees) independently on each bootstrap sample.
3. **Aggregating Predictions**
    - Combining predictions from individual models through methods like voting or averaging.

### Bootstrap Sampling
A bootstrap sample is a critical component of bagging, where each subset is created by randomly selecting instances *with* replacement from the original dataset. The idea is that when training multiple base learners on different versions of the data, the variance of the final model will decrease, thereby mitigating overfitting.

## Demo: Predicting Types of Wine
<img src="../images/wine.gif" width="350" />

In [28]:
# Imports
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
import pandas as pd
import numpy as np

In [27]:
# Load data
data = load_wine()
df = pd.DataFrame(data= np.c_[data['data'],data['target']],
                  columns= data['feature_names'] + ['target'])

# Split data
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0.0


In [31]:
# Single decision tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Evaluation
y_pred = clf.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Accuracy of Single Model\n------------------------')
print(round(score, 4))

Accuracy of Single Model
------------------------
0.8889


In [36]:
# Bagging 100 decision trees
bagging_clf = BaggingClassifier(estimator=clf, n_estimators=100)
bagging_clf.fit(X_train, y_train)

# Evaluation
y_pred = bagging_clf.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Accuracy of Bagging Model\n-------------------------')
print(round(score, 4))

Accuracy of Bagging Model
-------------------------
0.9444
