# Bagging

In [1]:
import numpy as np
import matplotlib.pyplot as plt

A single decision tree does not perform well as it tends to overfit.  A possible solution is the construct multiple trees to reduce variances.  To make sure each tree is not exactly learning the same thing since it will then be all same trees, we need to inject some differences to these trees (i.e., make them as diverse as possible but at the same time they also see some overlappinp samples).  One simple idea is that each of the tree is trained on a subset of **bootstrapping sample** and then perform some sort of aggregation of the decision.

The process has the following steps:

1. Sample $m$ times **with replacement** from the original training data
2. Repeat $B$ times to generate $B$ "boostrapped" training datasets $D_1, D_2, \cdots, D_B$
3. Train $B$ trees using the training datasets $D_1, D_2, \cdots, D_B$ 

Boostrapping the data plus performing some sort of aggregation (averaging or majority votes) is called **boostrap aggregation** or **bagging**.

*Example*:

Assume that we have a training set where $m=4$, and $n=2$:

$$D = {(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4)}$$

We generate, say, $B = 3$ datasets by boostrapping:

$$D_1 = {(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_3, y_3)}$$
$$D_2 = {(x_1, y_1), (x_4, y_4), (x_4, y_4), (x_3, y_3)}$$
$$D_3 = {(x_1, y_1), (x_1, y_1), (x_2, y_2), (x_2, y_2)}$$

We can then train 3 trees.

Note: When sampling is performed **without** replacement, it is called **pasting**.  In other words, both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

Let's try to code from scratch.  To make our life easier, we shall use DecisionTree from the sklearn library (since we already code it from scratch in the previous class)

## 1. Scratch

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                test_size=0.3, shuffle=True, random_state=42)

In [3]:
from sklearn.tree import DecisionTreeClassifier
import random
from scipy import stats
from sklearn.metrics import classification_report

B = 5
m, n = X_train.shape
boostrap_ratio = 1
tree_params = {'max_depth': 2, 'criterion':'gini', 'min_samples_split': 5}
models = [DecisionTreeClassifier(**tree_params) for _ in range(B)]

#sample size for each tree
sample_size = int(boostrap_ratio * len(X_train))

xsamples = np.zeros((B, sample_size, n))
ysamples = np.zeros((B, sample_size))

#subsamples for each model
for i in range(B):
    ##sampling with replacement; i.e., sample can occur more than once
    #for the same predictor
    for j in range(sample_size):
        idx = random.randrange(m)   #<----with replacement #change so no repetition
        xsamples[i, j, :] = X_train[idx]
        ysamples[i, j] = y_train[idx]
        #keep track of idx that i did not use for ith tree

#fitting each estimator
for i, model in enumerate(models):
    _X = xsamples[i, :]
    _y = ysamples[i, :]
    model.fit(_X, _y)
    
#make prediction and return the probabilities
predictions = np.zeros((B, X_test.shape[0]))
for i, model in enumerate(models):
    yhat = model.predict(X_test)
    predictions[i, :] = yhat
    
#predictions.shape = (B, m_test)
    
yhat = stats.mode(predictions)[0][0] #the first zero gives us the mode array, the second zero convert from 2D ([[]]) to 1D ([])

print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



  yhat = stats.mode(predictions)[0][0] #the first zero gives us the mode array, the second zero convert from 2D ([[]]) to 1D ([])


## 2. Sklearn

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()

'''
To perform in sklearn, we can use the BaggingClassifier API.  
Pasting can be done using BaggingClassifier< setting boostrap=False
'''

bag = BaggingClassifier(tree, n_estimators=5, max_samples=0.99)

bag.fit(X_train, y_train)
yhat = bag.predict(X_test)
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

