# Optional Coding Exercise 

## -- Implementing a Bagging Algorithm from Scratch

In [1]:
%load_ext watermark
%watermark  -d -u -a 'Sebastian Raschka' -v -p numpy,scipy,matplotlib,sklearn

Author: Sebastian Raschka

Last updated: 2021-12-17

Python implementation: CPython
Python version       : 3.9.6
IPython version      : 7.29.0

numpy     : 1.21.2
scipy     : 1.7.0
matplotlib: 3.4.3
sklearn   : 1.0



In [2]:
import numpy as np

## 2) Bagging

In this coding exercise, you will be combining multiple decision trees to a bagging classifier. This time, we will be using the decision tree algorithm implemented in scikit-learn (which is some variant of the CART algorithm for binary splits, as implemented earlier and discussed in class).

### 2.1 Bootrapping

As you remember, bagging relies on bootstrap sampling. So, as a first step, your task is to implement a function for generating bootstrap samples. In this exercise, for simplicity, we will perform the computations based on the Iris dataset.

On an interesting side note, scikit-learn recently updated their version of the Iris dataset since it was discovered that the Iris version hosted on the UCI machine learning repository (https://archive.ics.uci.edu/ml/datasets/Iris/) has two data points that are different from R. Fisher's original paper (Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).) and changed it in their most recent version. Since most students may not have the latest scikit-learn version installed, we will be working with the Iris dataset that is deposited on UCI, which has become quite the standard in the Python machine learning community for benchmarking algorithms. Instead of manually downloading it, we will be fetching it through the `mlxtend` (http://rasbt.github.io/mlxtend/) library that you installed in the last homework.

In [3]:
# DO NOT EDIT OR DELETE THIS CELL

from mlxtend.data import iris_data
X, y = iris_data()

print('Number of examples:', X.shape[0])
print('Number of features:', X.shape[1])
print('Unique class labels:', np.unique(y))

Number of examples: 150
Number of features: 4
Unique class labels: [0 1 2]


Use scikit-learn's `train_test_split` function to divide the dataset into a training and a test set.

- The test set should contain 45 examples, and the training set should contain 105 examples.
- To ensure reproducible results, use `123` as a random seed.
- Perform a stratified split.

In [4]:
# SOLUTION
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True,
                                                    stratify=y)

print('Number of training examples:', X_train.shape[0])
print('Number of test examples:', X_test.shape[0])

Number of training examples: 105
Number of test examples: 45


In [5]:
X_train[:5]

array([[5. , 2. , 3.5, 1. ],
       [5.4, 3.9, 1.3, 0.4],
       [5.6, 3. , 4.1, 1.3],
       [7.4, 2.8, 6.1, 1.9],
       [4.6, 3.4, 1.4, 0.3]])

Next we are implementing a function to generate bootstrap samples of the training set. In particular, we will perform the bootstrapping as follows:

- Create an index array with values 0, ..., 104.
- Draw a random sample (with replacement) from this index array using the `choice` method of a NumPy `RandomState` object that is passed to the function as `rng`. 
- Select training examples from the X array and labels from the y array using the new sample of indices.

In [6]:
# SOLUTION

def draw_bootstrap_sample(rng, X, y):
    sample_indices = np.arange(X.shape[0])
    bootstrap_indices = rng.choice(sample_indices,
                                   size=sample_indices.shape[0],
                                   replace=True)
    return X[bootstrap_indices], y[bootstrap_indices]

I added the following code cell for your convenience to double-check your solution. If your results don't match the results shown below, there is a bug in your implementation of the `draw_bootstrap_sample` function.

In [7]:
# DO NOT EDIT OR DELETE THIS CELL

rng = np.random.RandomState(123)
X_boot, y_boot = draw_bootstrap_sample(rng, X_train, y_train)

print('Number of training inputs from bootstrap round:', X_boot.shape[0])
print('Number of training labels from bootstrap round:', y_boot.shape[0])
print('Labels:\n', y_boot)

Number of training inputs from bootstrap round: 105
Number of training labels from bootstrap round: 105
Labels:
 [0 0 1 0 0 1 2 0 2 1 0 0 2 1 1 1 1 2 1 1 2 0 2 1 2 1 1 1 0 1 0 0 1 2 0 0 0
 0 2 1 1 2 1 2 1 1 2 1 2 0 1 1 2 2 1 0 1 0 2 2 0 1 0 2 0 0 0 0 1 2 0 0 1 0
 1 1 0 1 1 2 2 0 2 0 2 0 1 1 2 2 0 2 2 2 0 1 0 1 2 2 2 1 0 0 0]


### 2.2 Baggging classifier from decision trees (4 pts)

In this section, you will implement a Bagging algorithm based on the `DecisionTreeClassifier`. I provided a partial solution for you. 

In [9]:
# SOLUTION

from sklearn.tree import DecisionTreeClassifier


class BaggingClassifier(object):
    
    def __init__(self, num_trees=10, random_state=123):
        self.num_trees = num_trees
        self.rng = np.random.RandomState(random_state)
        
    def fit(self, X, y):
        self.trees_ = [DecisionTreeClassifier(random_state=self.rng) for i in range(self.num_trees)]
        for i in range(self.num_trees):
            X_boot, y_boot = draw_bootstrap_sample(self.rng, X, y)
            self.trees_[i].fit(X_boot, y_boot)
        
    def predict(self, X):
        ary = np.zeros((X.shape[0], len(self.trees_)), dtype=np.int64)
        for i in range(len(self.trees_)):
            ary[:, i] = self.trees_[i].predict(X)

        maj = np.apply_along_axis(lambda x:
                                  np.argmax(np.bincount(x)),
                                            axis=1,
                                            arr=ary)
        return maj

I added the following code cell for your convenience to double-check your solution. If your results don't match the results shown below, there is a bug in your implementation of the `BaggingClassifier()`.

In [10]:
# DO NOT EDIT OR DELETE THIS CELL

model = BaggingClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print('Individual Tree Accuracies:')
for tree in model.trees_:
    predictions = tree.predict(X_test) 
    print('%.1f%%' % ((predictions == y_test).sum() / X_test.shape[0] * 100))

print('\nBagging Test Accuracy: %.1f%%' % ((predictions == y_test).sum() / X_test.shape[0] * 100))

Individual Tree Accuracies:
88.9%
93.3%
97.8%
93.3%
93.3%
93.3%
91.1%
97.8%
97.8%
97.8%

Bagging Test Accuracy: 97.8%
