# 05: Bagging and random forests

In [44]:
%matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt

import mylib as my

The goal of ensemble methods to to reduce bias and/or variance help prevent overfitting. In this notebook we look at two ensemble methods: bagging and random forests.

## Bootstrap samples
Let's start by seeing how we can draw a bootstrap sample given a dataset $D$. A bootstrap sample is a sample drawn randomly with replacement from the given dataset such that the size of the sample is the same as the size of the original dataset. That means some examples will show up multiple times in the drawn sample.

In the example below, we are using a subset of the car dataset with classes indicating whether the car is in acceptable or unacceptable condition. The description of the original car dataset can be found at [this page](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation).

In [45]:
df = pd.read_csv('datasets/ua_car.csv')
ds = my.DataSet(df, y=True)
print(df.iloc[:,-1].value_counts())
df.head()

unacc    384
acc      384
Name: y, dtype: int64


Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
0,med,vhigh,4,4,small,med,unacc
1,vhigh,med,2,2,small,med,unacc
2,med,vhigh,5more,4,med,low,unacc
3,high,vhigh,3,more,med,high,unacc
4,med,med,4,2,small,med,unacc


In [46]:
train, test = ds.train_test_split(test_portion=.25, shuffle=True)
print(train)
print(test)

    buying maintenance  doors persons luggage safety      y
37     med       vhigh      3       4     med    med  unacc
429  vhigh         low      3       4     big    med    acc
597    med       vhigh  5more    more     med   high    acc
193  vhigh         low  5more       4   small    low  unacc
555   high         low  5more       4     med    med    acc
..     ...         ...    ...     ...     ...    ...    ...
763    low         low      3    more   small    med    acc
113  vhigh        high      2       2     med    med  unacc
195    low       vhigh      4       2     med    med  unacc
628    med        high  5more       4     med   high    acc
684    low       vhigh      2    more     big    med    acc

[576 rows x 7 columns]
    buying maintenance  doors persons luggage safety      y
632    med        high  5more    more     med    med    acc
43    high       vhigh      4       4   small   high  unacc
259   high       vhigh      2    more     big    med  unacc
196  vhigh      

Given the above training set, we can draw a bootstrap sample like this:

In [47]:
sample_indexes = np.random.randint(0, train.N, size=train.N)
bootstrap_sample = train.examples.iloc[sample_indexes, :]
bootstrap_sample # Same num of rows as train of course

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
294,high,vhigh,2,more,small,med,unacc
479,high,high,4,more,med,high,acc
159,vhigh,high,2,4,big,low,unacc
209,vhigh,high,4,2,small,med,unacc
661,med,med,5more,4,small,med,acc
...,...,...,...,...,...,...,...
350,high,high,5more,4,small,med,unacc
422,vhigh,low,2,4,big,med,acc
672,med,low,3,4,small,med,acc
451,vhigh,low,5more,more,small,high,acc


of which:

In [48]:
print("{:.2%}".format(pd.unique(bootstrap_sample.index).shape[0] / len(bootstrap_sample)), 'are unique examples')
# pd.unique(bootstrap_sample.index).shape[0] is the unique numbers in the index col of bootstrap_sample, shape[0] gets len
print("{:.2%}".format(1 - pd.unique(bootstrap_sample.index).shape[0] / len(bootstrap_sample)), 'are repeated examples')

64.06% are unique examples
35.94% are repeated examples


Sometimes, it's useful to be able to identify the examples that are included in a given sample and those that aren't. Here are two functions for doing so.

In [49]:
def examples_in_sample(examples, sample):
    return examples[examples.index.isin(sample.index)] # wow cool function ".isin()"

def examples_not_in_sample(examples, sample):
    return examples[~examples.index.isin(sample.index)]

Here are the examples from the training set what are in the above bootstrap sample:

In [50]:
examples_in_sample(train.examples, bootstrap_sample)

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
597,med,vhigh,5more,more,med,high,acc
555,high,low,5more,4,med,med,acc
459,high,high,2,4,big,high,acc
198,vhigh,med,3,4,small,med,unacc
148,high,med,5more,2,med,high,unacc
...,...,...,...,...,...,...,...
313,high,low,3,more,small,med,unacc
661,med,med,5more,4,small,med,acc
180,low,vhigh,2,2,small,med,unacc
113,vhigh,high,2,2,med,med,unacc


And here are the examples from the training set that are not in the above bootstrap sample:

In [51]:
examples_not_in_sample(train.examples, bootstrap_sample)

Unnamed: 0,buying,maintenance,doors,persons,luggage,safety,y
37,med,vhigh,3,4,med,med,unacc
429,vhigh,low,3,4,big,med,acc
193,vhigh,low,5more,4,small,low,unacc
380,vhigh,high,3,more,big,med,unacc
361,vhigh,med,2,more,small,low,unacc
...,...,...,...,...,...,...,...
270,vhigh,high,2,2,med,med,unacc
312,med,low,3,more,med,low,unacc
763,low,low,3,more,small,med,acc
195,low,vhigh,4,2,med,med,unacc


## Bagging
The simplest form of ensemble methods is called **bagging** which stands for **bootstrap aggregation**. The idea is simple:
* take $T$ bootstrap samples from the given dataset
* for each bootstrap sample, train a decision tree DT
* the predicted label of an unseen example is the average(for regression problems) or the plurality vote (for classification problems) of all the output predicted by all the trained $T$ trees.

Here is a simple implementation of bagging.

In [52]:
class Bagger:
    def __init__(self, dataset, nTrees):
        self.ds = dataset
        self.nTrees = nTrees
        self.classifiers = []
        self.samples = []
        self.make_trees()

    def make_trees(self):
        indexes = np.random.randint(0, self.ds.N,(self.ds.N,self.nTrees))
        for i in range(self.nTrees):
            # Create bootstrap samples one for each tree
            self.samples.append(self.ds.examples.iloc[indexes[:, i], :])

            # Build classifiers
            self.classifiers.append(my.DecisionTreeClassifier(my.DataSet(self.samples[i])))

    def predict(self, unseen):
        """
        Returns the most probable label (or class) for each unseen input. The
        unseen needs to be a data series with the same features (as indexes) as the 
        training data. It can also be a data frame with the same features as 
        the training data.
        """
        if unseen.ndim == 1:
            classes = np.array([ dt.predict(unseen) for dt in self.classifiers ])
            classes = classes[classes != None]
            return st.mode(classes).mode[0]
        
        else:
            return np.array([self.predict(unseen.iloc[i,:]) for i in range(len(unseen))]) 

## Random forests
Bagging is not exclusive to decision trees. Random forests is bagging applied exclusively to decision trees. In addition to obtaining $T$ random bootstrap samples, it also requires what is sometimes called **feature bagging**. Feature bagging requires that only a randomly selected subset of the features is considered at each node during the construction of the decision tree. 

That means we need to modify our implementation of the decision tree such that it takes a numeric parameter named `nFeatures` which defaults to 0. If `nFeatures` is 0, then the tree functions as normal. If not, it picks this many features randomly and only consider the best of those during the construction of the tree. The provided `my.DecisionTreeClassifier` class already has these changes.

For prediction, a plurality vote of the $T$ predicted labels is returned. Here is a simple implementing of random forests. Think about the similarities and differences between these too classes.

In [53]:
class RandomForest:
    def __init__(self, dataset, nTrees, nFeatures=0):
        self.ds = dataset
        self.nTrees = nTrees
        self.nFeatures = nFeatures
        self.classifiers = []
        self.samples = []
        self.make_forest()

    def make_forest(self):
        indexes = np.random.randint(0, self.ds.N,(self.ds.N,self.nTrees))
        for i in range(self.nTrees):
            # Create bootstrap samples one for each tree
            self.samples.append(self.ds.examples.iloc[indexes[:, i], :])

            # Build classifiers
            self.classifiers.append(my.DecisionTreeClassifier(my.DataSet(self.samples[i]), nFeatures=self.nFeatures))

    def predict(self, unseen):
        """
        Returns the most probable label (or class) for each unseen input. The
        unseen needs to be a data series with the same features (as indexes) as the 
        training data. It can also be a data frame with the same features as 
        the training data.
        """
        if unseen.ndim == 1:
            classes = np.array([ dt.predict(unseen) for dt in self.classifiers ])
            classes = classes[classes != None]
            return st.mode(classes).mode[0]
        
        else:
            return np.array([self.predict(unseen.iloc[i,:]) for i in range(len(unseen))]) 

## Testing

In [54]:
# dt = my.DecisionTreeClassifier(train)
# cm = my.confusion_matrix(test.target, dt.predict(test.examples.iloc[:,:-1]))
# accuracy = np.trace(cm) / np.sum(cm)

# print(cm)
# print('Decision tree accuracy: ', accuracy)


# bg = Bagger(train, 20)
# cm = my.confusion_matrix(test.target, bg.predict(test.examples.iloc[:,:-1]))
# accuracy = np.trace(cm) / np.sum(cm)

# print(cm)
# print('Bagger accuracy: ', accuracy)

# rf = RandomForest(train, 20, nFeatures=3)
# cm = my.confusion_matrix(test.target, rf.predict(test.examples.iloc[:,:-1]))
# accuracy = np.trace(cm) / np.sum(cm)

# print(cm)
# print('Random forests accuracy: ', accuracy)

You should try different values for `nTrees` and `nFeatures`. These variables are considered hyperparameters, and cross-validation can be used to determine the best values for them. Common values for `nFeatures` are $\sqrt{m}$ and $log_2(m)$ where $m$ is the number of features.

## Out of bag score
Another way of testing random forests is to calculate the so-called **out-of-bag** score. Such a score does not require splitting the dataset into a training and test sets. One way to calculate it is to identify for each example $x$ in the dataset the list of trees that are trained using samples that do not include $x$; let's call this list of trees $D_x$. We then call the `predict` method on each tree of $D_x$ to get the list of predicted classes for example $x$; let's call this list of classes $C_x$. Finally we find the class in $C_x$ that repeats the most and report it as the predicted class of $x$; let's call it $h_x$.

Doing this for each example in the dataset gives us an array of predicted classes, which we can compare against the actual target classes of these examples. Using the confusion matrix we can report the accuracy as the out of bag score.

Notice that the above implementations of `Bagger` and `RandomForest` already give you access to the bootstrap samples and the classifiers that are trained on them. You can use that to find out what sample does not include a given example.

## CHALLENGE
Write a function that calculates the out of bag score as described above given three arguments: a dataset, number of trees (`nTrees`), and number of features (`nFeatures`). The function should use these arguments to create a random forest object to use for calculating this score.

Test and report the out of bag scores for the whole car dataset and for when `nTrees` is 10, 15, and 20.

# Challenge Accepted

Function with the given parameters, dataset, number of trees, and number of features. 

I start by creating a RandomForest classifier and an array which will hold the predictions for each example in the dataset ds. I then go through each example in ds and find the trees from the RandomForest classifier I created which use a sample not containing the current example. I then predict targets for our given example using the trees I had found. I use the most popular prediction as that example's prediction and add such prediction to the array holding predictions. 

After going through this process for each example, I take the predictions held in an array and compare them with the actual outcomes from the dataset. I compute the accuracy and return the value.

In [55]:
def calc_out_of_bag_score(ds, nTrees, nFeatures):
    # create RandomForest object
    rf = RandomForest(ds, nTrees=nTrees, nFeatures=nFeatures)
    hx_bag = []
    
    for x in range(ds.N):
        # this example
        x_as_df = ds.examples.iloc[x:x+1,:]
        
        # make list of trees whose sample doesn't include x
        dx = []
        for dt in rf.classifiers:
            # if x is not in decision tree's (dt) sample set
            if ~x_as_df.index.isin(dt.dataset.examples.index)[0]:
                dx.append(dt)
        
        # call predict on each tree in Dx for this x
        predictions = []
        for dt in dx:
            predictions.append(dt.predict(x_as_df.iloc[:,:-1]))
        
        # find most frequent class in Cx and call it hx and append to array of predicted classes associated with ds.examples
        if not predictions:
            # no decision trees existed which did not contain x... make up hx?
            hx = x_as_df.iloc[0][-1]
        else:
            hx = max([p[0] for p in predictions])
        
        hx_bag.append(hx)
        
    # compare hx's with actual targets via confusion matrix and report accuracy
    cm = my.confusion_matrix(ds.examples.iloc[:,-1].values, np.array(hx_bag))
    accuracy = np.trace(cm) / np.sum(cm)
    return accuracy

In [56]:
out_of_bag_score_10_trees = calc_out_of_bag_score(ds, 10, 3)
out_of_bag_score_15_trees = calc_out_of_bag_score(ds, 15, 3)
out_of_bag_score_20_trees = calc_out_of_bag_score(ds, 20, 3)

In [57]:
print(f'Out-of-bag score for Random Forest using 10 trees:\n{out_of_bag_score_10_trees}\n')
print(f'Out-of-bag score for Random Forest using 15 trees:\n{out_of_bag_score_15_trees}\n')
print(f'Out-of-bag score for Random Forest using 20 trees:\n{out_of_bag_score_20_trees}\n')

Out-of-bag score for Random Forest using 10 trees:
0.8919270833333334

Out-of-bag score for Random Forest using 15 trees:
0.8919270833333334

Out-of-bag score for Random Forest using 20 trees:
0.83203125

