# Bagging and Random Forests
### Jack Bennetto
### July 17, 2017

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.misc import comb

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV

## Objectives

Morning Objectives

 * Explain & construct a random forest (classification or regression).
 * Explain the relationship and difference between random forest and bagging.
 * Explain why random forests are more accurate than a single decision tree.

Afternoon Objectives

 * Get feature importances from a random forest.
 * Explain how OOB error is calculated and what is it an estimate of.

## Agenda

Morning Agenda:

 * Discuss ensemble methods
 * Review bias/variance tradeoff
 * Review decision trees
 * Discuss bagging (bootstrap aggregation)
 * Discuss random forests

Afternoon Agenda:

 * Discuss out-of-bag error
 * Discuss feature importance



## What is an Ensemble Method?

In general, an **ensemble** method combines many weak models to form a strong model.

Train multiple different models on the data. The overall prediction is

 * the average prediction, for a regressor, or
 * the plurality choice, for a classifier (the fraction of models is probability).

Why is probability is important?

## Ensembles: Intuition

Suppose we have 5 *independent* binary classifers that are each 70% accurate. What's the overall accuracy?

In [None]:
def find_ensemble_accuracy(n, p):
    '''Given a n independent classifiers each of p accuracy,
    return the emsumble accuracy'''
    ensemble_accuracy = 0
    for k in range((n + 1) / 2, n+1):
        ensemble_accuracy += comb(n, k) * p**k * (1-p)**(n-k)
    return ensemble_accuracy

In [None]:
find_ensemble_accuracy(5, 0.7)

In [None]:
ns = np.arange(1, 55, 2)
vfea = np.vectorize(find_ensemble_accuracy, excluded=['p'])
ensemble_accuracies = vfea(ns, p=0.7)

fig, ax = plt.subplots()
ax.plot(ns, ensemble_accuracies, '.')
ax.set_ylabel("Ensemble accuracy")
ax.set_xlabel("Number of independent 0.7-accuracy classifiers")
ax.set_title("Accuracy of an Ensemble of Independent Classifiers")

$$ \binom{5}{5} 0.7^5 + \binom{5}{4} 0.7^4 0.3 + \binom{5}{3} 0.7^3 0.3^2 \approx 0.83 $$

With 55 such classifiers we can achieve 99.9% accuracy.


Ok, so that's all great, but what's the limitation?

## How to Make Them Independent?

If the learners are all the same, ensembles don't help.

Train each learner on different subset of data.

 * Why is this better than a single good model?

## Bias and Variance

**Bias:** Error from failure to match training set

**Variance:** Error from sampling training set

What is the bias of an unpruned decision tree?

## Review: Classification Trees

A **classification tree** is a decision tree to predicts whether a data point is in one class or another. Each branch node is a decision, choosing left or right based on the value of a certain feature. Each leaf node gives the probability that a data point is in one class or another.

Let's look at the tennis dataset from the other day.

In [None]:
# Read in our data
tennis_df = pd.read_table('data/tennis.txt', delim_whitespace=True)
tennis_df.rename(columns={'playtennis': 'played'}, inplace=True)
#tennis_df['played'] = tennis_df['played'].apply(lambda x: 1 if x == 'yes' else 0)
tennis_df

In [None]:
from graphviz import Digraph
dot = Digraph(comment='A simple classification tree')

dot.node('O', 'outlook?', shape='diamond')
dot.node('1', "no", shape='rectangle')
dot.node('H', 'humidity?', shape='diamond')
dot.node('O2', 'outlook?', shape='diamond')
dot.node('W', 'wind?', shape='diamond')
dot.node('3', 'yes', shape='rectangle')
dot.node('T', 'temperature?', shape='diamond')
dot.node('4', 'yes', shape='rectangle')
dot.node('5', "no", shape='rectangle')
dot.node('2', "no", shape='rectangle')
dot.node('W2', 'wind?', shape='diamond')
dot.node('6', "no", shape='rectangle')
dot.node('7', 'yes', shape='rectangle')

dot.edge('O', '1', 'overcast')
dot.edge('O', 'H', 'not overcast')
dot.edge('H', 'O2', 'high')
dot.edge('H', 'W', 'normal')
dot.edge('W', '3', 'False')
dot.edge('W', 'T', 'True')
dot.edge('T', '4', 'mild')
dot.edge('T', '5', 'cool')
dot.edge('O2', '2', 'sunny')
dot.edge('O2', 'W2', 'rainy')
dot.edge('W2', '7', 'False')
dot.edge('W2', '6', 'True')
dot

A classification tree is built by

* Iteratively divide the nodes such that (entropy/gini impurity) is minimized
* Various stopping conditions like a depth limit
* Prune trees by merging nodes

## Review: Regression Trees

A **regression tree** predicting a number rather than the probability that something is in one class or another. Prediction works the same as with classification trees, but the leaf nodes give a number rather than probabilities of a class.

To train a regression tree, we

* Iteratively divide the nodes such that *total squared error* is minimized,

$$\sum_{i \in L} (y_i - m_L)^2 + \sum_{i\in R} (y_i - m_R)^2$$

* Use various stopping conditions like a depth limit, minimum leaf size, and
* Prune trees by merging nodes

## Regression Trees: Example

 $x_1$ |   $x_2$ |  $y$
-------|---------|--------
 1     |    1    |   1
 0     |    0    |   2
 1     |    0    |   3
 0     |    1    |   4

 Prior to the split we guess the mean, 2.5, for everything, giving total squared error:
 
 $$ E = (1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2  = 5$$
 After we split on $x_1$ we guess 2 for rows 1 & 3 and 3 for rows 2 & 4:
 
 $$ E = (1-2)^2 + (3-2)^2 + (2-3)^2 + (4-3)^2 = 4 $$

## Decision Tree Summary

What are the pros and cons?

Pros
 * No feature scaling needed
 * Model nonlinear relationships
 * Can do both classification and regression
 * Robust
 * Highly interpretable

Cons
 * Can be expensive to train
 * Often poor predictors because of high variance


## Review: Bootstrapping

What is a bootstrap sample?

What have we learned that bootstrap samples are good for so far?

Let's get the median of some data.

In [None]:
data = scs.uniform(0,10).rvs(100)
np.median(data)

What's the confidence interval of this estimate?

In [None]:
alpha = .05
medians = []
for _ in range(10000):
    bootstrap_sample = np.random.choice(data, len(data))
    medians.append(np.median(bootstrap_sample))
print("The {}% confidence interval is from {} to {}".format(1-alpha,
                                                            np.percentile(medians, 100*(alpha/2.)),
                                                            np.percentile(medians, 100*(1-alpha/2.))))

Our procedure was
  * Take 10000 bootstrap samples.
  * Take the median of each sample.
  * The 95% confidence inverval for the median is between the 250th and 9750th largest samples.

## Bagging (bootstrap aggregation)

We could repeatedly sample from the population, building decision tree
models and averaging the results.  But we only have one sample. Instead, we simulate multiple draws from the data by using multiple bootstrap samples.

In a bit more detail:

 * Take a bunch of bootstrap samples - say n
 * Train a high variance, low bias model on each of them
 * Average the results - this can reduce your variance by up to $\sqrt n$


Question: Why is the reduction in variance less than $\sqrt n$?


 * We are thinking about the population of all possible decision tree models on our data.
 * If I take $n$ samples *iid* from this distribution and average them the variance goes down by $\sqrt n$
 * There is some correlation between models because they are all trained on bootstrap samples from the same draw.

## An Experiment

You're each going to be a decision tree on some data based on a bootstrap sample, and then we'll all together be a random forest.

In [None]:
data = load_iris()

# Split into test/train, using the same random state for everyone
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=462)

In [None]:
clf = DecisionTreeClassifier()
# each of you has a different bootstrap sample
bootstrap_sample = np.random.choice(range(len(X_train)), len(X_train))
clf.fit(X_train[np.where(bootstrap_sample)], y_train[np.where(bootstrap_sample)])

In [None]:
print("Accuracy = {:.3f}".format(np.mean(clf.predict(X_test) == y_test)))

In [None]:
print("My prediction: {}".format(clf.predict(X_test)[0:20]))
print("Actual result: {}".format(y_test[0:20]))

What is your prediction?

## Random Forests

Random Forests improve on Bagging by de-correlating the trees using a technique called Subspace Sampling.

 * At each decision tree split only $m$ (often $m = \sqrt k$) features are considered.

## Random Forest Parameters

Random Forest Parameters

 * Total number of trees
 * Number of features to use at each split
 * Individual decision tree Parameters
    - e.g., tree depth, pruning, split criterion

In general, RF are fairly robust to the choice of parameters and overfitting.

## Pros and Cons of Random Forest

Pros

 * Often give near state-of-the-art performance
 * Good out-of-the-box performance
 * No feature scaling needed
 * Model nonlinear relationships

Cons

 * Can be expensive to train (though can be done in parallel)
 * Not interpretable

## Accuracy

Let's investigate the accuracy of a random forests compared with a single decision tree using the boston dataset. 

In [None]:
# Load Boston data
data = load_breast_cancer()
#data = pd.drop(data, )
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X = X.drop(['worst concave points', 'mean concave points', 'worst perimeter', 'worst radius', 'worst area', 'mean concavity'], axis=1)
# Split into test/train
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=.33,
                                                    random_state=0)


In [None]:
X_train.columns

First, consider a decision tree, doing a grid search over hyperparameters.

In [None]:
# Parameter Search                                     
model = DecisionTreeClassifier()
depth_parm = np.arange(1, 12, 1)
num_samples_parm = np.arange(5,95,10)
parameters = {'max_depth' : depth_parm,
             'min_samples_leaf' : num_samples_parm}
clf = GridSearchCV(model, parameters, cv=10)
clf.fit(X_train,y_train)
print(clf.score(X_test, y_test))

In [None]:
clf.best_estimator_

In [None]:
clf.best_params_

Now random forests.

In [None]:

# Train and fit model                                                   
rf = RandomForestClassifier(n_estimators=1000,
                           max_features='auto',
                           random_state=0)
rf.fit(X_train, y_train)
                                     
# Test Prediction
pred = rf.predict(X_test)
print(rf.score(X_test, y_test))

So that's better.

# Afternoon Lecture

## Interpreting Random Forests

## Objectives

Morning Objectives

 * Explain & construct a random forest (classification or regression).
 * Explain the relationship and difference between random forest and bagging.
 * Explain why random forests are more accurate than a single decision tree.

Afternoon Objectives

 * Get feature importances from a random forest.
 * Explain how OOB error is calculated and what is it an estimate of.

## Agenda

Morning Agenda

 * Discuss ensemble methods
 * Review bias/variance tradeoff
 * Review decision trees
 * Discuss bagging (bootstrap aggregation)
 * Discuss random forests

Afternoon Agenda

 * Discuss out-of-bag error
 * Discuss feature importance

## Review: Bagging and Random Forests

What is bagging?

Can bagging be used with other models?

What's the difference between bagging and random forests?

In [None]:
1/np.e

## Out-Of-Bag Error

Measuring error of a bagged model.

* Out-of-bag (OOB) error is a method of estimating the error of ensemble methods that use Bagging.  
* About 37% of the estimators will not have been trained on each data point.
* Test each data point only on the estimators that didn't see that data point during training.  
* Often use cross validation anyway because we're comparing with other models and want to measure the accuracy the same way.

---

## Feature Importances

One of the challenges of random forests is the lack of interpretability. Feature importances are a measure of which features aften actually effect the predictions.

This can be a critical business question. For example, with churn analysis, it's generally more important to understand *why* customers are churning than to predict which customers are going to churn.

How should we measure it?

## Feature Importances: Mean Decrease Impurity

How much does each feature decrease the impurity?

To compute the importance of the $j^{th}$ feature:

 * For each tree, each split is made in order to reduce the total impurity of the tree (Gini/entropy/MSE); we can record the magnitude of the reduction.
 * Then the importance of a feature is the average decrease in impurity across trees in the forest, as a result of splits defined by that feature.  
 * This is implemented in sklearn.

## Feature Importances: Mean Decrease Accuracy

How much does randomly mixing values of a feature affect accuracy?

To compute the importance of the $j^{th}$ feature:

 * When the $b^{th}$ tree is grown, use it to predict the OOB samples and record accuracy.
 * Scramble the values of the $j^{th}$ feature in the OOB samples and do the prediction again.  Compute the new (lower) accuracy.
 * Average the decrease in accuracy across all trees.

## Feature Importances: ipython

In [None]:
# Plot the feature importance
feat_scores = pd.DataFrame({'Fraction of Samples Affected' : rf.feature_importances_},
                           index=X.columns)
feat_scores = feat_scores.sort_values(by='Fraction of Samples Affected')
feat_scores.plot(kind='barh')

#### Mean Decrease Accuracy

A different approach to calculating feature importances shuufles the values of a feature, and measures the decrease in accuracy. 

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict

rf = RandomForestRegressor()
scores = defaultdict(list)


names = data.feature_names
 
rf = RandomForestRegressor()
scores = defaultdict(list)
 
# crossvalidate the scores on a number of 
# different random splits of the data
splitter = ShuffleSplit(100, test_size=.3)

for train_idx, test_idx in splitter.split(X, y):
    X_train, X_test = X.values[train_idx], X.values[test_idx]
    y_train, y_test = y.values[train_idx], y.values[test_idx]
    rf.fit(X_train, y_train)
    acc = r2_score(y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)

score_series = pd.DataFrame(scores).mean()
scores = pd.DataFrame({'Mean Decrease Accuracy' : score_series})
scores.sort_values(by='Mean Decrease Accuracy').plot(kind='barh')