# Machine Learning with Scikit-Learn:<br/>Trees and Forests
## Jake VanderPlas

In [None]:
# The code in these cells is runable. 
# Click on this cell, then press Shift+Enter to run it, 
# or click the Run button in the toolbar.

print("Hello, World!")

# Outline

This notebook gives a conceptual introduction to decision trees and random forests, along with some examples of their usage in scikit-learn.

I will begin by exploring the idea of a simple decision tree for classification. After discussing the drawbacks of decision trees, I’ll explore the idea of instead using randomized collections of decision trees, known as random forests. The main benefit of random forests is the ability to fit complicated datasets while avoiding the problem of overfitting.

At the end of this notebook, the viewer will:

- Have an intuitive understanding of decision tree and random forest models
- Understand the term bagging and what it means in a machine learning context
- Understand how to use decision trees and random forests in scikit-learn

As usual, we'll start with imports and enable matplotlib's ``inline`` mode, so that figures display in the notebook rather than in a new window:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

We'll also use the Seaborn style for our plots:

In [None]:
plt.style.use('seaborn')

# Motivating Random Forests: Decision Trees

Random forests are an example of an *ensemble learner* built on decision trees.
For this reason, we'll start by discussing decision trees themselves.

Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero in on the classification.
For example, if you wanted to build a decision tree to classify an animal you come across while on a hike, you might construct the one shown here:

![](figures/05.08-decision-tree.png)

The binary splitting makes this algorithm very efficient. In a well-constructed tree, each question will cut the number of options by approximately half, very quickly narrowing the options even among a large number of classes.

The trick, of course, is in deciding which questions to ask at each step!

In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data: that is, each node in the tree splits the data into two groups using a cutoff value within one of the features.
Let's now look at an example of this.

## Creating a Decision Tree

Consider the following two-dimensional data, which has one of four class labels:

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='viridis');

A simple decision tree built on this data will iteratively split the data along one or the other axis according to some quantitative criterion, and at each level assign the label of the new region according to a majority vote of points within it.

The following widget shows an interactive decision tree for this data:

In [None]:
import lesson2
lesson2.plot_tree_interactive(X, y);

Notice that after the first split, every point in the upper branch remains unchanged, so there is no need to further subdivide this branch.
Except for nodes that contain all of one color, at each level *every* region is again split along one of the two features.

## Decision Trees in Scikit-Learn

This process of fitting a decision tree to our data can be done in scikit-learn with the ``DecisionTreeClassifier`` estimator:

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X, y)

Using the utility function in ``lesson2.py`` lets us visualize the result:

In [None]:
import lesson2
lesson2.visualize_classifier(tree, X, y)

## Decision Trees and Overfitting

Notice that as the depth increases, we tend to get very strangely shaped classification regions; for example, at a depth of five, there is a tall and skinny purple region between the yellow and blue regions.
It's clear that this is less a result of the true, intrinsic data distribution, and more a result of the particular sampling or noise properties of the data.

Such overfitting turns out to be a general property of decision trees: it is very easy to go too deep in the tree, and thus to fit details of the particular data rather than the overall properties of the distributions they are drawn from.
Another way to see this overfitting is to look at models trained on different subsets of the data. For example, in this figure we train two different trees, each on half of the original data:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
for i in range(2):
    indices = np.random.randint(0, len(X), 3 * len(X) // 2)
    tree = DecisionTreeClassifier()
    lesson2.visualize_classifier(tree, X[indices], y[indices], ax=ax[i])

It is clear that in some places, the two trees produce consistent results (e.g., in the four corners), while in other places, the two trees give very different classifications (e.g., in the regions between any two clusters).
The key observation is that the inconsistencies tend to happen where the classification is less certain—thus, by using information from *both* of these trees, we might come up with a better result!

Let's explore this with a widget:

In [None]:
from ipywidgets import interact

# store X & y here so that this widget still works
# when X & y are overwritten in later cells
randomized_tree_data = (X, y)

@interact
def fit_randomized_tree(random_state=100, frac=(0.0, 1.0)):
    X, y = randomized_tree_data
    clf = DecisionTreeClassifier(max_depth=15)
    i = np.arange(len(y))
    rng = np.random.RandomState(random_state)
    rng.shuffle(i)
    N = int(frac * X.shape[0])
    lesson2.visualize_tree(clf, X[i[:N]], y[i[:N]], boundaries=False,
                           xlim=(X[:, 0].min(), X[:, 0].max()),
                           ylim=(X[:, 1].min(), X[:, 1].max()))

Just as using information from two trees improves our results, we might expect that using information from many trees would improve our results even further; we'll explore this idea in the next section.

# Ensembles of Estimators: Bagging and Random Forests

The notion that multiple overfitting estimators can be combined to reduce the effect of overfitting is what underlies an ensemble method called *bagging*.
Bagging makes use of an ensemble (a grab bag, perhaps) of parallel estimators, each of which overfits the data, and averages the results to find a better classification.
An ensemble of randomized decision trees is known as a *random forest*.

This type of bagging classification can be done manually using scikit-learn's ``BaggingClassifier`` meta-estimator, as shown here:

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

model = DecisionTreeClassifier()
bag = BaggingClassifier(model, n_estimators=100, max_samples=0.8,
                        random_state=1)

bag.fit(X, y)
lesson2.visualize_classifier(bag, X, y)

In this example, we have randomized the data by fitting each estimator with a random subset of 80% of the training points.
In practice, decision trees are more effectively randomized by injecting some stochasticity in how the splits are chosen: this way all the data contributes to the fit each time, but the results of the fit still have the desired randomness.
For example, when determining which feature to split on, the randomized tree might select from among the top several features.
You can read more technical details about these randomization strategies in the <a href="http://scikit-learn.org/stable/modules/ensemble.html#forest" target="_blank">scikit-learn documentation and references within</a>.

In scikit-learn, such an optimized ensemble of randomized decision trees is implemented in the ``RandomForestClassifier`` estimator, which automatically takes care of all the randomization.
All you need to do is select a number of estimators, and it will very quickly (in parallel, if desired) fit the ensemble of trees:

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
lesson2.visualize_classifier(model, X, y);

We see that by averaging over 100 randomly perturbed models, we end up with an overall model that is much closer to our intuition about how the parameter space should be split.

# Random Forest Regression

Earlier, we considered random forests within the context of classification.

Random forests can also be made to work in the case of regression (that is, continuous rather than categorical variables). The estimator to use for this is the ``RandomForestRegressor``, and the syntax is very similar to what we saw above.

Consider the following data, drawn from the combination of a fast and slow oscillation:

In [None]:
rng = np.random.RandomState(42)
x = 10 * rng.rand(200)

def model(x, sigma=0.3):
    fast_oscillation = np.sin(5 * x)
    slow_oscillation = np.sin(0.5 * x)
    noise = sigma * rng.randn(len(x))

    return slow_oscillation + fast_oscillation + noise

y = model(x)
plt.errorbar(x, y, 0.3, fmt='o');

Using the random forest regressor, we can find the best fit curve as follows:

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(200)
forest.fit(x[:, None], y)

xfit = np.linspace(0, 10, 1000)
yfit = forest.predict(xfit[:, None])
ytrue = model(xfit, sigma=0)

plt.errorbar(x, y, 0.3, fmt='o', alpha=0.5)
plt.plot(xfit, yfit, '-r');
plt.plot(xfit, ytrue, '-k', alpha=0.5);

Here the true model is shown in the smooth black curve, while the random forest model is shown by the jagged red curve.
As you can see, the nonparametric random forest model is flexible enough to fit the multiperiod data, without us needing to specifying a multiperiod model!

# Example: Random Forest for Classifying Digits

In an earlier notebook, *Machine Learning with Scikit-Learn: Introduction to Machine Learning*, we took a quick look at the handwritten digits data.
Let's use that again here to see how the random forest classifier can be used in this context.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

To remind us what we're looking at, we'll visualize the first few data points:

In [None]:
# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

We can quickly classify the digits using a random forest as follows:

In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target)
model = RandomForestClassifier()
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)

We can take a look at the accuracy on the test data:

In [None]:
from sklearn import metrics
metrics.accuracy_score(ytest, ypred)

And for good measure, plot the confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap='Blues')
plt.xlabel('true label')
plt.ylabel('predicted label');

We find that a simple, untuned random forest results in a very accurate classification of the digits data.

## Exploring Hyperparameters

In any ensemble estimator, one free parameter you can tune is the number of estimators.
Let's quickly look at how this affects the results in this case:

In [None]:
for n_estimators in [1, 5, 10, 50, 100, 500, 1000]:
    model = RandomForestClassifier(n_estimators=n_estimators)
    model.fit(Xtrain, ytrain)
    ypred = model.predict(Xtest)
    
    print(n_estimators, metrics.accuracy_score(ytest, ypred))

We see here that a larger ``n_estimators`` value, in this case, leads to a better accuracy on the test data.

Similarly, we could explore the effect of ``max_depth``, which controls how many splits each individual tree will have:

In [None]:
for max_depth in [1, 2, 4, 8, 16, 32]:
    model = RandomForestClassifier(n_estimators=1000,
                                   max_depth=max_depth)
    model.fit(Xtrain, ytrain)
    ypred = model.predict(Xtest)
    
    print(max_depth, metrics.accuracy_score(ytest, ypred))

We see that in general, deeper trees lead to better results.

These parameters, ``n_estimators`` and ``max_depth``, are known as *hyperparameters* (parameters that control the model itself, rather than parameters fit by the model).
In the next notebook, *Machine Learning with Scikit-Learn: Hyperparameters and Model Validation*, we will look in more detail at the process of choosing optimal hyperparameters.

# Summary

In this notebook, we looked at decision trees (very simple models that use binary splits to quickly predict a label on unknown data) and random forests (a collection of randomized decision trees).
Random forests are a powerful estimator because they are both fast to train and evaluate, and can closely fit very complicated datasets.