# Random Forests

Random forests are a very successful machine learning algorithm developed in the late 90's and early 2000's. You can read the [original notes online][1] from the creator Leo Brieman himself. A random forest is an extension of a decision tree model.

## Random forests are a collection of decision trees

A random forest is simply a collection of decision trees built from the same training data. The final model will be a number of decision trees that are determined by the user. To make a prediction, the random forest averages the predictions made by each individual tree.

### Isn't a decision tree deterministic?

Yes, all the nodes and branches created from a decision tree are deterministic, meaning that the tree will always be built the same given the same input and output data (There actually is a slight exception here. When there are multiple splits of the data that are tied for the best split, then scikit-learn randomly selects which split to take.). 

### A random forest adds two sources of randomness to the decision tree

Each decision tree created in a random forest will be different. There are **two sources of randomness**. The first source of randomness comes from choosing different random sets of data to build the trees from. This means all the trees will be built from different observations. The datasets for each tree are created through a procedure called **bootstrapping**. Rows are sampled from the original dataset **with replacement**. Typically, the same number of rows are chosen for the random dataset. Because the data is sampled with replacement, many rows will be repeated and others will be absent.

The other source of randomness comes when building the tree itself. In a normal decision tree, every single feature is searched for the best binary split. With a random forest, you can opt to select a subset of the features to search at each node for the best binary split. By default, scikit-learn searches all the features, but provides the `max_features` parameter to limit the number of features to search. Note, that this random procedure is done for each node, so a different set of features will be searched at each node.

### Bagging = Bootstrap Aggregating

A random forest uses bootstrapping to create new random datasets for each tree. The predictions of all the trees are aggregated to produce a final decision. This combination of bootstrapping and aggregating is called **bagging**. This procedure is not specific to random forests and may be applied to many different models.

## Random Forest in Scikit-Learn

Let's build a random forest on the housing dataset with three decision trees. Choose the number of trees by setting the parameter `n_estimators` during instantiation. We will keep the depth of the trees small to make visualization possible in the notebook. We can also set a limit to the number of features searched at each node with the `max_features` parameter.

### Import from the `ensemble` module

Random forests are considered an **ensemble** model which is a term to refer to any model that combines many other models together. Let's begin b reading in our data selecting a few of the columns to learn from.

[1]: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

In [None]:
import pandas as pd
housing = pd.read_csv('../data/housing_sample.csv')
cols = ['GrLivArea', 'GarageArea', 'BedroomAbvGr', 'FullBath']
X = housing[cols]
y = housing['SalePrice']
X.head()

We import the `RandomForestRegressor` from the `ensemble` module and then instantiate it by setting the number of trees to 3, the max depth of 2, and search 2 features at each node for the best split. Finally, we train the model with the `fit` method.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=3, max_depth=2, max_features=2)
rfr.fit(X, y)

### Each tree is stored in the `estimators_` attribute

We can get access to each individual tree in the `estimators_` attribute. It is a list of all the trees.

In [None]:
rfr.estimators_

### Visualize each tree

Let's visualize each tree so that we can verify that each one is indeed different. A function is created to automate the process.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=3, max_depth=2, max_features=2)
rfr.fit(X, y)

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
%matplotlib inline

fig, axes = plt.subplots(3, 1, figsize=(14, 10))
fig.suptitle('Visualizing each tree from a random forest', fontsize=30)
kwargs = dict(feature_names=cols, precision=1, filled=True, rounded=True)
for i in range(3):
    plot_tree(rfr.estimators_[i], ax=axes[i], **kwargs)
    axes[i].set_title(f'Tree {i + 1}', fontsize=15, pad=10)

### Make a prediction

We have four features in our model. Let's predict the sale price of a house with above ground living area of 2,300, garage area of 1,000, 4 bedrooms and 2 baths.

In [None]:
val = [[2300, 1000, 4, 2]]
rfr.predict(val)

### Verify prediction is average of individual trees

Let's iterate over each tree to get a prediction for each one. Then let's take the average and verify that it matches the above.

In [None]:
y_pred = [tree.predict(val) for tree in rfr.estimators_]
y_pred

In [None]:
sum(y_pred) / 3

### Evaluate the random forest

Again, we can evaluate our model with the `score` method.

In [None]:
rfr.score(X, y)

## Random forests build weak learners, why are they good?

A random forest purposefully withholds information from each tree. Not all the observations are available to each tree and not all the features are available for each node at each decision. Each individual tree is considered a **weak learner** because of this. As it turns out, many weak learners acting independently can together make good decisions. This is sometimes referred to as **the wisdom of the crowd**.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Build many different random forests with different combinations of features and different values for the number of trees and the depth of each tree.</span>