# Ensembles of Decision Trees
Follow _Introduction to Machine Learning_ [Chapter 2 Supervised Learning](https://github.com/amueller/introduction_to_ml_with_python/blob/master/02-supervised-learning.ipynb) **2.3.6 Ensembles of Decision Trees** (p.85)

> Ensembles are methods that combine multiple machine learning models to create more powerful models.

### Random Forest
Many, different, parallel decision trees are built, soft-voting (classification) or average (regression) is used

Powerful, relatively easy to train, robust, parallelizable training, costly in prediction (many trees)

### Gradient boosted trees
A sequence of decision trees is built, each trained to correct the errors of the previous

Powerful, a bit harder to train than random forest, XGBoost (a variant) wins many Kaggle competitions, costly in training, faster prediction

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import mglearn

## Random Forest 

What is random? 
1. Data points to train trees is randomly selected from data using bootstrap sampling (sampling with replacement).
2. A subset of features is randomly selected to find the best split.

Parameters:
- `n_estimators` the number of trees (not tuned, more is better).
- `max_features` maximum number of features that can be selected to determine split.
- `max_depth` maximum number of tree levels. Stops tree growth.
- `min_samples_leaf` minimum number of samples required to be a leaf node. Stops tree growth.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    random_state=42)

forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest.fit(X_train, y_train)

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(20, 10))
for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
    ax.set_title("Tree {}".format(i))
    mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)
    
mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1],
                                alpha=.4)
axes[-1, -1].set_title("Random Forest")
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)

Decision boundaries for each tree is slightly different

Random forest overfits less than individual trees

More trees would result in smoother decision boundaries

### Random forest classification of breast cancer tumor

In [None]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

X_train, X_val, y_train, y_val = train_test_split(
    cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(random_state=0)
forest.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on validation set: {:.3f}".format(forest.score(X_val, y_val)))

Using the default parameters, we get better performance than linear models and using a single decision tree

We might be overfitting as the training accuracy is at 100%

**Question:** How can we reduce model complexity?

**Answer:** ...


Looking at aggregated feature importances, these look more reliable. More features have non-zero importances.

In [None]:
def plot_feature_importances_cancer(model, figsize=(4,6)):
    importances = pd.DataFrame({'Feature importance': model.feature_importances_}, 
                           index=cancer.feature_names).sort_values(by='Feature importance',ascending=False)
    importances.plot.barh(figsize=figsize);

plot_feature_importances_cancer(forest)

## Gradient Boosted Trees 

Trees are build in series, each tree trying to correct mistakes made by the previous tree

A new parameter `learning_rate` controls how much each tree tries to correct the previous. **Higher** `learning_rate` means **larger** corrections are possible, resulting in **more** complex models

Generally, shallow trees are used. This results in an ensemble of *weak learners*

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

X_train, X_val, y_train, y_val = train_test_split(
    cancer.data, cancer.target, random_state=0)

gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on validation set: {:.3f}".format(gbrt.score(X_val, y_val)))

Just like with Random Forest, with the default parameters, we might be overfitting

We can pre-prune by reducing the maximum depth to reduce complexity:


In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on validation set: {:.3f}".format(gbrt.score(X_val, y_val)))

Or reduce `learning_rate`:

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)
gbrt.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on validation set: {:.3f}".format(gbrt.score(X_val, y_val)))

### Feature importance of gradient boosted classifier

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
gbrt.fit(X_train, y_train)

plot_feature_importances_cancer(gbrt)

Some features are completely ignored by boosted tree classifier