# Chapter 7 Ensemble Learning and Random Forests

In this notebook we will go through ensemble learning methods. An *ensemble* is a group of predictors (classifiers or regressors) whose aggregated prediction is better than the prediction of each individual predictor. This is called the *wisdom of the crowd*.

At the end of a project you will often create an ensemble from all the different well-performing models you built and tested along the way. This is a way of making your final model as good as possible. Instead of relying on one good model, use the aggregated prediction from a few good mdels that differ in their methods and such.

## Voting classifiers

If you have built a few different classifier models (logistic regression clf, decision tree clf, svm clf etc) on a particular dataset that all perform reasonably well (around 80%), a simple way to get an even better classifier is to aggregate the predictions of each classifier and predict the class with the most votes. This majority-vote classifier is a *hard-voting* classifier.

Even if the models are *weak learners* (predicting slightly better than just guessing), the ensemble can be a *strong learner* (high accuracy) due to the *law of large numbers*. Provided there are enough contributing models.

The key thing to consider is that the models should be as independent from each otehr as possible (eg different training algorithms or different models all together) so that the models don't all make the same types of errors.

In [28]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_clf = LogisticRegression()
forest_clf = RandomForestClassifier(n_estimators=10)
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', forest_clf), ('svc', svm_clf)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [29]:
from sklearn.metrics import accuracy_score
for clf in(log_clf, forest_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.85
RandomForestClassifier 0.87
SVC 0.87
VotingClassifier 0.87


  if diff:


So the classifier which takes the vote between all three classifiers performs better than each individual classifier.

*Soft voting* is when the ensemble model predicts the class which has the highest probability averaged over all the individual models. In this case, the models need to have a predict_proba() method. This type of voting gives higher weight to more confident models and so often gives better accuracy than hard voting. Lets do this:

In [30]:
log_clf = LogisticRegression()
forest_clf = RandomForestClassifier()
# by default the SVC class does not make predictions based on probabilities. Setting the probability
# hyperparameter to True makes the class use cross validation to calculate probailities for classes.
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', forest_clf), ('svc', svm_clf)],
    # Simply change the voting hyperparameter to 'soft'
    voting='soft'
)

for clf in(log_clf, forest_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.85
RandomForestClassifier 0.88
SVC 0.87
VotingClassifier 0.88


  if diff:


For this dataset, all the models are performing well, so it is hard to see the benefit of the voting classifier.

## Bagging and Pasting

In order to get a diverse ensemble, you can either train the models using very different algorithms, or to use the same algorithms but train them on different random subsets of the training set.

In *bagging* (short for *bootstrap aggregating*), the sampling is performed *with* replacement (called 'bootstrapping' in statistics). This means that an training instance can be sampled again for the same predictor.

In *pasting*, the sampling is performed *without* replacement. So a predictor can only be trained by a particular instance once.

The final prediction is typically the *statistical mode* (hard voting) for classifiers or the average for regression. Aggregation reduces both bias and variance.

Bagging and pasting scale well with the size of the problem since the individual models can be trained and evaluated in parallel (multiple cores or servers).