# Ensemble Learning and Random Forests

# Voting Classiers

A very simple way to create an even better classifier is to aggregate the predictions of
each classifier and predict the class that gets the most votes. This majority-vote classi‐
fier is called a hard voting classifier

The following code creates and trains a voting classifier in Scikit-Learn, composed of
three diverse classifiers (the training set is the moons dataset, introduced in Chap‐
ter 5):

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn


In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [4]:
log_clf = LogisticRegression(solver='lbfgs')
rf_clf = RandomForestClassifier(n_estimators=100)
svm_clf = SVC(gamma='scale')

In [5]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [6]:
X, y = datasets.make_moons(n_samples=10000, noise=0.5)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33)

In [7]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((6700, 2), (6700,), (3300, 2), (3300,))

In [8]:
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rf_clf), ('svc', svm_clf)], voting='hard')

In [9]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [10]:
from sklearn.metrics import accuracy_score

In [11]:
for clf in [log_clf, rf_clf, svm_clf, voting_clf]:
    clf.fit(X_train, y_train)
    y_hat = clf.predict(X_val)
    print(clf.__class__.__name__, accuracy_score(y_val, y_hat))

LogisticRegression 0.8087878787878788
RandomForestClassifier 0.7996969696969697
SVC 0.8248484848484848
VotingClassifier 0.8209090909090909


There we have it! The voting classifier slightly outperforms the individual classifiers.

If all ensemble method learners can estimate class probabilities, we can average their probabilities per class then predict the class with the highest probability. This is called Soft voting. It often yields results better than hard voting because it weights confidence.

# Bagging and Pasting in Scikit-Learn


In [13]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


bagging classifier is classifier which takes one type of classifier algorithm and uses different independent data for training that classifier 

The following code trains an
ensemble of 500 Decision Tree classifiers,5
each trained on 100 training instances ran‐
domly sampled from the training set with replacement (this is an example of bagging,
but if you want to use pasting instead, just set bootstrap=False). The n_jobs param‐
eter tells Scikit-Learn the number of CPU cores to use for training and predictions
(–1 tells Scikit-Learn to use all available cores)

In [14]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)

In [16]:
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1)

In [17]:
y_hat = bag_clf.predict(X_val)

In [18]:
y_hat

array([0, 1, 1, ..., 1, 1, 0], dtype=int64)