# Voting Classifiers

Let's build a voting classifier:

Scikit-Learn provides a VotingClassifier class that’s quite easy to use: just give
it a list of name/predictor pairs, and use it like a normal classifier. Let’s try it on
the moons dataset. We will load and split the moons
dataset into a training set and a test set, then we’ll create and train a voting classifier
composed of three diverse classifiers:

In [3]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

When you fit a VotingClassifier, it clones every estimator and fits the clones. The
original estimators are available via the estimators attribute, while the fitted clones
are available via the estimators_ attribute. If you prefer a dict rather than a list, you
can use named_estimators or named_estimators_ instead. To begin, let’s look at each
fitted classifier’s accuracy on the test set:

In [4]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.864
rf = 0.896
svc = 0.896


When you call the voting classifier’s predict() method, it performs hard voting. For
example, the voting classifier predicts class 1 for the first instance of the test set,
because two out of three classifiers predict that class:

In [5]:
voting_clf.predict(X_test[:1])

array([1], dtype=int64)

In [6]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1], dtype=int64), array([1], dtype=int64), array([0], dtype=int64)]

Now let’s look at the performance of the voting classifier on the test set:

In [7]:
voting_clf.score(X_test, y_test)

0.912

There you have it! The voting classifier outperforms all the individual classifiers.
If all classifiers are able to estimate class probabilities (i.e., if they all have a
predict_proba() method), then you can tell Scikit-Learn to predict the class with
the highest class probability, averaged over all the individual classifiers. This is called
soft voting. It often achieves higher performance than hard voting because it gives
more weight to highly confident votes. All you need to do is set the voting class
fier’s voting hyperparameter to "soft", and ensure that all classifiers can estimate
class probabilities. This is not the case for the SVC class by default, so you need
to set its probability hyperparameter to True (this will make the SVC class use
cross-validation to estimate class probabilities, slowing down training, and it will add
a predict_proba() method). Let’s try that:

In [8]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.92

We reach 92% accuracy simply by using soft voting—not bad!

# Bagging and Pasting

Scikit-Learn offers a simple API for both bagging and pasting: BaggingClassifier
class (or BaggingRegressor for regression). The following code trains an ensemble
of 500 decision tree classifiers:6 each is trained on 100 training instances randomly
sampled from the training set with replacement (this is an example of bagging, but
if you want to use pasting instead, just set bootstrap=False). The n_jobs parameter
tells Scikit-Learn the number of CPU cores to use for training and predictions, and
–1 tells Scikit-Learn to use all available cores:

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            max_samples=100, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)

* A BaggingClassifier automatically performs soft voting instead
of hard voting if the base classifier can estimate class probabilities
(i.e., if it has a predict_proba() method), which is the case with
decision tree classifiers

Figure below compares the decision boundary of a single decision tree with the decision
boundary of a bagging ensemble of 500 trees (from the preceding code), both trained
on the moons dataset. As you can see, the ensemble’s predictions will likely generalize
much better than the single decision tree’s predictions: the ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the
training set, but the decision boundary is less irregular).
Bagging introduces a bit more diversity in the subsets that each predictor is trained
on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity
also means that the predictors end up being less correlated, so the ensemble’s variance
is reduced. Overall, bagging often results in better models, which explains why
it’s generally preferred. But if you have spare time and CPU power, you can use
cross-validation to evaluate both bagging and pasting and select the one that works
best.

In [None]:
# extra code - To generate figure below
import matplotlib.pyplot as plt
import numpy as np
def plot_decision_boundary(clf, X, y, alpha=1.0):
    axes=[-1.5, 2.4, -1, 1.5]
    x1, x2 = np.meshgrid(np.linspace(axes[0], axes[1], 100),
                         np.linspace(axes[2], axes[3], 100))
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    
    plt.contourf(x1, x2, y_pred, alpha=0.3 * alpha, cmap='Wistia')
    plt.contour(x1, x2, y_pred, cmap="Greys", alpha=0.8 * alpha)
    colors = ["#78785c", "#c47b27"]
    markers = ("o", "^")
    for idx in (0, 1):
        plt.plot(X[:, 0][y == idx], X[:, 1][y == idx],
                 color=colors[idx], marker=markers[idx], linestyle="none")
    plt.axis(axes)
    plt.xlabel(r"$x_1$")
    plt.ylabel(r"$x_2$", rotation=0)

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(tree_clf, X_train, y_train)
plt.title("Decision Tree")
plt.sca(axes[1])
plot_decision_boundary(bag_clf, X_train, y_train)
plt.title("Decision Trees with Bagging")
plt.ylabel("")
plt.show()

## Out-of-Bag evaluation

With bagging, some training instances may be sampled several times for any given
predictor, while others may not be sampled at all. By default a BaggingClassifier
samples m training instances with replacement (bootstrap=True), where m is the
size of the training set. With this process, it can be shown mathematically that only
about 63% of the training instances are sampled on average for each predictor.7 The
remaining 37% of the training instances that are not sampled are called out-of-bag
(OOB) instances. Note that they are not the same 37% for all predictors.
A bagging ensemble can be evaluated using OOB instances, without the need for
a separate validation set: indeed, if there are enough estimators, then each instance
in the training set will likely be an OOB instance of several estimators, so these
estimators can be used to make a fair ensemble prediction for that instance. Once
you have a prediction for each instance, you can compute the ensemble’s prediction
accuracy (or any other metric).
In Scikit-Learn, you can set oob_score=True when creating a BaggingClassifier
to request an automatic OOB evaluation after training. The following code demonstrates this. The resulting evaluation score is available in the oob_score_ attribute:

In [None]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            oob_score=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

According to this OOB evaluation, this BaggingClassifier is likely to achieve about
89.6% accuracy on the test set. Let’s verify this:

In [None]:
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

We get 92% accuracy on the test. The OOB evaluation was a bit too pessimistic, just
over 2% too low.
The OOB decision function for each training instance is also available through the
oob_decision_function_ attribute. Since the base estimator has a predict_proba()
method, the decision function returns the class probabilities for each training
instance. For example, the OOB evaluation estimates that the first training instance
has a 67.6% probability of belonging to the positive class and a 32.4% probability of
belonging to the negative class:

In [None]:
bag_clf.oob_decision_function_[:3]  # probas for the first 3 instances

# Random Forest

a random forest10 is an ensemble of decision trees, generally
trained via the bagging method (or sometimes pasting), typically with max_samples
set to the size of the training set. Instead of building a BaggingClassifier and
passing it a DecisionTreeClassifier, you can use the RandomForestClassifier
class, which is more convenient and optimized for decision trees11 (similarly, there
is a RandomForestRegressor class for regression tasks). The following code trains a
random forest classifier with 500 trees, each limited to maximum 16 leaf nodes, using
all available CPU cores:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

With a few exceptions, a RandomForestClassifier has all the hyperparameters of
a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.
The random forest algorithm introduces extra randomness when growing trees;
instead of searching for the very best feature when splitting a node (see Chapter 6),
it searches for the best feature among a random subset of features. By default, it
samples nfeatures (where n is the total number of features). The algorithm results
in greater tree diversity, which (again) trades a higher bias for a lower variance,
generally yielding an overall better model. So, the following BaggingClassifier is
equivalent to the previous RandomForestClassifier:

In [None]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(max_features="sqrt", max_leaf_nodes=16),
    n_estimators=500, n_jobs=-1, random_state=42)

In [None]:
# extra code – verifies that the predictions are identical
bag_clf.fit(X_train, y_train)
y_pred_bag = bag_clf.predict(X_test)
np.all(y_pred_bag == y_pred_rf)  # same predictions

## Feature Importance

Yet another great quality of random forests is that they make it easy to measure the
relative importance of each feature. Scikit-Learn measures a feature’s importance by
looking at how much the tree nodes that use that feature reduce impurity on average,
across all trees in the forest. More precisely, it is a weighted average, where each
node’s weight is equal to the number of training samples that are associated with it
(see Week 7 Lecture Notes).
Scikit-Learn computes this score automatically for each feature after training, then
it scales the results so that the sum of all importances is equal to 1. You can access
the result using the feature_importances_ variable. For example, the following code
trains a RandomForestClassifier on the iris dataset and
outputs each feature’s importance. It seems that the most important features are
the petal length (44%) and width (42%), while sepal length and width are rather
unimportant in comparison (11% and 2%, respectively):

In [None]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)

Similarly, if you train a random forest classifier on the MNIST dataset and plot each pixel’s importance, you get the image represented in figure below.

In [None]:
# extra code – this cell generates figure below

from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml('mnist_784', return_X_y=True, as_frame=False,
                                parser='auto')

rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(X_mnist, y_mnist)

heatmap_image = rnd_clf.feature_importances_.reshape(28, 28)
plt.imshow(heatmap_image, cmap="hot")
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),
                           rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'], fontsize=14)
plt.axis("off")
plt.show()

Random forests are very handy to get a quick understanding of what features actually
matter, in particular if you need to perform feature selection.

# Boosting

## AdaBoost

In [None]:
# extra code – this cell generates figure below

m = len(X_train)

fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)
for subplot, learning_rate in ((0, 1), (1, 0.5)):
    sample_weights = np.ones(m) / m
    plt.sca(axes[subplot])
    for i in range(5):
        svm_clf = SVC(C=0.2, gamma=0.6, random_state=42)
        svm_clf.fit(X_train, y_train, sample_weight=sample_weights * m)
        y_pred = svm_clf.predict(X_train)

        error_weights = sample_weights[y_pred != y_train].sum()
        r = error_weights / sample_weights.sum()  # equation 7-1
        alpha = learning_rate * np.log((1 - r) / r)  # equation 7-2
        sample_weights[y_pred != y_train] *= np.exp(alpha)  # equation 7-3
        sample_weights /= sample_weights.sum()  # normalization step

        plot_decision_boundary(svm_clf, X_train, y_train, alpha=0.4)
        plt.title(f"learning_rate = {learning_rate}")
    if subplot == 0:
        plt.text(-0.75, -0.95, "1", fontsize=16)
        plt.text(-1.05, -0.95, "2", fontsize=16)
        plt.text(1.0, -0.95, "3", fontsize=16)
        plt.text(-1.45, -0.5, "4", fontsize=16)
        plt.text(1.36,  -0.95, "5", fontsize=16)
    else:
        plt.ylabel("")

plt.show()

The following code trains an AdaBoost classifier based on 30 decision stumps using
Scikit-Learn’s AdaBoostClassifier class (as you might expect, there is also an
AdaBoostRegressor class). A decision stump is a decision tree with max_depth=1—in
other words, a tree composed of a single decision node plus two leaf nodes. This is
the default base estimator for the AdaBoostClassifier class:

Notes: If your AdaBoost ensemble is overfitting the training set, you can
try reducing the number of estimators or more strongly regularizing the base estimator.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=30,
    learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

In [None]:
# extra code – in case you're curious to see what the decision boundary
#              looks like for the AdaBoost classifier
plot_decision_boundary(ada_clf, X_train, y_train)

## Gradient Boosting

Another very popular boosting algorithm is gradient boosting.17 Just like AdaBoost,
gradient boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual
errors made by the previous predictor.

Let’s go through a simple regression example, using decision trees as the base predic‐
tors; this is called gradient tree boosting, or gradient boosted regression trees (GBRT).
First, let’s generate a noisy quadratic dataset and fit a DecisionTreeRegressor to it:

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0] ** 2 + 0.05 * np.random.randn(100)  # y = 3x² + Gaussian noise

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

Now let's train another decision tree regressor on the residual errors made by the previous predictor:

In [None]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)

And then we’ll train a third regressor on the residual errors made by the second
predictor:

In [None]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)

Now we have an ensemble containing three trees. It can make predictions on a new
instance simply by adding up the predictions of all the trees:

In [None]:
X_new = np.array([[-0.4], [0.], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [None]:
# extra code – this cell generates figure below

def plot_predictions(regressors, X, y, axes, style,
                     label=None, data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1))
                 for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    if label or data_label:
        plt.legend(loc="upper center")
    plt.axis(axes)

plt.figure(figsize=(11, 11))

plt.subplot(3, 2, 1)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style="g-",
                 label="$h_1(x_1)$", data_label="Training set")
plt.ylabel("$y$  ", rotation=0)
plt.title("Residuals and tree predictions")

plt.subplot(3, 2, 2)
plot_predictions([tree_reg1], X, y, axes=[-0.5, 0.5, -0.2, 0.8], style="r-",
                 label="$h(x_1) = h_1(x_1)$", data_label="Training set")
plt.title("Ensemble predictions")

plt.subplot(3, 2, 3)
plot_predictions([tree_reg2], X, y2, axes=[-0.5, 0.5, -0.4, 0.6], style="g-",
                 label="$h_2(x_1)$", data_style="k+",
                 data_label="Residuals: $y - h_1(x_1)$")
plt.ylabel("$y$  ", rotation=0)

plt.subplot(3, 2, 4)
plot_predictions([tree_reg1, tree_reg2], X, y, axes=[-0.5, 0.5, -0.2, 0.8],
                  style="r-", label="$h(x_1) = h_1(x_1) + h_2(x_1)$")

plt.subplot(3, 2, 5)
plot_predictions([tree_reg3], X, y3, axes=[-0.5, 0.5, -0.4, 0.6], style="g-",
                 label="$h_3(x_1)$", data_style="k+",
                 data_label="Residuals: $y - h_1(x_1) - h_2(x_1)$")
plt.xlabel("$x_1$")
plt.ylabel("$y$  ", rotation=0)

plt.subplot(3, 2, 6)
plot_predictions([tree_reg1, tree_reg2, tree_reg3], X, y,
                 axes=[-0.5, 0.5, -0.2, 0.8], style="r-",
                 label="$h(x_1) = h_1(x_1) + h_2(x_1) + h_3(x_1)$")
plt.xlabel("$x_1$")

plt.show()

You can use Scikit-Learn’s GradientBoostingRegressor class to train GBRT ensembles more easily (there’s also a GradientBoostingClassifier class for classification). Much like the RandomForestRegressor class, it has hyperparameters to
control the growth of decision trees (e.g., max_depth, min_samples_leaf), as well
as hyperparameters to control the ensemble training, such as the number of trees
(n_estimators). The following code creates the same ensemble as the previous one:

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3,
                                 learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

To find the optimal number of trees, you could perform cross-validation using
GridSearchCV or RandomizedSearchCV, as usual, but there’s a simpler way: if you set
the n_iter_no_change hyperparameter to an integer value, say 10, then the Gradient
BoostingRegressor will automatically stop adding more trees during training if it
sees that the last 10 trees didn’t help. This is simply early stopping (introduced in
Chapter 4), but with a little bit of patience: it tolerates having no progress for a few
iterations before it stops. Let’s train the ensemble using early stopping:

In [None]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2, learning_rate=0.05, n_estimators=500,
    n_iter_no_change=10, random_state=42)
gbrt_best.fit(X, y)

If you set n_iter_no_change too low, training may stop too early and the model will
underfit. But if you set it too high, it will overfit instead. We also set a fairly small
learning rate and a high number of estimators, but the actual number of estimators in
the trained ensemble is much lower, thanks to early stopping:

In [None]:
gbrt_best.n_estimators_

In [None]:
# extra code – this cell generates figure below

fig, axes = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)

plt.sca(axes[0])
plot_predictions([gbrt], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style="r-",
                 label="Ensemble predictions")
plt.title(f"learning_rate={gbrt.learning_rate}, "
          f"n_estimators={gbrt.n_estimators_}")
plt.xlabel("$x_1$")
plt.ylabel("$y$", rotation=0)

plt.sca(axes[1])
plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8], style="r-")
plt.title(f"learning_rate={gbrt_best.learning_rate}, "
          f"n_estimators={gbrt_best.n_estimators_}")
plt.xlabel("$x_1$")

plt.show()

# Stacking 

Scikit-Learn provides two classes for stacking ensembles: StackingClassifier and
StackingRegressor. For example, we can replace the VotingClassifier we used at
the beginning of this chapter on the moons dataset with a StackingClassifier:

In [None]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5  # number of cross-validation folds
)
stacking_clf.fit(X_train, y_train)

In [None]:
stacking_clf.score(X_test, y_test)