# 8.

Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for val‐ idation, and 10,000 for testing).

Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier.

Next, try to com‐ bine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting.

Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

Warning: since Scikit-Learn 0.24, fetch_openml() returns a Pandas DataFrame by default. To avoid this and keep the same code as in the book, we use as_frame=False.

In [1]:
import numpy as np
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

  warn(


In [2]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [4]:
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_tree_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [5]:
%%time
estimators = [random_forest_clf, extra_tree_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(random_state=42)
Training the ExtraTreesClassifier(random_state=42)
Training the LinearSVC(max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [6]:
[estimator.score(X_val, y_val) for estimator in estimators]

[0.9692, 0.9715, 0.859, 0.9614]

The linear SVM is far outperformed by the other classifiers. However, let's keep it for now since it may improve the voting classifier's performance.

Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier.

In [7]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_tree_clf", extra_tree_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

In [8]:
%%time
voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)

In [9]:
voting_clf.score(X_val, y_val)

0.9722

In [10]:
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9692, 0.9715, 0.859, 0.9614]

Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to None using set_params() like this:



In [11]:
voting_clf.set_params(svm_clf=None)

The updated list of estimators:

In [12]:
voting_clf.estimators

[('random_forest_clf', RandomForestClassifier(random_state=42)),
 ('extra_tree_clf', ExtraTreesClassifier(random_state=42)),
 ('svm_clf', None),
 ('mlp_clf', MLPClassifier(random_state=42))]

However, it did not update the list of trained estimators:

In [13]:
voting_clf.estimators_

[RandomForestClassifier(random_state=42),
 ExtraTreesClassifier(random_state=42),
 LinearSVC(max_iter=100, random_state=42, tol=20),
 MLPClassifier(random_state=42)]

So we can either fit the VotingClassifier again, or just remove the SVM from the list of trained estimators:



In [14]:
del voting_clf.estimators_[2]

Now let's evaluate the VotingClassifier again:

In [15]:
voting_clf.score(X_val, y_val)

0.974

A bit better! The SVM was hurting performance. Now let's try using a soft voting classifier. We do not actually need to retrain the classifier, we can just set voting to "soft":

In [16]:
voting_clf.voting = "soft"

In [17]:
voting_clf.score(X_val, y_val)

0.9699

Nope, hard voting wins here

Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [18]:
voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9703

In [19]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9645, 0.9691, 0.9607]

The voting classifier only very slightly reduced the error rate of the best model in this case.

# 9.

Run the individual classifiers from the previous exercise to make predictions on the validation set,

and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class.

Train a classifier on this new training set.

In [20]:
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)

In [21]:
X_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 3., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [23]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(X_val_predictions, y_val)

In [25]:
rnd_forest_blender.oob_score_

0.97

You could fine-tune this blender or try other types of blenders (e.g., an MLPClassifier), then select the best one using cross-validation, as always.

 Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble!

Now let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?

In [26]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [27]:
y_pred = rnd_forest_blender.predict(X_test_predictions)

In [28]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.9691

This stacking ensemble does not perform as well as the voting classifier we trained earlier, it's not quite as good as the best individual classifier.