# Set up

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [12]:
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml(
    "mnist_784", return_X_y=True, as_frame=False, parser="auto"
)

The MNIST dataset is split and shuffled already, such that the first 60,000 is the training set and the last 10,000 is the test set.

In [13]:
X_train, y_train = X_mnist[:50000], y_mnist[:50000]
X_valid, y_valid = X_mnist[50000:60000], y_mnist[50000:60000]
X_test, y_test = X_mnist[60000:], y_mnist[60000:]

Following the instructions of the exercise, we will train a random forest classifier, an extra-trees classifier and an SVM classifier. We will also train an Multilayer Perceptron.

In [14]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [15]:
random_forest_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
extra_trees = ExtraTreesClassifier(n_estimators=100, n_jobs=-1, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, dual=True, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [16]:
estimators = [random_forest_clf, extra_trees, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(n_jobs=-1, random_state=42)
Training the ExtraTreesClassifier(n_jobs=-1, random_state=42)
Training the LinearSVC(dual=True, max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


In [17]:
[estimator.score(X_valid, y_valid) for estimator in estimators]

[0.9736, 0.9743, 0.8662, 0.9658]

Well, the SVC is outperformed by all other models, but let's keep it this time, to see if the ensemble perform any better.

Following the exercise, now we will try to combine them into an ensemble. Because `LinearSVC` is not capable at predict probabilities, we will use hard voting.

In [18]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ("random_forest_clf", random_forest_clf),
        ("extra_trees_clf", extra_trees),
        ("svm_clf", svm_clf),
        ("mlp_clf", mlp_clf),
    ],
    n_jobs=-1,
)

In [19]:
voting_clf.fit(X_train, y_train)

In [20]:
voting_clf.score(X_valid, y_valid)

0.975

The `VotingClassifier` made a clone of each base classifier, and it trained the clones using class indices as label, instead of the original class names. Therefore, to evaluate these models, we need to provide class indices as well. To convert the classes to class indices, we use `LabelEncoder`.

In [21]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_valid_encoded = encoder.fit_transform(y_valid)

In the MNIST's case, we can directly convert the class names to the class indices, since the digits match the class ids.

In [25]:
y_valid_encoded = y_valid.astype(np.int64)

Here, we will evaluate the performance of the classifier clones.

In [27]:
[estimator.score(X_valid, y_valid_encoded) for estimator in voting_clf.estimators_]

[0.9736, 0.9743, 0.8662, 0.9613]

Let's try to remove SVC to see if we gain any extra performance. We can remove an estimaotr by setting it to `"drop"` using `set_params()` method.

In [28]:
voting_clf.set_params(svm_clf="drop")

Doing so update the list of estimators.

In [29]:
voting_clf.estimators

[('random_forest_clf', RandomForestClassifier(n_jobs=-1, random_state=42)),
 ('extra_trees_clf', ExtraTreesClassifier(n_jobs=-1, random_state=42)),
 ('svm_clf', 'drop'),
 ('mlp_clf', MLPClassifier(random_state=42))]

However, doing so does not update the list of *trained* estimators.

In [30]:
voting_clf.estimators_

[RandomForestClassifier(n_jobs=-1, random_state=42),
 ExtraTreesClassifier(n_jobs=-1, random_state=42),
 LinearSVC(dual=True, max_iter=100, random_state=42, tol=20),
 MLPClassifier(random_state=42)]

In [33]:
voting_clf.named_estimators_

{'random_forest_clf': RandomForestClassifier(n_jobs=-1, random_state=42),
 'extra_trees_clf': ExtraTreesClassifier(n_jobs=-1, random_state=42),
 'svm_clf': LinearSVC(dual=True, max_iter=100, random_state=42, tol=20),
 'mlp_clf': MLPClassifier(random_state=42)}

So we can either fit the `VotingClassifier` again, or just remove the `LinearSVC` from the list fo trained estimators, in both `estimators_` and `named_estimators_`

In [34]:
svm_clf_trained = voting_clf.named_estimators_.pop("svm_clf")
voting_clf.estimators_.remove(svm_clf_trained)

Now the `LinearSVC` clone in the `voting_clf` is gone.

In [35]:
voting_clf.named_estimators_

{'random_forest_clf': RandomForestClassifier(n_jobs=-1, random_state=42),
 'extra_trees_clf': ExtraTreesClassifier(n_jobs=-1, random_state=42),
 'mlp_clf': MLPClassifier(random_state=42)}

In [36]:
voting_clf.estimators_

[RandomForestClassifier(n_jobs=-1, random_state=42),
 ExtraTreesClassifier(n_jobs=-1, random_state=42),
 MLPClassifier(random_state=42)]

Now, let's try to evaluate `VotingClassifier` again.

In [37]:
voting_clf.score(X_valid, y_valid)

0.9761

A little better. Looks like the SVM did hurt performance. Now let's try using soft voting. You don't need to retrain the model, just set `voting="soft"`.

In [38]:
voting_clf.voting = "soft"

In [39]:
voting_clf.score(X_valid, y_valid)

0.9703

Well, hard voting wins in this case.

Now, come back to the exercise. We will evaluate the ensemble and all the individual classifiers on the test set and compare the performance of them all.

In [40]:
voting_clf.voting = "hard"
voting_clf.score(X_test, y_test)

0.9733

In [41]:
[
    estimator.score(X_test, y_test.astype(np.int8))
    for estimator in voting_clf.estimators_
]

[0.968, 0.9703, 0.9618]

The voting classifier reduces the error rate of the best model from about 3% to about 2.7%, which is 10% less error.