# Set up

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
from sklearn.datasets import fetch_openml

X_mnist, y_mnist = fetch_openml(
    "mnist_784", return_X_y=True, as_frame=False, parser="auto"
)

The MNIST dataset is split and shuffled already, such that the first 60,000 is the training set and the last 10,000 is the test set.

In [5]:
X_train, y_train = X_mnist[:50000], y_mnist[:50000]
X_valid, y_valid = X_mnist[50000:60000], y_mnist[50000:60000]
X_test, y_test = X_mnist[60000:], y_mnist[60000:]

Following the instructions of the exercise, we will train a random forest classifier, an extra-trees classifier and an SVM classifier. We will also train an Multilayer Perceptron.

In [6]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [5]:
random_forest_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
extra_trees = ExtraTreesClassifier(n_estimators=100, n_jobs=-1, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, dual=True, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [6]:
estimators = [random_forest_clf, extra_trees, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)

Training the RandomForestClassifier(n_jobs=-1, random_state=42)
Training the ExtraTreesClassifier(n_jobs=-1, random_state=42)
Training the LinearSVC(dual=True, max_iter=100, random_state=42, tol=20)
Training the MLPClassifier(random_state=42)


Following the instructions from the exercise, we will create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class.

In [18]:
X_valid_predictions = np.empty((len(X_valid), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_valid_predictions[:, index] = estimator.predict(X_valid)

In [19]:
X_valid_predictions

array([['3', '3', '3', '3'],
       ['8', '8', '8', '8'],
       ['6', '6', '6', '6'],
       ...,
       ['5', '5', '5', '5'],
       ['6', '6', '6', '6'],
       ['8', '8', '8', '8']], dtype=object)

In [20]:
random_forest_blender = RandomForestClassifier(
    n_estimators=200, oob_score=True, random_state=42
)
random_forest_blender.fit(X_valid_predictions, y_valid)

In [21]:
random_forest_blender.oob_score_

0.9723

You can fine-tune this blender or try other types of blenders (e.g. an `MLPClassifier`), then select the best one using cross-validation.

Exercise: _Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. How does it compare to the voting classifier you trained earlier?_

In [22]:
X_test_predictions = np.empty((len(X_test), len(estimators)), dtype=object)

for index, estimator in enumerate(estimators):
    X_test_predictions[:, index] = estimator.predict(X_test)

In [23]:
from sklearn.metrics import accuracy_score

y_predict = random_forest_blender.predict(X_test_predictions)
accuracy_score(y_test, y_predict)

0.9695

It does not perform as well as our previous voting classifier.

Exercise: _Now try again using a `StackingClassifier` instead: do you get better performance? If so, why?_

Because `StackingClassifier` performs a K-folds cross validation, so we don't need a separate validation set. Let's join the training set and validation set into a big training set.

In [24]:
X_train_big, y_train_big = X_mnist[:60000], y_mnist[:60000]

Now let's create and train the stacking classifier on the full training set.

In [25]:
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

**Warning**: The following code cell takes a very long time to run, as it uses K-fold cross-validation with 5 folds by default. It will train the 4 classifiers 5 times, each time on 80% of the training set, plus one last time on the full training set. Lastly, it will train the final model on predictions. That is a total of 25 models to train.

In [26]:
from sklearn.ensemble import StackingClassifier

stack_clf = StackingClassifier(
    estimators=named_estimators, n_jobs=-1, final_estimator=random_forest_blender
)
stack_clf.fit(X_train_big, y_train_big)

That is the reason why I will use joblib to save this model for later use.

In [27]:
import joblib

joblib.dump(stack_clf, "stacking_classifier.pkl")
stack_clf = joblib.load("stacking_classifier.pkl")

In [28]:
stack_clf.score(X_test, y_test)

0.9783

The `StackingClassifier` greatly outperforms our custom stacking implementation earlier. There are 2 main reasons:
- Since we could reclaim the validation set, the `StackingClassifier` is trained on a larger training set.
- It used `predict_proba()` if available, or else `decision_function()` if available, else `predict()`, instead of only `predict()` as our implementation. This gave the blender much more different inputs to work with.