Load the MNIST data (introduced in Chapter 3), and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation,
and 10,000 for testing). Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine
them into an ensemble that outperforms each individual classifier on the
validation set, using soft or hard voting. Once you have found one, try it on the
test set. How much better does it perform compared to the individual classifiers?

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [43]:
from sklearn.datasets import fetch_openml

# Fetch the MNIST dataset
# 'mnist_784' refers to the full 28x28 pixel MNIST dataset with 784 features.
# as_frame=False returns NumPy arrays instead of pandas DataFrames.
# cache=True (default) stores the dataset locally after the first download.
mnist = fetch_openml('mnist_784', as_frame=False, cache=True)

# The 'mnist' object is a Bunch object, similar to a dictionary.
# It contains 'data' (features) and 'target' (labels).
X = mnist.data  # Features (image pixel data)
y = mnist.target # Labels (digits 0-9)


In [44]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=50000, test_size=20000, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, train_size=10000, test_size=10000, random_state=42)

In [45]:
X_train.shape, X_val.shape, X_test.shape

((50000, 784), (10000, 784), (10000, 784))

In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

In [47]:
rand_forest = RandomForestClassifier(n_jobs=-1)

In [48]:
extra_trees = ExtraTreesClassifier(n_jobs=-1)

In [49]:
log_reg = LogisticRegression(max_iter=500, n_jobs=-1)

In [50]:
from sklearn.ensemble import VotingClassifier

hard_vote = VotingClassifier(
    [('rand_forest', rand_forest),
     ('extra_trees', extra_trees),
     ('log_reg', log_reg)],
     voting='hard',
     n_jobs=-1
)

soft_vote = VotingClassifier(
    [('rand_forest', rand_forest),
     ('extra_trees', extra_trees),
     ('log_reg', log_reg)],
     voting='soft',
     n_jobs=-1
)

In [51]:
# rand_forest.fit(X_train, y_train)
# extra_trees.fit(X_train, y_train)
# log_reg.fit(X_train, y_train)
# hard_vote.fit(X_train, y_train)

In [52]:
# from sklearn.metrics import classification_report

# print(classification_report(y_val, rand_forest.predict(X_val)))

# print(classification_report(y_val, extra_trees.predict(X_val)))

# print(classification_report(y_val, log_reg.predict(X_val)))

In [53]:
from sklearn.metrics import accuracy_score
for clf in (rand_forest, extra_trees, log_reg, hard_vote, soft_vote):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_val)
    print(clf.__class__.__name__, accuracy_score(y_val, y_pred))

RandomForestClassifier 0.9654
ExtraTreesClassifier 0.9692


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression 0.9143


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


VotingClassifier 0.9687


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


VotingClassifier 0.9496


Run the individual classifiers from the previous exercise to make predictions on
the validation set, and create a new training set with the resulting predictions:
each training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Train a classifier on
this new training set. Congratulations, you have just trained a blender, and
together with the classifiers it forms a stacking ensemble! Now evaluate the
ensemble on the test set. For each image in the test set, make predictions with all
your classifiers, then feed the predictions to the blender to get the ensemble’s predictions.
How does it compare to the voting classifier you trained earlier?

In [54]:
rand_pred = rand_forest.predict(X_val)
extra_pred = extra_trees.predict(X_val)
log_pred = log_reg.predict(X_val)

In [56]:
X_val_blender = pd.concat([pd.Series(rand_pred), pd.Series(extra_pred), pd.Series(log_pred)], axis=1)

In [58]:
X_val_blender, y_val

(      0  1  2
 0     8  8  8
 1     5  5  5
 2     3  5  3
 3     4  4  4
 4     2  2  2
 ...  .. .. ..
 9995  8  8  8
 9996  7  7  7
 9997  3  3  3
 9998  7  7  7
 9999  0  0  0
 
 [10000 rows x 3 columns],
 array(['8', '5', '5', ..., '3', '7', '0'], dtype=object))

In [60]:
log_blender = LogisticRegression(max_iter=5000)
log_blender.fit(X_val_blender, y_val)

In [61]:
rand_test = rand_forest.predict(X_test)
extra_test = extra_trees.predict(X_test)
log_test = log_reg.predict(X_test)
X_test_blender = pd.concat([pd.Series(rand_test), pd.Series(extra_test), pd.Series(log_test)], axis=1)

In [62]:
accuracy_score(y_test, log_blender.predict(X_test_blender))

0.9494