8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [25]:
# Loading MNIST object from Scikit-Learn

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
X, y = mnist["data"], mnist["target"]

In [26]:
# Splitting data into train and test set because data is already shuffled

X_train, X_val, X_test, y_train, y_val, y_test = X[:50000], X[50000:60000], X[60000:], y[:50000], y[50000:60000], y[60000:]

for label, dataset in zip(['X_train', 'X_val', 'X_test', 'y_train', 'y_val', 'y_test'],[X_train, X_val, X_test, y_train, y_val, y_test]):
  print(f"{label} shape: {dataset.shape}")

X_train shape: (50000, 784)
X_val shape: (10000, 784)
X_test shape: (10000, 784)
y_train shape: (50000,)
y_val shape: (10000,)
y_test shape: (10000,)


In [27]:
# Metrics imports

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [28]:
# Function to create a report 

def create_report(predictions, validation_set):
  print(confusion_matrix(predictions, validation_set))
  print('Accuracy: {0}'.format(accuracy_score(predictions, validation_set)))
  print(classification_report(predictions, validation_set))

In [29]:
# Fitting Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train,)

rf_train_preds = rf.predict(X_train)

rf_val_preds = rf.predict(X_val)

# print(confusion_matrix(y_val, rf_val_preds))
print('Training Accuracy: {0}'.format(accuracy_score(y_train, rf_train_preds)))
print('Validation Accuracy: {0}'.format(accuracy_score(y_val, rf_val_preds)))
# print(classification_report(y_val, rf_val_preds))

Training Accuracy: 1.0
Validation Accuracy: 0.9733


In [30]:
# Fitting Extra Trees

from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier()
et.fit(X_train, y_train)

et_train_preds = et.predict(X_train)

et_val_preds = et.predict(X_val)

print('Training Accuracy: {0}'.format(accuracy_score(y_train, et_train_preds)))
print('Validation Accuracy: {0}'.format(accuracy_score(y_val, et_val_preds)))

Training Accuracy: 1.0
Validation Accuracy: 0.9754


In [31]:
# Fitting SVM

from sklearn.svm import LinearSVC

svc = LinearSVC(
  max_iter=100,
  tol=20,
  random_state=0,
)

svc.fit(X_train, y_train)

svc_train_preds = svc.predict(X_train)

svc_val_preds = svc.predict(X_val)

print('Training Accuracy: {0}'.format(accuracy_score(y_train, svc_train_preds)))
print('Validation Accuracy: {0}'.format(accuracy_score(y_val, svc_val_preds)))

Training Accuracy: 0.87838
Validation Accuracy: 0.8836


In [32]:
from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier(random_state=42)

mlp_clf.fit(X_train, y_train)
mlp_clf_train_preds = mlp_clf.predict(X_train)

mlp_clf_val_preds = svc.predict(X_val)

print('Training Accuracy: {0}'.format(accuracy_score(y_train, mlp_clf_train_preds)))
print('Validation Accuracy: {0}'.format(accuracy_score(y_val, mlp_clf_val_preds)))

Training Accuracy: 0.99036
Validation Accuracy: 0.8836


In [33]:
# Ensembling using voting classifier

from sklearn.ensemble import VotingClassifier
from sklearn.neural_network import MLPClassifier

## Creating the Voting Classifier
voting_clf = VotingClassifier(
    estimators=[
      ('rf', rf), 
      ('et', et), 
      ('mlp', mlp_clf),
      # ('svc', svc)
    ],
    voting='hard'
)

## Fitting the voting classifier
voting_clf.fit(X_train, y_train)

voting_train_preds = voting_clf.predict(X_train)

voting_val_preds = voting_clf.predict(X_val)

print('Training Accuracy: {0}'.format(accuracy_score(y_train, voting_train_preds_preds)))
print('Validation Accuracy: {0}'.format(accuracy_score(y_val, voting_val_preds)))

Training Accuracy: 1.0
Validation Accuracy: 0.9765


9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?