<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/7-ensemble-learning-and-random-forests/03_ensemble_learning_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Ensemble Learning Exercises

If you aggregate
the predictions of a group of predictors (such as classifiers or regressors), you will
often get better predictions than with the best individual predictor. A group of predictors
is called an ensemble; thus, this technique is called Ensemble Learning, and an
Ensemble Learning algorithm is called an Ensemble method.

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

There are many boosting methods available, but by far the most popular are

* AdaBoost(short for Adaptive Boosting)
* Gradient Boosting

In fact, the winning solutions in Machine Learning competitions
often involve several Ensemble methods (most famously in the Netflix Prize
competition).



##Setup

In [1]:
# Common imports
import numpy as np
import os

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [2]:
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, as_frame=False)
    mnist.target = mnist.target.astype(np.int64)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

##Exercise-1: Voting Classifier

Let's load the MNIST data, and split it into a training set, a
validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation,
and 10,000 for testing). 

Then train various classifiers, such as a Random
Forest classifier, an Extra-Trees classifier, and an SVM classifier. 

Next, try to combine
them into an ensemble that outperforms each individual classifier on the
validation set, using soft or hard voting. Once you have found one, try it on the
test set. 

How much better does it perform compared to the individual classifiers?

###Step-1

_Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing)._

In [3]:
x_train_val, x_test, y_train_val, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, test_size=10000, random_state=42)

###Step-2

_Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM._

In [4]:
random_forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=10, random_state=42)
svm_clf = LinearSVC(random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [5]:
estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]

for estimator in estimators:
  print(f"Traing the {estimator}")
  estimator.fit(x_train, y_train)

Traing the RandomForestClassifier(n_estimators=10, random_state=42)
Traing the ExtraTreesClassifier(n_estimators=10, random_state=42)
Traing the LinearSVC(random_state=42)




Traing the MLPClassifier(random_state=42)


In [6]:
[estimator.score(x_val, y_val) for estimator in estimators]

[0.9469, 0.9492, 0.8695, 0.9639]

The linear SVM is far outperformed by the other classifiers. 

However, let's keep it for now since it may improve the voting classifier's performance.

###Step-3

_Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier._

In [7]:
named_estimators = [
  ("random_forest_clf", random_forest_clf),
  ("extra_trees_clf", extra_trees_clf),
  ("svm_clf", svm_clf),
  ("mlp_clf", mlp_clf)                 
]

In [8]:
voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(x_train, y_train)



VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(n_estimators=10,
                                                     random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(n_estimators=10,
                                                   random_state=42)),
                             ('svm_clf', LinearSVC(random_state=42)),
                             ('mlp_clf', MLPClassifier(random_state=42))])

In [9]:
voting_clf.score(x_val, y_val)

0.9624

In [10]:
[estimator.score(x_val, y_val) for estimator in voting_clf.estimators_]

[0.9469, 0.9492, 0.8695, 0.9639]

###Step-4

Let's remove the SVM to see if performance improves. It is possible to remove an estimator by setting it to `None` using `set_params()` like this:

In [11]:
voting_clf.set_params(svm_clf=None)

VotingClassifier(estimators=[('random_forest_clf',
                              RandomForestClassifier(n_estimators=10,
                                                     random_state=42)),
                             ('extra_trees_clf',
                              ExtraTreesClassifier(n_estimators=10,
                                                   random_state=42)),
                             ('svm_clf', None),
                             ('mlp_clf', MLPClassifier(random_state=42))])

This updated the list of estimators:

In [12]:
voting_clf.estimators

[('random_forest_clf',
  RandomForestClassifier(n_estimators=10, random_state=42)),
 ('extra_trees_clf', ExtraTreesClassifier(n_estimators=10, random_state=42)),
 ('svm_clf', None),
 ('mlp_clf', MLPClassifier(random_state=42))]

However, it did not update the list of _trained_ estimators:

In [13]:
voting_clf.estimators_

[RandomForestClassifier(n_estimators=10, random_state=42),
 ExtraTreesClassifier(n_estimators=10, random_state=42),
 LinearSVC(random_state=42),
 MLPClassifier(random_state=42)]

So we can either fit the `VotingClassifier` again, or just remove the SVM from the list of trained estimators:

In [14]:
del voting_clf.estimators_[2]

Now let's evaluate the `VotingClassifier` again:

In [15]:
voting_clf.score(x_val, y_val)

0.9652

A bit better! The SVM was hurting performance. 

###Step-5

Now let's try using a soft voting classifier. 

We do not actually need to retrain the classifier, we can just set `voting` to `"soft"`:

In [16]:
voting_clf.voting = "soft"

voting_clf.score(x_val, y_val)

0.9698

That's a significant improvement, and it's much better than each of the individual classifiers.

###Step-6

_Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?_

In [17]:
voting_clf.score(x_test, y_test)

0.9677

In [18]:
[estimator.score(x_test, y_test) for estimator in voting_clf.estimators_]

[0.9437, 0.9474, 0.9604]

The voting classifier reduced the error rate from about 4.0% for our best model (the `MLPClassifier`) to just 3.1%. 

That's about 22.5% less errors, not bad!

##Exercise-2: Stacking Ensemble

Let's run the individual classifiers from the previous exercise to make predictions on
the validation set, and create a new training set with the resulting predictions:
each training instance is a vector containing the set of predictions from all your
classifiers for an image, and the target is the image’s class. Train a classifier on
this new training set. 

Congratulations, you have just trained a blender, and
together with the classifiers it forms a stacking ensemble! Now evaluate the
ensemble on the test set. For each image in the test set, make predictions with all
your classifiers, then feed the predictions to the blender to get the ensemble’s predictions.

How does it compare to the voting classifier you trained earlier?

###Step-1

_Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set._

In [21]:
x_val_predictions = np.empty((len(x_val), len(estimators)), dtype=np.float32)

In [22]:
for index, estimator in enumerate(estimators):
  x_val_predictions[:, index] = estimator.predict(x_val)

In [23]:
x_val_predictions

array([[5., 5., 5., 5.],
       [8., 8., 8., 8.],
       [2., 2., 2., 2.],
       ...,
       [7., 7., 7., 7.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]], dtype=float32)

In [24]:
rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rnd_forest_blender.fit(x_val_predictions, y_val)

RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)

In [26]:
rnd_forest_blender.oob_score_

0.9629

You could fine-tune this blender or try other types of blenders (e.g., an `MLPClassifier`), then select the best one using cross-validation, as always.

###Step-2

Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! 

Now let's evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble's predictions. 

How does it compare to the voting classifier you trained earlier?

In [27]:
x_test_predictions = np.empty((len(x_test), len(estimators)), dtype=np.float32)

In [28]:
for index, estimator in enumerate(estimators):
  x_test_predictions[:, index] = estimator.predict(x_test)

In [29]:
y_pred = rnd_forest_blender.predict(x_test_predictions)

In [30]:
accuracy_score(y_test, y_pred)

0.9623

This stacking ensemble does not perform as well as the soft voting classifier we trained earlier, it's just as good as the best individual classifier.