# Ensemble Learning and Random Forests

In [1]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# Voting Classifier
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

# Measure each classifier's accuracy
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.904
SVC 0.896
VotingClassifier 0.904


# Exercises

## 1.
The five trained models can be combined into a voting ensemble, which often yields better results. If the models are different from each other, it should lead to slightly better results despite the fact that they were all trained on the same data.

## 2.
Hard voting only cares about the absolute count of votes to make a prediction, while soft voting averages out all the prediction probabilities, which favors more confident predictions, while uncertain ones will have less weight.

## 3.
Yes, a bagging ensemble can be distributed and trained in parallel to speed up the process. The same applies to pasting and Random Forests ensembles. Stacking ensembles can be only partially distributed: all the predictors from a given layer are independent of each other, but the predictors in one layer can only be trained after predictors in the previous layer have all been trained. Boosting, on the other hand, cannot take advantage of such parallelism, since each step is strictly dependent on the last.

## 4.
Out-of-bag evaluation let's us evaluate each predictor in a bagging ensemble on data it has never seen, without the need of an additional validation set. Thus, we have more data available for training, which should help the ensemble perform slightly better.

## 5.
Extra-Trees only consider a random subset of features for splitting at each node, and rather than searching for the best thresholds (like regular Decision Trees do), they use random thresholds for each feature. This speeds-up training significantly, since this search for optimal feature thresholds is one of the most time-consuming tasks of training a Decision Tree. Predictions, however, are neither faster or slower than regular Decision Trees.

## 6.
If an AdaBoost ensemble is underfitting the data, adding more estimator to it might help. Another viable option is to reduce the regularization hyperparameters of the base estimator. Increasing the learning rate could be a good option as well.

## 7.
If an Gradient Boosting ensemble is overfitting the data, the learning rate should probably be decreased. Early stopping is often a good option to find the right number of predictors (overfitting often indicates that there is too many of them).

## 8.

In [2]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

X = X.astype(np.uint8)
y = y.astype(np.uint8)

X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=10000, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf_clf = RandomForestClassifier(n_jobs=-1, random_state=42)
rf_clf.fit(X_train, y_train)
np.mean(cross_val_score(rf_clf, X_train, y_train, cv=3))

0.9633599935705593

In [4]:
from sklearn.ensemble import ExtraTreesClassifier

et_clf = ExtraTreesClassifier(n_jobs=-1, random_state=42)
et_clf.fit(X_train, y_train)
np.mean(cross_val_score(et_clf, X_train, y_train, cv=3))

0.9673799971738474

In [5]:
from sklearn.neural_network import MLPClassifier

mlp_clf = MLPClassifier(random_state=42)
mlp_clf.fit(X_train, y_train)
np.mean(cross_val_score(mlp_clf, X_train, y_train, cv=3))

0.9530600015624793

Let's evaluate the trained models on the validation set:

In [6]:
[estimator.score(X_val, y_val) for estimator in (rf_clf, et_clf, mlp_clf)]

[0.9683, 0.9718, 0.9668]

We can now combine the estimators into an ensemble to see if we can make it outperform them all:

In [7]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier([('random_forest', rf_clf), ('extra_trees', et_clf), ('mlp', mlp_clf)])
voting_clf.fit(X_train, y_train)
voting_clf.score(X_val, y_val)

0.9733

We can see an improvement on the validation set performance! Finally, we can compare the performance of the voting classifier on the training set with the other estimators:

In [8]:
voting_clf.score(X_test, y_test)

0.9721

In [9]:
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

[0.9695, 0.9698, 0.9613]

We can see that the ensemble method actually performed better on the test set as well, although just slightly.

## 9.

In [10]:
X_val_stacking = np.zeros((len(X_val), len(voting_clf.estimators_)), dtype=np.float32)

for idx, estimator in enumerate(voting_clf.estimators_):
    X_val_stacking[:, idx] = estimator.predict(X_val)
    
X_val_stacking

array([[7., 7., 7.],
       [3., 3., 3.],
       [8., 8., 8.],
       ...,
       [9., 9., 9.],
       [8., 8., 8.],
       [2., 3., 1.]], dtype=float32)

In [11]:
rf_blender = RandomForestClassifier(oob_score=True, random_state=42)
rf_blender.fit(X_val_stacking, y_val)
rf_blender.oob_score_

0.9697

In [12]:
X_test_stacking = np.zeros((len(X_test), len(voting_clf.estimators_)), dtype=np.float32)

for idx, estimator in enumerate(voting_clf.estimators_):
    X_test_stacking[:, idx] = estimator.predict(X_test)
    
rf_blender.score(X_test_stacking, y_test)

0.9706

In this case, the stacking did not help us, although the resulting model is almost as good as the voting classifier from the previous exercise.