### Ensemble Learning and Random Forests

*Wisdom of the crowd* idea: if we aggregate a group of predictors, we get better predictions than with the best individual predictor.

#### 1. Voting Classifiers

**Hard voting** classifier: aggregate the predictions of each classifier and predict the class that gets the most votes. 

**!** Ensemble methods work best when the predictors are as independent from one another as possible. This can be achieved by training the classifiers with different algorithms, so to ensure that they will make different types of errors, improving the ensemble's accuracy. 

In [1]:
# Use moons dataset 
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples = 100, noise = 0.15)

In [2]:
# Split the training dataset into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.3, # This can be changed, though it makes sense to use 25-30% of the data for test
        random_state=1996
    )

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC 

In [4]:
# Instantiate individual classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

In [5]:
voting_clf = VotingClassifier(
    estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting = 'hard'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [6]:
from sklearn.metrics import accuracy_score 

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9
RandomForestClassifier 1.0
SVC 0.9666666666666667
VotingClassifier 0.9666666666666667


If all classifiers are able to estimate class probabilities (predict_proba() method) then the Ensemble classifier can predict the class with the highest class probability averaged over all the individual classifiers (**soft voting**) 

In [7]:
# Instantiate individual classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability = True) # set to False by default in SVM

In [8]:
voting_clf = VotingClassifier(
    estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting = 'soft'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()),
                             ('svc', SVC(probability=True))],
                 voting='soft')

In [9]:
from sklearn.metrics import accuracy_score 

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9
RandomForestClassifier 1.0
SVC 0.9666666666666667
VotingClassifier 0.9666666666666667


#### 2. Bagging and Pasting

Instead of using different training algorithms, we can instead use the same training algorithm for every predictor but train them on a different random subset of the training set. 

If sampling is performed **with replacement** --> **BAGGING** (short for bootstrap aggregating)
If sampling is performed **without replacement** --> **PASTING** 

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is usually:

 * **statistical mode** (most frequent prediction, like a hard voting classifier) for classification
 * **average** for regression 
 
Each individual predictor has a higher bias than if it were trained on the original training set, but the aggregation reduces both bias and variance. 

In [10]:
from sklearn.ensemble import BaggingClassifier 
from sklearn.tree import DecisionTreeClassifier 

In [11]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators = 500, max_samples = 50, bootstrap = True, n_jobs = -1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

#### Out-of-bag evaluation 

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all --> the training instances that are not sampled are called **out-of-bag** instances. Therefore a bagging ensemble can be evaluated using out-of-bag instances without the need for a separate validation set. 

In [12]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators = 500, max_samples = 50, bootstrap = True, n_jobs = -1,
oob_score = True # request an automatic out-of-bag evaluation after training
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9285714285714286

In [13]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

1.0

#### Random Patches and Random Subspaces 

BaggingClassifier class supports sampling of features as well controlled by two hyperparameters: 

 * max_features 
 * bootstrap_features
 

This technique is very useful when dealing with high-dimensional inputs (e.g. images). Sampling both training instances and features is called **Random Patches** method. 

Keeping all training instances (by setting bootstrap = False and max_samples = 1.0) but sampling features is instead called the **Random Subspaces** method. 

#### 3. Random Forests

Random Forest is an ensemble of Decision Trees, generally trained via the bagging method with max_samples set to the size of the training set. It has all the hyperparameters of the DecisionTreeClassifier plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

 * Introduces extra randomness when growing trees by searching for the best feature among a random subset of features when splitting a node
 * Greater tree diversity --> trades a higher bias for a lower variance 

In [14]:
rnd_clf = RandomForestClassifier(n_estimators = 500, max_leaf_nodes = 16, n_jobs = -1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

#### Feature importance 

Measure a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average across all trees in the forest (weighted average). Sklearn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. 

In [15]:
from sklearn.datasets import load_iris 

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators = 500, n_jobs = -1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.08700477823573695
sepal width (cm) 0.024017209601911817
petal length (cm) 0.4517276004598349
petal width (cm) 0.4372504117025165


#### 4. Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The idea is to train predictors sequentially, each trying to correct its predecessor. 

#### 4a. AdaBoost (Adaptive Boosting)

A new predictor corrects its predecssor by paying more attention to the training instances that the predecessor underfitted --> results in new predictors focussing more and more on the hard cases. 

 1. Train first a base classifier 
 2. Use the base classifier to make predictions on the training set 
 3. Increase the relative weight of misclassified training instances 
 4. Train a second classifier using the updated weights
 5. Repeat 
 

Once all predictors are trained, the ensemble makes predictions like bagging or pasting, but in AdaBoost the predictors have different weights depending on their overall accuracy on the weighted training set. 

**! Drawback**: AdaBoost cannot be parallelized since each predictor can only be trained after the previous predictor has been trained and evaluated. 

 * Each instance weight $w^{(i)}$ is initially set to $\frac{1}{m}$
 * A first predictor is trained, and its weighted error rate $r_{i}$ is computed on the training set 
 

**Weighted error rate of the j-th predictor**

$r_{j}$ = $\frac{\sum_{i = 1}^{m} w_{(i)}}{\sum_{i = 1}^{m} w_{(i)}}$ for $\hat{y}_{j}^{(i)}$ != $y^{(i)}$

 * $\hat{y}_{j}^{(i)}$ = j-th predictor's prediction for the i-th instance 


**Predictor weight**

$\alpha_{j}$ = $\eta log \frac{1 - r_{j}}{r_{j}}$

 * The more accurate the predictor is, the higher its weight 
 * Update the instance weights, which boots the weights of the misclassified instances 


**Weight update rule**

$w^{(i)}$ = 

 * $w^{(i)}$ if $\hat{y}_{j}^{(i)}$ = $y^{(i)}$

 * $w^{(i)}$ $exp(\alpha_{j})$ if $\hat{y}_{j}^{(i)}$ != $y^{(i)}$
 
The algorithm stops when the desired number of predictors is reached or when a perfect predictor is found. 

**AdaBoost predictions** 

$\hat{y}(x)$ = $argmax_{k}$ $\sum_{j = 1}^{N} \alpha_{j}$ for $\hat{y}_{j}(x)$ = $k$

 * N = number of predictors 

In [16]:
from sklearn.ensemble import AdaBoostClassifier 

ada_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth = 1), n_estimators = 200, 
algorithm = 'SAMME.R', learning_rate = 0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

sklearn implementation uses **Decision Stumps**, which are Decision Trees with a max_depth = 1. 

#### 4b. Gradient Boosting

Like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. 

**!** Instead of tweaking the instance weights at every iteration, it tries to fit the new predictor to the residual errors made by the previous predictor. 


In [17]:
from sklearn.tree import DecisionTreeRegressor 

tree_reg1 = DecisionTreeRegressor(max_depth = 2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(max_depth=2)

In [18]:
# Train the second DecisionTreeRegressor on the residual errors made by the first predictor
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth = 2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

In [19]:
# Train the third DecisionTreeRegressor on the resiudal errors made by the second predictor
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth = 2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

In [None]:
# Make prediction on a new instance by adding up the predictions of all the trees 
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [23]:
# Automatically done by sklearn with the class GradientBoostingRegressor 

from sklearn.ensemble import GradientBoostingRegressor 

gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 3, learning_rate = 1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

**Shrinkage Regularization**

Learning rate scales the contribution of each tree: if low value (e.g. 0.1), we will need more trees in the ensemble to fit the training set, but the predictions will generalize better. 

**Early stopping** 

To find the optimal number of trees we can use early stopping with the staged_predict() method, which returns an iterator over the predictions made by the ensemble at each stage of training. 

In [26]:
import numpy as np 
from sklearn.metrics import mean_squared_error 

gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred) for y_pred in gbrt.staged_predict(X_test)]
best_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth = 2, n_estimators = best_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=118)

**Stochastic Gradient Boosting** 

Specify the fraction of training instances to be used for training each tree with subsample hyperparameter. This trades a higher bias for a lower variance and speeds up training. 

**!** An optimized implementation of Gradient Boosting is available in the python library **XGBoost**

In [30]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)

In [33]:
# Automatically takes care of early stopping
xgb_reg.fit(X_train, y_train, eval_set = [(X_test, y_test)], early_stopping_rounds = 2)
y_pred = xgb_reg.predict(X_test)

[0]	validation_0-rmse:0.36873
[1]	validation_0-rmse:0.26813
[2]	validation_0-rmse:0.19520
[3]	validation_0-rmse:0.14229
[4]	validation_0-rmse:0.10389
[5]	validation_0-rmse:0.07617
[6]	validation_0-rmse:0.05635
[7]	validation_0-rmse:0.04064
[8]	validation_0-rmse:0.02908
[9]	validation_0-rmse:0.02178
[10]	validation_0-rmse:0.01739
[11]	validation_0-rmse:0.01309
[12]	validation_0-rmse:0.00992
[13]	validation_0-rmse:0.00757
[14]	validation_0-rmse:0.00583
[15]	validation_0-rmse:0.00490
[16]	validation_0-rmse:0.00432
[17]	validation_0-rmse:0.00399
[18]	validation_0-rmse:0.00378
[19]	validation_0-rmse:0.00380


#### 5. Stacking

Instead of using functions to aggregate the predictions of all predictors in an ensemble, train a model to perform this aggregation. Such model is called a blender or meta-learner, which takes the predictions outputted by the predictors and makes the final prediction. 

 * To train the blender we can use a hold-out set

Open source implementation: DESlib 

#### End of notebook