# Ensemble Learning and Random Forest

Lets say you give a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this answer would be better than an experts answer. This is called "wisdom of the crowd". This is the concept upon which Ensemble Learning is built. For ensemble learning, we aggregate the predictions of a group of predictors (eg, classifiers or regressors), this aggregated prediction will often give a better result than that of the best individual predictor. A group of predictors is called an **Ensemble**, this technique is called **Ensemble Learning** and a Ensemble Learning algorithm is called **Ensemble method**

We saw this method in action in the chapter on Decision Trees. In the assignment section we built our own implementation of a Random Forest Classifier. We had trained a group of Decision Tree Classifiers on subsets of the training data, we then made predictions on the test data, and we aggregated by taking the prediction that was the most frequent. Such an ensemble of Decision Trees is called a **Random Forest**. Despite the simplicity of such a method, this algorithm is one of the most powerful machine learning algorithms. In this section we will be discussing the most popular ensemble methods in detail such as bagging, boosting and stacking. We will also discuss the Random Forest Algorithm in greater detail.

This chapter can be divided into the following sections:
1. Voting Classifiers
2. Bagging and Pasting
3. Random Patches and Random Subspaces
4. Random Forests
5. Boosting
6. Stacking

## 1. Voting Classifiers

Lets take a scenario where you have trained a few classifiers, each achieving about 80% accuracy. For example, they could be a LogisticRegression classifier, an SVM classifier, a RandomForest classifier, a K-Nearest Neighbors classifier and perhaps a few more. 

<center>
<br>
<img src="https://miro.medium.com/max/962/1*Z6M1YaQ2k-0clMBpLEruOg.png" height="250">
<br>
</center>

We could create an even better classifier by aggregating these classifiers together to create an ensemble. We would do this by aggregating the predictions of the classifiers and predict the class that gets the most votes. This type of aggregation (majority vote classifier) is called a **hard voting classifer**.

<center>
<br>
<img src="https://cdn-images-1.medium.com/max/1000/0*c0Eg6-UArkslgviw.png" height="300">
<br>
</center>

A soft voting classifier would on the other hand predict the probability that a particular instance belongs to a particular class and then predict the class with the highest probability. This is called a **soft voting classifier**. (This is possible only if all the classifiers can estimate class probabilities, e.g. the LogisticRegressor that has a predict_proba() method).

This voting classifer often achieves a higher accuracy than the best classifier in the ensemble. This is true even if the ensemble is made up of classifiers that are weak learners (in the sense that they do only slightly better than random guessing (e.g. a classifier that has an accuracy of 51%)). The ensemble can still be a strong learner provided there are a succificient number of weak learners and they are sufficiently diverse.

To understand how this is possible, we will use the example of a biased coin flip. Suppose you have a slightly biased coin that has a 51% chance of coming up heads and 49% of coming up tails. If this biased coin is tossed 1000 times, it will come up heads 510 times and tails 490 times, and hence a majority of heads. If you do the math, you will find that the probability of obtaining a majority of heads afer 1,000 tosses is close to 75%. The more you toss the coin, the higher is the probability (10,000 tosses, probability goes up to 97%). This is due to the **law of large numbers**, as you keep tossing the coin the ratio of heads to tails gets closer and closer to the probability of heads (51%). The below image shows that as the number of tosses increases the ratio of heads gets coser to 51%. By the end it is consistently above 50%.

<center>
<br>
<img src="https://miro.medium.com/max/1400/1*t5hIwNudkP1dQ5yk3NQBYw@2x.png" height="300">
<br>
</center>

For example if you build an ensemble containing 1000 classifiers that are individually correct only 51% of the time (barely better than random guessing), and you predict the majority voted class you might end up getting a 75% accuracy. This however is only true if all the classifiers are perfectly independent making uncorrelated errors, which will not be the case as it is trained on the same data. THey are likely to make the same types of errors, so there will be many majority votes for the wrong class, reducing the ensemble's accuracy.

Note: Ensemble methods work best when the predictors are as independent from one another as possible. One way to get the predictors to be diverse is to train them using very different algorithms, This increases the chance that they will make very different types of errors, improving the ensemble's accuracy. We can try to use this method on our Titanic assignment.

Given below is an implementation that trains a voting classifier in sklearn, composed of three diverse classifiers. They will be trained on the moons dataset. Earlier in the chapter on Decision Trees w had made our own implementation of a voting classifier. We will now use the sklearn implementation.

In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=1000, noise=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# voting hard as we are using majority vote
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

voting_clf.fit(X_train, y_train)

In [4]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_preds = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_preds))

LogisticRegression 0.836
RandomForestClassifier 0.86
SVC 0.86
VotingClassifier 0.868


Remember the RandomForestClassifier, is itself an ensemble, it is an ensemble of DecisionTreeClassifiers. 

If all classifiers are able to estimate probabilities, then you can tell sklearn to predict the class with the highest class probability averaged over all the individual classifiers. This is called **soft voting**. It often achieves higher performance than hard voting, this is beause it gived more weight to hgihly confident votes. To use soft voting, you simply need to switch the value of voting="hard" to voting="soft" and ensure that all classifiers can estimate class probabilites. This will not be the case for a default SVC, so you will need to set its probability hyperparameter to True (this will make the SVC class estimate class probabilities, slowing down training, and it will add a predict_proba() method).

Lets do this by implementing it in the following manner.

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

# voting hard as we are using majority vote
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='hard')

voting_clf.fit(X_train, y_train)

In [6]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_preds = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_preds))

LogisticRegression 0.836
RandomForestClassifier 0.852
SVC 0.86
VotingClassifier 0.876


As you can see the performance of the ensemble has slightly improved simply by using soft voting.

## 2. Bagging and Pasting

We have already used the method of getting a diverse set by training different algorithms. However there is another approach where the algorithm is the same but the training data is different, i.e. we use the same training algorithm for every predictor, but we train them on different random subsets of the training data. When sampling is performed with replacement, this method is called **bagging** (also called bootstrap aggregating, used by Random Forests). When sampling is performing without replacement, it is called **pasting**.

Both bagging and sampling allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor.

<center>
<br>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/bagging.png" height="300">
<br>
</center>

Once all the predictors are trained the ensemble can make a prediction for a new instance simply by aggregating the predictions of all predictors. The aggregation function is usually the *statistical mode* (we had used Scipy mode() function) for classification or the *average* for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. The net result is that the ensemble will have a similar bias but lower variance, than a single predictor trained on the original training set.

Predictors can be trained in parallel, via different CPU cores or even different servers. Similarly predictions can be made in parallel. This is a major reason as to why bagging and pasting is so popular, they scale well.

This section will be divided into the following subsections: -
1. Bagging and Pasting in Scikit-Learn
2. Out of Bag Evaluation
3. Random Patches and Random Subspaces

### 2.1 Bagging and Pasting in Scikit-Learn

Sklearn offers a simple API for both bagging and pasting with the BaggingClassifier/BaggingRegrros. The following code trains an ensemble of 500 DecisionTree classifiers each is trained on 100 training instances randomly sampled from our training set with replacement (if bootstap=True this is Bagging, if you want Pasting then set value as False). The n_jobs parameter tells sklearn the number of CPU cores to use, this allows it to run in parallel. A value of -1 tells it to use all available cores. The implementation is as follows:

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs =-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.856

The BaggingClassifier automatically chooses whether to perform soft or hard voting based on whether ot not the predictor given has the capability of estimating probabilites. Since this is a functionality of DecisionTreeClassifier, it performs soft voting.

<center>
<br>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/dt_bagging.png" height="300">
<br>
</center>

Bootstrapping introduces a bit more diversity in he subsets that each predictor is trained on so bagging ends up with a slightly higher bias as compared to pasting but the extra diversity also means that the predictors end up being less correated, so the ensembles variance is reduce. overall bagging often results in better models, which is why it is generally preffered. however if you have spare time and CPU power you can use cross validation to evaluate both bagging and pasting and select the one that works best.

### 2.2 Out of Bag Evaluation

While bagging some instances may be sampled several times for any given predictor, while others may not be sampled at all. By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set. This means that only about 63% of the training instances are sampled on average for each predictor (as m grows the ratio appraches $1 - exp(-1) \approx 63.212%$). The reamining 37% of the training instances that are not sampled are called out of bag (oob) instances. It must be remembered that this is not the same for all predictors.

Since a predictor neever sees the oob instanced during training it can be evaluated on these instances without the need for a seperate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each predictor. In sklearn you can set oob_score=True when creating a BaggingClassifier to request an automatic oob evaluation after training. The following code demonstrates this, the resulting evaluation score is available through the oob_score_ variable.

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, 
    bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8426666666666667

In [9]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.852

The oob decision function for each training instance is also available through the oob_decision_function_ variable (in this case since the base estimator has a predict_proba() method), the decision function returns the class proababilities for each training instance, For example the oob evaluation estimates that the first training instance ha a 68.25% probability of belonging to the positive class (and 31.75% of belonging to the negative class).

In [10]:
bag_clf.oob_decision_function_

array([[0.48603352, 0.51396648],
       [0.77486911, 0.22513089],
       [1.        , 0.        ],
       ...,
       [0.98876404, 0.01123596],
       [0.71186441, 0.28813559],
       [0.        , 1.        ]])

### 2.3 Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling is controlled by two hyperparameters: max_features and bootstrap features. They work the same way as max_samples and bootstrap, but for features sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.

This technique is especially useful when we deal with high-dimensional inputs (such as images). Sampling both the training instances and features os ca;;ed tje Random Patches method. Keeping all training instances (bootstrap=False and max_samples=1.0) but sampling features (bootstrap_features = True and/or max_features to a value smaller than 1.0) is called the Random Subspaces method.

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.


## 3. Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained by the bagging method typically with max_samples set tot the size of the training set. instead of building a BaggingClassifier and passing it tot a DecisionTreeClassifier, you can use a RandomForestClassifier class, which is more convenient and optimised for Decision Trees (there is also a RandomForestRegressor class for regression tasks). The following code uses all available CPU core to train a Random Forest classifier  with 500 trees (each limited to maximum 16 nodes)

In [11]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

Apart from a few exceptions (splitter is absent (forced to random), presort is absent (forced to False), max_samples is absent (forced to 1.0), base_estimator is absent(forced to DecisoinTreeClassifier with the provided hyperparameters)) a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown) plus all the hyperparameters ofa BaggingClassifier to control the ensemble itself.

The Random Forest Algorithm introduces extra randomness when growing trees, instead of searching for the best feature when splitting a node, it searches for a best feature among a random subset of features. The algorithm results in greater tree diversity, which trades a hgiher bias for a lower variance, generally yielding an overall better model. The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier.

In [12]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), 
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

This section will be now be divided into the following subsections: -
1. Extra-Trees
2. Feature Importance

### 3.1 Extra-Trees

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by using random thresholds for each feature rather than searching for the best possible thresholds. A forest of such extremely random trees is called an Extremely Randomized Trees ensemble (or Extra-Trees for short). Once again, this trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

You can create an Extra-Trees classifier using sklearns ExtraTreesClassifier class. Its API is identical to the RandomForestClassifier class. Similarly the ExtraTreesRegressor class has the same API has the RandomForestRegressor class.

We cannot tell in advance whether a RandomForestClassifier will perform better ir worse than a ExtraTreeslassfier the only way to knwow is to try both and compare them using cross validation (tune the hyperparameters using grid search).

### 3.2 Feature Importance

Another great quality of Random Forests is that they can make it easy to measure the relatvie importance of each feature. Sklearn measure s afeatures importance by looking at how much the tree nodes that use that feature reduce impurity on average (by looking across all the trees in the forest), to be more precise it is a weighted avergae where each nodes weight is equal to thenumber of training samples that are associated with it. 

Sklearn comtes this score automatically for each feature after trainingm then it scales the results so that the sum of all importances is equal to 1, You can access the resuls using the feature_importances_ variable. FOr edxmple the following code trains a RandomForestClassifier on the iris dataset, and it outputs each features importance. It seem the most important features are the petal length(44%) and the petal width(42%) while sepal length and width are rather unimportant in comparison (11% and 2% respectively).

In [13]:
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()

rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(f"{name}: {score}")

sepal length (cm): 0.10623141495246413
sepal width (cm): 0.025929552479618437
petal length (cm): 0.4387934645557202
petal width (cm): 0.42904556801219723


If you train a Random FOrest classifier on the MNIST dataset, and plot each pixels importance, you get the image below.

Random Forests are very useful to get a quick understanding of what features are actually important and matter, in particular this can be helpful for feature selection.

## 4. Boosting

Boosting also called hypothesis boosting, refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost (Adaptive Boosting) and Gradient Boosting.

This section will be divided into the following subsections: -
1. AdaBoost
2. Gradient Boosting

### 4.1 AdaBoost

One way fpr a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This will result in new predictor focussing more and more on hard cases. This is the technique that is used by AdaBoost.

For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated again and so on.

<center>
<br>
<img src="https://miro.medium.com/max/1400/1*gf0ifraaFvG6fmgRLZcetQ.png" height="300">
<br>
</center>

The following figure shows the decision boundary of five consecuitve predictors on the moons dataset (each predictor is a highly regluarised version of SVM classifier with an RBF kernel, SVMs are generally not good base predictors for AdaBoost, they are slow and tend to be unstable with it). The first classifier gets many instances wrong so their weights get boosted. The second classifier therefore does a better job on these instances, and so on. The plot on the right represents the same sequence of predictors except that the learning rate is halved (therefore the misclassified instances don't get boosted as much at each iteration). As you can see, this sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor's parameters to minimise a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

<center>
<br>
<img src="https://i0.wp.com/thecleverprogrammer.com/wp-content/uploads/2020/08/image-13.png?resize=789%2C309&ssl=1" height="300">
<br>
</center>

Once all the predictos are trained, the ensemble makes predictions very much like bagging or pasting except that the predictors have dfferent weights depending o their overall accuracy on the wighted training set.

There is however an important drawback to this equential learning technique, ti is that it cannot be parallelized (or only partially), since each predictor can only be trained after the previous preictor has been trained and evaluated. As a result it does not scale as well as bagging or pasting.

Lets take a closer look at the AdaBoost algorithm. Each instance weight $w^{(i)}$ is initially set to $\frac{1}{m}$, where m is the number of training instances. A first predicator is trained, and its weighted error rate $r_1$ is computed on the training set. The predictor's weight $\alpha_1$ is then computed using the following equation:

Weighted error rate of the $j^{th}$ predicator:

$r_j = \frac{\sum_{i=1}^{m} w^{(i)} \cdot 1(y^{(i)} \neq \hat{y}_j^{(i)})}{\sum_{i=1}^{m} w^{(i)}}$

Where $\hat{y}_j^{(i)}$ is the $j^{th}$ predictors prediction for the $i^{th}$ instance.

Predicators weight $\alpha_j$:

$\alpha_j = \eta log \frac{1 - r_j}{r_j}$

Where $\eta$ is the learning rate hyperparameter (defaults to 1). The more accurate the predictor is, the higher its weight will be. If it is just guessing randomly, then its weight will be close to zero. However if it is most often wrong (less accurate than random guessing), then its weight will be negative.

Next the AdaBoost algorithm updates the instance weights using the equation given below, which boosts the weights of the misclassified instances.

Weight Update Rule: 

for i = 1 to m:

$w^{(i)} = w^{(i)}$ if $\hat{y^{(i)}} = y{(i)}$ else $ w^{(i)} exp (\alpha_j)$ if $\hat{y^{(i)}} \ne y{(i)}$

Then all the instance weights are normalised (i.e. divided by \sum \limits_{i=1}^{m} w^{(i)}).

Finally a new predictor is trained using the updated weights and the whole process is repeated. The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.

To make predictions AdaBoost simply computes the predictions of all the predictors and weights them using the predictor weights $\alpha_j$. The predicted class is the one that recieves the majority of the weighted votes.

AdaBoost Predictions:

$\hat{y(x)} = argmax_k \sum_{j=1}^{N} \alpha_j 1(\hat{y_j(x)} = k)$

Where $\hat{y_j(x)}$ is the $j^{th}$ predictors prediction for the instance x, and N is the number of predictors.

Sklearn uses a multiclass version of AdaBoost called SAMME which stands for Stagewsise Additive Modelling using a Multiclass Exponential loss function. When there are just two classes SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities (i.e. if they have a predict_proba() method) then Sklearn can use a variant of SAMME called SAMME.R (the R stands for Real), which relies on class probabilities rather than predictions and generally performs better.

The following code traines a AdaBoost classifier based on 200 Decision Stumps using sklearns AdaBoostClassifier class (there is also an AdaBoostRegressor class). A Decision Stump is a Decision Tree with max_depth=1, in other words the tree is composed of a single decision node plus two leaf nodes, THis is the default base estimator for the AdaBoostClassifier class.


In [14]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm='SAMME.R', learning_rate=0.5
)

ada_clf.fit(X_train, y_train)

Note: if your AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularise the base estimator.

### 4.2 Gradient Boosting

Gradient Boosting is another very popular boosting algorithm. Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

We will use a simple regression example, using Decision Trees as the base predictor. This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT). First, let's fit a DecisionTreeRegressor to the training set (for example a noisy quadratic training set with max_depth=2):

In [15]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)

tree_reg1.fit(X, y)

Next we will train a second DecisionTreeRegressor on the residual errors made by the first predictor:

In [16]:
y2 = y - tree_reg1.predict(X)

tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

We then train a third regressor on the residual errors made by the second predictor:

In [17]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

This has resulted in a ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees. In other words, this is a Gradient Boosted Regression Tree ensemble with three trees.

In [18]:
y_pred = sum(tree.predict(X) for tree in (tree_reg1, tree_reg2, tree_reg3))

The figure given below represents the predictions of these three trees in the left column and the ensembles predictions in the right column. In the first row the ensemble has ust one tree so its predictions are exactly the same as the first tree's predictions. In the second row a new tree is trained on the residual errors of the first tree. You can see that the ensemble's predictions are equal to the sum of the first two trees predictions. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble's predictions get better gradually as trees are added to the ensemble.

A simple way to train GBRT ensembles is to use Scikit-Learn's GradientBoostingRegressor class. Much like the RandomForestRegressor class, it has hyperparameters to control the growth of Decision Trees (e.g. max_depth, min_samples_leaf, etc.), as well as hyperparameters to control the ensemble training, such as the number of trees (n_estimators). Following code creates the same ensemble as the one we just manually made.

In [19]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

<center>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/boosting_ensembles.png" width="600">
</center>

The learning rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage. The figure given below shows two GBRT ensembles trained with a low learning rate of 0.1 (left) and a high learning rate of 1 (right). The ensemble on the left does not have enough trees to fit the training set, while the ensemble on the right overfits it.

<center>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/ensemble_overunderfit.png" width="600">
</center>

In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the staged_predict() method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, and finally trains another GBRT ensemble using the optimal number of trees.

In [20]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

The validation errors are represented on the left of the figure given below and the best model predictions are represented on the right. The optimal number of trees appears to be around 60.

<center>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/gradient_boosting_optimization.png" width="600">
</center>

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). You can do so by setting warm_start=True, which makes Scikit-Learn keep existing trees when the fit() method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row.

In [21]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this trades a higher bias for a lower variance. It also speeds up training considerably. This technique is called Stochastic Gradient Boosting.

It is also possible to use Gradient Boosting with other cost functions. For example, you can use Gradient Boosting to perform regression using a different loss function called the Huber loss, which is less sensitive to outliers than the mean squared error. Scikit-Learn does not support this directly, but it is fairly easy to implement this by subclassing the BaseEstimator class and implementing a custom fit() method as shown below. This can be controlled using the loss hyperparameter.

It is worth noting than an optimized implementation of Gradient Boosting is available in the popular Python library XGBoost, which stands for Extreme Gradient Boosting. This was the winning solution of many ML competitions in recent years. XGBoost offers several nice features, such as automatically taking care of early stopping. It also has a parallel version that is capable of exploiting all your CPU cores during training. It is also optimized to work efficiently on large datasets. Finally, XGBoost is under active development, and it is often faster than Scikit-Learn. However, it lacks some of the nice features of Scikit-Learn, such as support for missing values and easy handling of categorical features. In short, XGBoost is a great tool to have in your ML toolbox, but you should also consider using Scikit-Learn's GradientBoostingRegressor class as it is well integrated with the rest of the library.

In [22]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

In [23]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.42274
[1]	validation_0-rmse:0.38239
[2]	validation_0-rmse:0.35219
[3]	validation_0-rmse:0.34369
[4]	validation_0-rmse:0.33817
[5]	validation_0-rmse:0.33043
[6]	validation_0-rmse:0.32999
[7]	validation_0-rmse:0.33048


### 4.3 Stacking

The lasty ensemble method we will discuss in this chapter is called stacking (short for stacked generalisation). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don't we train a model to perform this aggregation? The figure given below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).

<center>
<img src="https://raw.githubusercontent.com/Akramz/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/63b8d7a91ff1ca2fdc35947a6a390c5a81085bd2//static/imgs/blender.png" width="600">
</center>

To train the blender, a common approach is to use a hold-out set. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer. 

<center>
<img src="https://vatsalparsaniya.github.io/ML_Knowledge/_images/stacking2.png" width="600">
</center>

Next, the first layer predictors are used to make predictions on the second (held-out) set. This ensures that the predictions are "clean", since the predictors never saw these instances during training. For each instance in the hold-out set there are now three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer's predictions. Now we can evaluate the blender on the test set, and it will generalize much better than the first layer predictors.

<center>
<img src="https://vatsalparsaniya.github.io/ML_Knowledge/_images/stacking3.png" width="600">
</center>

It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on), to get a whole layer of blenders. The trick is to split the training set into three subsets. The first subset is used to train the first layer predictors. The second subset is used to create the training set for the second layer predictors. The third subset is used to create the training set for the third layer predictors. Then we train the first layer predictors, make predictions on the second subset, and train the second layer predictors on these predictions. We repeat this process for the third layer.

Sklearn does not support stacking directly. However, it is not too hard to roll out your own implementation.

## 5. Exercises

1. Load the MNIST dataset and split it into a training set, a validation set and a test set (take the first 50,000 instances for training, 10,000 for validation and the remaining 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?
2. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image's class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now evaluate the ensemble on the test set. How does it compare to the voting classifier you trained earlier?