# Ensemble Learning & Random Forests

## Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80% accuracy. You have a Logistic Regression classifier, an SVM classifier, a Random Forest classifier, a KNN classifier, and perhaps a few more.

A very simple way to create an even better classifier is to aggregate the predictions of each classifier and predict hte class that gets the most votes. This majority-vote classifier is called a **hard voting** classifier.

Somewhat suprisingly, this voting classifier often acheives a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a `weak learner` (e.g it only does slightly better than random guesing), the ensemble can still be a `strong` learner (e.g achieves a high accuracy rating), provided there are a sufficient # of weak learners and they are sufficiently diverse due to the `law of large numbers`.

The `law of large numbers` states that if you perform the same experiment a large number of times, the average of the results obtained from the experiemnts should be close to the expected value and tends to become closer to the expected value as more trials are performed. For example, if you flip a coin a 1000 times with a probability of 51% of getting heads & 49% of getting tails, you would expec that with each flip you would get closer 510 heads and 490 tails.

The following code creates and trains a voting classifier in Scikit-Learn:

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=100, noise=0.15)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)]
voting_clf = VotingClassifier(estimators=estimators, voting="hard")
voting_clf.fit(X_train, y_train)

In [36]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9
RandomForestClassifier 0.85
SVC 0.95
VotingClassifier 0.9


If all classifiers are able to estimate the class probabilities (e.g they all have a `predict_proba` method), then you can tell Scikit-Learn to perdict the class with the highest class probability, averaged over all the individual classifers, which is called `Soft Voting`. `Soft Voting` often achieves higher performance relative to hard voting because it gives more weight to highly confident votes. 

All you need to do is set `voting="hard` to `voting="soft"`, and ensure that all the classifiers can estimate class probabilities. 

## Bagging & Pasting

One way to get a diverse set of classifiers is to use very different training algorithems (e.g SVC, KNN, ect), and another is to use the same training algo for every predictor and then train them on different subsets of the training set.

When sampling is performed w/ replacement, this is called `bagging`. When sampling is performed w/out replacement, this is called `pasting`. Only bagging allows training instances to be sampled several times for the same predictor.

Once all the predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically called the `statiscial mode` (e.g the most frequent prediction made, just like a hard voting classifier) for classification, and average for regression.

Each individual predictor hass a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

### Bagging & Pasting in Scikit-Learn

`Scikit-Learn` has a very simple API for both bagging and pasting w/ the `BaggingClassifier` class (or `BaggingRegressor` for regression).

* `n_estimators` = Ensemble of n classifiers (e.g 500 Dtrees:
* `max_samples` = Train each estimator on n samples of the trianing data
* `bootstrap` = Bagging = True, Pasting = False

Overall, bagging oftern results in better models which explains why it is oftern preferred.

In [39]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=75, bootstrap=True, n_jobs=1)

bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.9

### Out-of-Bag Evaluation

With bagging, some instances my be sampled several times for any given predictor, while others may not be sampled at all. By default, a `BaggingClassifier` samples `m` training instances w/ replacement (e.g `bootstrap=True`) where `m` is the size of the training. This means that only ~63% of training instance are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called `out-of-bag` instances. 

Since a predictor never sees the `out-of-bag` instances during training, it can be evaluated on these instances without the need for a seperate training set. You can evaluate the the ensemble iteslef by averageing out the `oob` evaluations of each predictor:

In [42]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=1, oob_score=True)

bag_clf.fit(X_train, y_train)

print(f"Validation Score: {bag_clf.oob_score_}")

Validation Score: 0.9375


In [44]:
y_pred = bag_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.85

## Random Patches & Random Subspaces

The `BaggingClassifier` class supports sampling the features as well, which is controlled by two hyperparameters: `max_features` & `bootstrap_features`. This allows each predictor to be trained on a random subset of the input features, and this technique is particulary useful when you are dealing with high-dimensional inputs such as images. 

Sampling both training instances and features is called the `Random Patches` method, while keeping all training instances but sampling features is called the `Random Subspaces` method.

## Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method, typically w/ `max_samples` set to the size of the training set. Instead of using the `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can instead use the `RandomForestClassifier`:

In [46]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

RandomForests introduce extra randomness when growing trees - instead of searching for the very best feature when splitting a node, it searches for the best feautre among a random subset of features

### Feature Importance

A great quality of Random Forests is that they make it easy to measure the the relative importance of each feature. 

Scikit-Learn measures a features importance by looking at how much the tree nodes use that feature to reduce impurity on average (across all trees in the forest). Scikit-Learn computes this score automatically for each feature after training, and then scales the results so that the sum of all importances is equal to 1. These scores can be accessed via the `feature_importances_` variable:

In [47]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=1)

X = iris["data"]
y = iris["target"]

rnd_clf.fit(X, y)

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.08682404757777516
sepal width (cm) 0.021006785266283006
petal length (cm) 0.44391103027366163
petal width (cm) 0.44825813688228033


Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection

## Boosting

Boosting refers to any ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct it's predecessor. 

There are many boosting methods available, but by far the most popualr are `Adaptive Boosting e.g AdaBoost` and `Gradient Boosting`. 

### AdaBoost          

One way for a new predictor to correct it's predecessor is to pay a bit more attention to the training instances that the predecessor underfitterd. This results in new predictors focusing more & more on the hard cases.

For example, when training an AdaBoost classifier: 

1. The algo first trains a base classifier (like a Dtree) and uses it to make predictions on the training set. 
2. The algo than increases the relative weight of misclassfied training instances.
3. Then it trains as second classifier, using the updated weights, and again makes predictions on the training set.
4. Repeat steps 2 & 3 until all estimators are finished. 

Once all predictors are trained, the ensemble makes predictors very much like bagging or pasting except the predictors have different weights depending on their overall accuracy on the weighted training set.

In [52]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5)

ada_clf.fit(X_train, y_train)

y_pred = ada_clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.85

### Gradient Boost

Just like `AdaBoost`, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting it's predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new preditor to the residual errors made by the previous predictor.

Let's go through an example on how this works, using DTree's as the base predictor:

In [56]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

# Train a second DecisionTreeRegressor on the residual errors of the first predictor:

y2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)

# Train a third DecisionTreeRegressor on the residual errors of the second predictor:
y3 = y2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

Now we have an ensemble of 3 trees - it can make predictions on a new instance by simply adding up the predictions of all the trees:

In [61]:
y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))

A simplier way to train a Gradient Boosting Regression tree is to use Scikit-Learn's `GradientBoostingRegressor` class. Much like the `RandomForestRegressor` class, it has hyperparameters to control the growth of the Decision Trees (e.g `max_depth`, `min_samples_leaf`) as well as hyperparameters to control the ensemble training, such as the number of trees (`n_estimators`):

In [62]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.6)
gbrt.fit(X_train, y_train)

y_pred = gbrt.predict(X_test)

accuracy_score(y_test, y_pred)

0.95

The `leaerning_rate` hyperparameter scales the contribution of each tree. If you set it to low values, such as 0.1, you will need more trees in the ensemble to fir the training set but the predictions will usually generalize better. This is a regularization technique called `shrinkage`. 

With a low learning_rate, if you have too few trees you won't fit the training set, but if you have too many trees you will overfit the training set. 

In order to find the optimal # of trees, you can use early stopping (e.g stop training before you overfit the model). Early stopping can be implemented via the following code by measuring the validaiton error at teach stage of training to find the optimal # of trees, and then finally training another ensemble using the optimal number of trees:

In [69]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor

# Load the dataset
california_housing = fetch_california_housing()

# Get the features and target variable
X = california_housing.data
y = california_housing.target

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=300)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]

best_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimators)
gbrt_best.fit(X_train, y_train)

In [70]:
best_n_estimators

299

In [71]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0

for n_estimators in range(1, 300):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # Early Stopping

The `GradientBoostingRegressor` class also supports a subsample hyperparameter, whcih specifies the fraction of training instances to be used training each tree. For example, if `subsample=0.25`, then each tree is trained on 25% of the training instances selected randomly. This techniques trades higher bias for lower variance and speeds up training considerably. This is called `Stochastic Gradient Boosting`.

It is worth nothing that an optimized version implementation of Gradient Boosting is available in the popular Python Library XGBoost. XGBoost is often an important componenet of the winning entries in ML competitions:

In [78]:
# !pip install xgboost

In [77]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

val_error = mean_squared_error(y_val, y_pred)
val_error

0.22144596276767312

xgboost has several nice features, such as automaticallly taking care of early stopping:

In [None]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = 

## Stacking