# Chapter 7 Ensemble Learning and Random Forests

In [19]:
# Load moons dataset and split into training and testing set
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=10000, noise=0.15)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

In [21]:
# Examine each classifier's accuracy on test set
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.8785
RandomForestClassifier 0.99
SVC 0.991
VotingClassifier 0.992


Voting classifier slightly outperforms all the individual classifiers.

# Bagging and Pasting in Scikit-Learn

Getting a diverse set of classifiers
- Use very different training algorithms
- Use same training algorithm for every predictor AND train them on different random subsets of the training set
    - Bagging: sampling WITH REPLACEMENT (bootstrap = True)
    - Pasting: sampling WITHOUT REPLACEMENT (bootstrap = False)

In [22]:
# Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

NOTES
- BaggingClassifier performs soft voting (instead of hard voting) if base classifier can estimate probabilities (= case for DecisionTreeClassifier)
- n_jobs: number of CPU cores to use for training and predictions (-1 if Scikit-Learn should use all available cores)

# Out-of-Bag (oob) Evaluations
With bagging, some instances may not be sampled at all. Instances that are not sampled are called out-of-bag(oob) instances.
The predictor never sees the oob instances during training, the predictor can be evaluated on these instances without a separate test set. The ensemble can be evaluated by averaging the oob evaluations of each predictor.
In Scikit-Learn, passing oob_score = True when creating a BaggingClassifier requests an automatic oob evaluation after training.

In [23]:
# Pass oob_score = True - creating BaggingClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, 
    bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9885

According to this oob evaluation, the BaggingClassifier is likely to achieve 98.9% accuracy on the test set. This can be verified.

In [24]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.9895

We get 98.95% accuracy, which is close to the above 98.85% accuracy.

In [25]:
# oob decision function (available through the oob_decision_function_ variable)
bag_clf.oob_decision_function_

array([[0., 1.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [0., 1.],
       [0., 1.]])

# Random Patches and Random Subspaces

BaggingClassifier supports sampling features, done through "max_features" and "bootstrap_features"

Random Patches
- Sampling both training instances and features
Random Subspaces
- Keeping all training instances (by setting bootstrap=False, max_samples=1.0) but sampling features (bootstrap_features=True OR max_features < (value smaller than 1.0))
Sampling features results in more predictor diversity, more bias, and lower variance.

# Random Forests

- Random Forest: ensemble of Decision Trees, trained via the bagging method, max_samples set to size of training set

- Use RandomForestClassifier class (alternatively, can be done by building a BaggingClassifier and passing it a DecisionTreeClassifier)

- RandomForestRegressor: for regression

In [26]:
# Uses all available CPU cores to train Random Forest classifier with 500 trees (limited to maximum 16 nodes)

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

# Feature Importance

Feature Importance
- How much tree nodes that use that feature reduce impurity on average
- Weighted average -> each node's weight is equal to the number of training samples that are associated with it

In [28]:
# Feature Importance, using iris dataset

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09825657448206183
sepal width (cm) 0.024746575681776514
petal length (cm) 0.42349326327318154
petal width (cm) 0.4535035865629802


Most important features are petal width (45.3%) and petal length (42.3%), while sepal length (9.8%) and sepal width (2.5%) is relatively less important.

# Boosting

Boosting
- any ensemble method that can combine several weak learners into a strong learner
- trains predictors sequentially, each trying to correct its predecessor
- one of popular methods: AdaBoost

AdaBoost
- pays more attention to the training instances that predecessor underfitted