### grp

## Hands-On Machine Learning with Scikit-Learn & TensorFlow

## CHAPTER 7: Ensemble Learning and Random Forests

## Ensemble Learning:
https://en.wikipedia.org/wiki/Ensemble_learning
-  aggregate predictions for group of predictors
-  _ensemble methods work best when the predictors are as independent from one another as possible_
-  _diverse set of classifiers can be improved by using different training algorithms_

## Ensemble Methods (Algorithms) Techniques:
-  _Option 1 - use different training algorithms_
-  _Option 2 - use same training algorithms for every predictor, but train them on different random subsets of the training set_

-  ensemble makes prediction for new instance once all predictors are trained by aggregating the predictions of all predictors:
    -  most frequent prediction (classification)
    -  average (regression)

## Ensemble Method Types:
-  bagging
-  pasting
-  boosting
-  stacking

### Bagging => https://en.wikipedia.org/wiki/Bootstrap_aggregating
-  sampling performed WITH replacement:
    -  some instances may be sampled several times for any predictor
    -  some instances will not be sampled at all aka "out of bag" (OOB) instances:
        -  thus can evaluate the ensemble by averaging out the OOB evaluations for each predictor
        -  similar to a holdout set like cross validation

### Pasting
-  sampleing performed WITHOUT replacement

#### Bagging vs Pasting: 
https://en.wikipedia.org/wiki/Sampling_(statistics)#Replacement_of_selected_units

Sampling schemes may be without replacement ('WOR'—no element can be selected more than once in the same sample) or with replacement ('WR'—an element may appear multiple times in the one sample). For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water, this becomes a WOR design. If we tag and release the fish we caught, we can see whether we have caught a particular fish before.

### Boosting => https://en.wikipedia.org/wiki/Boosting_(machine_learning)
- combine several weak learners into strong learners:
    -  train predictors sequentially while trying to correct its predecessor
    -  types:
        -  AdaBoost => https://en.wikipedia.org/wiki/AdaBoost
        -  Gradient Boosting => https://en.wikipedia.org/wiki/Gradient_boosting

## AdaBoost Classifier:
1.  train a first base classifier (ex: DT) and make predictions on training set
2.  relative weights of misclassified training instances are increased aka "***boosted***" at every iteration
3.  train a second base classifier using updated weights and make predictions on training set
4.  predictions are weighed via predictor weights and the predicted class is based on majority of weighted votes
5.  continue process until number of predictors is reached or when perfect predictor is found ...
#### the more accurate the predictor is the higher the weight will be (positive value)
#### the more inaccurate the predictor is the lower the weight will be (negative value)
#### a random guess will result in the weight being close to 0 (zero value)
#### _adaboost cannot be parallelized thus not scaling as well as bagging or pasting_

## Gradient Boosting Classifier:
1.  train a first base classifier (ex: DT) and make predictions on training set
2.  train a second base classifier (ex: DT) on the residual errors made by the fiest predictor
3.  train a third base classifier (ex: DT) on the residual errors made by the second predictor
4.  output contains an ensemble with 3 trees
5.  make predictions on new instances by adding up predictions of all the trees
#### the goal is to fit the new predictor to the _residual error_ made by the previous predictor

### Stacking
1.  split training set in 2 subsets
2.  use 1st subset to train the predictors in first layer
3.  use 2nd subset to create training set used to train second layer (using predictions made by predictors of first layer)
4.  blender "blends" predictions as inputs to make final prediction 
#### goal is to train a model to perform final aggregation of predictors by taking all predictions as inputs to make final prediction

***predictions can be made in parallel - bagging and pasting scale very well***

## Voting Classifiers:
1.  train many different classifiers (SVM, RF, KNN, LR, etc.)
2.  aggregate the predictions of each classifier
3.  predict the class that gets the most votes

#### _hard voting_ classifier => majority-vote classifier

## Random Forest:
https://en.wikipedia.org/wiki/Random_forest
1. train a group of DT classifiers each on different subset of training set via bagging (or pasting) method
2. gather predictions of all individual trees
3. predict class that gets most votes
#### RF searches for the best feature among a random subset of features via sklearn's 'feature_importances' variable
#### _sklearn's RandomForestClassifier is roughly equivalent to sklearn's BaggingClassifier_

## _Exercises_

In [1]:
import sklearn
print(sklearn.__version__)

0.20.0


### voting classifier

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
X[:10]

array([[ 0.83103915, -0.25874875],
       [ 1.18506381,  0.92038714],
       [ 1.16402213, -0.45552558],
       [-0.0236556 ,  1.08628844],
       [ 0.48050273,  1.50942444],
       [ 1.31164912, -0.55117606],
       [ 1.16542367, -0.15862989],
       [ 0.1567364 ,  1.31817168],
       [ 0.45330102,  0.49607493],
       [ 1.65139719, -0.45980435]])

In [4]:
y[:10]

array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1])

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [6]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', Rando...f',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [7]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896


**quote**: "If all classifiers are able to estimate class probabilities (i.e., **they have a predict_proba() method**),
then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the
individual classifiers. This is called **soft voting**. It often achieves higher performance than hard voting
because it gives more weight to highly confident votes. All you need to do is replace voting="hard"
with voting="soft" and ensure that all classifiers can estimate class probabilities. This is not the case
of the SVC class by default, so you need to set its probability hyperparameter to True (this will make
the SVC class use cross-validation to estimate class probabilities, slowing down training, and it will add
a predict_proba() method)." - _hands on ml w sklearn and tf (aurelien geron)_

In [8]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=10, random_state=42)
svm_clf = SVC(gamma="auto", probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', Rando...bf',
  max_iter=-1, probability=True, random_state=42, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='soft', weights=None)

In [9]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912


### bagging

In [10]:
# example:
    # ensemble of 500 DT classifiers
    # each trained on 100 training instances randomly sampled from the training set w/ replacement (bagging)
    # n_jobs => sklearn uses number of CPU cores for training and predictions [-1 = use all available cores]
    # max_samples => each predictor is trained on a random subset of input samples
    # max_features => each predictor is trained on a random subset of input features
    # random patches and random subspaces => see book [page 190/191]

In [11]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [12]:
# the bagging classifier automatically performs SOFT voting instead of HARD voting if ...
    # base classifier can estimate class probabilities (i.e. if it has a predict_proba() method)

In [13]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred)) # bagging test set accuracy

0.904


In [14]:
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree)) # regular DT test set accuracy

0.856


### oob evaluation

In [15]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True, random_state=40)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_ # saying the test set is likely to be close to 90.1% accuracy

0.9013333333333333

In [16]:
bag_clf.oob_decision_function_[:10] # probabilities for each instance [ex: 32% negative class; 68% positive class]

array([[0.31746032, 0.68253968],
       [0.34117647, 0.65882353],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.08379888, 0.91620112],
       [0.31693989, 0.68306011],
       [0.02923977, 0.97076023],
       [0.97687861, 0.02312139],
       [0.97765363, 0.02234637]])

In [17]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred) # test set is 91% accuracy

0.912

### random forest

In [18]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

### RF feature importance

In [19]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.11249225099876374
sepal width (cm) 0.023119288282510326
petal length (cm) 0.44103046436395765
petal width (cm) 0.4233579963547681


In [20]:
# feature importance measured by ...
    # weighted average
    # average tree nodes using a feature for reducing impurity
    # each node's weight is equal to the number of training samples

In [21]:
rnd_clf.feature_importances_

array([0.11249225, 0.02311929, 0.44103046, 0.423358  ])

### adaboost

In [22]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=0.5, n_estimators=200, random_state=42)

### gradient boosting

In [23]:
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [24]:
X[:10]

array([[-0.12545988],
       [ 0.45071431],
       [ 0.23199394],
       [ 0.09865848],
       [-0.34398136],
       [-0.34400548],
       [-0.44191639],
       [ 0.36617615],
       [ 0.10111501],
       [ 0.20807258]])

In [25]:
y[:10]

array([ 0.0515729 ,  0.59447979,  0.16605161, -0.07017796,  0.34398593,
        0.37287494,  0.65976498,  0.3763414 , -0.00975194,  0.10479474])

In [26]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [27]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [28]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [29]:
X_new = np.array([[0.8]])

In [30]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [31]:
y_pred

array([0.75026781])

### sklearn gradient boosting class

In [32]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=3, n_iter_no_change=None, presort='auto',
             random_state=42, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [33]:
# learning rate => scales the contribution of each tree
# "shrinkage" regularization technique:
    # low value learning rate means algorithm will need more trees in the ensemble to fit the training set ...
    # but predictions will typically generalize better

### early stopping [find optimal number of trees] technique

In [34]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=120, n_iter_no_change=None, presort='auto',
             random_state=42, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [35]:
errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=55, n_iter_no_change=None, presort='auto',
             random_state=42, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [36]:
print(gbrt_best.n_estimators)

55


In [37]:
# visualization on page 200 / 201

### stochastic gradient boosting

In [38]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [39]:
# code above stops training when the validation error does not improve for five consecutive iterations 

In [40]:
print(gbrt.n_estimators)

61


In [41]:
print("Minimum validation MSE:", min_val_error)

Minimum validation MSE: 0.002712853325235463


### xgboost

In [42]:
try:
    import xgboost
except ImportError as ex:
    print("Error: the xgboost library is not installed.")
    xgboost = None

In [43]:
if xgboost is not None:  # not shown in the book
    xgb_reg = xgboost.XGBRegressor(random_state=42)
    xgb_reg.fit(X_train, y_train)
    y_pred = xgb_reg.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    print("Validation MSE:", val_error)

Validation MSE: 0.0028512559726563943


In [44]:
if xgboost is not None:  # not shown in the book
    xgb_reg.fit(X_train, y_train,
                eval_set=[(X_val, y_val)], early_stopping_rounds=2)
    y_pred = xgb_reg.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    print("Validation MSE:", val_error)

[0]	validation_0-rmse:0.286719
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:0.258221
[2]	validation_0-rmse:0.232634
[3]	validation_0-rmse:0.210526
[4]	validation_0-rmse:0.190232
[5]	validation_0-rmse:0.172196
[6]	validation_0-rmse:0.156394
[7]	validation_0-rmse:0.142241
[8]	validation_0-rmse:0.129789
[9]	validation_0-rmse:0.118752
[10]	validation_0-rmse:0.108388
[11]	validation_0-rmse:0.100155
[12]	validation_0-rmse:0.09208
[13]	validation_0-rmse:0.084791
[14]	validation_0-rmse:0.078699
[15]	validation_0-rmse:0.073248
[16]	validation_0-rmse:0.069391
[17]	validation_0-rmse:0.066277
[18]	validation_0-rmse:0.063458
[19]	validation_0-rmse:0.060326
[20]	validation_0-rmse:0.0578
[21]	validation_0-rmse:0.055643
[22]	validation_0-rmse:0.053943
[23]	validation_0-rmse:0.053138
[24]	validation_0-rmse:0.052415
[25]	validation_0-rmse:0.051821
[26]	validation_0-rmse:0.051226
[27]	validation_0-rmse:0.051135
[28]	validation_0-rmse:0.05091
[29]	validation_0-rmse

In [45]:
%timeit xgboost.XGBRegressor().fit(X_train, y_train) if xgboost is not None else None

4.71 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [46]:
%timeit GradientBoostingRegressor().fit(X_train, y_train)

13.9 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### additional exercises:

https://github.com/ageron/handson-ml/blob/master/07_ensemble_learning_and_random_forests.ipynb

1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
2. What is the difference between hard and soft voting classifiers?
3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, random forests, or stacking ensembles?
4. What is the benefit of out-of-bag evaluation?
5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
6. If your AdaBoost ensemble underfits the training data, what hyperparameters should you tweak and how?
7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?
8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 40,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?
9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Congratulations, you have just trained a blender, and together with the classifiers they form a stacking ensemble! Now let’s evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?

### grp