## Ensemble Learning and Random Forests
-----
-----

Aggregating the predictions of a group of predictors, will often result in getting better predictions.

## Voting Classifiers
-----

- hard voting classifier: majority-vote classification; say you have a few predictors, aggregating the prediction of each classifier and predict the class that gets the most votes.
- ensemble works best wehn predictors are as independent from each other as possible. 

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [21]:
log_clf = LogisticRegression(random_state=42)
rnd_clf = RandomForestClassifier(random_state=42)
svm_clf = SVC(random_state=42)

In [22]:
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

In [23]:
voting_clf.fit(X_train, y_train)



VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=42, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)), ('rf', RandomFore...rbf', max_iter=-1, probability=False, random_state=42,
  shrinking=True, tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [24]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.896




In [25]:
# soft voting, all classifiers need dirct_proba() method
log_clf_soft = LogisticRegression(random_state=42)
rnd_clf_soft = RandomForestClassifier(random_state=42)
svm_clf_soft = SVC(random_state=42, probability=True)

In [26]:
voting_clf_soft = VotingClassifier(
    estimators=[('lr', log_clf_soft), ('rf', rnd_clf_soft), ('svc', svm_clf_soft)],
    voting='soft'
)

In [27]:
for clf in (log_clf_soft, rnd_clf_soft, svm_clf_soft, voting_clf_soft):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.872
SVC 0.888
VotingClassifier 0.912




## Bagging and Pasting
-----

Use one algorithm for every predictor, but train them on random subset of training set.
- when sampling is preformed with replacement -> bagging
- when sampling is preformed without replacement -> pasting

both allow training instances to be sampled several times across multiple predictors, but only bagging allows training instance to be sampled several times for the same predictor.

- each individual predictor has higher bias then if it were trained on original training set, but aggregation reduces both bias and variance. Ensemble has similar bias but lower variance than a single predictor trained on the original training set.
- predictors can be trained in parallel, and predictions can be done in parallel

#### Bagging and Pasting Sklearn

In [39]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=4
)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=100, n_estimators=500, n_jobs=4, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [40]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

bagging is generally prefered of pasting.
- slightly higher bias
- predictors less corrolated
- variance is reduced

#### Out-of-Bag Evaluation


In [43]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=4, oob_score=True
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.9253333333333333

In [44]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.912

#### Random Patches and Random Subspaces
`BaggingClassifier` support sampling the features: `max_samples` and `bootstrap_features`. will be trained on a random subset of features.

## Random Forests
-----

an ensemble of Decision trees

In [45]:
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=4)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_test, y_pred_rf)

0.92

#### Extra Trees

In [47]:
from sklearn.ensemble import ExtraTreesClassifier
et_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=4)
et_clf.fit(X_train, y_train)
y_pred_rf = et_clf.predict(X_test)
accuracy_score(y_test, y_pred_rf)

0.912

#### Feature Importance

can use `.feature_importance_` variable to know what is important in the tree.

Random forest are useful when getting a quick understanding of what features actually matter, or feature selection.

## Boosting
-----

Any ensemble method that can combine several weak learners into a strong learner. General idea is predictors are trained sequentially and the new predictors try to correct the predecessors

#### AdaBoost(adaptive boosting)
New predictors focus on instances that the predecessor underfitted, focussing more and more on the hard cases.

In [58]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=500, algorithm="SAMME.R", learning_rate=.5
)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_score(y_test, y_pred_ada)

0.88

#### Gradient Boosting

Tries to fit new predictor to the residual errors made by the previous predictor

In [60]:
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [61]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [62]:

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [63]:

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

In [64]:
X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

In [65]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X, y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=3, n_iter_no_change=None, presort='auto',
             random_state=42, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [66]:
gbrt_slow = GradientBoostingRegressor(max_depth=2, n_estimators=200, learning_rate=0.1, random_state=42)
gbrt_slow.fit(X, y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=200, n_iter_no_change=None, presort='auto',
             random_state=42, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [68]:
# early stopping
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)


gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [69]:
print(gbrt.n_estimators)

61


## Stacking
-----

Train a model to do the aggregation, instead of using trivial functions, such as, hard voting. not in sklearn, but can use [brew](https://github.com/viisar/brew)

## Excersises
-----
1. If the models are sufficiently different(SVM, SGD, Tree...) then we would still get a small improvement using a voting ensemble.
2. Hard-voters is majority vote, counts votes of each classifier. Soft voting, calculates the avergae estimated class probability of each class and select class with highest prob.
3. bagging, pasting and random forest, individual classifiers are independent therefore can be trained in parallel and can also make predictions in parallel. But for boosting the new classifiers require knowledge of its predecessor, making it sequential and not distributable over servers. For Stacking ensemble training the individual predictors in a layer can be done in parallel but layer to layer is sequential.
4. So the trained bag can evaluate the preformance on part of the training set that was heldout.
5. Use random threshold for each feature instead of finding the best possible threshold, so it is faster training, but predictions is still a binary tree lookup.
6. increase # of estimators or decrease regularization parameters
7. try decreasing learning rate.
8. 

In [4]:
from sklearn.datasets import fetch_mldata
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
mnist = fetch_mldata('MNIST original')
X,y = mnist["data"], mnist['target']
X_train, y_train, X_test, y_test = X[:60000],y[:60000],X[60000:],y[60000:]



In [5]:
np.random.seed(42)
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))

In [None]:

svm_clf = SVC(random_state=42, probability=True)
svm_clf.fit(X_train_scaled, y_train)

y_pred = svm_clf.predict(X_test_scaled)

accuracy_score(y_test, y_pred)

In [None]:
rnf_clf = RandomForestClassifier(random_state=42, n_jobs=4)
rnf_clf.fit(X_train_scaled, y_train)

y_pred = rnf_clf.predict(X_test_scaled)

accuracy_score(y_test, y_pred)

In [None]:
et_cle = ExtraTreesClassifier(random_state=42, n_jobs=4)
et_cle.fit(X_train_scaled, y_train)

y_pred = et_cle.predict(X_test_scaled)

accuracy_score(y_test, y_pred)

In [None]:
voting_clf_soft = VotingClassifier(
    estimators=[('svm', svm_clf), ('rf', rnf_clf), ('et', et_cle)],
    voting='soft'
)
voting_clf_soft.fit(X_train, y_train)