#### Bagging

bagging is an ensemble method that trains the same algorithm many times using different subsets sampled from the training data, bagging can be used to create a tree ensemble, the random forests algorithm can create further ensemble diversity through randomization at the level of each spilt in the trees forming ensemble 

**bagging** is also known as **b**ootstrap **agg**regation, in baggin the ensemble is formed by models that use the same training algorithm, these models aren't trained on the entire training set because they're trained on different subsets of the data, overall bagging is the effect of reducing the variance of individual models in the ensemble

a bootstrap sample is sampling with replacement, all of these samples are then used to train n models that use the same algorithm

each model outputs a prediction and then the meta model collects them and outputs a final prediction based on the problem

classification uses majority voting for its final prediction, BaggingClassifier

regression uses the average of the predictions made by the individual models that form the ensemble for its final prediction, BaggingRegressor

In [None]:
# import models and utility functions
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# set seed for reproducibility
SEED = 1

# split the dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

# instantiate a classification tree
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=SEED)
# instantiate a bagging classifier that consists of 300 classification trees
# setting n_jobs to -1 makes it so all CPU cores are used in computation
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1)

# fit to the training set
bc.fit(X_train, y_train)
# predict test set labels
y_pred = bc.predict(X_test)

# evaluate and print the test set accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))

#### Out of Bag Evaluation

when you use bagging some instances may be sampled several times for one model and some instances may not be sampled at all
for each model, on average, 63% of the training instances are sampled
the remaining 37% that aren't sampled are the *OOB* out of bag instances, the OOB instances aren't seen by a model during training so they can be used to estimate the performance of the ensemble so that you don't need to do cross validation, this is OOB evaluation, in other words, train on the bootstrap samples and evaluate on the OOB samples, the OOB score of the bagging ensemble is the average of the OOB scores

the oob score corresponds to accuracy for classifiers and r squared for regressors 

In [None]:
# OOB evaluation in sklearn
# is a tumor benign or not?
# import models and utility functions
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# set seed for reproducibility
SEED = 1

# split the dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

# instantiate a classification tree
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=SEED)
# instantiate a bagging classifier that consists of 300 classification trees
# use oob_score so you can get the oob accuracy of bc after training
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, oob_score=True, n_jobs=-1)

# fit to the training set
bc.fit(X_train, y_train)
# predict test set labels
y_pred = bc.predict(X_test)

# evaluate and print the test set accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))

# print the test oob accuracy
print('OOB accuracy: {:.3f}'.format(oob_accuracy))
# another way to do this # Evaluate OOB accuracy
# acc_oob = bc.oob_score_

# the two accuracies in this case are pretty close, oob can be used to get a good performance metric on unseen data without cv

#### Random Forests

random forests is another ensemble learning method

in bagging the base estimator can be anything (like dt, lr, or even a neural network), each estimator is trained on a distinct bootstrap sample of the training set, estimators use all features for training and prediction

the random forest ensemble method uses a decision tree as a base estimator, each estimator is trained on a different bootstap sample which is the same size as the training set, rf introduces even more randomization then bagging when training each of the trees (base estimators), when each tree is trained only d features can be sampled at each node without replacement (d is a number smaller than the total number of features), in scikit-learn the default d is the square root of the number of features (100 features would mean that only 10 features are sampled at each node)
each prediction is collected by the random forests meta classifier and the final prediction is made based on the nature of the problem, classification=majority voting (RandomForestClassifier), regression=average of all the labels predicted by the base estimator (RandomForestRegressor)

random forests usually achieve a lower variance than individual trees 

when a tree based method is trained the predictive power of a feature or its importance can be assessed, in other words tree based methods enable measurcing the importance of each feature in the prediction, in sklearn it's measured by how much the tree nodes use a particular feature (weighted average) to reduce impurity, the importance of a feature is expressed as a percentage which indicated the weight of that feature in training and prediction

In [None]:
# random forests regressor in sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

# set seed for reproducibility
SEED = 1

# split the dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# instantiate a random forest regressor with 400 estimators, each leaf will contain at least 12% of the data used in training
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=SEED)

# fit to the training set
rf.fit(X_train, y_train)
# predict test set labels
y_pred = rf.predict(X_test)

# evaluate and print the test set RMSE 
rmse_tree = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
# the RMSE shows a smaller error than the one achieved by a single regression tree

In [None]:
# visualize the importance of features as assessed by rf in sklearn
import pandas as pd
import matplotlib.pyplot as plt

# create a series of features importances
importances_rf = pd.Series(rf.feature_importances_, index=X.columns)

# sort the importances
sorted_importances_rf = importances_rf.sort_values()

# make a horizontal bar plot
sorted_importances_rf.plot(kind='barh', color='lightgreen'); plt.show()