# Bagging (Bootstrap Aggregating)

* **Bagging** is an ensemble meta-algorithm that reduces variance in an estimator
* It can be used in classification and regression  classification tasks
  * When components are **regressors**, the ensemble averages their predictions
  * When components are **classifiers**, the ensemble returns the mode class
* Independently fits multiple models on variants of training data
* Training data variants are created using a technique called bootstrap resampling

##### Bootstrap Resampling
* A method of estimating the uncertainty in a statistic
* Bootstrap resampling can only be used if observations in the sample are drawn independently
* Produces multiple variants by resampling repeatedly from the original sample
* Variant samples will have same number of observations as original sample
* Statistics can be computed for each variant
* Statistics can be used to estimate uncertainty by:
  * Creating a confidence interval
  * Calculating the standard error
  
Bagging is particularly useful for estimators that have high variance and low bias - decision trees  
Bagged decision trees are used often and successfully called random forests

In [1]:
import numpy as np

# Sample 10 integers
sample = np.random.randint(low=1, high=100, size=10)
print('Original sample: %s' % sample)
print('Sample mean: %s' % sample.mean())

# Bootstrap re-sample 100 times by re-sampling with replacement from original sample
resamples = [np.random.choice(sample, size=sample.shape) for i in range(100)]
print('Number of bootstrap re-samples: %s' % len(resamples))
print('Example re-sample: %s' % resamples[0])

resample_means = np.array([resample.mean() for resample in resamples])
print('Mean of re-samples\' means: %s' % resample_means.mean())

Original sample: [88 14 95 58 17 80 25 35 43 46]
Sample mean: 50.1
Number of bootstrap re-samples: 100
Example re-sample: [43 46 58 14 43 88 58 88 25 88]
Mean of re-samples' means: 50.87700000000001


### Random Forest
* The implementation of bagging on decision trees
* Number of trees is an important hyperparameter
* Increasing number of trees increases the model's preformance and computational complexity
* Algorithm selects the best from random features at each node

In [3]:
# Training a Random Forest with Scikit-Learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create artificial classification set and split the test data
# Dataset has 1000 instances with 100 features of which 20 are informative
#     while the rest are redundent combinations of information features or noise.
X, y = make_classification(
    n_samples=1000, n_features=100, n_informative=20, n_clusters_per_class=2, random_state=11)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)

# Train and evaluate a single decision tree
clf = DecisionTreeClassifier(random_state=11)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.73      0.66      0.69       127
          1       0.68      0.75      0.71       123

avg / total       0.71      0.70      0.70       250



In [5]:
# Now train and evaluate on a random forest with 10 trees
clf = RandomForestClassifier(n_estimators=10, random_state=11)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.74      0.83      0.79       127
          1       0.80      0.70      0.75       123

avg / total       0.77      0.77      0.77       250



# Boosting
