# Bagging with Random Forests

## Bagging ensembles
Ensemble models aggregate the results of multiple machine models, making them superior to individual models because they are less prone to error. Ensemble methods are generally classified into two types. The first type combines different machine learning models as chosen by the user. The second type, such as XGBoost and random forests, combines many versions of the same model. The overall accuracy of ensemble methods tend to be higher than individual models, but they are more computationally expensive and take longer to train.

Bootstrap aggregation, or bagging, is a method of combining multiple machine learning models to reduce variance. Bagging is a special case of the model averaging approach. In bagging, the same machine learning algorithm is trained many times using different subsets (with replacement) sampled from the training data. The final output is averaged across the predictions of all of the sub-models. Bagging can be used to reduce the variance of a model. It also reduces the chance of overfitting, since it is not as sensitive to small changes in the training data. Random forests aggregate the predictions of bootstrapped decision trees.

## Exploring random forests

In [2]:
# Use a random forest classifier to predict whether a user makes more or less than $50,000 USD using census data.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

In [3]:
df_census = pd.read_csv('census_cleaned.csv')
X_census = df_census.iloc[:, :-1]
y_census = df_census.iloc[:, -1]

In [4]:
# Initialize a Random Forest classifier with 10 estimators
rf = RandomForestClassifier(n_estimators=10, random_state=2, n_jobs=-1)
scores = cross_val_score(rf, X_census, y_census, cv=5)

In [6]:
print('Accuracy:', np.round(scores, 3))
print('Accuracy mean: %0.3f' % (scores.mean()))

Accuracy: [0.851 0.844 0.851 0.852 0.851]
Accuracy mean: 0.850


The bagging method described in the previous section probably explains why the random forest performs better than the decision tree in the previous chapter. (86% vs. 81%) The bootstrapped trees increase the diversity of the training set and then aggregated. Random forests also work with regression.

In [8]:
df_bikes = pd.read_csv('bike_rentals_cleaned.csv')
df_bikes.head()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,1562
4,5,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,1600
