Each machine learning algorithm has strengths and weaknesses. A weakness of decision trees is that they are prone to overfitting on the training set. A way to mitigate this problem is to constrain how large a tree can grow. Bagged trees try to overcome this weakness by using bootstrapped data to grow multiple deep decision trees. The idea is that many trees protect each other from individual weaknesses.
![images](images/baggedTrees.png)


## Import Libraries

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

# Bagged Trees Regressor
from sklearn.ensemble import BaggingRegressor

## Load the Dataset
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The code below loads the dataset. The goal of this dataset is to predict price based on features like number of bedrooms and bathrooms

In [None]:
df = pd.read_csv('data/kc_house_data.csv')

df.head()

In [None]:
# This notebook only selects a couple features for simplicity
# However, I encourage you to play with adding and substracting more features
features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors']

X = df.loc[:, features]

y = df.loc[:, 'price'].values

## Splitting Data into Training and Test Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Note, another benefit of bagged trees like decision trees is that you don’t have to standardize your features unlike other algorithms like logistic regression and K-Nearest Neighbors. 

## Bagged Trees

<b>Step 1:</b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [None]:
# This was already imported earlier in the notebook so commenting out
#from sklearn.ensemble import BaggingRegressor

<b>Step 2:</b> Make an instance of the Model

This is a place where we can tune the hyperparameters of a model. 

In [None]:
reg = BaggingRegressor(n_estimators=100, 
                       random_state = 0)

<b>Step 3:</b> Training the model on the data, storing the information learned from the data

Model is learning the relationship between X (features like number of bedrooms) and y (price)

In [None]:
reg.fit(X_train, y_train)

<b>Step 4:</b> Make Predictions

Uses the information the model learned during the model training process

In [None]:
# Returns a NumPy Array
# Predict for One Observation
reg.predict(X_test.iloc[0].values.reshape(1, -1))

Predict for Multiple Observations at Once

In [None]:
reg.predict(X_test[0:10])

## Measuring Model Performance

Unlike classification models where a common metric is accuracy, regression models use other metrics like R^2, the coefficient of determination to quantify your model's performance. The best possible score is 1.0. A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

In [None]:
score = reg.score(X_test, y_test)
print(score)

## Tuning n_estimators (Number of Decision Trees)

A tuning parameter for bagged trees is **n_estimators**, which represents the number of trees that should be grown. 

In [None]:
# List of values to try for n_estimators:
estimator_range = [1] + list(range(10, 150, 20))

scores = []

for estimator in estimator_range:
    reg = BaggingRegressor(n_estimators=estimator, random_state=0)
    reg.fit(X_train, y_train)
    scores.append(reg.score(X_test, y_test))

In [None]:
plt.figure(figsize = (10,7))
plt.plot(estimator_range, scores);

plt.xlabel('n_estimators', fontsize =20);
plt.ylabel('Score', fontsize = 20);
plt.tick_params(labelsize = 18)
plt.grid()

Notice that the score stops improving after a certain number of estimators (decision trees). One way to get a better score would be to include more features in the features matrix.
