Weak point of one single decision tree: difficult to avoid both high bias and high variance

bagging is one method to solve this by:
- using many low-bias high-variance (overfitting) trees
- creating one final low bias low variance model out of them
- in short bagging reduces variance

Definition: bagging is short for bootstrap aggregating
- bootstrapping: taking random subsets out of a set with replacement
- aggregating: any operation that obtains a value out of many values

start with full training dataset with all training observations

then, create bagging subsets:
- an observation can be in multiple subsets
- a subset can have the same observation more than once
- each subset has to be the same size
- the number of subsets is a hyperparameter of a bagging model

one copy of the regressor/classifier model is trained with each subset (for now assume that they are all decision trees)
together multiple models trained on different bagged ubsets are called a bagging model

Training: 
- create B training subsets from the training dataset
- train B copies of the base model with each subset
- the trained copies act together as a bagging regressor/classifier

Prediction:
- for each test observation, each copy makes a prediction -- B predictions in total
- for each regression the average of all predictions is the final prediction of the bagging model
- for each classification, the mode of all predictions is the final prediction of the bagging model (majority voting)

Why is bagging necessary?
- bagging reduces variance of base model
- bagging takes an overfitting model as the base model and keeps it from overfitting
- therefore, a high variance decision tree is a very good base model for a bagging model, creating a low variance bagging model

How does bagging reduce variance?
- high bias means that the model is too simple (can't capture true pattern), training many copies of a high bias model will predict far away from true model
- high variance means that the model fits too much to the training dataset
    - training many copies of a high-variance model will result in predictions that jump around the true value
- assuming the copies are not somehow correlated, the predictions are expected to be scattered around the true value
- if we average the predictions, the result will be much closer to the true value


- note that each copy of the base model is trained on its subset and not on the full training dataset
- the observations a copy does not see can be used for validation
    - prediction accuracy on observations a copy doesn't see gives Out-of-bag (OOB) score

CV vs OOB
- OOB score is found during training so it is cheaper
- not always reliable since the predictions for each observation come from some copies not all
- CV is more expensive but more reliable

Bagging and Random Forest:
- a bagging model that has B trees would be called a  bagged trees model
- a random forest is a bagged trees model with a modification:
    - when a tree in the model branches out during training, only a random subset of predictors is allowed to be used for each decision rule

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import BaggingRegressor, BaggingClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

In [2]:
trainf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_train.csv")
trainp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_train.csv")
train = pd.merge(trainf, trainp)

testf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_test.csv")
testp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_test.csv")
test = pd.merge(testf, testp)

predictors = ["mpg", "engineSize", "year", "mileage"]

X_train = train[predictors]
y_train = train["price"]

X_test = test[predictors]
y_test = test["price"]

In [3]:
# no scaling needed, based on decision trees

# Model Inputs

In [4]:
# we first create the base model -- the Decision Tree inputs have the exact same idea as discussed before
base_model = DecisionTreeRegressor(
    random_state = 12,
    max_depth = 4,
    max_leaf_nodes = 10
)

model = BaggingRegressor(
    estimator = base_model, # can create a bagging model of any base model
    n_estimators = 10, # how many copies of the base model
    max_samples = 0.8, # how much of the data to use for each subset
    max_features = 0.75, # percent of the training data is used for each subset
    bootstrap = True, # sample with replacement
    bootstrap_features=True, # same ide as bootstrap, but for predictors
    random_state=12,
)

# fit the model
model.fit(X_train, y_train)

### Tuning a Bagging Model

Don't bother tuning the base model, we want the base model to overfit