# Random Forest

training:
- create B subsets from the training data
- train B trees with each subset, using only a subset of predictors at each split for each tree
- At each split in a tree, instead of considering all features, it selects a random subset of features to find the best split.

prediction:
- for each test observation get a prediction from each tree
- average the predictions for regression, take majority vote for classification

hyperparameters:
- hyperparams of decision trees
- number of trees in the forest
- data subset size
- predictor subset size


Reason for Predictor Subsets:
- to ensure that we reduce the variance

How?

- suppose there is a very strong predictor in the training data: a predictor that is so useful for prediction that most of the decision rules are created with it
- then the predictions of bagged trees will be highly correlated even though they are trained on differen subsets -- all subsets have that strong predictor in them
- the problem with this is that averaging the predictions of highly correlated trees will not reduce the variance substantially because the predictions will already be very similar to each other

when we use random subsets of predictors:
- the strong predictors will not be seen by many copies
- this will reduce the correlation between the copies -- allowing the bagging model to reduce the variance more

Bagged Trees vs Random Forest: which model should be preferred? 
- random forest is more popular because of the random predictor subset at each split, most guaranteed way to break ay correlation between the copies and reduce the variance to a minimum
- Bagged trees assign a random subset of the data to each tree, but not a random subset of features at each decision point

Random Forest fails when there is a presence of bad predictors:
- many random predictor subsets will not work well at their split during the algorithm


- if data is created well and all the predictors are actually useful and informative, random forest is always preffered
- if not bagging (with max_features = 1) would work better
- a bagging model can have any model as the base model, so it is more flexibile for different types of models since random forest can only have trees

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier


In [4]:
trainf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_train.csv")
trainp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_train.csv")
train = pd.merge(trainf, trainp)

testf = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_prices_test.csv")
testp = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-3/Datasets/Car_features_test.csv")
test = pd.merge(testf, testp)

train.head()

Unnamed: 0,carID,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,18473,bmw,6 Series,2020,Semi-Auto,11,Diesel,145,53.3282,3.0,37980
1,15064,bmw,6 Series,2019,Semi-Auto,10813,Diesel,145,53.043,3.0,33980
2,18268,bmw,6 Series,2020,Semi-Auto,6,Diesel,145,53.4379,3.0,36850
3,18480,bmw,6 Series,2017,Semi-Auto,18895,Diesel,145,51.514,3.0,25998
4,18492,bmw,6 Series,2015,Automatic,62953,Diesel,160,51.4903,3.0,18990


In [5]:
predictors = ['mpg', 'engineSize', 'year', 'mileage']
target = 'price'
X_train = train[predictors]
y_train = train[target]
X_test = test[predictors]
y_test = test[target]

In [6]:
model = RandomForestRegressor(
    random_state = 12,
    # tree inputs are directly inside the forest
    # base model has to be a decision tree by definition
    # because the extension (each split in the base tree seeing a predictor subset) requires the base models to be decision trees, no other option
    max_depth = 4,
    max_leaf_nodes = 10,
    n_estimators = 20,
    max_samples = 0.8, # observation subset size, identical to bagged trees
    max_features = .9, # predictor subset size, NOT IDENETICAL TO BAGGED TREES, max_features in random forest means the predictor subset size seeen by each split  in each tree
    bootstrap=True # sample with replacement
)

model.fit(X_train, y_train)

# tuning the model

- the tree hyperparametrs will be left default, same as bagged trees
- n_estimators should be set as a value that stabilizes the model performance, use OOB
- tune everything else

In [7]:
# tune n_estimators with OOB
model = RandomForestRegressor(
    random_state = 12,
    n_estimators = 40,
    oob_score=True)

model.fit(X_train, y_train)
model.oob_prediction_ # prediction for each training observation

model.oob_score_ # OOB score (R^2 for regression)

root_mean_squared_error(y_train, model.oob_prediction_)

5255.586691130506

In [8]:
# loop through an array of n_estimators find a high enough number that stabilizes the OOB performance
# implemented the same way as bagged trees


In [10]:
# then create the model and the grid

model = RandomForestRegressor(
    random_state = 12,
    n_estimators = 100) # the n_estimators we found in OOB tuning above

grid = {
    "max_samples" : [0.5, 0.7, 0.8, 0.9],
    "max_features" : [0.5, 0.7, 0.8, 0.9],
    "bootstrap" : [True, False],
}

# the rest is the same

# Feature Importances

In [12]:
# based on decision trees, so feature importances are available to random forests

# a trained model
model = RandomForestRegressor(
    random_state = 12,
    n_estimators = 100) # the n_estimators we found in OOB tuning above

model.fit(X_train, y_train)

model.feature_importances_

array([0.14496009, 0.48871729, 0.20954816, 0.15677446])

# Adaptive Boosting

Starting point:
- the weakpoint of one decision tree: it is very difficult for a decision tree to avoid both high bias and high variance
- boosting is the second method to solve this problem (bagging was the first)
- boosting uses many high bias low variance (underfitting) trees
    - creates one final low bias low variance model
    - in short boosting reduces bias

- unlike bagging, there are multiple ways to boost a number of trees
- we will start with the simplest one: Adaptive Boosting (AdaBoost)
- others: Gradient Boosting, XGBoost, CatBoost, LightGBM

Adaboost:
- number of copies of a base model
- can be any underfitting model in theory, but we use Decision Trees 
- difference: each copy is trained on the entire training data, not a subset
- copies are trained sequentially, not in parallel
- base model should be high bias low variance
    - common example is Decision Tree with low max_depth (decision stub)

We start with:
- all the training data
- a number of trees (hyperparameter)
- a lambda value -- will be explained later (another hyperparameter)

Define two sets of weights
- observation weights
- tree weights
