
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Random Forests and ExtraTrees


_Authors: Matt Brems (DC), Riley Dallas (AUS)_

---

## Random Forests
---

With bagged decision trees, we generate many different trees on pretty similar data. These trees are **strongly correlated** with one another. Because these trees are correlated with one another, they will have high variance. Looking at the variance of the average of two random variables $T_1$ and $T_2$:

$$
\begin{eqnarray*}
Var\left(\frac{T_1+T_2}{2}\right) &=& \frac{1}{4}\left[Var(T_1) + Var(T_2) + 2Cov(T_1,T_2)\right]
\end{eqnarray*}
$$

If $T_1$ and $T_2$ are highly correlated, then the variance will about as high as we'd see with individual decision trees. By "de-correlating" our trees from one another, we can drastically reduce the variance of our model.

That's the difference between bagged decision trees and random forests! We're going to do the same thing as before, but we're going to de-correlate our trees. This will reduce our variance (at the expense of a small increase in bias) and thus should greatly improve the overall performance of the final model.

So how do we "de-correlate" our trees?

Random forests differ from bagging decision trees in only one way: they use a modified tree learning algorithm that selects, at each split in the learning process, a **random subset of the features**. This process is sometimes called the *random subspace method*.

The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be used in many/all of the bagged decision trees, causing them to become correlated. By selecting a random subset of features at each split, we counter this correlation between base trees, strengthening the overall model.

For a problem with $p$ features, it is typical to use:

- $\sqrt{p}$ (rounded down) features in each split for a classification problem.
- $p/3$ (rounded down) with a minimum node size of 5 as the default for a regression problem.

While this is a guideline, Hastie and Tibshirani (authors of Introduction to Statistical Learning and Elements of Statistical Learning) have suggested this as a good rule in the absence of some rationale to do something different.

Random forests, a step beyond bagged decision trees, are **very widely used** classifiers and regressors. They are relatively simple to use because they require very few parameters to set and they perform pretty well.
- It is quite common for interviewers to ask how a random forest is constructed or how it is superior to a single decision tree.

--- 

## Extremely Randomized Trees (ExtraTrees)
Adding another step of randomization (and thus de-correlation) yields extremely randomized trees, or _ExtraTrees_. Like Random Forests, these are trained using the random subspace method (sampling of features). However, they are trained on the entire dataset instead of bootstrapped samples. A layer of randomness is introduced in the way the nodes are split. Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity) for each feature under consideration, a random value is selected for the split. This value is selected from the feature's empirical range.

This further reduces the variance, but causes an increase in bias. If you're considering using ExtraTrees, you might consider this to be a hyperparameter you can tune. Build an ExtraTrees model and a Random Forest model, then compare their performance!

That's exactly what we'll do below.

## Import libraries
---

We'll need the following libraries for today's lecture:
- `pandas`
- `numpy`
- `GridSearchCV`, `train_test_split` and `cross_val_score` from `sklearn`'s `model_selection` module 
- `RandomForestClassifier` and `ExtraTreesClassifier` from `sklearn`'s `ensemble` module 

In [43]:
import pandas as pd
import numpy as np
from sklearn import set_config
import matplotlib.pyplot as plt
set_config(display='diagram') 
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import mean_squared_error

## Load Data
---

Our first dataset has to do with detecting sarcasm in news headlines. [Source](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection)

In [3]:
sarcasm_df = pd.read_csv('data/sarcasm.csv')

In [5]:


sarcasm_df.head()

Unnamed: 0,link,headline,sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [6]:
print(sarcasm_df.iloc[3, 1])

boehner just wants wife to listen, not come up with alternative debt-reduction ideas


## Challenge: What is our baseline accuracy?
---

The baseline accuracy is the percentage of the majority class, regardless of whether it is 1 or 0. It serves as the benchmark for our model to beat.

In [10]:
sarcasm_df['sarcastic'].value_counts(normalize = True)

0    0.561047
1    0.438953
Name: sarcastic, dtype: float64

## Train/Test Split
---



In [11]:
X_train, X_test, y_train, y_test = train_test_split(sarcasm_df['headline'],
                                                    sarcasm_df['sarcastic'],
                                                    stratify = sarcasm_df['sarcastic'],
                                                    random_state = 42)

## Model instantiation
---

Create an instance of `RandomForestClassifier` and `ExtraTreesClassifier`.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [18]:
cvect = CountVectorizer(max_features = 1000) #using 1000 words that occur most frequently

In [19]:
X_train_vect = cvect.fit_transform(X_train)
X_test_vect = cvect.transform(X_test)

In [21]:
forest = RandomForestClassifier()
extra = ExtraTreesClassifier()

In [22]:
forest.fit(X_train_vect, y_train)

In [23]:
extra.fit(X_train_vect, y_train)

## Model Evaluation
---

Which one has a higher `cross_val_score`?

In [24]:
cross_val_score(forest, X_train_vect, y_train) #gives us accuracy 

array([0.7873721 , 0.78781827, 0.78307539, 0.78282576, 0.7870694 ])

In [25]:
cross_val_score(extra, X_train_vect, y_train)

array([0.78762166, 0.78282576, 0.78107838, 0.78207688, 0.78482277])

## Grid Search
---



In [None]:
#n_estimators = [10, 100, 500, 1000]
#max_depth = [1, 2, 3, 4]

In [26]:
params = {
    'n_estimators': [10, 100, 500],
    'max_depth': [10, 20]
}



In [27]:
grid = GridSearchCV(forest, param_grid=params)

In [28]:
grid.fit(X_train_vect, y_train)

In [29]:
grid.score(X_test_vect, y_test)

0.7470799640610961

### Problem

In small groups you are to build a simple classifier to predict customer churn.  Don't get hung up in too much preprocessing; your main goal is to compare a `KNeighborsClassifier`, `LogisticRegressor`, `DecisionTreeClassifier`, and `RandomForestClassifier`.  


If you have time, investigate the feature importances of the decision tree and random forest.  What features do they suggest are the most important to predicting customer churn?

In [30]:
churn_df = pd.read_csv('data/cell_phone_churn.csv')
churn_df.head(2)

Unnamed: 0,state,account_length,area_code,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,intl_mins,intl_calls,intl_charge,custserv_calls,churn
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False


In [31]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

In [33]:
X = churn_df.drop(['vmail_plan', 'intl_plan', 'state', 'churn'], axis = 1)
y = churn_df['churn']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [35]:
X_train

Unnamed: 0,account_length,area_code,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,intl_mins,intl_calls,intl_charge,custserv_calls
367,45,415,0,78.2,127,13.29,253.4,108,21.54,255.0,100,11.48,18.0,3,4.86,1
3103,115,415,0,195.9,111,33.30,227.0,108,19.30,313.2,113,14.09,13.2,1,3.56,2
549,121,408,31,237.1,63,40.31,205.6,117,17.48,196.7,85,8.85,10.1,5,2.73,4
2531,180,415,0,143.3,134,24.36,180.5,113,15.34,184.2,87,8.29,10.1,4,2.73,1
2378,112,510,0,206.2,122,35.05,164.5,94,13.98,140.3,101,6.31,12.6,7,3.40,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,106,510,0,274.4,120,46.65,198.6,82,16.88,160.8,62,7.24,6.0,3,1.62,1
1130,122,415,0,35.1,62,5.97,180.8,89,15.37,251.6,58,11.32,12.7,2,3.43,1
1294,66,408,0,87.6,76,14.89,262.0,111,22.27,184.6,125,8.31,9.2,5,2.48,1
860,169,415,0,179.2,111,30.46,175.2,130,14.89,228.6,92,10.29,9.9,6,2.67,2


In [37]:
#Baseline null model
baseline_test_preds = np.ones(y_test.shape)*np.mean(y_train)


In [40]:
#RMSE

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, baseline_test_preds, squared = False)

0.3570152856061543

In [45]:
#instantiate a standard scaler 

scaler = StandardScaler()

#fit transform the train data 

X_train_scaled = scaler.fit_transform(X_train)

#transform the test data 

X_test_scaled = scaler.transform(X_test)

In [46]:
#Linear Regression Model RMSE 
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)


In [49]:
#make predictions 

lr_preds = lr.predict(X_test_scaled)

In [50]:
#RMSE 

lr_rmse = mean_squared_error(y_test, lr_preds, squared = False)
print(lr_rmse) #better than baseline 

0.33423295817696463


In [None]:
#KNN Model RMSE 

In [51]:
#knn instantiate (leave at default - 8 neighbors) 

knn_reg = KNeighborsRegressor()

#fit 

knn_reg.fit(X_train_scaled, y_train)

#make predictions 
knn_preds = knn_reg.predict(X_test_scaled)

In [52]:
#RMSE 

mean_squared_error(y_test, knn_preds, squared = False) #better than linear model

0.31195215767975565

In [None]:
#Random Forest Model RMSE 

In [53]:
forest = RandomForestClassifier()
extra = ExtraTreesClassifier()

In [55]:
forest.fit(X_train_scaled, y_train)

In [60]:
forest_preds = forest.predict(X_test_scaled)

In [57]:
cross_val_score(forest, X_train_scaled, y_train)

array([0.914     , 0.918     , 0.932     , 0.918     , 0.92184369])

In [58]:
cross_val_score(extra, X_train_scaled, y_train)

array([0.914     , 0.91      , 0.93      , 0.9       , 0.91382766])

In [64]:
#Decision Tree 
from sklearn.tree import DecisionTreeRegressor



In [66]:
dtr = DecisionTreeRegressor(max_depth=2)

In [67]:
dtr.fit(X_train_scaled, y_train)

In [69]:
dtr_preds = dtr.predict(X_test_scaled)

In [70]:
#RMSE 

mean_squared_error(y_test, dtr_preds, squared = False) #looks like the best model so far

0.3021439001298899

## WHAT JACOB DID