### ** Trees: Ensemble Methods - Bagging

Bagging: Training a bunch of individual models in a parallel way. Each model is trained by a random subset of the data. (Summary!)

BAGGing stands for Bootstrapping(sampling with replacement) and AGGregating (Averaging predictions).

With Random Forest in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.

With Random Forest, our goal is to reduce the variance of a decision Tree. We end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

- forests = high variance, low bias base learners
- Bagging to decrease the model’s variance

<img src="./images/boostrap_aggregating.png" width="500" height="500" />

### <strong> Extremely Randomized Trees </strong>

Extremely Randomized Trees, abbreviated as ExtraTrees in Sklearn, adds one more step of randomization to the random forest algorithm. 

Random forests will 

1. compute the optimal split to make for each feature within the randomly selected subset, and it will then choose the best feature to split on. 
2. builds multiple trees with bootstrap = False by default, which means it samples without replacement.

ExtraTrees on the other hand(compared to Random Forests) will instead choose a random split to make for each feature within that random subset, and it will subsequently choose the best feature to split on by comparing those randomly chosen splits. (nodes are split on random splits, not best splits.)

Extremely randomized trees are much more computationally efficient than random forests, and their performance is almost always comparable. In some cases, they may even perform better!

Link to Paper: https://link.springer.com/content/pdf/10.1007/s10994-006-6226-1.pdf

In [1]:
import pandas as pd
import numpy as numpy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import f1_score

In [2]:
#load dataset

X,y = load_iris(return_X_y=True)

#train,test split

X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42)

#random forest with gini
rf = RandomForestClassifier(criterion='gini',n_estimators=150,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)  #fit on the data

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [3]:
#random forest with gini
rf = RandomForestClassifier(criterion='entropy',n_estimators=200,max_depth=4,n_jobs=-1)

rf.fit(X_train,y_train)

rf_predict = rf.predict(X_test)

f1_score(y_test, rf_predict, average=None)

array([1., 1., 1.])

In [7]:
df = pd.read_csv('./data/shoesize_data/shoesizes.csv',index_col=0)

In [8]:
df.head()

Unnamed: 0,height,sex_no,shoe_size
0,160.0,2,40
1,171.0,2,39
2,174.0,2,39
3,176.0,2,40
4,195.0,1,46


In [9]:
df.isnull().sum()

height       0
sex_no       0
shoe_size    0
dtype: int64

In [21]:
df.sex_no.value_counts()

2    115
1     66
Name: sex_no, dtype: int64

In [20]:
df.loc[df.sex_no == 0,'sex_no'] = 2   #replace 0 with 2

In [22]:
df.head()

Unnamed: 0,height,sex_no,shoe_size
0,160.0,2,40
1,171.0,2,39
2,174.0,2,39
3,176.0,2,40
4,195.0,1,46


In [24]:
y = df.shoe_size
x = df.drop('shoe_size',axis=1)

#train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42,shuffle=True)

In [64]:
from sklearn.ensemble import RandomForestRegressor

rfr =  RandomForestRegressor(n_estimators = 50,max_depth=4,n_jobs=-1,criterion='mae')

In [65]:
rfr.fit(x_train,y_train)

RandomForestRegressor(criterion='mae', max_depth=4, n_estimators=50, n_jobs=-1)

In [66]:
pred = rfr.predict(x_test)

In [67]:
#mean absolute error

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test,pred)

1.3208108108108108

Exercise: Can you get a better mean absolute error compared to the random forest used at https://shoe-size-predict.herokuapp.com/

Mean absolute error at present is 0.789