# Random Forest

Random forest(Bootstrap Aggregation) is bagging technique, it combines the output of multiplr decision trees to reach a single result. It can handle both classification and regression problems.
<br><br>
## Intuition

> Firstly it do bagging. That's mean whole dataset having d records are trained on n decision trees. And each decision tree will get d' number of records as training data. That's called row sampling with replacement. With replacement simply means that rows once selected for one decision tree can be selected for other decision tree.

It' final output is robust to overfitting and decrease variance.

sudo code:



```
for i=1,2,...,B:
  split data 80/20
  train on 80%  -> ti
  get accuracy on 20% -> Ai
```
For classification majority vote will be final output

For regression average of all DT results will be final output

The problem here is if we get 60% positive results and 40% negative results, we will assign posituve class as final result but with low confidence. Our model gets uncertainty.

> **Random Subset:** In bagging we get highly correlated trees because all trees are using all features. So in random subset we consider a subset of features m' at each split.
It allows us to introduce variability in columns.

There is no thumb rule for how much features should be selected at each split but following are the conventions:
## How much features should be selected?
> For classification:  sqrt(p)

> For regression:  p/3

p is number of features in dataset.

## Why we use random forest if we have decision trees?

To reduce overfitting and reduce variance.

## Bagging Vs Random Forest


## Random Forest vs Decision Tree
Random forest uses multiple decision trees to generate results. It's benefits are same as decision trees. But it can be computationally expensive and it can be not that interpretable. Random forest generalize more as compared to decision trees.

<br><br>
## Hyperparameters

Hyperparameters that are related to random forest are as follow:

**Num_features:**  defines number of decision trees used for training. Default value: 100

**Max_features:**  Defines number of features used by each DT. It has 4 values:



1.   Auto (take sqrt)
2.   sqrt (takes sqrt)
3.   log2 (takes log of number of features)
4.   int (mean exact no. of features)
5.   float(mean % of features)

**Bootstrap:** True or Flase (True means records are sampled with replacement)

**Max_samples:** No. of rows to each decision tree

<br><br>

## Feature Importance

### Idea 1



*   Accuracy on ith training set
*   Shuffle ith feature
*   Accuracy on shuffled training set   
*   Average on all training set

(accuracy of original training set - accuracy of shuffled set)

If a feature is important model accuracy will be decreased.




### Idea 2

Node importance in every decision tree will be calculated for each node.
<br><br>

node = (N-t)/N [impurity - {(N-t-r)/(N-t) * right_impurity} - {(N-t-L)/(N-t) * left_impurity}]

<br><br>

feature importance of all features summed up to one.

> Can be misleading for high cardinality features. For this permutation_importance class of sklearn is used.

<br><br>

## Out of Bag problem


As we already seen, when we are sampling rows/features with replacement (d', m') in base models(that is decision trees) some rows and features can be repeated. This is statiscally proven that some rows and features does not get selected at all. These unselected records are called out of bag records.

#### OOB-Score

We use all data including OOB as a validation dataset. It's accuracy will be called OOB-Score.

#### How much data is in OOB?

2/3 ∗ d is in training data

1/3 ∗ d is in OOB.



In [2]:
## Hyperparameter tuning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [9]:
df = pd.read_csv('/content/heaart.csv')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [11]:
df.shape

(299, 13)

In [12]:
x = df.iloc[:,0:-1]
y = df.iloc[:,-1]

In [13]:
x_train,x_test, y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [14]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(239, 12) (60, 12) (239,) (60,)


In [15]:
rf = RandomForestClassifier()

In [16]:
rf.fit(x_train,y_train)

In [17]:
y_pred = rf.predict(x_test)

In [18]:
accuracy_score(y_test,y_pred)

0.7166666666666667

In [20]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(RandomForestClassifier(), x,y, cv=10, scoring='accuracy'))

0.7722988505747127

In [21]:
n_estimator = [20,40,60,100,120]
max_feature = [0.2,0.4,0.6,0.7]
max_depth = [2,4,7,None]
max_sample = [0.2, 0.5,0.75,1]

In [22]:
param_grid = {
    'n_estimators': n_estimator,
    'max_features': max_feature,
    'max_depth': max_depth,
    'max_samples': max_sample
}

In [23]:
print(param_grid)

{'n_estimators': [20, 40, 60, 100, 120], 'max_features': [0.2, 0.4, 0.6, 0.7], 'max_depth': [2, 4, 7, None], 'max_samples': [0.2, 0.5, 0.75, 1]}


In [24]:
rf = RandomForestClassifier()

In [26]:
from sklearn.model_selection import GridSearchCV

rf_grid = GridSearchCV(
    estimator=rf,
    param_grid = param_grid,
    cv = 10,
    verbose =2
)

In [27]:
rf_grid.fit(x_train,y_train)

Fitting 10 folds for each of 320 candidates, totalling 3200 fits
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.1s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.0s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.0s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.0s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.0s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.0s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.1s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.1s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.1s
[CV] END max_depth=2, max_features=0.2, max_samples=0.2, n_estimators=20; total time=   0.1s
[CV] 

In [32]:
rf_grid.best_params_

{'max_depth': 4, 'max_features': 0.4, 'max_samples': 0.5, 'n_estimators': 20}

In [29]:
rf_grid.best_score_

0.8871376811594203

In [39]:
rf = RandomForestClassifier(max_depth=4,
                            max_features = 0.4,
                            max_samples = 0.5,
                            n_estimators = 20)

In [47]:
rf.fit(x_train,y_train)

In [48]:
y_pred = rf.predict(x_test)

In [49]:
accuracy_score(y_pred,y_test)

0.7666666666666667