# Random Forest Classifier

Random Forests are an ensemble of decision trees typically trained using a (tweaked) bagging or pasting approach. The idea is that we train numerous weak learners and aggregate their predictive power. For classification problems, the predicted class that gets the most votes is chosen. For regression tasks, the mean of the individual decision trees is considered.

The reason why an ensemble of weak learners produces a powerful model is due to the law of large numbers. According to the theory, if we repeat an experiment many times and average the results, the results obtained will be close to the theoretical expected value. Thus, if we create a weak learner that is only slightly better than random guessing (51%), a series of 1000 such learners will ultimately produce a much larger predictive power.


# Training Process

It’s important to note that random forests utilize a bagging-based resampling approach. This process involves taking multiple bootstapped (bagged) samples of the original dataset with replacement and each sample is used to train a seperate decision tree. 

A typical bagging based approach would train multiple trees and all trees would end up splitting at the most important feature. This results in a highly correlated prediction thereby producing high variance. Random forests address this problem by forcing each split based on a random sample of a predictor subset of $m$ predictors (column subsampling). The model then splits the trees based on the best feature within the subset. This increases tree diversity as some trees will contain predictors that would have been dismissed otherwise.

To summarize, the primary difference between bagging and random forests is the criteria for choosing the predictor subset. In bagging we use all predictors $m = p$ vs while in random forests $m = √p$. This process is advantageous as it prevents all trees from splitting at the most important variable (thereby reducing variance). This decorrelation allows the model to reduce overfitting and produces greater tree diversity.


# Random Forest Hyperparameters

Random forests have almost all hyperparameters of decision trees and bagging classifiers. Tweaking these allows us to control tree growth and the ensemble process. At each tree split only a random subset of the features is considered. This randomness can be further exaggerated or constrained allowing us to control the bias-variance trade-off. Higher randomness typically increases bias but produces a more reliable model.

- Node Size

- Number of trees 

- Number of features sampled


# Feature Importance

Another interesting characteristic of random forest is that it measures the relative importance of each input feature. It computes feature importance using the largest relative mean decrease. For regression, we use the RSS decrease. A large value indicates a stronger predictor.


# Random Forest Pros and Cons


**Pros**

- Produce lower variance relative to decision trees and regular bagging as it decorrelates features.

- Produces a stronger model at the cost of interpretability. Unlike a regular decision tree model, we cannot produce a decision tree diagram. When we bag numerous trees, it is not possible to visualize the statistical learning process through a decision tree diagram. 

- Allows us to determine feature importance. This could be used as a method conduct feature extraction.

- Doesn't require data-pre-processing (feature scaling)

- Generally robust against outliers.

**Cons**

- Trades better performance for lack of interpretability

- Works well with non-linear datasets

- Lots of Tuning parameters

- Cannot be used in time series-based datasets.



# 1. Libraries

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# Import Data
df = pd.read_csv('LungCapData.csv')
df.head()

Unnamed: 0,LungCap,Age,Height,Smoke,Gender,Caesarean
0,6.475,6,62.1,no,male,no
1,10.125,18,74.7,yes,female,no
2,9.55,16,69.7,no,female,yes
3,11.125,14,71.0,no,male,no
4,4.8,5,56.9,no,male,no


# 2. Preprocessing

In [3]:
# Predictors and Target
X = df.drop(columns = ['LungCap'])
y = df['LungCap']

# Instantiate one-hot encoder
ohe = OneHotEncoder()

# columns to be one hot encoded
ct = make_column_transformer(

    (ohe, ['Smoke', 'Gender', 'Caesarean']),
    remainder = 'passthrough')

# predictors and target variable
X = np.array(ct.fit_transform(X))
y = np.array(y)

# Checck input and target variable shape
X.shape, y.shape

((725, 8), (725,))

In [4]:
# Training and Testing subsets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 911)

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print('Standardized feature Mean:',  X_train.mean().round())
print('Standardized feature SD :',   X_train.std().round())

Standardized feature Mean: 0.0
Standardized feature SD : 1.0


# 3. Training

In [5]:
# Training the Decision Tree Classifier on default parameters
rf = RandomForestRegressor(random_state = 0)
rf.fit(X_train, y_train)

RandomForestRegressor(random_state=0)

# 4. Testing

In [6]:
# Predicting the Test set results
y_pred = rf.predict(X_test)

# Mean squared error
print('Mean Squared Error :', mean_squared_error(y_test, y_pred))

Mean Squared Error : 1.2464510183973299


# 5. K-Fold Cross Validation

In [7]:
# 10 fold cross validation
R2 = cross_val_score(estimator = RandomForestRegressor(),
                             X = X,
                             y = y,
                             cv = 10)

# Cross validation accuracy and standard deviation
print(R2)
print("R2: {:.3f} %".format(R2.mean()*100))
print("R2 Standard Deviation: {:.3f} %".format(R2.std()*100))

[0.75984122 0.82651557 0.84983209 0.73433377 0.82721232 0.8618788
 0.79488641 0.83534297 0.81926248 0.69429939]
R2: 80.034 %
R2 Standard Deviation: 5.148 %


# 6. Hyperparametric Tuning

In [8]:
# Grid Search CV
param_grid = [{'bootstrap': [True],
     'max_depth': [6, 10],
     'max_features': ['auto'],
     'min_samples_leaf': [3, 5],
     'min_samples_split': [4, 6],
     'n_estimators': [100, 350]}]

rf = RandomForestRegressor()

# Configure GridSearchCV
grid_search = GridSearchCV(rf, param_grid, cv=5,
                                  scoring="r2",
                                  n_jobs=-1)
# Initiate Search
grid_search.fit(X_train, y_train)

# Extract Tuned Parameters and Predictive Accuracy
tuned_params = grid_search.best_params_
tuned_score = grid_search.best_score_
best_estimator = grid_search.best_estimator_

# Print Results
print("Best R2: {:.2f} %".format(grid_search.best_score_*100))
print("Best Parameters:", tuned_params)

Best R2: 82.01 %
Best Parameters: {'bootstrap': True, 'max_depth': 6, 'max_features': 'auto', 'min_samples_leaf': 5, 'min_samples_split': 4, 'n_estimators': 350}


In [9]:
# Randomized Search
rf = RandomForestRegressor()

param_space = {"bootstrap": [True],
        "max_depth": [6, 8, 10, 12, 14],
        "max_features": ['auto'],
        "min_samples_leaf": [2, 3, 4],
        "min_samples_split": [2, 3, 4, 5],
        "n_estimators": [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
}

# Configure Randomized Search
random_search = RandomizedSearchCV(rf, param_space,
                                        scoring="r2", cv=5,
                                        n_jobs=-1, random_state=911)
# Initiate Search
random_search.fit(X_train, y_train)

# Extract Tuned Parameters and Predictive Accuracy
tuned_params = random_search.best_params_
tuned_score = random_search.best_score_
best_estimator = random_search.best_estimator_

# Print accuracy and best parameters
print("Best Accuracy: {:.2f} %".format(random_search.best_score_*100))
print("Best Parameters:", tuned_params)

Best Accuracy: 81.84 %
Best Parameters: {'n_estimators': 400, 'min_samples_split': 3, 'min_samples_leaf': 4, 'max_features': 'auto', 'max_depth': 6, 'bootstrap': True}
