## Exploring Random Forest Classifier

- The dataset is about 303 patients with some biological metrics.
- It predicts if that patient has heart disease or not.


In [26]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

### Importing Dataset

In [27]:
heart_df = pd.read_csv(r"../../../dataset/heart.csv")


In [28]:
heart_df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [29]:
heart_df.shape


(303, 14)

### train-test-split

In [30]:
X = heart_df.iloc[:,0:-1]
y = heart_df.iloc[:,-1]

In [31]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)


In [32]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(242, 13)
(61, 13)
(242,)
(61,)


### Model Building and Comparison with Different Algorithms


In [33]:
rf = RandomForestClassifier()
gb = GradientBoostingClassifier()
svc = SVC()
lr = LogisticRegression()

In [34]:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("accuracy score for rf: {:.2f}".format(accuracy_score(y_test, y_pred)))


accuracy score for rf: 0.85


In [35]:
gb.fit(X_train,y_train)
y_pred = gb.predict(X_test)
print("accuracy score for gb: {:.2f}".format(accuracy_score(y_test, y_pred)))


accuracy score for gb: 0.77


In [36]:
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
print("accuracy score for svc: {:.2f}".format(accuracy_score(y_test, y_pred)))


accuracy score for svc: 0.70


In [38]:
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
print("accuracy score for lr: {:.2f}".format(accuracy_score(y_test, y_pred)))


accuracy score for lr: 0.89


- Data internally might be linear hence logistic regression might be giving good results.
- If we see, random forest performance is second highest without any tuning.


In [44]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(RandomForestClassifier(),X,y,cv=10,scoring='accuracy'))


0.8116129032258066

In [46]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(LogisticRegression(),X,y,cv=10,scoring='accuracy'))


0.8283870967741935

- Here we have done cross-validation.
- Initially, the accuracy we got was based on one-time train and split.
- But if we cross-validate, the accuracy drops because the same thing we are doing 10 times and getting the average.
- This is actually an accurate representation.
- The same we did with the logistic regression, the accuracy dropped.
- So after using cross-validation, the results for both RF and LR are more or less the same.


In [47]:
rf = RandomForestClassifier(max_samples=0.75,random_state=42)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("accuracy score for rf_with_tuning: {:.2f}".format(accuracy_score(y_test, y_pred)))


accuracy score for rf_with_tuning: 0.90


In [48]:
from sklearn.model_selection import cross_val_score
np.mean(cross_val_score(RandomForestClassifier(max_samples=0.75),X,y,cv=10,scoring='accuracy'))


0.8347311827956989

- So here we tuned the parameter by setting `max_samples`, i.e., the number of rows that will go for each decision tree, to 75%.
- By doing this, the accuracy increased significantly.
- The same tuning was applied with cross-validation, and the accuracy further improved.
- The problem with Random Forest is that there are many hyperparameters (around 25), so to find the best parameters, hyperparameter tuning is essential.
- We have different types of hyperparameter tuning methods:
  1. **Grid Search CV**: Exhaustively searches through a specified subset of hyperparameters.
  2. **Randomized Search CV**: Randomly samples from a range of hyperparameters, which can be more efficient with large parameter spaces.


## GridSearchCV

- In GridSearchCV, each hyperparameter is assigned a set of values to test.
- For each combination of these hyperparameters, Random Forest will be trained.
- In our case, we are testing 16 different combinations of hyperparameters.
- This is why it is called Grid Search; it systematically searches through a "grid" of hyperparameter values.
- When tuning 4 different parameters, it forms a 4-dimensional grid (4D grid), where each point in this grid represents a unique combination of hyperparameter values, and Random Forest is trained for each combination.


In [57]:
# Number of trees in random forest
n_estimators = [20,60,100]

# Number of features to consider at every split
max_features = [0.2,0.6]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75]

# 16 diff random forest train

- We form a dictionary named `param_grid` to specify the hyperparameters and their values.
- In this dictionary, we provide the name of each hyperparameter as keys and their respective values as lists.


In [58]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples
             }
print(param_grid)

{'n_estimators': [20, 60, 100], 'max_features': [0.2, 0.6], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75]}


In [59]:
# creating a random forest object
rf = RandomForestClassifier()


In [60]:
from sklearn.model_selection import GridSearchCV

rf_grid = GridSearchCV(estimator = rf, 
                       param_grid = param_grid,  # on what all options to train the model
                       cv = 5,  # train my 16 rf model 5 times 
                       verbose=2, # will see output during process
                       n_jobs = -1) # using all my cores to fasten process

In [61]:
rf_grid.fit(X_train,y_train)


Fitting 5 folds for each of 36 candidates, totalling 180 fits


- The process involves training a total of 180 models behind the scenes.


In [62]:
rf_grid.best_params_


{'max_depth': None,
 'max_features': 0.2,
 'max_samples': 0.5,
 'n_estimators': 60}

In [65]:
rf_grid.best_score_
print("accuracy score for rf_with_best_parameters: {:.2f}".format(rf_grid.best_score_))


accuracy score for rf_with_best_parameters: 0.83


## RandomSearchCV

- Randomized Search CV
- In the previous method, we had 180 combinations. When there are many parameters to be trained, Grid Search becomes slower.
- Randomized Search CV addresses this issue by randomly selecting a subset (e.g., 10-15) of the total combinations and performing the training on these randomly chosen samples.
- Below, we have added additional parameters like `bootstrap` and `min_samples_leaf` for Randomized Search CV.


In [66]:
# Number of trees in random forest
n_estimators = [20,60,100,120]

# Number of features to consider at every split
max_features = [0.2,0.6,1.0]

# Maximum number of levels in tree
max_depth = [2,8,None]

# Number of samples
max_samples = [0.5,0.75,1.0]

# Bootstrap samples
bootstrap = [True,False]

# Minimum number of samples required to split a node
min_samples_split = [2, 5]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]

In [67]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
              'max_samples':max_samples,
              'bootstrap':bootstrap,
              'min_samples_split':min_samples_split,
              'min_samples_leaf':min_samples_leaf
             }
print(param_grid)

{'n_estimators': [20, 60, 100, 120], 'max_features': [0.2, 0.6, 1.0], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75, 1.0], 'bootstrap': [True, False], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2]}


In [68]:
from sklearn.model_selection import RandomizedSearchCV

rf_grid = RandomizedSearchCV(estimator = rf, 
                       param_distributions = param_grid, 
                       cv = 5, 
                       verbose=2, 
                       n_jobs = -1)

In [69]:
rf_grid.fit(X_train,y_train)


Fitting 5 folds for each of 10 candidates, totalling 50 fits


- Randomized Search CV randomly selects 10 candidates from the total parameter space, leaving out the rest.
- This process is much faster but may not always provide the best results compared to Grid Search.


In [70]:
rf_grid.best_params_


{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_samples': 0.5,
 'max_features': 0.6,
 'max_depth': None,
 'bootstrap': True}

In [71]:
print("accuracy score for rf_with_best_parameters: {:.2f}".format(rf_grid.best_score_))


accuracy score for rf_with_best_parameters: 0.81


## Summary of Implementation and Results

### Accuracy Scores

| Model                     | Accuracy Score (Train/Test Split) | Accuracy Score (Cross-Validation) | Accuracy Score (Best Parameters) | Accuracy Score (Tuned Parameters) |
|---------------------------|----------------------------------|-----------------------------------|----------------------------------|-----------------------------------|
| Random Forest (RF)        | 0.85                             | 0.81                              | 0.83                             | 0.90                              |
| Gradient Boosting (GB)    | 0.77                             | Not Calculated                     | Not Calculated                    | Not Calculated                    |
| Support Vector Classifier (SVC) | 0.70                             | Not Calculated                     | Not Calculated                    | Not Calculated                    |
| Logistic Regression (LR)  | 0.89                             | 0.83                              | Not Calculated                    | Not Calculated                    |

### Conclusions

- **Logistic Regression**: Achieved the highest accuracy of 0.89 on the test set, suggesting a good fit for linearly separable data. Cross-validation accuracy was 0.83.

- **Random Forest (RF)**: Initially had an accuracy of 0.85, which improved to 0.90 with hyperparameter tuning (`max_samples`). However, cross-validation showed a slight drop to 0.81, indicating variability in model performance across different data subsets.

- **Gradient Boosting (GB)** and **Support Vector Classifier (SVC)**: Performed less effectively, with accuracies of 0.77 and 0.70, respectively.

- **Hyperparameter Tuning**:
  - **Grid Search CV**: Tested 180 combinations of hyperparameters, resulting in an accuracy of 0.83 with the best parameters. This method is thorough but computationally intensive.
  - **Randomized Search CV**: Tested 10-15 random combinations, achieving an accuracy of 0.81. This method is faster but may not always find the best parameters.

### Final Note

- **Grid Search CV** provides a comprehensive evaluation of hyperparameters and is preferred when computational resources are available, especially for smaller parameter spaces.
- **Randomized Search CV** is a practical alternative for large datasets and numerous parameters due to its faster execution, though it may not always yield the best results.


