## **RADI605: Modern Machine Learning**

### Assignment: Random Forests
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

Note: In case of Python Markdown errors, you may access the assignment through this GitHub [Link](https://github.com/rrwabina/RADI605/blob/main/05%20Adaptive%20Boosting/scripts/assignment.ipynb)

In [67]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

from numpy import mean
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from mlxtend.plotting import plot_learning_curves
from statsmodels.stats.outliers_influence import variance_inflation_factor

import random
import warnings
warnings.filterwarnings('ignore')

In [26]:
def convert_binary(df, columns = ['Risk1Yr']):
    for column in columns:
        df[column] = df[column].apply(lambda x: 0 if x == 'F' else 1)
    return df

def load_thoracic(path = '../data/ThoraricSurgery.csv'):
    data = pd.read_csv(path)
    data = data[data.columns[2:]]
    data = data.drop(['PRE6', 'PRE14'], axis = 1)
    label_columns = data.columns[2:12]
    data = convert_binary(data, columns = ['Risk1Yr'])
    data = convert_binary(data, columns = label_columns)
    include_columns = data.columns[0:-1]
    X, y = data[include_columns], data['Risk1Yr']
    X, y = X.to_numpy(), y.to_numpy()
    y[y == 0] = -1
    return X, y, data

X, y, data = load_thoracic()

When features are highly correlated, they contain redundant information, and the Random Forest may use the same feature in many of the trees, leading to overfitting. Additionally, highly correlated features can cause instability in the feature importance scores and make it difficult to interpret the results.

It's always a good idea to check for highly correlated features before training a Random Forest and remove or combine them as needed. This can be done using techniques such as principal component analysis (PCA) or feature selection. By reducing the number of highly correlated features, you can simplify the model and make it more interpretable, and you may also see an improvement in performance.


In [162]:
def convert_binary(df, columns = ['Risk1Yr']):
    for column in columns:
        df[column] = df[column].apply(lambda x: 0 if x == 'F' else 1)
    return df

def load_thoracic(path = '../data/ThoraricSurgery.csv'):
    data = pd.read_csv(path)
    data = data[data.columns[2:]]
    data = data.drop(['PRE6', 'PRE14'], axis = 1)
    label_columns = data.columns[2:12]
    data = convert_binary(data, columns = ['Risk1Yr'])
    data = convert_binary(data, columns = label_columns)
    include_columns = data.columns[0:-1]
    X, y = data[include_columns], data['Risk1Yr']
    X, y = X.to_numpy(), y.to_numpy()
    y[y == 0] = -1
    return X, y, data

def split_data(X, y, pca_included = False, smote_included = False):
    print('Model Assumptions:')
    if pca_included is True:
        print('\t The dataset used PCA for dimensionality reduction.')
        pca = PCA(n_components = 8)
        X = pca.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

    if smote_included is True:
        print('\t The dataset used SMOTE to rectify class imbalance.')
        smote = SMOTE(sampling_strategy = 'minority', k_neighbors = 5, random_state = 42)
        X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
    else:
        print('\t The dataset did NOT use SMOTE.')
        X_resampled, y_resampled = X_train, y_train 
    return X_resampled, y_resampled, X_test, y_test

def init_parameters():
    param_grid = { 
                'n_estimators': [10, 2000], 
                'criterion': ['gini', 'entropy'],
                'max_depth': np.arange(1, 20),
                'min_samples_split': np.arange(1, 5)
              }
    return param_grid

def validation(rsearch, X_test, y_test):
    predictions = rsearch.predict(X_test)
    print('Confusion Matrix: ')
    print(confusion_matrix(y_test, predictions))
    print(classification_report(y_test, predictions))
    return predictions

def train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = False, print_params = False):
    param_grid = init_parameters()
    if use_randomsearch is True:
        print('\t This model has been cross-validated through Random Search.')
        rsearch = RandomizedSearchCV(estimator = RandomForestClassifier(), 
                                     param_distributions = param_grid, 
                                     cv = 10, n_iter = 10)
        rsearch.fit(X_train, y_train)
        if print_params is True:
            print(rsearch.best_params_)
        predictions = validation(rsearch, X_test, y_test)
    else:
        print('\t This model did NOT cross-validate through Random Search.')
        rsearch = RandomForestClassifier(criterion = 'gini', n_estimators = 100, 
                                         max_depth = 9, min_samples_split = 4)
        rsearch.fit(X_train, y_train)
        predictions = validation(rsearch, X_test, y_test)
    return predictions 

In [160]:
X, y, data = load_thoracic()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = False, smote_included = False)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = False)

Model Assumptions:
	 The dataset did NOT use SMOTE.
	 This model did NOT cross-validate through Random Search.
Confusion Matrix: 
[[75  0]
 [19  0]]
              precision    recall  f1-score   support

          -1       0.80      1.00      0.89        75
           1       0.00      0.00      0.00        19

    accuracy                           0.80        94
   macro avg       0.40      0.50      0.44        94
weighted avg       0.64      0.80      0.71        94



In [163]:
X, y, data = load_thoracic()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = True, smote_included = False)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = False)

Model Assumptions:
	 The dataset used PCA for dimensionality reduction.
	 The dataset did NOT use SMOTE.
	 This model did NOT cross-validate through Random Search.
Confusion Matrix: 
[[75  0]
 [19  0]]
              precision    recall  f1-score   support

          -1       0.80      1.00      0.89        75
           1       0.00      0.00      0.00        19

    accuracy                           0.80        94
   macro avg       0.40      0.50      0.44        94
weighted avg       0.64      0.80      0.71        94



In [157]:
X, y, data = load_thoracic()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = False, smote_included = True)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = True)

Model Assumptions:
	 The dataset used SMOTE to rectify class imbalance.
	 This model has been cross-validated through Random Search.
Confusion Matrix: 
[[67  8]
 [17  2]]
              precision    recall  f1-score   support

          -1       0.80      0.89      0.84        75
           1       0.20      0.11      0.14        19

    accuracy                           0.73        94
   macro avg       0.50      0.50      0.49        94
weighted avg       0.68      0.73      0.70        94



In [158]:
X, y, data = load_thoracic()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = True, smote_included = True)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = True)

Model Assumptions:
	 The dataset used PCA for dimensionality reduction.
	 The dataset used SMOTE to rectify class imbalance.
	 This model has been cross-validated through Random Search.
Confusion Matrix: 
[[68  7]
 [16  3]]
              precision    recall  f1-score   support

          -1       0.81      0.91      0.86        75
           1       0.30      0.16      0.21        19

    accuracy                           0.76        94
   macro avg       0.55      0.53      0.53        94
weighted avg       0.71      0.76      0.72        94



## Cervical Cancer Dataset

In [166]:
data = pd.read_csv('../data/sobar-72.csv', sep = ',', header = 0)
X = data.iloc[:, 0:19].to_numpy()
y = data.iloc[:, 19].to_numpy()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = False, smote_included = False)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = False)

Model Assumptions:
	 The dataset did NOT use SMOTE.
	 This model did NOT cross-validate through Random Search.
Confusion Matrix: 
[[8 0]
 [2 5]]
              precision    recall  f1-score   support

           0       0.80      1.00      0.89         8
           1       1.00      0.71      0.83         7

    accuracy                           0.87        15
   macro avg       0.90      0.86      0.86        15
weighted avg       0.89      0.87      0.86        15



In [170]:
data = pd.read_csv('../data/sobar-72.csv', sep = ',', header = 0)
X = data.iloc[:, 0:19].to_numpy()
y = data.iloc[:, 19].to_numpy()
X_train, y_train, X_test, y_test = split_data(X, y, pca_included = True, smote_included = False)
predictions = train_randomforest(X_train, y_train, X_test, y_test, use_randomsearch = True)

Model Assumptions:
	 The dataset used PCA for dimensionality reduction.
	 The dataset did NOT use SMOTE.
	 This model has been cross-validated through Random Search.
Confusion Matrix: 
[[8 0]
 [1 6]]
              precision    recall  f1-score   support

           0       0.89      1.00      0.94         8
           1       1.00      0.86      0.92         7

    accuracy                           0.93        15
   macro avg       0.94      0.93      0.93        15
weighted avg       0.94      0.93      0.93        15

