In [1]:
import pandas as pd
import numpy as np

# Computional Learning - Assignment 3

Tamir Zecler 204168223

## The Dataset

This part explains how I preprocessed the data and gives a feel to the dataset

### Data Loading
in order to load the dataset properly, please make sure that data set path specified is relevent. 
My code assumes the data set is in the same directory as this notebook. it splits data to label and features data.

In [2]:
dataset_path = "ex3_data.csv"

In [3]:
def load_data(path, ignore_nan=True):
  df = pd.read_csv(path)
  entries = df.shape[0]
  if ignore_nan == True:
      df = df.dropna()
  return df

def split_x_y_df(df):
  y_data = df['EVENT_PRIMARY']
  x_data = df.drop(['EVENT_PRIMARY'], axis=1)
  return x_data, y_data

In [4]:
x_df, y_df = split_x_y_df(load_data(dataset_path))

In [5]:
print(f'our data constists of {x_df.shape[0]} entries with {x_df.shape[1]} features each')

our data constists of 8281 entries with 29 features each


In [6]:
print(f'Outcome:: Dead: {sum(y_df)}, Alive:{y_df.shape[0]-sum(y_df)}')
print(f'Death Precentage: {100*sum(y_df)/y_df.shape[0]}')

Outcome:: Dead: 513, Alive:7768
Death Precentage: 6.194903997101799


### Data Preprocess

In [7]:
from sklearn.preprocessing import MinMaxScaler

#### Feature Types:
in this data set we have 29 different features we can use to predict, out of those features we have:
<ul>
    <li>11 - boolean features</li>
    <li>2 - Categorical Features</li>
    <li>16 - Numeric Features</li>
</ul>

first and most importantly at this point, is to make data more learnable.
in order to do that I used a min-max scaler on the numeric values, as well as 
using dummy variables for categorical endocing of features since there is no meaning to order on both of them(infered from assignment features explination, could be missunderstanding of clinical arm explanation).
also, I choose to ignore RACE_BLACK feature since its already depicted in the race_4(same column for black after dummy variable use) and just feels racist, as well as site location, since we hope to get results that are not dependent on location for robustness, trying to use it as a feature seems redundant. during trail and error adding\removing features to yield best results
for the boolean features I chose to not one hot encode them as it more likely to suffer from curse of dimensionality, as well as consumes more memory and can cause more overfitting then in normal represantions, since I cant understand each feature meaning, I choose to limit the changes I make based on previous knowledge. I choose to use min-max scaler with deafult value range so it wont change my binary features. since values are positive only the is no reason to prefer MaxAbsScaler. I used min max scaler as I have no previous knowledge and usinging a standard range of 0 1 to learn gives every feature similiar 'weight'.

In [83]:
features_ignored = ['RACE_BLACK', 'NEWSITEID']
categorical_features = ['RACE4', 'INTENSIVE']
boolean_features = ['INCLUSIONFRS', 'NOAGENTS', 'ASPIRIN', 'SUB_CKD', 'FEMALE', 'SUB_CVD', 'SUB_CLINICALCVD', 'SUB_SUBCLINICALCVD', 'SUB_SENIOR', 'STATIN']

In [99]:
def data_preprocess(x_df, y_df):
    # drop unused features:
    x_df = x_df.drop(columns=features_ignored)
    
    # make sure boolean values are ints
    y_df = y_df.astype(int)
    for bool_feat in boolean_features:
            x_df[[bool_feat]] = x_df[[bool_feat]].astype(int)

    # make dummy variable from categorical features
    for category in categorical_features:
        dummies = pd.get_dummies(x_df[[category]])
        x_df.drop([category], axis=1, inplace=True)
        x_df = pd.concat([x_df, dummies], axis=1)  
    
    #scale the feature values with min max scaler
    scaler = MinMaxScaler()
    scaler.fit_transform(x_df) 
    # this returns a numpy array and not a data frame
    x_ndr = scaler.transform(x_df)
    y_ndr = y_df.to_numpy()
    return x_ndr, y_ndr 

## Models

In [10]:
from imblearn.over_sampling import RandomOverSampler 

The Provided Dataset is heavily unbalanced, as seen after loading, only about 6% of the labels are labeled as death.
In order to overcome this problem, oversampling method was used, I used the RandomOverSampler from the imblearn package.

### Linear Classifier

In [11]:
from sklearn.linear_model import LogisticRegression
clf_lm = LogisticRegression(max_iter=10000)

The linear classifier I choose to work with is the logistic regression

### Enesmble Model

In [12]:
from sklearn.ensemble import RandomForestClassifier
clf_em = RandomForestClassifier() 

The ensemble model I choose to use is the Random Forest Classifier

### Deep Learning Model

In [37]:
from sklearn.neural_network import MLPClassifier
clf_dl = MLPClassifier(solver='adam')

I choose to use adam as optimizer since its considered a standard and each model takes a long time to train so I focused on other hyper parameters.

### Grid Search
In order to get best hyper parameters, grid search for each model was used. I used to use imblearn pipeline since it has better support for over sampling in its pipeline, during fit stage it does it by itself. In general since this data set is not balanced, using the AUC can help us get an understanding of the probabilty that a random positive sample gets a higher score than a random negetive score.

In [14]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler

from sklearn.model_selection import GridSearchCV

In [85]:
dataset_df = load_data(dataset_path)
x_df, y_df = split_x_y_df(dataset_df)
x_ndr_prcsd, y_ndr_prcsd = data_preprocess(x_df, y_df)

#### Linear Classifier

In [18]:
def lm_gridSearch(x_data, y_data, resampling=False, scoring='recall'):
  lm_parameters = {"tol": [0.001, .0025, 0.005, 0.01, 0.025], 'solver': ['newton-cg', 'lbfgs', 'saga']}
  clf_lm_grid = GridSearchCV(clf_lm, lm_parameters, scoring=scoring)
  if resampling:
      sampler = RandomOverSampler()
      pipeline = Pipeline([("sampler", sampler),
                           ("classifier", clf_lm_grid)])
      pipeline.fit(x_data, y_data)
      print("Best parameters for Logistic Regression with Resampling:")
      print(clf_lm_grid.best_params_)
  else:
      clf_lm_grid.fit(x_data, y_data)
      print("Best parameters for Logistic Regression without Resampling:")   
      print(clf_lm_grid.best_params_)

In [19]:
lm_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=False)

Best parameters for Logistic Regression without Resampling:
{'solver': 'newton-cg', 'tol': 0.001}


In [20]:
lm_gridSearch(x_df_prcsd, y_df_prcsd, resampling=False, scoring='f1')

Best parameters for Logistic Regression without Resampling:
{'solver': 'newton-cg', 'tol': 0.001}


In [21]:
lm_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=False, scoring='roc_auc')

Best parameters for Logistic Regression without Resampling:
{'solver': 'lbfgs', 'tol': 0.01}


In [22]:
lm_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True)

Best parameters for Logistic Regression with Resampling:
{'solver': 'saga', 'tol': 0.025}


In [23]:
lm_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True, scoring='f1')

Best parameters for Logistic Regression with Resampling:
{'solver': 'saga', 'tol': 0.01}


In [24]:
lm_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True, scoring='roc_auc')

Best parameters for Logistic Regression with Resampling:
{'solver': 'newton-cg', 'tol': 0.0025}


Since I am going to use oversampling in my model fitting, I choose parameters more based on them, as well as the default recall I choose for it since It feels more important to have a good value on  recall when trying to predict the death. Overall, the paramaters I am going to use are: {'solver': 'saga', 'tol': 0.002}


#### Ensemble Model

In [25]:
def em_gridSearch(x_data, y_data, resampling=False, scoring='recall'):
  em_parameters = {'n_estimators': [10,50,100,250], 'criterion': ['gini', 'entropy'], 'max_samples': [0.5, 0.625, 0.75, 0.825, 1], 'max_depth': [3,5,7,9]}
  clf_em_grid = GridSearchCV(clf_em, em_parameters, scoring=scoring)
  if resampling:
      sampler = RandomOverSampler()
      pipeline = Pipeline([("sampler", sampler), ("classifier", clf_em_grid)])
      pipeline.fit(x_data, y_data)
      print("Best parameters for Random Forest with Resampling:")
      print(clf_em_grid.best_params_)
  else:
      clf_em_grid.fit(x_data, y_data)
      print("Best parameters for Random Forest without Resampling:")   
      print(clf_em_grid.best_params_)

In [26]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=False)

Best parameters for Random Forest without Resampling:
{'criterion': 'gini', 'max_depth': 9, 'max_samples': 0.75, 'n_estimators': 10}


In [27]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=False, scoring='f1')

Best parameters for Random Forest without Resampling:
{'criterion': 'gini', 'max_depth': 9, 'max_samples': 0.825, 'n_estimators': 10}


In [28]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=False, scoring='roc_auc')

Best parameters for Random Forest without Resampling:
{'criterion': 'entropy', 'max_depth': 5, 'max_samples': 0.825, 'n_estimators': 250}


In [29]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True)

Best parameters for Random Forest with Resampling:
{'criterion': 'entropy', 'max_depth': 9, 'max_samples': 1, 'n_estimators': 50}


In [30]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True, scoring='f1')

Best parameters for Random Forest with Resampling:
{'criterion': 'gini', 'max_depth': 9, 'max_samples': 0.75, 'n_estimators': 250}


In [31]:
em_best_parms = em_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True, scoring='roc_auc')

Best parameters for Random Forest with Resampling:
{'criterion': 'gini', 'max_depth': 9, 'max_samples': 0.825, 'n_estimators': 250}


As We can see, when using resampling the preferd paramaters are higher sample rate, higher depth and higher number of estimators, I will use the following parameters: {'criterion': 'entropy', 'max_depth': 9, 'max_samples': 0.9, 'n_estimators': 100}
since I am going to resample and Recall feels like the most netural metrics to choose by, since it gives us a feel on positive labels predictions - which is the main goal in my opnion when trying to class death cases, My choices were mainly based on that.

#### Deep Learning Model
in this model i tried relu and sigmoid for activation but choose while considering the calcultion efficiency as well.
I tried only a few network architectures and alpha (penalty weight) since each model is expensive to compute, as well as only checked with resampling since I am going to use it anyway.

In [39]:
def dl_gridSearch(x_data, y_data, resampling=False, scoring='recall'):
    dl_paramaters = {
    'hidden_layer_sizes': [(7,3,5), (5,5,3,2), (10,10)],
    'activation': ['logistic', 'relu'],
    'alpha': [0.0001, 0.005]
    }
    clf_dl_grid = GridSearchCV(clf_dl, dl_paramaters, scoring=scoring)
    if resampling:
      sampler = RandomOverSampler()
      pipeline = Pipeline([("sampler", sampler), ("classifier", clf_dl_grid)])
      pipeline.fit(x_data, y_data)
      print("Best parameters for Neural Network with Resampling:")
      print(clf_dl_grid.best_params_)
    else:
      clf_dl_grid.fit(x_data, y_data)
      print("Best parameters for  Neural Network without Resampling:")   
      print(clf_dl_grid.best_params_)

In [40]:
dl_best_parms = dl_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True)

Best parameters for Neural Network with Resampling:
{'activation': 'relu', 'alpha': 0.005, 'hidden_layer_sizes': (7, 3, 5)}


In [41]:
dl_best_parms = dl_gridSearch(x_ndr_prcsd, y_ndr_prcsd, resampling=True, scoring='roc_auc')

Best parameters for Neural Network with Resampling:
{'activation': 'relu', 'alpha': 0.005, 'hidden_layer_sizes': (10, 10)}


Since we couldnt reach an agreement on architecture I used the one with better recall, that is:
{'activation': 'relu', 'alpha': 0.005, 'hidden_layer_sizes': (7, 3, 5)}

## Models Performance

I used the StartifiedKFold to ensure balanced distribution of test train split label wise.

### Models Training

I used a 10-fold cross validation in these experiments.

In [43]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold


In [127]:
class classifier_results():
    def __init__(self, classifier_name):
        self.classifier_name = classifier_name
        self.acc = []
        self.auc = []
        self.precision = []
        self.recall = []
        self.f1 = []
        
    def print_eval(self):
        mean_acc = sum(self.acc) / len(self.acc)
        mean_auc = sum(self.auc) / len(self.auc)
        mean_precision = sum(self.precision) / len(self.precision)
        mean_recall = sum(self.recall) / len(self.recall)
        mean_f1 = sum(self.f1) / len(self.f1)
        print(f'{self.classifier_name} - Accuracy: {mean_acc} AUC: {mean_auc} Percision: {mean_precision} Recall: {mean_recall} F1: {mean_f1}')
        
    def add_evaluation(self, y_test, y_pred): 
      self.acc.append(accuracy_score(y_test, y_pred))
      self.auc.append(roc_auc_score(y_test, y_pred))
      self.precision.append(precision_score(y_test, y_pred))
      self.recall.append(recall_score(y_test, y_pred))
      self.f1.append(f1_score(y_test, y_pred))

In [135]:
def run_experiment(x_ndr, y_ndr, clf_lm, clf_em, clf_dl):
    num_splits = 10
    cross_validator = StratifiedKFold(n_splits=num_splits)
    sampler =RandomOverSampler()
    lm_classifier_results = classifier_results('Logistic Regression')
    em_classifier_results = classifier_results('Random Forest')
    dl_classifier_results = classifier_results('Multi Layer Percepton')
 
    # Training loop
        
    for i, (train_idxs, test_idxs) in enumerate(cross_validator.split(x_ndr, y_ndr)):
        print(f'iteration {i+1} out of {num_splits} iterations.')
        x_train, y_train = x_ndr[train_idxs], y_ndr[train_idxs]
        x_test, y_test = x_ndr[test_idxs], y_ndr[test_idxs]
        
        x_train_resample, y_train_resample= sampler.fit_resample(x_train, y_train)
        
        clf_lm.fit(x_train_resample, y_train_resample)
        clf_em.fit(x_train_resample, y_train_resample)
        clf_dl.fit(x_train_resample, y_train_resample)
        
        
        clf_lm_pred = clf_lm.predict(x_test)
        clf_em_pred = clf_em.predict(x_test)
        clf_dl_pred = clf_dl.predict(x_test)
        
        lm_classifier_results.add_evaluation(y_test, clf_lm_pred)
        em_classifier_results.add_evaluation(y_test, clf_em_pred)
        dl_classifier_results.add_evaluation(y_test, clf_dl_pred)
    
    print("\n The Models Results: \n")
    lm_classifier_results.print_eval()
    em_classifier_results.print_eval()
    dl_classifier_results.print_eval()

        

In [136]:
#define the used classifiers
clf_lm = LogisticRegression(max_iter=10000, solver='saga', tol=0.002)
clf_em = RandomForestClassifier(criterion='entropy', max_depth=9, max_samples=0.9, n_estimators=100) 
clf_dl = MLPClassifier(max_iter=1000, solver='adam', activation='relu', alpha=0.005, hidden_layer_sizes=(7,3,5))

In [137]:
dataset_df = load_data(dataset_path)
x_df, y_df = split_x_y_df(dataset_df)
x_ndr_prcsd, y_ndr_prcsd = data_preprocess(x_df, y_df)

In [138]:
run_experiment(x_ndr_prcsd, y_ndr_prcsd,clf_lm, clf_em, clf_dl)

iteration 1 out of 10 iterations.
iteration 2 out of 10 iterations.
iteration 3 out of 10 iterations.
iteration 4 out of 10 iterations.
iteration 5 out of 10 iterations.
iteration 6 out of 10 iterations.
iteration 7 out of 10 iterations.
iteration 8 out of 10 iterations.
iteration 9 out of 10 iterations.
iteration 10 out of 10 iterations.

 The Models Results: 

Logistic Regression - Accuracy: 0.68288782830137 AUC: 0.6371409344006372 Percision: 0.11047514631941471 Recall: 0.5849170437405731 F1: 0.1857566597323889
Random Forest - Accuracy: 0.8711508540060487 AUC: 0.5671732245526969 Percision: 0.14434995715384585 Recall: 0.220211161387632 F1: 0.1740507539831301
Multi Layer Percepton - Accuracy: 0.5954499629959848 AUC: 0.5933811949882295 Percision: 0.0885519125054697 Recall: 0.5909879336349924 F1: 0.15341536179759235


### Model Comparison

Overall, as we can see from this comparison the Best recall we get is from the logistic regression model. although Random forest gets best accuracy, its recall is really low, which means we might predict wrongly about crucial cases. overall when weighing all these factors together, the best model In my opinion for this model is the Logistic Regression Models since it gets good results on most fields and is the easist to compute. On the other hand, it dpends on the type of the clinical trial, if accuracy is more important, Random Forest is better, but in this domain it feels the linear model works best. 