## Subash Chandra Biswal (U77884251) ##
# WE08 - Cardiotocography


## Introduction and Overview


Author: J. P. Marques de Sá, J. Bernardes, D. Ayers de Campos.  
Source: UCI  
Please cite: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318, UCI    

2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.  

Attribute Information:  
LB - FHR baseline (beats per minute)  
AC - # of accelerations per second  
FM - # of fetal movements per second  
UC - # of uterine contractions per second  
DL - # of light decelerations per second  
DS - # of severe decelerations per second  
DP - # of prolongued decelerations per second  
ASTV - percentage of time with abnormal short term variability  
MSTV - mean value of short term variability  
ALTV - percentage of time with abnormal long term variability  
MLTV - mean value of long term variability  
Width - width of FHR histogram  
Min - minimum of FHR histogram  
Max - Maximum of FHR histogram  
Nmax - # of histogram peaks  
Nzeros - # of histogram zeros  
Mode - histogram mode  
Mean - histogram mean  
Median - histogram median  
Variance - histogram variance  
Tendency - histogram tendency  
CLASS - FHR pattern class code (1 to 10)  
NSP - fetal state class code (N=normal(1); S=suspect(2); P=pathologic(3))  

## Install and import necessary packages

In [1]:
# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from __future__ import print_function

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score,classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from scikeras.wrappers import KerasClassifier
from keras.initializers import GlorotNormal
import tensorflow as tf
from tensorflow import keras
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

# set random seed to ensure that results are repeatable
np.random.seed(1)
tf.random.set_seed(1)

## Load data 

In [2]:
X_train = pd.read_csv("./X_train.csv")
y_train = pd.read_csv("./y_train.csv")
X_test = pd.read_csv("./X_test.csv")
y_test = pd.read_csv("./y_test.csv")
X = pd.read_csv("./X.csv")
y = pd.read_csv("./y.csv")

## Performance Metrics ##
Since this is pharmacutical data and we are targeting suspects from various medical test data, we need to minimize the false negatives as this will cost somebody's life. This cost is significantly high as compared to false positive and in case of false positive the patient/insurance company needs to bear only the further investigation costs. 

Since this is a classification problem our score metrics is confusion matix and our measure of score is recall.

In [3]:
score_measure = "recall"
kfolds = 5

dtree = DecisionTreeClassifier()
svmm = SVC()
logreg = LogisticRegression()
adatree = AdaBoostClassifier()
rforest = RandomForestClassifier()
xgboost = XGBClassifier()
gboost = GradientBoostingClassifier()
ann = MLPClassifier()



##  Random search of parameter grids of all models ##

In [4]:
#Grid for Logistic Regression

param_grid_logr = [{
     'penalty': ['l1', 'l2', 'elasticnet', 'none'],
     'solver': ['saga'],
     'max_iter': np.arange(100,900),},
      {
     'penalty': ['l1', 'l2'],
     'solver': ['liblinear'],
     'max_iter': np.arange(100,900),},
    {
     'penalty': ['l2', 'none'],
     'solver': ['lbfgs'],
     'max_iter': np.arange(100,900),}
]    

rand_search_logr = RandomizedSearchCV(estimator = logreg, param_distributions=param_grid_logr, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Logistic Regression model fit for grid search
_ = rand_search_logr.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_logr.best_score_}")
print(f"... with parameters: {rand_search_logr.best_params_}")

bestRecallLogr = rand_search_logr.best_estimator_

Fitting 5 folds for each of 100 candidates, totalling 500 fits
The best recall score is 0.9539958592132505
... with parameters: {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 306}


70 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
70 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\linear_model\_logistic.py", line 1291, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\utils\parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-p

In [5]:
# Grid for decision tree
param_grid_tree = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 50), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

rand_search_tree = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid_tree, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Decision tree model fit for grid search
_ = rand_search_tree.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_tree.best_score_}")
print(f"... with parameters: {rand_search_tree.best_params_}")

bestRecallTree = rand_search_tree.best_estimator_

Fitting 5 folds for each of 100 candidates, totalling 500 fits
The best recall score is 0.9453416149068324
... with parameters: {'min_samples_split': 31, 'min_samples_leaf': 4, 'min_impurity_decrease': 0.0021, 'max_leaf_nodes': 43, 'max_depth': 15, 'criterion': 'entropy'}


5 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\tree\_classes.py", line 889, in fit
    super().fit(
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\tree\_classes.py", line 177, in fit
    self._validate_params()
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "C:\User

In [6]:
# Grid for SVM
param_grid_svm = [{
    'degree': [2,3],
    'C': [1,5,10],
    'kernel': ['poly'],   
},
{
    'C': [1,5,10],
    'gamma': [1, 0.1],
    'kernel': ['rbf'],   
},
{
    'C': [1,5,10],
    'kernel': ['linear'],  
}]

rand_search_svm = RandomizedSearchCV(estimator = svmm, param_distributions=param_grid_svm, cv=kfolds, n_iter=50,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# SVM model fit for grid search
_ = rand_search_svm.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_svm.best_score_}")
print(f"... with parameters: {rand_search_svm.best_params_}")

bestRecallSvm = rand_search_svm.best_estimator_



Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9712629399585921
... with parameters: {'kernel': 'linear', 'C': 1}


In [7]:
#Grid for ADABoost Classifier

param_grid_ada = {  
     'n_estimators': [10,50,250,1000,2000],
     'learning_rate': [0.01,0.1,0.2,1.0],}   

rand_search_ada = RandomizedSearchCV(estimator = adatree, param_distributions=param_grid_ada, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = rand_search_ada.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_ada.best_score_}")
print(f"... with parameters: {rand_search_ada.best_params_}")

bestRecallAda = rand_search_ada.best_estimator_



Fitting 5 folds for each of 20 candidates, totalling 100 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'learning_rate': 0.1}


In [8]:
#Grid for Randomforest Classifier

param_grid_rf = {  
     'n_estimators': [10,50,250,1000,2000],
     'max_features': ['auto', 'sqrt', 'log2'],
     'max_depth' : [4,6,8,10],
     'criterion' :['gini', 'entropy', 'log_loss'],}   

rand_search_rf = RandomizedSearchCV(estimator = rforest, param_distributions=param_grid_rf, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = rand_search_rf.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_rf.best_score_}")
print(f"... with parameters: {rand_search_rf.best_params_}")

bestRecallRf = rand_search_rf.best_estimator_



Fitting 5 folds for each of 180 candidates, totalling 900 fits


  self.best_estimator_.fit(X, y, **fit_params)
  warn(


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 250, 'max_features': 'auto', 'max_depth': 10, 'criterion': 'entropy'}


In [9]:
#Grid for XGBoost Classifier

param_grid_xg = {  
    'max_depth': range (2, 10, 1),
    'n_estimators': [10,50,250,1000,2000],
    'learning_rate': [1.0,0.2,0.1, 0.01, 0.05],}   

rand_search_xg = RandomizedSearchCV(estimator = xgboost, param_distributions=param_grid_xg, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# XGBoost Classifier model fit for grid search
_ = rand_search_xg.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_xg.best_score_}")
print(f"... with parameters: {rand_search_xg.best_params_}")

bestRecallXg = rand_search_xg.best_estimator_



Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 250, 'max_depth': 4, 'learning_rate': 0.2}


In [10]:
#Grid for Gradient Boost Classifier

param_grid_gb = {  
    'min_samples_split': np.arange(1,20),  
    'min_samples_leaf': np.arange(1,12),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005), 
    'loss': ['log_loss', 'deviance', 'exponential'],
    'criterion': ['friedman_mse', 'squared_error'],
    'max_depth': range (2, 10, 1),
    'n_estimators': [10,50,250,1000,2000],
    'learning_rate': [1.0,0.2,0.1, 0.01, 0.05],}   

rand_search_gb = RandomizedSearchCV(estimator = gboost, param_distributions=param_grid_gb, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Gradient Boost Classifier model fit for grid search
_ = rand_search_gb.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_gb.best_score_}")
print(f"... with parameters: {rand_search_gb.best_params_}")

bestRecallgb = rand_search_gb.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits


160 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
160 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\ensemble\_gb.py", line 420, in fit
    self._validate_params()
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\scbis\anaconda3\envs\tf\lib\site-packages\sklearn\utils\_param_validation.py", line 97, in validate_parameter_cons

The best recall score is 0.9655486542443065
... with parameters: {'n_estimators': 2000, 'min_samples_split': 19, 'min_samples_leaf': 6, 'min_impurity_decrease': 0.0071, 'max_depth': 2, 'loss': 'deviance', 'learning_rate': 0.2, 'criterion': 'friedman_mse'}


In [11]:
%%time

param_grid_nn = {
    'hidden_layer_sizes': [ (50,), (70,), (100,), (50,30), (40,20), (50,30, 20), (70,50,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}


rand_search_nn = RandomizedSearchCV(estimator = ann, param_distributions=param_grid_nn, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search_nn.fit(X_train, y_train)

bestRecallNn = rand_search_nn.best_estimator_

print(rand_search_nn.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


  y = column_or_1d(y, warn=True)


{'solver': 'adam', 'max_iter': 5000, 'learning_rate_init': 0.001, 'learning_rate': 'invscaling', 'hidden_layer_sizes': (50,), 'alpha': 0, 'activation': 'logistic'}
CPU times: total: 2.81 s
Wall time: 2min


In [12]:
%%time
model = keras.models.Sequential()
recall_metric = tf.keras.metrics.Recall()
def build_clf(hidden_layer_sizes, dropout):
    ann = tf.keras.models.Sequential()
    ann.add(keras.layers.Input(shape=36)),
    for hidden_layer_size in hidden_layer_sizes:
        model.add(keras.layers.Dense(hidden_layer_size, kernel_initializer= tf.keras.initializers.GlorotNormal(seed=1), 
                                     bias_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None), activation="relu"))
        model.add(keras.layers.Dropout(dropout))
    ann.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    ann.compile(loss = 'binary_crossentropy', metrics = [recall_metric])
    return ann


from scikeras.wrappers import KerasClassifier

keras_clf = KerasClassifier(
    model=build_clf,
    hidden_layer_sizes=36,
    dropout = 0.0
)


params = {
    'optimizer__learning_rate': [0.0005, 0.001, 0.005],
    'model__hidden_layer_sizes': [(90, ), (100,), (100, 90), (100,90,70)],
    'model__dropout': [0, 0.1],
    'batch_size':[20, 60, 100],
    'epochs':[10, 50, 100],
    'optimizer':["adam",'sgd']
}
keras_clf.get_params().keys()


rnd_search_cv = RandomizedSearchCV(estimator=keras_clf, param_distributions=params, scoring='recall', n_iter=50, cv=5)

import sys
sys.setrecursionlimit(10000) # note: the default is 3000 (python 3.9)

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')
callback = [earlystop]

_ = rnd_search_cv.fit(X_train, y_train, callbacks=callback, verbose=0)

rnd_search_cv.best_params_

best_net = rnd_search_cv.best_estimator_
print(rnd_search_cv.best_params_)



{'optimizer__learning_rate': 0.0005, 'optimizer': 'sgd', 'model__hidden_layer_sizes': (100, 90), 'model__dropout': 0, 'epochs': 100, 'batch_size': 20}
CPU times: total: 7min 40s
Wall time: 24min 9s


In [13]:
print("\n=========================LOGISTIC REGRESSION====================\n")
print(f"The best {score_measure} score is {rand_search_logr.best_score_}")
print(f"... with parameters: {rand_search_logr.best_params_}")
print("\n=========================DECISION TREE==========================\n")
print(f"The best {score_measure} score is {rand_search_tree.best_score_}")
print(f"... with parameters: {rand_search_tree.best_params_}")
print("\n==============================SVM===============================\n")
print(f"The best {score_measure} score is {rand_search_svm.best_score_}")
print(f"... with parameters: {rand_search_svm.best_params_}")
print("\n=========================ADABOOST===============================\n")
print(f"The best {score_measure} score is {rand_search_ada.best_score_}")
print(f"... with parameters: {rand_search_ada.best_params_}")
print("\n=========================RANDOMFOREST===========================\n")
print(f"The best {score_measure} score is {rand_search_rf.best_score_}")
print(f"... with parameters: {rand_search_rf.best_params_}")
print("\n=========================XGBOOST================================\n")
print(f"The best {score_measure} score is {rand_search_xg.best_score_}")
print(f"... with parameters: {rand_search_xg.best_params_}")
print("\n=========================GRADIENT BOOST================================\n")
print(f"The best {score_measure} score is {rand_search_gb.best_score_}")
print(f"... with parameters: {rand_search_gb.best_params_}")
print("\n=========================NEURAL NETWORK================================\n")
print(f"The best {score_measure} score is {rand_search_nn.best_score_}")
print(f"... with parameters: {rand_search_nn.best_params_}")
print("\n=========================DEEP NEURAL NETWORK================================\n")
print(f"The best {score_measure} score is {rnd_search_cv.best_score_}")
print(f"... with parameters: {rnd_search_cv.best_params_}")



The best recall score is 0.9539958592132505
... with parameters: {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 306}


The best recall score is 0.9453416149068324
... with parameters: {'min_samples_split': 31, 'min_samples_leaf': 4, 'min_impurity_decrease': 0.0021, 'max_leaf_nodes': 43, 'max_depth': 15, 'criterion': 'entropy'}


The best recall score is 0.9712629399585921
... with parameters: {'kernel': 'linear', 'C': 1}


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'learning_rate': 0.1}


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 250, 'max_features': 'auto', 'max_depth': 10, 'criterion': 'entropy'}


The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 250, 'max_depth': 4, 'learning_rate': 0.2}


The best recall score is 0.9655486542443065
... with parameters: {'n_estimators': 2000, 'min_samples_split': 19, 'min_samples_leaf': 6, 'min_impurity_decrease': 0.0071, 'max_

### Final Grid Search ###

In [14]:
# ADABoosting classifier grid
param_grid_ada = {  
     'n_estimators': [800,1000,1200],
     'learning_rate': [0.07,0.1,0.13],}  


grid_search_ada = GridSearchCV(estimator = adatree, param_grid=param_grid_ada, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoosting classifier fit
_ = grid_search_ada.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_ada.best_score_}")
print(f"... with parameters: {grid_search_ada.best_params_}")

bestRecallAda = grid_search_ada.best_estimator_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.959792960662526
... with parameters: {'learning_rate': 0.07, 'n_estimators': 800}


In [15]:
#Grid for XGBoost Classifier

param_grid_xg = {  
    'max_depth': [2,4,6],
    'n_estimators': [200,250,300],
    'learning_rate': [0.17,0.2,0.23],}   

grid_search_xg = GridSearchCV(estimator = xgboost, param_grid=param_grid_xg, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# XGBoost Classifier model fit for grid search
_ = grid_search_xg.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_xg.best_score_}")
print(f"... with parameters: {grid_search_xg.best_params_}")

bestRecallXg = grid_search_xg.best_estimator_

Fitting 5 folds for each of 27 candidates, totalling 135 fits
The best recall score is 0.9655072463768116
... with parameters: {'learning_rate': 0.2, 'max_depth': 4, 'n_estimators': 250}


In [16]:
#Grid for Randomforest Classifier
param_grid_rf = {  
     'n_estimators': [40,50,60],
     'max_features': ['sqrt'],
     'max_depth' : [8,10,12],
     'criterion' :['entropy'],}   

grid_search_rf = GridSearchCV(estimator = rforest, param_grid=param_grid_rf, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = grid_search_rf.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_rf.best_score_}")
print(f"... with parameters: {grid_search_rf.best_params_}")

bestRecallRf = grid_search_rf.best_estimator_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


  self.best_estimator_.fit(X, y, **fit_params)


The best recall score is 0.9569358178053831
... with parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 40}


In [17]:
# Grid for SVM
param_grid_svm = {
    'C': [1,2,3],
    'kernel': ['linear'],  
}

grid_search_svm = GridSearchCV(estimator = svmm, param_grid=param_grid_svm, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# SVM model fit for grid search
_ = grid_search_svm.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_svm.best_score_}")
print(f"... with parameters: {grid_search_svm.best_params_}")

bestRecallSvm = grid_search_svm.best_estimator_

Fitting 5 folds for each of 3 candidates, totalling 15 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9712629399585921
... with parameters: {'C': 1, 'kernel': 'linear'}


In [18]:
# Grid for decision tree
 
param_grid_tree = {
    'min_samples_split': [29,31,33],  
    'min_samples_leaf': [2,4,6],
    'min_impurity_decrease': [0.0018,0.0021,0.0024],
    'max_leaf_nodes': [41,43,45], 
    'max_depth': [13,15,17], 
    'criterion': ['entropy'],
}

grid_search_tree = GridSearchCV(estimator = dtree, param_grid=param_grid_tree, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Decision tree model fit for grid search
_ = grid_search_tree.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_tree.best_score_}")
print(f"... with parameters: {grid_search_tree.best_params_}")

bestRecallTree = grid_search_tree.best_estimator_

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
The best recall score is 0.94824016563147
... with parameters: {'criterion': 'entropy', 'max_depth': 13, 'max_leaf_nodes': 41, 'min_impurity_decrease': 0.0018, 'min_samples_leaf': 2, 'min_samples_split': 29}


In [19]:
#Grid for Logistic Regression

param_grid_logr = {
     'penalty': ['l1'],
     'solver': ['liblinear'],
     'max_iter': [280,306,330],
}

grid_search_logr = GridSearchCV(estimator = logreg, param_grid=param_grid_logr, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Logistic Regression model fit for grid search
_ = grid_search_logr.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_logr.best_score_}")
print(f"... with parameters: {grid_search_logr.best_params_}")

bestRecallLogr = grid_search_logr.best_estimator_

Fitting 5 folds for each of 3 candidates, totalling 15 fits
The best recall score is 0.9539958592132505
... with parameters: {'max_iter': 280, 'penalty': 'l1', 'solver': 'liblinear'}


  y = column_or_1d(y, warn=True)


In [20]:
#Grid for Gradient Boost Classifier

param_grid_gb = {  
    'min_samples_split': [17,19,21],  
    'min_samples_leaf': [6,8,10],
    'min_impurity_decrease': [0.0059,0.0061,0.0063], 
    'loss': ['deviance'],
    'criterion': ['friedman_mse'],
    'max_depth': [4,6,8],
    'n_estimators': [200,250,300],
    'learning_rate': [0.07,0.1,0.13],}   

grid_search_gb = GridSearchCV(estimator = gboost, param_grid=param_grid_gb, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Gradient Boost Classifier model fit for grid search
_ = grid_search_gb.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_gb.best_score_}")
print(f"... with parameters: {grid_search_gb.best_params_}")

bestRecallgb = rand_search_gb.best_estimator_

Fitting 5 folds for each of 729 candidates, totalling 3645 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9626501035196687
... with parameters: {'criterion': 'friedman_mse', 'learning_rate': 0.07, 'loss': 'deviance', 'max_depth': 6, 'min_impurity_decrease': 0.0059, 'min_samples_leaf': 8, 'min_samples_split': 17, 'n_estimators': 200}


In [21]:
%%time
 
param_grid_nn = {
    'hidden_layer_sizes': [ (40,), (50,), (60,)],
    'activation': ['logistic'],
    'solver': ['adam'],
    'alpha': [0,.3, .5, .7],
    'learning_rate': ['invscaling'],
    'learning_rate_init': [0.0007, 0.001, 0.0013 ],
    'max_iter': [4500,5000,5500]
}

ann = MLPClassifier()
grid_search_nn = GridSearchCV(estimator = ann, param_grid=param_grid_nn, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search_nn.fit(X_train, y_train)

bestRecallNn = grid_search_nn.best_estimator_

print(grid_search_nn.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


  y = column_or_1d(y, warn=True)


{'activation': 'logistic', 'alpha': 0, 'hidden_layer_sizes': (50,), 'learning_rate': 'invscaling', 'learning_rate_init': 0.0007, 'max_iter': 4500, 'solver': 'adam'}
CPU times: total: 3.86 s
Wall time: 2min 33s


In [22]:
%%time
 
model = keras.models.Sequential()
recall_metric = tf.keras.metrics.Recall()
def build_clf(hidden_layer_sizes, dropout):
    ann = tf.keras.models.Sequential()
    ann.add(keras.layers.Input(shape=36)),
    for hidden_layer_size in hidden_layer_sizes:
        model.add(keras.layers.Dense(hidden_layer_size, kernel_initializer= tf.keras.initializers.GlorotNormal(seed=1), 
                                     bias_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None), activation="relu"))
        model.add(keras.layers.Dropout(dropout))
    ann.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    ann.compile(loss = 'binary_crossentropy', metrics = [recall_metric])
    return ann


from scikeras.wrappers import KerasClassifier

keras_clf = KerasClassifier(
    model=build_clf,
    hidden_layer_sizes=36,
    dropout = 0.0
)


params = {
    'optimizer__learning_rate': [0.0002,0.0005, 0.0008],
    'model__hidden_layer_sizes': [(100,70), (100, 90), (120,90)],
    'model__dropout': [0, 0.1],
    'batch_size':[10, 20, 30],
    'epochs':[80,100,120],
    'optimizer':['sgd']
}
keras_clf.get_params().keys()


grid_search_cv = GridSearchCV(estimator=keras_clf, param_grid=params, scoring='recall', cv=5)

import sys
sys.setrecursionlimit(10000) # note: the default is 3000 (python 3.9)

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')
callback = [earlystop]

_ = grid_search_cv.fit(X_train, y_train, callbacks=callback, verbose=0)

grid_search_cv.best_params_

bestRecallnet = grid_search_cv.best_estimator_
print(grid_search_cv.best_params_)



{'batch_size': 10, 'epochs': 100, 'model__dropout': 0.1, 'model__hidden_layer_sizes': (100, 90), 'optimizer': 'sgd', 'optimizer__learning_rate': 0.0005}
CPU times: total: 8h 45min 46s
Wall time: 4h 33min 4s


## Final models with best parameters ##

In [23]:
dtree = bestRecallTree
svmm = bestRecallSvm  
logreg = bestRecallLogr
adatree = bestRecallAda
rforest = bestRecallRf
xgboost = bestRecallXg
gboost = bestRecallgb
ann = bestRecallNn
dnn = bestRecallnet

## Model fit for train dataset and prediction with test dataset ##

In [24]:
_ = xgboost.fit(X_train, y_train)
y_pred = xgboost.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

xgboost_recall = recall_score(y_test, y_pred)

      Model             Score       
************************************
>> Recall Score:  0.959349593495935
Accuracy Score:   0.9887218045112782
Precision Score:  0.9915966386554622
F1 Score:         0.9752066115702479


In [25]:
_ = rforest.fit(X_train, y_train)
y_pred = rforest.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

rforest_recall = recall_score(y_test, y_pred)

  _ = rforest.fit(X_train, y_train)


      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9924812030075187
Precision Score:  0.9917355371900827
F1 Score:         0.9836065573770492


In [26]:
_ = adatree.fit(X_train, y_train)
y_pred = adatree.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

adatree_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.983739837398374
Accuracy Score:   0.9962406015037594
Precision Score:  1.0
F1 Score:         0.9918032786885246


In [27]:
_ = logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

logreg_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9906015037593985
Precision Score:  0.9836065573770492
F1 Score:         0.9795918367346939


In [28]:
_ = svmm.fit(X_train, y_train)
y_pred = svmm.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

svmm_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9943609022556391
Precision Score:  1.0
F1 Score:         0.9876543209876543


In [29]:
_ = dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

dtree_recall = recall_score(y_test, y_pred)

      Model             Score       
************************************
>> Recall Score:  0.967479674796748
Accuracy Score:   0.9830827067669173
Precision Score:  0.9596774193548387
F1 Score:         0.9635627530364373


In [30]:
_ = gboost.fit(X_train, y_train)
y_pred = gboost.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

gboost_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.983739837398374
Accuracy Score:   0.9962406015037594
Precision Score:  1.0
F1 Score:         0.9918032786885246


In [31]:
_ = ann.fit(X_train, y_train)
y_pred = ann.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

ann_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.959349593495935
Accuracy Score:   0.9887218045112782
Precision Score:  0.9915966386554622
F1 Score:         0.9752066115702479


In [32]:
_ = dnn.fit(X_train, y_train)
y_pred = dnn.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

dnn_recall = recall_score(y_test, y_pred)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [33]:
print("Recall scores...")
print(f"{'Decision Tree:':25}{dtree_recall}")
print(f"{'SVM:':25}{svmm_recall}")
print(f"{'Logistic Regression':25}{logreg_recall}")
print("\n===========================================\n")
print(f"{'Random Forest:':25}{rforest_recall}")
print(f"{'Ada Boosted Tree:':25}{adatree_recall}")
print(f"{'XGBoost Tree:':25}{xgboost_recall}")
print(f"{'Gradient Boosting':25}{gboost_recall}")
print("\n===========================================\n")
print(f"{'Neural Network':25}{ann_recall}")
print(f"{'Deep Neural Network':25}{dnn_recall}")

Recall scores...
Decision Tree:           0.967479674796748
SVM:                     0.975609756097561
Logistic Regression      0.975609756097561


Random Forest:           0.975609756097561
Ada Boosted Tree:        0.983739837398374
XGBoost Tree:            0.959349593495935
Gradient Boosting        0.983739837398374


Neural Network           0.959349593495935
Deep Neural Network      0.967479674796748


## Analysis of Models ##

The analysis of recall score of all the models shows that the noural network doesnot perform better than ensemble models. The neural network scores 0.96 and deep neural network scores 0.97 which are same as decision tree and random forest but the gradient bossting model and Ada Boost are the best with recall score of 0.98.

## Save Model to disk ##

In [34]:
import pickle

# save model
pickle.dump(gboost, open('./gboost_model.pkl', "wb"))

# If you wish to load this model later, simply use pickle.load method
#loaded_model = pickle.load(open('logistic_model_example01.pkl', "rb"))