## Subash Chandra Biswal (U77884251) ##
# Assignment 1 - Cardiotocography


## Introduction and Overview


Author: J. P. Marques de Sá, J. Bernardes, D. Ayers de Campos.  
Source: UCI  
Please cite: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318, UCI    

2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.  

Attribute Information:  
LB - FHR baseline (beats per minute)  
AC - # of accelerations per second  
FM - # of fetal movements per second  
UC - # of uterine contractions per second  
DL - # of light decelerations per second  
DS - # of severe decelerations per second  
DP - # of prolongued decelerations per second  
ASTV - percentage of time with abnormal short term variability  
MSTV - mean value of short term variability  
ALTV - percentage of time with abnormal long term variability  
MLTV - mean value of long term variability  
Width - width of FHR histogram  
Min - minimum of FHR histogram  
Max - Maximum of FHR histogram  
Nmax - # of histogram peaks  
Nzeros - # of histogram zeros  
Mode - histogram mode  
Mean - histogram mean  
Median - histogram median  
Variance - histogram variance  
Tendency - histogram tendency  
CLASS - FHR pattern class code (1 to 10)  
NSP - fetal state class code (N=normal(1); S=suspect(2); P=pathologic(3))  

## Install and import necessary packages

In [2]:
# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

# set random seed to ensure that results are repeatable
np.random.seed(1)

## Load data 

In [3]:
X_train = pd.read_csv("./X_train.csv")
y_train = pd.read_csv("./y_train.csv")
X_test = pd.read_csv("./X_test.csv")
y_test = pd.read_csv("./y_test.csv")
X = pd.read_csv("./X.csv")
y = pd.read_csv("./y.csv")

## Performance Metrics ##
Since this is pharmacutical data and we are targeting suspects from various medical test data, we need to minimize the false negatives as this will cost somebody's life. This cost is significantly high as compared to false positive and in case of false positive the patient/insurance company needs to bear only the further investigation costs. 

Since this is a classification problem our score metrics is confusion matix and our measure of score is recall.

In [4]:
score_measure = "recall"
kfolds = 5

dtree = DecisionTreeClassifier()
svmm = SVC()
logreg = LogisticRegression()
adatree = AdaBoostClassifier()
rforest = RandomForestClassifier()
xgboost = XGBClassifier()
gboost = GradientBoostingClassifier()

##  Random search of parameter grids of all models ##

In [16]:
#Grid for Logistic Regression

param_grid_logr = [{
     'penalty': ['l1', 'l2', 'elasticnet', 'none'],
     'solver': ['saga'],
     'max_iter': np.arange(100,900),},
      {
     'penalty': ['l1', 'l2'],
     'solver': ['liblinear'],
     'max_iter': np.arange(100,900),},
    {
     'penalty': ['l2', 'none'],
     'solver': ['lbfgs'],
     'max_iter': np.arange(100,900),}
]    

rand_search_logr = RandomizedSearchCV(estimator = logreg, param_distributions=param_grid_logr, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Logistic Regression model fit for grid search
_ = rand_search_logr.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_logr.best_score_}")
print(f"... with parameters: {rand_search_logr.best_params_}")

bestRecallLogr = rand_search_logr.best_estimator_

Fitting 5 folds for each of 100 candidates, totalling 500 fits
The best recall score is 0.9539958592132505
... with parameters: {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 306}


70 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
70 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1471, in fit
    raise ValueError(
ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)

        nan 0.76132505 0.89358178 0.64078675 0.41097308 0.87639752
 0.38501035        nan 0.89072464 0.40517598 0.95395445 0.7699793
 0.95395445 0.95399586 0.8447205  0.95399586 0.82173913 0.95399586
 0.61486542 0.6148654

In [17]:
# Grid for decision tree
param_grid_tree = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 50), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

rand_search_tree = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid_tree, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Decision tree model fit for grid search
_ = rand_search_tree.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_tree.best_score_}")
print(f"... with parameters: {rand_search_tree.best_params_}")

bestRecallTree = rand_search_tree.best_estimator_

Fitting 5 folds for each of 100 candidates, totalling 500 fits
The best recall score is 0.9453416149068324
... with parameters: {'min_samples_split': 31, 'min_samples_leaf': 4, 'min_impurity_decrease': 0.0021, 'max_leaf_nodes': 43, 'max_depth': 15, 'criterion': 'entropy'}


5 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 0.62335404 0.90815735 0.72401656 0.84186335 0.80488613 0.79039337
 0.88530021 

In [18]:
# Grid for SVM
param_grid_svm = [{
    'degree': [2,3],
    'C': [1,5,10],
    'kernel': ['poly'],   
},
{
    'C': [1,5,10],
    'gamma': [1, 0.1],
    'kernel': ['rbf'],   
},
{
    'C': [1,5,10],
    'kernel': ['linear'],  
}]

rand_search_svm = RandomizedSearchCV(estimator = svmm, param_distributions=param_grid_svm, cv=kfolds, n_iter=50,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# SVM model fit for grid search
_ = rand_search_svm.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_svm.best_score_}")
print(f"... with parameters: {rand_search_svm.best_params_}")

bestRecallSvm = rand_search_svm.best_estimator_



Fitting 5 folds for each of 15 candidates, totalling 75 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9712629399585921
... with parameters: {'kernel': 'linear', 'C': 1}


In [19]:
#Grid for ADABoost Classifier

param_grid_ada = {  
     'n_estimators': [10,50,250,1000,2000],
     'learning_rate': [0.01,0.1,0.2,1.0],}   

rand_search_ada = RandomizedSearchCV(estimator = adatree, param_distributions=param_grid_ada, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = rand_search_ada.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_ada.best_score_}")
print(f"... with parameters: {rand_search_ada.best_params_}")

bestRecallAda = rand_search_ada.best_estimator_



Fitting 5 folds for each of 20 candidates, totalling 100 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'learning_rate': 0.1}


In [20]:
#Grid for Randomforest Classifier

param_grid_rf = {  
     'n_estimators': [10,50,250,1000,2000],
     'max_features': ['auto', 'sqrt', 'log2'],
     'max_depth' : [4,6,8,10],
     'criterion' :['gini', 'entropy', 'log_loss'],}   

rand_search_rf = RandomizedSearchCV(estimator = rforest, param_distributions=param_grid_rf, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = rand_search_rf.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_rf.best_score_}")
print(f"... with parameters: {rand_search_rf.best_params_}")

bestRecallRf = rand_search_rf.best_estimator_



Fitting 5 folds for each of 180 candidates, totalling 900 fits


300 fits failed out of a total of 900.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
300 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
    trees = Parallel(
  File "C:\Users\scbis\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\scbis\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\scbis\ana

The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'max_features': 'auto', 'max_depth': 10, 'criterion': 'entropy'}


In [21]:
#Grid for XGBoost Classifier

param_grid_xg = {  
    'max_depth': range (2, 10, 1),
    'n_estimators': [10,50,250,1000,2000],
    'learning_rate': [1.0,0.2,0.1, 0.01, 0.05],}   

rand_search_xg = RandomizedSearchCV(estimator = xgboost, param_distributions=param_grid_xg, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# XGBoost Classifier model fit for grid search
_ = rand_search_xg.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_xg.best_score_}")
print(f"... with parameters: {rand_search_xg.best_params_}")

bestRecallXg = rand_search_xg.best_estimator_



Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 250, 'max_depth': 4, 'learning_rate': 0.2}


In [22]:
#Grid for Gradient Boost Classifier

param_grid_gb = {  
    'min_samples_split': np.arange(1,20),  
    'min_samples_leaf': np.arange(1,12),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005), 
    'loss': ['log_loss', 'deviance', 'exponential'],
    'criterion': ['friedman_mse', 'squared_error'],
    'max_depth': range (2, 10, 1),
    'n_estimators': [10,50,250,1000,2000],
    'learning_rate': [1.0,0.2,0.1, 0.01, 0.05],}   

rand_search_gb = RandomizedSearchCV(estimator = gboost, param_distributions=param_grid_gb, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Gradient Boost Classifier model fit for grid search
_ = rand_search_gb.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_gb.best_score_}")
print(f"... with parameters: {rand_search_gb.best_params_}")

bestRecallgb = rand_search_gb.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 10, 'min_samples_split': 8, 'min_samples_leaf': 10, 'min_impurity_decrease': 0.0096, 'max_depth': 9, 'loss': 'exponential', 'learning_rate': 1.0, 'criterion': 'friedman_mse'}


980 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
905 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 525, in fit
    self._check_params()
  File "C:\Users\scbis\anaconda3\lib\site-packages\sklearn\ensemble\_gb.py", line 282, in _check_params
    raise ValueError("Loss '{0:s}' not supported. ".format(self.loss))
ValueError: Loss 'log_loss' not supported. 

-------------------------------------------------------------------------

In [23]:
print("\n=========================LOGISTIC REGRESSION====================\n")
print(f"The best {score_measure} score is {rand_search_logr.best_score_}")
print(f"... with parameters: {rand_search_logr.best_params_}")
print("\n=========================DECISION TREE==========================\n")
print(f"The best {score_measure} score is {rand_search_tree.best_score_}")
print(f"... with parameters: {rand_search_tree.best_params_}")
print("\n==============================SVM===============================\n")
print(f"The best {score_measure} score is {rand_search_svm.best_score_}")
print(f"... with parameters: {rand_search_svm.best_params_}")
print("\n=========================ADABOOST===============================\n")
print(f"The best {score_measure} score is {rand_search_ada.best_score_}")
print(f"... with parameters: {rand_search_ada.best_params_}")
print("\n=========================RANDOMFOREST===========================\n")
print(f"The best {score_measure} score is {rand_search_rf.best_score_}")
print(f"... with parameters: {rand_search_rf.best_params_}")
print("\n=========================XGBOOST================================\n")
print(f"The best {score_measure} score is {rand_search_xg.best_score_}")
print(f"... with parameters: {rand_search_xg.best_params_}")
print("\n=========================GRADIENT BOOST================================\n")
print(f"The best {score_measure} score is {rand_search_gb.best_score_}")
print(f"... with parameters: {rand_search_gb.best_params_}")



The best recall score is 0.9539958592132505
... with parameters: {'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 306}


The best recall score is 0.9453416149068324
... with parameters: {'min_samples_split': 31, 'min_samples_leaf': 4, 'min_impurity_decrease': 0.0021, 'max_leaf_nodes': 43, 'max_depth': 15, 'criterion': 'entropy'}


The best recall score is 0.9712629399585921
... with parameters: {'kernel': 'linear', 'C': 1}


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'learning_rate': 0.1}


The best recall score is 0.959792960662526
... with parameters: {'n_estimators': 1000, 'max_features': 'auto', 'max_depth': 10, 'criterion': 'entropy'}


The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 250, 'max_depth': 4, 'learning_rate': 0.2}


The best recall score is 0.9655072463768116
... with parameters: {'n_estimators': 10, 'min_samples_split': 8, 'min_samples_leaf': 10, 'min_impurity_decrease': 0.0096, 'max_d

### Final Grid Search ###

In [24]:
# ADABoosting classifier grid
param_grid_ada = {  
     'n_estimators': [800,1000,1200],
     'learning_rate': [0.07,0.1,0.13],}  


grid_search_ada = GridSearchCV(estimator = adatree, param_grid=param_grid_ada, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoosting classifier fit
_ = grid_search_ada.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_ada.best_score_}")
print(f"... with parameters: {grid_search_ada.best_params_}")

bestRecallAda = grid_search_ada.best_estimator_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.959792960662526
... with parameters: {'learning_rate': 0.07, 'n_estimators': 800}


In [25]:
#Grid for XGBoost Classifier

param_grid_xg = {  
    'max_depth': [2,4,6],
    'n_estimators': [200,250,300],
    'learning_rate': [0.17,0.2,0.23],}   

grid_search_xg = GridSearchCV(estimator = xgboost, param_grid=param_grid_xg, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# XGBoost Classifier model fit for grid search
_ = grid_search_xg.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_xg.best_score_}")
print(f"... with parameters: {grid_search_xg.best_params_}")

bestRecallXg = grid_search_xg.best_estimator_

Fitting 5 folds for each of 27 candidates, totalling 135 fits
The best recall score is 0.9655072463768116
... with parameters: {'learning_rate': 0.2, 'max_depth': 4, 'n_estimators': 250}


In [26]:
#Grid for Randomforest Classifier
param_grid_rf = {  
     'n_estimators': [40,50,60],
     'max_features': ['sqrt'],
     'max_depth' : [8,10,12],
     'criterion' :['entropy'],}   

grid_search_rf = GridSearchCV(estimator = rforest, param_grid=param_grid_rf, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# ADABoost Classifier model fit for grid search
_ = grid_search_rf.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_rf.best_score_}")
print(f"... with parameters: {grid_search_rf.best_params_}")

bestRecallRf = grid_search_rf.best_estimator_

Fitting 5 folds for each of 9 candidates, totalling 45 fits


  self.best_estimator_.fit(X, y, **fit_params)


The best recall score is 0.9568944099378882
... with parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'sqrt', 'n_estimators': 60}


In [27]:
# Grid for SVM
param_grid_svm = {
    'C': [1,2,3],
    'kernel': ['linear'],  
}

grid_search_svm = GridSearchCV(estimator = svmm, param_grid=param_grid_svm, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# SVM model fit for grid search
_ = grid_search_svm.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_svm.best_score_}")
print(f"... with parameters: {grid_search_svm.best_params_}")

bestRecallSvm = grid_search_svm.best_estimator_

Fitting 5 folds for each of 3 candidates, totalling 15 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9712629399585921
... with parameters: {'C': 1, 'kernel': 'linear'}


In [28]:
# Grid for decision tree
 
param_grid_tree = {
    'min_samples_split': [29,31,33],  
    'min_samples_leaf': [2,4,6],
    'min_impurity_decrease': [0.0018,0.0021,0.0024],
    'max_leaf_nodes': [41,43,45], 
    'max_depth': [13,15,17], 
    'criterion': ['entropy'],
}

grid_search_tree = GridSearchCV(estimator = dtree, param_grid=param_grid_tree, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Decision tree model fit for grid search
_ = grid_search_tree.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_tree.best_score_}")
print(f"... with parameters: {grid_search_tree.best_params_}")

bestRecallTree = grid_search_tree.best_estimator_

Fitting 5 folds for each of 243 candidates, totalling 1215 fits
The best recall score is 0.94824016563147
... with parameters: {'criterion': 'entropy', 'max_depth': 13, 'max_leaf_nodes': 41, 'min_impurity_decrease': 0.0018, 'min_samples_leaf': 2, 'min_samples_split': 29}


In [29]:
#Grid for Logistic Regression

param_grid_logr = {
     'penalty': ['l1'],
     'solver': ['liblinear'],
     'max_iter': [180,200,220],
}

grid_search_logr = GridSearchCV(estimator = logreg, param_grid=param_grid_logr, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Logistic Regression model fit for grid search
_ = grid_search_logr.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_logr.best_score_}")
print(f"... with parameters: {grid_search_logr.best_params_}")

bestRecallLogr = grid_search_logr.best_estimator_

Fitting 5 folds for each of 3 candidates, totalling 15 fits
The best recall score is 0.9539958592132505
... with parameters: {'max_iter': 180, 'penalty': 'l1', 'solver': 'liblinear'}


  y = column_or_1d(y, warn=True)


In [30]:
#Grid for Gradient Boost Classifier

param_grid_gb = {  
    'min_samples_split': [13,15,17],  
    'min_samples_leaf': [5,7,9],
    'min_impurity_decrease': [0.0038,0.0041,0.0044], 
    'loss': ['deviance'],
    'criterion': ['friedman_mse'],
    'max_depth': [3,5,7],
    'n_estimators': [200,250,300],
    'learning_rate': [0.17,0.2,0.23],}   

grid_search_gb = GridSearchCV(estimator = gboost, param_grid=param_grid_gb, cv=kfolds,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

# Gradient Boost Classifier model fit for grid search
_ = grid_search_gb.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search_gb.best_score_}")
print(f"... with parameters: {grid_search_gb.best_params_}")

bestRecallgb = rand_search_gb.best_estimator_

Fitting 5 folds for each of 729 candidates, totalling 3645 fits


  y = column_or_1d(y, warn=True)


The best recall score is 0.9683643892339544
... with parameters: {'criterion': 'friedman_mse', 'learning_rate': 0.2, 'loss': 'deviance', 'max_depth': 5, 'min_impurity_decrease': 0.0038, 'min_samples_leaf': 7, 'min_samples_split': 13, 'n_estimators': 200}


## Final models with best parameters ##

In [31]:
dtree = DecisionTreeClassifier(criterion='entropy', max_depth=13, max_leaf_nodes=41, min_impurity_decrease=0.0018, min_samples_leaf=2, min_samples_split=29)
svmm = SVC(C=1, kernel='linear')
logreg = LogisticRegression(max_iter=180, penalty='l1', solver='liblinear')
adatree = AdaBoostClassifier(learning_rate=0.07, n_estimators=800)
rforest = RandomForestClassifier(criterion='entropy', max_depth=10, max_features='sqrt', n_estimators=50)
xgboost = XGBClassifier(learning_rate=0.2, max_depth=4, n_estimators=250)
gboost = GradientBoostingClassifier(criterion='friedman_mse', learning_rate=0.17, loss='deviance', max_depth=7, min_impurity_decrease=0.0038, min_samples_leaf=5, min_samples_split=17, n_estimators=250)

## Model fit for train dataset and prediction with test dataset ##

In [32]:
_ = xgboost.fit(X_train, y_train)
y_pred = xgboost.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

xgboost_recall = recall_score(y_test, y_pred)

      Model             Score       
************************************
>> Recall Score:  0.959349593495935
Accuracy Score:   0.9887218045112782
Precision Score:  0.9915966386554622
F1 Score:         0.9752066115702479


In [33]:
_ = rforest.fit(X_train, y_train)
y_pred = rforest.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

rforest_recall = recall_score(y_test, y_pred)

  _ = rforest.fit(X_train, y_train)


      Model             Score       
************************************
>> Recall Score:  0.967479674796748
Accuracy Score:   0.9906015037593985
Precision Score:  0.9916666666666667
F1 Score:         0.9794238683127573


In [34]:
_ = adatree.fit(X_train, y_train)
y_pred = adatree.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

adatree_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.983739837398374
Accuracy Score:   0.9962406015037594
Precision Score:  1.0
F1 Score:         0.9918032786885246


In [35]:
_ = logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

logreg_recall = recall_score(y_test, y_pred)

      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9906015037593985
Precision Score:  0.9836065573770492
F1 Score:         0.9795918367346939


  y = column_or_1d(y, warn=True)


In [36]:
_ = svmm.fit(X_train, y_train)
y_pred = svmm.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

svmm_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9943609022556391
Precision Score:  1.0
F1 Score:         0.9876543209876543


In [37]:
_ = dtree.fit(X_train, y_train)
y_pred = dtree.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

dtree_recall = recall_score(y_test, y_pred)

      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9868421052631579
Precision Score:  0.967741935483871
F1 Score:         0.97165991902834


In [5]:
_ = gboost.fit(X_train, y_train)
y_pred = gboost.predict(X_test)

print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

gboost_recall = recall_score(y_test, y_pred)

  y = column_or_1d(y, warn=True)


      Model             Score       
************************************
>> Recall Score:  0.975609756097561
Accuracy Score:   0.9943609022556391
Precision Score:  1.0
F1 Score:         0.9876543209876543


In [39]:
print("Recall scores...")
print(f"{'Decision Tree:':25}{dtree_recall}")
print(f"{'SVM:':25}{svmm_recall}")
print(f"{'Logistic Regression':25}{logreg_recall}")

print(f"{'Random Forest:':25}{rforest_recall}")
print(f"{'Ada Boosted Tree:':25}{adatree_recall}")
print(f"{'XGBoost Tree:':25}{xgboost_recall}")
print(f"{'Gradient Boosting':25}{gboost_recall}")

Recall scores...
Decision Tree:           0.975609756097561
SVM:                     0.975609756097561
Logistic Regression      0.975609756097561
Random Forest:           0.967479674796748
Ada Boosted Tree:        0.983739837398374
XGBoost Tree:            0.959349593495935
Gradient Boosting        0.991869918699187


## Analysis of Models ##

The analysis of recall score of models show that Grading Boosting is the best model (0.992). 
While XGBoost performs worst (0.959), the Decision Tree, SVM, and Logistic Regression perform same (0.976). The random forest performs better than XGBoost (.967), but not better than most of the models.

This shows that the esemble models perform better than the pruning models.

## Save Model to disk ##

In [6]:
import pickle

# save model
pickle.dump(gboost, open('./gboost_model.pkl', "wb"))

# If you wish to load this model later, simply use pickle.load method
#loaded_model = pickle.load(open('logistic_model_example01.pkl', "rb"))