# **Cross Validation & Hyperparameter Tuning** 

**Part 1** - Feature Engineering --> ../content/04_data_preprocessing_&_feature_engineering/Solution_Classification_preprocessing.ipynb

**Part 2** - Model Building & Evaluation --> ../content/05_supervised_machine_learning/Classification_Part2_Model building & Evaluation.ipynb


This notebook contains Hyperparameter Tuning of Various Machine Learning algorithms and comparision of their performances.

**Input Data:** Cleaned and pre-processed files from Feature engineering session are used as input.

    x_train.csv, x_test.csv, y_train.csv, y_test.csv
    Location --> ../content/04_data_preprocessing_&_feature_engineering/Solution_Classification_preprocessing.ipynb

#### Importing Libraries

In [1]:
#importing required libraries for data analysis
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

#Visuals and Time libraries
import matplotlib.pyplot as plt
import time

# Import tuning library from sklearn
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#Import Data balancing libraries
import imblearn
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Import evaluation metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [2]:
# Read the training & test datasets from Part1- Feature Engineering solution 

x_train=pd.read_csv('../datasets/classification/processed/X_train.csv', index_col=0)
x_test=pd.read_csv('../datasets/classification/processed/X_test.csv', index_col=0)

y_train=pd.read_csv('../datasets/classification/processed/y_train.csv', index_col=0)
y_test=pd.read_csv('../datasets/classification/processed/y_test.csv', index_col=0)

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(20659, 19) (8984, 19) (20659, 1) (8984, 1)


In [3]:
#importing SMOTETomek to handle class imbalance

balanced_data = SMOTETomek(random_state=42)

# fit predictor and target variable
x_smote, y_smote = balanced_data.fit_resample(x_train, y_train)

In [4]:
#To fix DataConversionWarning
y_smote = y_smote.values.ravel()
y_test = y_test.values.ravel()

##### Cross-Validation:
    
    The Cross-Validation then iterates through the folds and at each iteration uses one of the K folds as the validation set while using all remaining folds as the training set. This process is repeated until every fold has been used as a validation set. 
    K fold --> CV = x   where x Indicates number of iterations

##### Hyperparameter Tuning with Grid search

With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

Hyperparameter --> parameter whose value is used to control the learning process

* Each ML model has its own set of hyperparameters
* Tree based models share few hyperparamaters considering their similar learning process



### 1. **Logistic Regression Model**


Possible parameters to tune:

    1. Regularization - penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]
    2. The C parameter controls the penality strength - C in [100, 10, 1.0, 0.1, 0.01, 0.01]

In [5]:
# Identifying Grid Hyperparameters to train Logistic regression model
param_distributions = {'penalty': ['l1','l2', 'elasticnet'],
              'C' : [0.001, 0.1, 0.5, 0.75, 1, 10]}

start = time.time()
logi = LogisticRegression(max_iter=400)

logi_grid = RandomizedSearchCV(logi,param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)
logi_grid.fit(x_smote,y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [6]:
# Displaying best parameters
print(logi_grid.best_estimator_)
print(logi_grid.best_params_)

logi_optimal_model = logi_grid.best_estimator_

#class prediction of y on train and test
y_pred_logi_grid = logi_optimal_model.predict(x_test)
y_train_pred_logi_grid = logi_optimal_model.predict(x_smote)

LogisticRegression(C=0.1, max_iter=400)
{'penalty': 'l2', 'C': 0.1}


In [7]:
#getting all scores for Logistic Regression
log_acctr = round(accuracy_score(y_train_pred_logi_grid,y_smote), 3)
log_acc = round(accuracy_score(y_pred_logi_grid,y_test), 3)
log_prec = round(precision_score(y_pred_logi_grid,y_test), 3)
log_rec = round(recall_score(y_pred_logi_grid,y_test), 3)
log_f1 = round(f1_score(y_pred_logi_grid,y_test), 3)
log_roc = round(roc_auc_score(y_pred_logi_grid,y_test), 3)

#Feature co-efficients (Coef_ - Is generally not a great representaion of feature importance)
ft_imp = pd.DataFrame(data={'Attribute': x_smote.columns,'Importance': logi_optimal_model.coef_[0]}).sort_values(by='Importance', ascending=False)
log_feat = np.array(ft_imp['Attribute'][:3:])

#Training time calculation
log_time=stop-start

results = pd.DataFrame([['Logistic Regression tuned', log_acctr, log_acc, log_prec, log_rec, log_f1, log_roc, log_time, log_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,Logistic Regression tuned,0.632,0.557,0.738,0.299,0.426,0.585,5.801523,"[AGE_median, PAY_0, PAY_2]"


### 2. **KNN Model**

Possible parameters to tune:

    1. n_neighbors - Number of neighbors
    2. leaf_size = list(range(1,50))

In [8]:
# Identifying Grid Hyperparameters to train Logistic regression model
param_distributions = {'leaf_size': list(range(1,50)),
              'n_neighbors' : list(range(1,30))}

start = time.time()
knn = KNeighborsClassifier()

knn_grid = RandomizedSearchCV(knn, param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)
knn_grid.fit(x_smote,y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [9]:
# Displaying best parameters
print(knn_grid.best_estimator_)
print(knn_grid.best_params_)

knn_optimal_model = knn_grid.best_estimator_

#class prediction of y on train and test
y_pred_knn_grid = knn_optimal_model.predict(x_test)
y_train_pred_knn_grid = knn_optimal_model.predict(x_smote)

KNeighborsClassifier(leaf_size=20, n_neighbors=9)
{'n_neighbors': 9, 'leaf_size': 20}


In [10]:
#getting all scores for Logistic Regression
knn_acctr = round(accuracy_score(y_train_pred_knn_grid,y_smote), 3)
knn_acc = round(accuracy_score(y_pred_knn_grid,y_test), 3)
knn_prec = round(precision_score(y_pred_knn_grid,y_test), 3)
knn_rec = round(recall_score(y_pred_knn_grid,y_test), 3)
knn_f1 = round(f1_score(y_pred_knn_grid,y_test), 3)
knn_roc = round(roc_auc_score(y_pred_knn_grid,y_test), 3)

#Feature Importance
knn_feat = ['No coefficients available']

#Training time calculation
knn_time=stop-start

results = pd.DataFrame([['K Nearest Neighbor tuned', knn_acctr, knn_acc, knn_prec, knn_rec, knn_f1, knn_roc, knn_time, knn_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,K Nearest Neighbor tuned,0.798,0.624,0.521,0.301,0.382,0.564,9.038884,[No coefficients available]


### 3. **Decision Trees**

Possible parameters to tune:

    1. max_depth
    2. min_samples_split
    3. min_samples_leaf
    4. max_features
    5. max_leaf_nodes


In [11]:
# Identifying Grid Hyperparameters to train the model
param_distributions = {'max_depth': [5, 10, 30, 50],
#              'min_samples_split': [2, 5, 10, 15],
              'min_samples_leaf': [2, 5, 10, 20],
               'max_features' : [2, 5, 10, 30]}

start = time.time()
# Create an instance of the decision tree
dtc = DecisionTreeClassifier()

# Grid search
dtc_grid = RandomizedSearchCV(dtc, param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)
dtc_grid.fit(x_smote, y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [12]:
# Displaying best parameters
print(dtc_grid.best_estimator_)
print(dtc_grid.best_params_)

dtc_optimal_model = dtc_grid.best_estimator_

# class prediction of y on train and test
y_pred_dtc_grid=dtc_optimal_model.predict(x_test)
y_train_pred_dtc_grid=dtc_optimal_model.predict(x_smote)

DecisionTreeClassifier(max_depth=10, max_features=30, min_samples_leaf=5)
{'min_samples_leaf': 5, 'max_features': 30, 'max_depth': 10}


In [13]:
y_test.shape

(8984,)

In [14]:
#getting all scores for Decision trees
dtc_acctr = round(accuracy_score(y_train_pred_dtc_grid,y_smote), 3)
dtc_acc = round(accuracy_score(y_pred_dtc_grid,y_test), 3)
dtc_prec = round(precision_score(y_pred_dtc_grid,y_test), 3)
dtc_rec = round(recall_score(y_pred_dtc_grid,y_test), 3)
dtc_f1 = round(f1_score(y_pred_dtc_grid,y_test), 3)
dtc_roc = round(roc_auc_score(y_pred_dtc_grid,y_test), 3)

#Feature Importance
imp_ft = pd.DataFrame(data={'Attribute': x_smote.columns, 'Importance': dtc_optimal_model.feature_importances_}).sort_values(by='Importance', ascending=False)
dtc_feat = np.array(imp_ft['Attribute'][:3:])

#Training time calculation
dtc_time=stop-start

results = pd.DataFrame([['Decision trees tuned', dtc_acctr, dtc_acc, dtc_prec, dtc_rec, dtc_f1 , dtc_roc, dtc_time, dtc_feat ]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,Decision trees tuned,0.87,0.798,0.394,0.567,0.465,0.704,2.244369,"[PAY_0, MARRIAGE_married, SEX_female]"


### 4. **Random Forest Classifer**


Possible parameters to tune:

    1. n_estimators in [10, 100, 1000]
    2. max_features in [‘sqrt’, ‘log2’]   or max_features [1 to 20]
    3. min_samples_split
    4. min_samples_leaf
    5. max_depth

In [15]:
# Hyperparameter Grid
param_distributions = {'n_estimators' : [10, 50, 70],
               'max_depth' : [2, 3, 5, 10]}

start = time.time()
# Create an instance of the RandomForestClassifier
rfc = RandomForestClassifier()

# Grid search
rfc_grid = RandomizedSearchCV(rfc ,param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)
rfc_grid.fit(x_smote, y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [16]:
# Displaying best parameters
print(rfc_grid.best_estimator_)
print(rfc_grid.best_params_)

rfc_optimal_model = rfc_grid.best_estimator_

#class prediction of y on train and test
y_pred_rfc_grid = rfc_optimal_model.predict(x_test)
y_train_pred_rfc_grid = rfc_optimal_model.predict(x_smote)

RandomForestClassifier(max_depth=10, n_estimators=50)
{'n_estimators': 50, 'max_depth': 10}


In [17]:
#getting all scores for Random Forest Classifier
rfc_acctr = round(accuracy_score(y_train_pred_rfc_grid,y_smote), 3)
rfc_acc = round(accuracy_score(y_pred_rfc_grid,y_test), 3)
rfc_prec = round(precision_score(y_pred_rfc_grid,y_test), 3)
rfc_rec = round(recall_score(y_pred_rfc_grid,y_test), 3)
rfc_f1 = round(f1_score(y_pred_rfc_grid,y_test), 3)
rfc_roc = round(roc_auc_score(y_pred_rfc_grid,y_test), 3)

#Feature Importance
imp_ft = pd.DataFrame(data={'Attribute': x_smote.columns, 'Importance': rfc_optimal_model.feature_importances_}).sort_values(by='Importance', ascending=False)
rfc_feat = np.array(imp_ft['Attribute'][:3:])

#Training time calculation
rfc_time=stop-start

results = pd.DataFrame([['Random Forest tuned', rfc_acctr, rfc_acc, rfc_prec, rfc_rec, rfc_f1, rfc_roc, rfc_time, rfc_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,Random Forest tuned,0.888,0.808,0.441,0.592,0.506,0.721,13.322136,"[PAY_0, MARRIAGE_married, PAY_2]"


### 5. **Gradient Boosting**

Possible parameters to tune:

1. learning_rate in [0.001, 0.01, 0.1]
2. n_estimators [10, 100, 1000]
3. subsample in [0.5, 0.7, 1.0]
4. max_depth in [2, 7, 9]

In [18]:
# Hyperparameter Grid       #3min
param_distributions = {'learning_rate': [0.01, 0.1],
              'n_estimators' : [10, 50, 100],
              'max_depth' : [2, 3, 4],}

start = time.time()
gbc = GradientBoostingClassifier()
stop = time.time()

# Grid search
gbc_grid = RandomizedSearchCV(gbc, param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)
gbc_grid.fit(x_smote, y_smote)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [19]:
# Displaying best parameters
print(gbc_grid.best_estimator_)
print(gbc_grid.best_params_)

gbc_optimal_model = gbc_grid.best_estimator_

#class prediction of y on train and test
y_pred_gbc_grid = gbc_optimal_model.predict(x_test)
y_train_pred_gbc_grid = gbc_optimal_model.predict(x_smote)

GradientBoostingClassifier(max_depth=4)
{'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.1}


In [20]:
#getting all scores for Gradient booster
gbc_acctr = round(accuracy_score(y_train_pred_gbc_grid,y_smote), 3)
gbc_acc = round(accuracy_score(y_pred_gbc_grid,y_test), 3)
gbc_prec = round(precision_score(y_pred_gbc_grid,y_test), 3)
gbc_rec = round(recall_score(y_pred_gbc_grid,y_test), 3)
gbc_f1 = round(f1_score(y_pred_gbc_grid,y_test), 3)
gbc_roc = round(roc_auc_score(y_pred_gbc_grid,y_test), 3)

#Feature Importance
imp_ft = pd.DataFrame(data={'Attribute': x_smote.columns, 'Importance': gbc_optimal_model.feature_importances_}).sort_values(by='Importance', ascending=False)
gbc_feat = np.array(imp_ft['Attribute'][:3:])

#Training time calculation
gbc_time=stop-start

results = pd.DataFrame([['Gradient Boosting Tuned', gbc_acctr, gbc_acc, gbc_prec, gbc_rec, gbc_f1, gbc_roc, gbc_time, gbc_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,Gradient Boosting Tuned,0.881,0.81,0.39,0.614,0.477,0.728,7.3e-05,"[PAY_0, MARRIAGE_married, SEX_female]"


### 6. **XG Boosting**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. max_depth [1 to 20]
    3. max_leaf_nodes
    4. gamma
    5. min_child_weight
    6. n_estimators in (50, 100, 150]


In [21]:
# Hyperparameter Grid
param_distributions = {
#             'n_estimators' : [50, 100],
              'max_depth': [2, 3, 5],
#              'learning_rate': [0.1, 0.15],
              'gamma': [0.1,0.6,0.8]
               }


start = time.time()
# Create an instance of the XGBClassifier
xgb = XGBClassifier(random_state=0, use_label_encoder=False, eval_metric = 'logloss')

# Grid search
xgb_grid = RandomizedSearchCV(xgb ,param_distributions, n_jobs=-1, random_state=0, cv=2, verbose=1)

# fitting model
xgb_grid.fit(x_smote,y_smote)
stop = time.time()

Fitting 2 folds for each of 9 candidates, totalling 18 fits


In [22]:
# Displaying best parameters
print(xgb_grid.best_estimator_)
print(xgb_grid.best_params_)

xgb_optimal_model = xgb_grid.best_estimator_

#class prediction of y on train and test
y_pred_xgb_grid = xgb_optimal_model.predict(x_test)
y_train_pred_xgb_grid = xgb_optimal_model.predict(x_smote)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0.8, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)
{'max_depth': 3, 'gamma': 0.8}


In [23]:
#getting all scores for XG Boosting Classifier
xgb_acctr = round(accuracy_score(y_train_pred_xgb_grid,y_smote), 3)
xgb_acc = round(accuracy_score(y_pred_xgb_grid,y_test), 3)
xgb_prec = round(precision_score(y_pred_xgb_grid,y_test), 3)
xgb_rec = round(recall_score(y_pred_xgb_grid,y_test), 3)
xgb_f1 = round(f1_score(y_pred_xgb_grid,y_test), 3)
xgb_roc = round(roc_auc_score(y_pred_xgb_grid,y_test), 3)

#Feature Importance
imp_ft = pd.DataFrame(data={'Attribute': x_smote.columns, 'Importance': xgb_optimal_model.feature_importances_}).sort_values(by='Importance', ascending=False)
xgb_feat = np.array(imp_ft['Attribute'][:3:])

#Training time calculation
xgb_time=stop-start

results = pd.DataFrame([['XG Boosting Tuned', xgb_acctr, xgb_acc, xgb_prec, xgb_rec, xgb_f1, xgb_roc, xgb_time, xgb_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,XG Boosting Tuned,0.885,0.812,0.388,0.625,0.479,0.733,635.614341,"[MARRIAGE_married, PAY_0, EDUCATION_university]"


### 7. **ADA Boosting**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. base_estimator
    3. n_estimators in [50, 100, 150]

In [24]:
# Hyperparameter Grid
param_distributions = {'learning_rate': [0.001, 0.01, 0.1, 0.5],
              'n_estimators' : [50, 100, 150]}

start = time.time()
ada = AdaBoostClassifier()

ada_grid = RandomizedSearchCV(ada ,param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)

# fitting model
ada_grid.fit(x_smote,y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [25]:
# Displaying best parameters
print(ada_grid.best_estimator_)
print(ada_grid.best_params_)

ada_optimal_model = ada_grid.best_estimator_

#class prediction of y on train and test
y_pred_ada_grid = ada_optimal_model.predict(x_test)
y_train_pred_ada_grid = ada_optimal_model.predict(x_smote)

AdaBoostClassifier(learning_rate=0.5, n_estimators=150)
{'n_estimators': 150, 'learning_rate': 0.5}


In [26]:
#getting all scores for Ada Boosting
ada_acctr = round(accuracy_score(y_train_pred_ada_grid,y_smote), 3)
ada_acc = round(accuracy_score(y_pred_ada_grid,y_test), 3)
ada_prec = round(precision_score(y_pred_ada_grid,y_test), 3)
ada_rec = round(recall_score(y_pred_ada_grid,y_test), 3)
ada_f1 = round(f1_score(y_pred_ada_grid,y_test), 3)
ada_roc = round(roc_auc_score(y_pred_ada_grid,y_test), 3)

#Feature Importance
imp_ft = pd.DataFrame(data={'Attribute': x_smote.columns, 'Importance': ada_optimal_model.feature_importances_}).sort_values(by='Importance', ascending=False)
ada_feat = np.array(imp_ft['Attribute'][:3:])

#Training time calculation
ada_time=stop-start

results = pd.DataFrame([['ADA Boosting tuned', ada_acctr, ada_acc, ada_prec, ada_rec, ada_f1, ada_roc, ada_time, ada_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,ADA Boosting tuned,0.863,0.806,0.392,0.6,0.474,0.721,47.734555,"[SEX_female, MARRIAGE_married, EDUCATION_unive..."


### 8. **Bagging**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. max_features
    3. n_estimators in [10,50, 100, 150]
    4. max_samples

In [27]:
# Hyperparameter Grid
param_distributions = {'n_estimators' : [5, 10, 50, 150, 300],
               'max_features' : [2, 5, 10, 15],
               'max_samples' : [0.05, 0.1, 0.2]}

start = time.time()
bag = BaggingClassifier()

bag_grid = RandomizedSearchCV(bag ,param_distributions, n_jobs=-1, random_state=0, cv=5, verbose=1)

bag_grid.fit(x_smote,y_smote)
stop = time.time()

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [28]:
# Displaying best parameters
print(bag_grid.best_estimator_)
print(bag_grid.best_params_)

bag_optimal_model = bag_grid.best_estimator_

#class prediction of y on train and test
y_pred_bag_grid = bag_optimal_model.predict(x_test)
y_train_pred_bag_grid = bag_optimal_model.predict(x_smote)

BaggingClassifier(max_features=5, max_samples=0.2, n_estimators=150)
{'n_estimators': 150, 'max_samples': 0.2, 'max_features': 5}


In [29]:
#getting all scores for Logistic Regression
bag_acctr = round(accuracy_score(y_train_pred_bag_grid,y_smote), 3)
bag_acc = round(accuracy_score(y_pred_bag_grid,y_test), 3)
bag_prec = round(precision_score(y_pred_bag_grid,y_test), 3)
bag_rec = round(recall_score(y_pred_bag_grid,y_test), 3)
bag_f1 = round(f1_score(y_pred_bag_grid,y_test), 3)
bag_roc = round(roc_auc_score(y_pred_bag_grid,y_test), 3)

#Feature Importance
bag_feat = ['No coefficients available']

#Training time calculation
bag_time=stop-start

results = pd.DataFrame([['Bagging classifier tuned', bag_acctr, bag_acc, bag_prec, bag_rec, bag_f1, bag_roc, bag_time, bag_feat]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC', 'Training Time(s)','Important Features'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC,Training Time(s),Important Features
0,Bagging classifier tuned,0.909,0.8,0.283,0.611,0.387,0.717,32.704007,[No coefficients available]


### **Final Model Comparision**

In [30]:
grid_classifiers = ['Optimal Logistic Regression', 'Optimal K Nearest Neighbors', 'Optimal Decision Tree', 'Optimal Random Forest',
                  'Optimal Gradient Boosting', 'Optimal XG Boosting', 'Optimal Ada Boosting', 'Optimal Bagging']

grid_train_accuracy =  [log_acctr, knn_acctr, dtc_acctr, rfc_acctr, gbc_acctr, xgb_acctr, ada_acctr, bag_acctr]
grid_test_accuracy =   [log_acc, knn_acc, dtc_acc, rfc_acc, gbc_acc, xgb_acc, ada_acc, bag_acc]
grid_precision_score = [log_prec, knn_prec, dtc_prec, rfc_prec, gbc_prec, xgb_prec, ada_prec, bag_prec]
grid_recall_score =    [log_rec, knn_rec, dtc_rec, rfc_rec, gbc_rec, xgb_rec, ada_rec, bag_rec]
grid_f1_score =        [log_f1, knn_f1, dtc_f1, rfc_f1, gbc_f1, xgb_f1, ada_f1, bag_f1]
grid_auc_score =       [log_roc, knn_roc, dtc_roc, rfc_roc, gbc_roc, xgb_roc, ada_roc, bag_roc]
training_time=[log_time, knn_time, dtc_time, rfc_time, gbc_time, xgb_time, ada_time, bag_time]
feature_imp = [log_feat, knn_feat, dtc_feat, rfc_feat, gbc_feat, xgb_feat, ada_feat, bag_feat]

In [31]:
compare_df = pd.DataFrame({'Classifier':grid_classifiers, 'Train Accuracy': grid_train_accuracy, 'Test Accuracy': grid_test_accuracy, 
                           'Precision': grid_precision_score, 'Recall': grid_recall_score, 'F1 Score': grid_f1_score , 
                           'AUC': grid_auc_score, 'Training Time(s)':training_time, 'Important Features':feature_imp})
compare_df

Unnamed: 0,Classifier,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,AUC,Training Time(s),Important Features
0,Optimal Logistic Regression,0.632,0.557,0.738,0.299,0.426,0.585,5.801523,"[AGE_median, PAY_0, PAY_2]"
1,Optimal K Nearest Neighbors,0.798,0.624,0.521,0.301,0.382,0.564,9.038884,[No coefficients available]
2,Optimal Decision Tree,0.87,0.798,0.394,0.567,0.465,0.704,2.244369,"[PAY_0, MARRIAGE_married, SEX_female]"
3,Optimal Random Forest,0.888,0.808,0.441,0.592,0.506,0.721,13.322136,"[PAY_0, MARRIAGE_married, PAY_2]"
4,Optimal Gradient Boosting,0.881,0.81,0.39,0.614,0.477,0.728,7.3e-05,"[PAY_0, MARRIAGE_married, SEX_female]"
5,Optimal XG Boosting,0.885,0.812,0.388,0.625,0.479,0.733,635.614341,"[MARRIAGE_married, PAY_0, EDUCATION_university]"
6,Optimal Ada Boosting,0.863,0.806,0.392,0.6,0.474,0.721,47.734555,"[SEX_female, MARRIAGE_married, EDUCATION_unive..."
7,Optimal Bagging,0.909,0.8,0.283,0.611,0.387,0.717,32.704007,[No coefficients available]


### **Conclusion**

* After cross validation and hyperparameter tunning, Ensemble algorithms shows better performance compared to other models

- The hyperparameter tuned optimized algorithms show improved performance compared to the Baseline models

* Hyperparameter tuning has also reduced the overfitting issue in Random forest & Decision tree models

* One can try tuning other hyperparameters, passing different ranges of values and a different CV fold to check for any further improvement in the model performance