# **Cross Validation & Hyperparameter Tuning** 

#### Importing Libraries

In [1]:
#importing required libraries for data analysis
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Import models from sklearn
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Import evaluation metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [2]:
# Read the training & test datasets from Part2_Classification-Model building part

x_smote=pd.read_csv('x_smote.csv')
x_test=pd.read_csv('x_test.csv')

y_smote=pd.read_csv('y_smote.csv')
y_test=pd.read_csv('y_test.csv')

print(x_smote.shape,x_test.shape,y_smote.shape,y_test.shape)

(8242, 86) (2250, 86) (8242, 1) (2250, 1)


##### Cross-Validation:
    
    The Cross-Validation then iterates through the folds and at each iteration uses one of the K folds as the validation set while using all remaining folds as the training set. This process is repeated until every fold has been used as a validation set. 
    K fold --> CV = x   where x Indicates number of iterations

##### Hyperparameter Tuning with Grid search

With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

Hyperparameter --> parameter whose value is used to control the learning process

* Each ML model has its own set of hyperparameters
* Tree based models share few hyperparamaters considering their similar learning process



### 1. **Logistic Regression Model**


Possible parameters to tune:

    1. Regularization - penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]
    2. The C parameter controls the penality strength - C in [100, 10, 1.0, 0.1, 0.01, 0.01]

In [3]:
# Identifying Grid Hyperparameters to train Logistic regression model
param_dict = {'penalty': ['l1','l2', 'elasticnet'],
              'max_iter' : [50, 100, 500, 1000],
              'C' : [0.001, 0.1, 0.5, 0.75, 1, 10]}

logi = LogisticRegression()

logi_grid = GridSearchCV(logi,param_dict, cv=5, n_jobs=-1)
logi_grid.fit(x_smote,y_smote)

In [4]:
# Displaying best parameters
print(logi_grid.best_estimator_)
print(logi_grid.best_params_)

logi_optimal_model = logi_grid.best_estimator_

#class prediction of y on train and test
y_pred_logi_grid = logi_optimal_model.predict(x_test)
y_train_pred_logi_grid = logi_optimal_model.predict(x_smote)

LogisticRegression(C=0.001, max_iter=50)
{'C': 0.001, 'max_iter': 50, 'penalty': 'l2'}


In [5]:
#getting all scores for Logistic Regression
log_acctr = round(accuracy_score(y_train_pred_logi_grid,y_smote), 3)
log_acc = round(accuracy_score(y_pred_logi_grid,y_test), 3)
log_prec = round(precision_score(y_pred_logi_grid,y_test), 3)
log_rec = round(recall_score(y_pred_logi_grid,y_test), 3)
log_f1 = round(f1_score(y_pred_logi_grid,y_test), 3)
log_roc = round(roc_auc_score(y_pred_logi_grid,y_test), 3)

results = pd.DataFrame([['Logistic Regression tuned', log_acctr, log_acc, log_prec, log_rec, log_f1, log_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,Logistic Regression tuned,0.71,0.778,0.561,0.522,0.541,0.693


### 2. **Decision Trees**

Possible parameters to tune:

    1. max_depth
    2. min_samples_split
    3. min_samples_leaf
    4. max_features
    5. max_leaf_nodes


In [85]:
# Identifying Grid Hyperparameters to train the model
param_dict = {'max_depth': [5, 10, 30, 50],
              'min_samples_split': [2, 5, 10, 15],
              'min_samples_leaf': [2, 5, 10, 20],
               'max_features' : [2, 5, 10, 30]}

# Create an instance of the decision tree
dtc = DecisionTreeClassifier()

# Grid search
dtc_grid = GridSearchCV(estimator=dtc,
                       param_grid = param_dict,
                       cv = 5, verbose=3, n_jobs = -1, scoring='roc_auc')
dtc_grid.fit(x_smote, y_smote)

Fitting 5 folds for each of 256 candidates, totalling 1280 fits


In [86]:
# Displaying best parameters
print(dtc_grid.best_estimator_)
print(dtc_grid.best_params_)

dtc_optimal_model = dtc_grid.best_estimator_

# class prediction of y on train and test
y_pred_dtc_grid=dtc_optimal_model.predict(x_test)
y_train_pred_dtc_grid=dtc_optimal_model.predict(x_smote)

DecisionTreeClassifier(max_depth=50, max_features=30, min_samples_leaf=10,
                       min_samples_split=10)
{'max_depth': 50, 'max_features': 30, 'min_samples_leaf': 10, 'min_samples_split': 10}


In [87]:
#getting all scores for Decision trees
dtc_acctr = round(accuracy_score(y_train_pred_dtc_grid,y_smote), 3)
dtc_acc = round(accuracy_score(y_pred_dtc_grid,y_test), 3)
dtc_prec = round(precision_score(y_pred_dtc_grid,y_test), 3)
dtc_rec = round(recall_score(y_pred_dtc_grid,y_test), 3)
dtc_f1 = round(f1_score(y_pred_dtc_grid,y_test), 3)
dtc_roc = round(roc_auc_score(y_pred_dtc_grid,y_test), 3)

results = pd.DataFrame([['Decision trees tuned', dtc_acctr, dtc_acc, dtc_prec, dtc_rec, dtc_f1, dtc_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,Decision trees tuned,0.859,0.725,0.504,0.424,0.461,0.632


### 3. **Random Forest Classifer**


Possible parameters to tune:

    1. n_estimators in [10, 100, 1000]
    2. max_features in [‘sqrt’, ‘log2’]   or max_features [1 to 20]
    3. min_samples_split
    4. min_samples_leaf
    5. max_depth

In [171]:
# Hyperparameter Grid   # 2min
param_dict = {'n_estimators' : [10, 50, 70],
               'max_depth' : [2, 3, 5, 10]}

# Create an instance of the RandomForestClassifier
rfc = RandomForestClassifier()

# Grid search
rfc_grid = GridSearchCV(estimator=rfc,
                       param_grid = param_dict,
                       cv = 3, verbose=2, scoring='roc_auc')
rfc_grid.fit(x_smote, y_smote)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END .......................max_depth=2, n_estimators=10; total time=   0.0s
[CV] END .......................max_depth=2, n_estimators=10; total time=   0.0s
[CV] END .......................max_depth=2, n_estimators=10; total time=   0.0s
[CV] END .......................max_depth=2, n_estimators=50; total time=   0.1s
[CV] END .......................max_depth=2, n_estimators=50; total time=   0.3s
[CV] END .......................max_depth=2, n_estimators=50; total time=   0.1s
[CV] END .......................max_depth=2, n_estimators=70; total time=   0.2s
[CV] END .......................max_depth=2, n_estimators=70; total time=   0.2s
[CV] END .......................max_depth=2, n_estimators=70; total time=   0.1s
[CV] END .......................max_depth=3, n_estimators=10; total time=   0.0s
[CV] END .......................max_depth=3, n_estimators=10; total time=   0.0s
[CV] END .......................max_depth=3, n_e

In [172]:
# Displaying best parameters
print(rfc_grid.best_estimator_)
print(rfc_grid.best_params_)

rfc_optimal_model = rfc_grid.best_estimator_

#class prediction of y on train and test
y_pred_rfc_grid = rfc_optimal_model.predict(x_test)
y_train_pred_rfc_grid = rfc_optimal_model.predict(x_smote)

RandomForestClassifier(max_depth=10, n_estimators=50)
{'max_depth': 10, 'n_estimators': 50}


In [173]:
#getting all scores for Random Forest Classifier
rfc_acctr = round(accuracy_score(y_train_pred_rfc_grid,y_smote), 3)
rfc_acc = round(accuracy_score(y_pred_rfc_grid,y_test), 3)
rfc_prec = round(precision_score(y_pred_rfc_grid,y_test), 3)
rfc_rec = round(recall_score(y_pred_rfc_grid,y_test), 3)
rfc_f1 = round(f1_score(y_pred_rfc_grid,y_test), 3)
rfc_roc = round(roc_auc_score(y_pred_rfc_grid,y_test), 3)

results = pd.DataFrame([['Random Forest tuned', rfc_acctr, rfc_acc, rfc_prec, rfc_rec, rfc_f1, rfc_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,Random Forest tuned,0.872,0.786,0.544,0.54,0.542,0.7


### 4. **Gradient Boosting**

Possible parameters to tune:

1. learning_rate in [0.001, 0.01, 0.1]
2. n_estimators [10, 100, 1000]
3. subsample in [0.5, 0.7, 1.0]
4. max_depth in [2, 7, 9]

In [146]:
# Hyperparameter Grid       #3min
param_dict = {'learning_rate': [0.01, 0.1],
              'n_estimators' : [10, 50, 100],
              'max_depth' : [2, 3, 4]}

gbc = GradientBoostingClassifier()

# Grid search
gbc_grid = GridSearchCV(estimator=gbc,
                       param_grid = param_dict,
                       cv = 3, verbose=2, scoring='roc_auc')
gbc_grid.fit(x_smote, y_smote)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] END ...learning_rate=0.01, max_depth=2, n_estimators=50; total time=   1.1s
[CV] END ...learning_rate=0.01, max_depth=2, n_estimators=50; total time=   1.2s
[CV] END ...learning_rate=0.01, max_depth=2, n_estimators=50; total time=   1.1s
[CV] END ..learning_rate=0.01, max_depth=2, n_estimators=100; total time=   2.2s
[CV] END ..learning_rate=0.01, max_depth=2, n_estimators=100; total time=   2.3s
[CV] END ..learning_rate=0.01, max_depth=2, n_estimators=100; total time=   2.2s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   1.5s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   1.3s
[CV] END ...learning_rate=0.01, max_depth=3, n_estimators=50; total time=   1.3s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   3.4s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   3.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_es

In [147]:
# Displaying best parameters
print(gbc_grid.best_estimator_)
print(gbc_grid.best_params_)

gbc_optimal_model = gbc_grid.best_estimator_

#class prediction of y on train and test
y_pred_gbc_grid = gbc_optimal_model.predict(x_test)
y_train_pred_gbc_grid = gbc_optimal_model.predict(x_smote)

GradientBoostingClassifier(max_depth=4)
{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100}


In [148]:
#getting all scores for Gradient booster
gbc_acctr = round(accuracy_score(y_train_pred_gbc_grid,y_smote), 3)
gbc_acc = round(accuracy_score(y_pred_gbc_grid,y_test), 3)
gbc_prec = round(precision_score(y_pred_gbc_grid,y_test), 3)
gbc_rec = round(recall_score(y_pred_gbc_grid,y_test), 3)
gbc_f1 = round(f1_score(y_pred_gbc_grid,y_test), 3)
gbc_roc = round(roc_auc_score(y_pred_gbc_grid,y_test), 3)

results = pd.DataFrame([['Gradient Boosting Tuned', gbc_acctr, gbc_acc, gbc_prec, gbc_rec, gbc_f1, gbc_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,Gradient Boosting Tuned,0.869,0.796,0.437,0.581,0.499,0.711


### 5. **XG Boosting**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. max_depth [1 to 20]
    3. max_leaf_nodes
    4. gamma
    5. min_child_weight
    6. n_estimators in (50, 100, 150]


In [36]:
# Hyperparameter Grid
param_dict = {'n_estimators' : [50, 100, 150],
              'max_depth' : [1, 3, 5],
              'learning_rate': [0.01, 0.1, 0.15]}

# Create an instance of the RandomForestClassifier
xgb = XGBClassifier()

# Grid search
xgb_grid = GridSearchCV(estimator=xgb,
                       param_grid = param_dict,
                       n_jobs=-1, cv = 3,
                       verbose=2, scoring='roc_auc')
# fitting model
xgb_grid.fit(x_smote,y_smote)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


In [37]:
# Displaying best parameters
print(xgb_grid.best_estimator_)
print(xgb_grid.best_params_)

xgb_optimal_model = xgb_grid.best_estimator_

#class prediction of y on train and test
y_pred_xgb_grid = xgb_optimal_model.predict(x_test)
y_train_pred_xgb_grid = xgb_optimal_model.predict(x_smote)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.15, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=5, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=150,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
{'learning_rate': 0.15, 'max_depth': 5, 'n_estimators': 150}


In [38]:
#getting all scores for XG Boosting Classifier
xgb_acctr = round(accuracy_score(y_train_pred_xgb_grid,y_smote), 3)
xgb_acc = round(accuracy_score(y_pred_xgb_grid,y_test), 3)
xgb_prec = round(precision_score(y_pred_xgb_grid,y_test), 3)
xgb_rec = round(recall_score(y_pred_xgb_grid,y_test), 3)
xgb_f1 = round(f1_score(y_pred_xgb_grid,y_test), 3)
xgb_roc = round(roc_auc_score(y_pred_xgb_grid,y_test), 3)

results = pd.DataFrame([['XG Boosting Tuned', xgb_acctr, xgb_acc, xgb_prec, xgb_rec, xgb_f1, xgb_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,XG Boosting Tuned,0.946,0.802,0.422,0.607,0.498,0.723


### 6. **ADA Boosting**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. base_estimator
    3. n_estimators in [50, 100, 150]

In [57]:
# Hyperparameter Grid
param_dict = {'learning_rate': [0.001, 0.01, 0.1, 0.5],
              'n_estimators' : [50, 100, 150]}

ada = AdaBoostClassifier()

ada_grid = GridSearchCV(estimator=ada,
                       param_grid = param_dict,
                       n_jobs=-1, cv = 3,
                       verbose=2, scoring='roc_auc')
# fitting model
ada_grid.fit(x_smote,y_smote)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [58]:
# Displaying best parameters
print(ada_grid.best_estimator_)
print(ada_grid.best_params_)

ada_optimal_model = ada_grid.best_estimator_

#class prediction of y on train and test
y_pred_ada_grid = ada_optimal_model.predict(x_test)
y_train_pred_ada_grid = ada_optimal_model.predict(x_smote)

AdaBoostClassifier(learning_rate=0.5, n_estimators=150)
{'learning_rate': 0.5, 'n_estimators': 150}


In [None]:
{'learning_rate': 0.5, 'n_estimators': 150}

Model	Train Accuracy	Test Accuracy	Precision	Recall	F1 Score	ROC
0	ADA Boosting tuned	0.785	0.784	0.475	0.541	0.506	0.694

In [59]:
#getting all scores for Ada Boosting
ada_acctr = round(accuracy_score(y_train_pred_ada_grid,y_smote), 3)
ada_acc = round(accuracy_score(y_pred_ada_grid,y_test), 3)
ada_prec = round(precision_score(y_pred_ada_grid,y_test), 3)
ada_rec = round(recall_score(y_pred_ada_grid,y_test), 3)
ada_f1 = round(f1_score(y_pred_ada_grid,y_test), 3)
ada_roc = round(roc_auc_score(y_pred_ada_grid,y_test), 3)

results = pd.DataFrame([['ADA Boosting tuned', ada_acctr, ada_acc, ada_prec, ada_rec, ada_f1, ada_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,ADA Boosting tuned,0.794,0.79,0.479,0.557,0.515,0.702


### 7. **Bagging**

Possible parameters to tune:

    1. learning_rate in [0.001, 0.01, 0.1]
    2. max_features
    3. n_estimators in [10,50, 100, 150]
    4. max_samples

In [82]:
# Hyperparameter Grid
param_dict = {'n_estimators' : [5, 10, 50, 150, 300],
               'max_features' : [2, 5, 10, 15],
               'max_samples' : [0.05, 0.1, 0.2]}

bag = BaggingClassifier()

bag_grid = GridSearchCV(estimator=bag, param_grid = param_dict,
                       n_jobs=-1,cv = 5,
                       verbose=2, scoring='roc_auc')

bag_grid.fit(x_smote,y_smote)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


In [83]:
# Displaying best parameters
print(bag_grid.best_estimator_)
print(bag_grid.best_params_)

bag_optimal_model = bag_grid.best_estimator_

#class prediction of y on train and test
y_pred_bag_grid = bag_optimal_model.predict(x_test)
y_train_pred_bag_grid = bag_optimal_model.predict(x_smote)

BaggingClassifier(max_features=10, max_samples=0.2, n_estimators=150)
{'max_features': 10, 'max_samples': 0.2, 'n_estimators': 150}


In [84]:
#getting all scores for Logistic Regression
bag_acctr = round(accuracy_score(y_train_pred_bag_grid,y_smote), 3)
bag_acc = round(accuracy_score(y_pred_bag_grid,y_test), 3)
bag_prec = round(precision_score(y_pred_bag_grid,y_test), 3)
bag_rec = round(recall_score(y_pred_bag_grid,y_test), 3)
bag_f1 = round(f1_score(y_pred_bag_grid,y_test), 3)
bag_roc = round(roc_auc_score(y_pred_bag_grid,y_test), 3)

results = pd.DataFrame([['Bagging classifier tuned', bag_acctr, bag_acc, bag_prec, bag_rec, bag_f1, bag_roc]],
               columns = ['Model', 'Train Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,ROC
0,Bagging classifier tuned,0.928,0.792,0.382,0.583,0.461,0.707


### **Final Model Comparision**

In [174]:
grid_classifiers = ['Optimal Logistic Regression', 'Optimal Decision Tree', 'Optimal Random Forest',
                  'Optimal Gradient Boosting', 'Optimal XG Boosting', 'Optimal Ada Boosting', 'Optimal Bagging']

grid_train_accuracy =  [log_acctr, dtc_acctr, rfc_acctr, gbc_acctr, xgb_acctr, ada_acctr, bag_acctr]
grid_test_accuracy =   [log_acc, dtc_acc, rfc_acc, gbc_acc, xgb_acc, ada_acc, bag_acc]
grid_precision_score = [log_prec, dtc_prec, rfc_prec, gbc_prec, xgb_prec, ada_prec, bag_prec]
grid_recall_score =    [log_rec, dtc_rec, rfc_rec, gbc_rec, xgb_rec, ada_rec, bag_rec]
grid_f1_score =        [log_f1, dtc_f1, rfc_f1, gbc_f1, xgb_f1, ada_f1, bag_f1]
grid_auc_score =       [log_roc, dtc_roc, rfc_roc, gbc_roc, xgb_roc, ada_roc, bag_roc]

In [175]:
compare_df = pd.DataFrame({'Classifier':grid_classifiers, 'Train Accuracy': grid_train_accuracy, 'Test Accuracy': grid_test_accuracy, 'Precision': grid_precision_score, 'Recall': grid_recall_score, 'F1 Score': grid_f1_score , 'AUC': grid_auc_score})
compare_df

Unnamed: 0,Classifier,Train Accuracy,Test Accuracy,Precision,Recall,F1 Score,AUC
0,Optimal Logistic Regression,0.71,0.778,0.561,0.522,0.541,0.693
1,Optimal Decision Tree,0.859,0.725,0.504,0.424,0.461,0.632
2,Optimal Random Forest,0.872,0.786,0.544,0.54,0.542,0.7
3,Optimal Gradient Boosting,0.869,0.796,0.437,0.581,0.499,0.711
4,Optimal XG Boosting,0.946,0.802,0.422,0.607,0.498,0.723
5,Optimal Ada Boosting,0.794,0.79,0.479,0.557,0.515,0.702
6,Optimal Bagging,0.928,0.792,0.382,0.583,0.461,0.707


### **Conclusion**

* After cross validation and hyperparameter tunning, Ensemble algorithms shows better performance compared to other models

- The hyperparameter tuned optimized algorithms show improved performance compared to the Baseline models

* Hyperparameter tuning has also reduced the overfitting issue in Random forest & Decision tree models

* One can try tuning other hyperparameters, passing different ranges of values and a different CV fold to check for any further improvement in the model performance