## WE04

#### Srikar Pusuluri
#### U95310908


Universal bank has recently trialed a marketing campaign to sell their new CD account product to existing customers. They contacted 5000 of their non-CD account customers with an offer. The data provided in universal.csv is the result of this market test. 

Use the techniques covered in this class to load and clean the data. Then, identify the best predictive model (using only the models covered thus far: Logistic Regression, SVM (with various kernels), and Decision trees). Your target variable is CD Account. Your scoring measure is recall. Use RandomSearchCV combined with GridSearchCV to identify the best parameters for each model tested.

Be sure to document your thought process using markdown. Think of this as a report that your manager will read. This assignment requires you to decide how to process the provided data best (i.e., encoding). Be sure to provide your arguments/observations in markdown as you progress through data preparation, fitting, and performance evaluation.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



np.random.seed(1)

In [2]:
# Load the Data
df = pd.read_csv('UniversalBank.csv')
df.head(10)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
5,6,37,13,29,92121,4,0.4,2,155,0,0,0,1,0
6,7,53,27,72,91711,2,1.5,2,0,0,0,0,1,0
7,8,50,24,22,93943,1,0.3,3,0,0,0,0,0,1
8,9,35,10,81,90089,3,0.6,2,104,0,0,0,1,0
9,10,34,9,180,93023,1,8.9,3,0,1,0,0,0,0


In [3]:
df.dtypes

ID                      int64
Age                     int64
Experience              int64
Income                  int64
ZIP Code                int64
Family                  int64
CCAvg                 float64
Education               int64
Mortgage                int64
Personal Loan           int64
Securities Account      int64
CD Account              int64
Online                  int64
CreditCard              int64
dtype: object

In [4]:
df.isnull().sum().sum()

0

In [5]:
df.columns = df.columns.str.replace(' ', '_')

In [6]:
df.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP_Code,Family,CCAvg,Education,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [7]:
df.columns = df.columns.str.lower()
df.head(5)

Unnamed: 0,id,age,experience,income,zip_code,family,ccavg,education,mortgage,personal_loan,securities_account,cd_account,online,creditcard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [8]:
df['cd_account'].unique()

array([0, 1], dtype=int64)

## Data Exploration

In [9]:
df.describe()

Unnamed: 0,id,age,experience,income,zip_code,family,ccavg,education,mortgage,personal_loan,securities_account,cd_account,online,creditcard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### Splitting the data and Standarizing the data

In [10]:
#splitting the data into X,y 
X = df.drop('cd_account', axis=1)
y = df['cd_account']

# Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3)

# Initialise the scaler
sclr = StandardScaler()

# Fit the scaler to the training set
sclr.fit(X_train)

# Transform the training and testing sets
X_train = sclr.transform(X_train)
X_test = sclr.transform(X_test)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  5000 non-null   int64  
 1   age                 5000 non-null   int64  
 2   experience          5000 non-null   int64  
 3   income              5000 non-null   int64  
 4   zip_code            5000 non-null   int64  
 5   family              5000 non-null   int64  
 6   ccavg               5000 non-null   float64
 7   education           5000 non-null   int64  
 8   mortgage            5000 non-null   int64  
 9   personal_loan       5000 non-null   int64  
 10  securities_account  5000 non-null   int64  
 11  cd_account          5000 non-null   int64  
 12  online              5000 non-null   int64  
 13  creditcard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


In [12]:
df.isna().sum()

id                    0
age                   0
experience            0
income                0
zip_code              0
family                0
ccavg                 0
education             0
mortgage              0
personal_loan         0
securities_account    0
cd_account            0
online                0
creditcard            0
dtype: int64

## Modelling the Data

In [13]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

## Fitting a Logistic Regression Model

In [14]:
log_reg_model = LogisticRegression( max_iter=900)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [15]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### Logistic Regression using RandomSearchCV 

In [16]:
hyperparam= {'C':[0.001, 0.01, 0.1, 1, 10, 100],
                      'penalty':['l2', 'l1']
                     }

score_measure = "recall"
#create a logistic regression model
logistic_model = LogisticRegression()

#create a random search cv object
random_search = RandomizedSearchCV(estimator=logistic_model,
                                     param_distributions=hyperparam,
                                     cv=5,scoring=score_measure,
                                     n_iter=10,
                                     n_jobs=-1,
                                   )

_ = random_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {random_search.best_score_}")
print(f"... with parameters: {random_search.best_params_}")

bestRecallTree = random_search.best_estimator_


The best recall score is 0.6662790697674419
... with parameters: {'penalty': 'l2', 'C': 100}


30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan        nan      

In [17]:
c_matrix = confusion_matrix(y_test, random_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"LogReg - RandomCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### Logistic Regression using GridSearchCV 

In [18]:
score_measure = "recall"

Log_Reg = LogisticRegression()

#Define the parameter grid
param_grid = {
    'penalty':['l1','l2'],
    'C':[0.1,1,10]
}

#Create the grid search object
grid_searchCV = GridSearchCV(estimator=Log_Reg, param_grid=param_grid, cv = 5, scoring= score_measure )


_ = grid_searchCV.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_searchCV.best_score_}")
print(f"... with parameters: {grid_searchCV.best_params_}")

bestRecallTree = grid_searchCV.best_estimator_

The best recall score is 0.6662790697674419
... with parameters: {'C': 0.1, 'penalty': 'l2'}


15 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



In [19]:
c_matrix = confusion_matrix(y_test, grid_searchCV.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"LogReg - GridCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM classification model using linear kernal

In [20]:
svm_linear_model = SVC(kernel="linear")
_ = svm_linear_model.fit(X_train, np.ravel(y_train))

In [21]:
model_preds = svm_linear_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"SVM Linear", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM Linear Model - RandomSearch CV

In [22]:
score_measure = "recall"
# Create parameter grid
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}

#create a SVM classifier
clf = SVC(kernel='linear')

# Create the random search model
rand_search = RandomizedSearchCV(clf, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_



Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best recall score is 0.6800211416490486
... with parameters: {'max_iter': 1500, 'C': 100}




In [23]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM Linear- RandCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM Linear - GridSearchCV

In [24]:
score_measure = "recall"
kfolds = 5
#create a SVM classifier
clf = SVC(kernel='linear')
#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear']}  
#instantiate the GridSearchCV object
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring= score_measure)


_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_


The best recall score is 0.6662790697674419
... with parameters: {'C': 0.1, 'gamma': 1, 'kernel': 'linear'}


In [25]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM Linear- GridCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM classification model using rbf kernal

In [26]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [27]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"SVM rbf", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM rbf - RandomSearchCV

In [28]:
score_measure = "recall"
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}

#create a SVM classifier
svm_clf = SVC(kernel='rbf')

# Create the random search model
random_search_rbf = RandomizedSearchCV(svm_clf, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = random_search_rbf.fit(X_train, y_train)

print(f"The best {score_measure} score is {random_search_rbf.best_score_}")
print(f"... with parameters: {random_search_rbf.best_params_}")

bestRecallTree = random_search_rbf.best_estimator_



Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best recall score is 0.6849894291754757
... with parameters: {'max_iter': 1500, 'C': 100}




In [29]:
c_matrix = confusion_matrix(y_test, random_search_rbf.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM rbf- RandCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM rbf - GridSearchCV

In [30]:
score_measure = "recall"
clf = SVC(kernel='rbf')

#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['rbf']} 


grid_CV = GridSearchCV(clf, param_grid, cv=5, scoring= score_measure)


_ = grid_CV.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_CV.best_score_}")
print(f"... with parameters: {grid_CV.best_params_}")

bestRecallTree = grid_CV.best_estimator_


The best recall score is 0.666384778012685
... with parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}


In [31]:
c_matrix = confusion_matrix(y_test, grid_CV.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM rbf- GridCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM classification model using polynomial kernal

In [34]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10, probability = True)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [35]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM Poly - RandomSearchCV

In [36]:
score_measure = "recall"
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}

svm_clf = SVC(kernel='poly')

# Create the random search model
rand_search_polym = RandomizedSearchCV(svm_clf, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = rand_search_polym.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_polym.best_score_}")
print(f"... with parameters: {rand_search_polym.best_params_}")

bestRecallTree = rand_search_polym.best_estimator_



Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best recall score is 0.739429175475687
... with parameters: {'max_iter': 1000, 'C': 100}




In [37]:
c_matrix = confusion_matrix(y_test, rand_search_polym.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM Poly- RandCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### SVM Poly - GridSearchCV

In [38]:
score_measure = "recall"
clf = SVC(kernel='poly')

#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['poly']}  
#instantiate the GridSearchCV object
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring= score_measure)


_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_


The best recall score is 0.6849894291754757
... with parameters: {'C': 10, 'gamma': 0.1, 'kernel': 'poly'}


In [39]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"SVM Poly- GridCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### Decision Tree classification model using defaults 

In [40]:
decisiontree = DecisionTreeClassifier().fit(X_train, np.ravel(y_train))

In [41]:
c_matrix = confusion_matrix(y_test, decisiontree.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree using Defaults", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### DTree - RandomSearchCV

In [42]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,70),  
    'min_samples_leaf': np.arange(1,70),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.6891120507399577
... with parameters: {'min_samples_split': 3, 'min_samples_leaf': 2, 'min_impurity_decrease': 0.0006000000000000001, 'max_leaf_nodes': 157, 'max_depth': 31, 'criterion': 'entropy'}


45 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\psrik\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 0.59830867 0.59830867 0.59830867        nan 0.54799154 0.57547569
 0.598308

In [43]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree- RandCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### Decision Tree - GridSearchCV

In [44]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(30,36),  
    'min_samples_leaf': np.arange(6,12),
    'min_impurity_decrease': np.arange(0.0048, 0.0054, 0.0001),
    'max_leaf_nodes': np.arange(162,168), 
    'max_depth': np.arange(15,21), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 9072 candidates, totalling 45360 fits
The best recall score is 0.6480972515856237
... with parameters: {'criterion': 'entropy', 'max_depth': 15, 'max_leaf_nodes': 162, 'min_impurity_decrease': 0.0048, 'min_samples_leaf': 6, 'min_samples_split': 30}


In [45]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]


performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree- GridCV", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [46]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.978,1.0,0.60241,0.75188
0,LogReg - RandomCV,0.978,1.0,0.60241,0.75188
0,LogReg - GridCV,0.978,1.0,0.60241,0.75188
0,SVM Linear,0.978,1.0,0.60241,0.75188
0,SVM Linear- RandCV,0.896667,0.293103,0.614458,0.396887
0,SVM Linear- GridCV,0.978,1.0,0.60241,0.75188
0,SVM rbf,0.972,0.836066,0.614458,0.708333
0,SVM rbf- RandCV,0.962,0.675676,0.60241,0.636943
0,SVM rbf- GridCV,0.970667,0.819672,0.60241,0.694444
0,poly svm,0.962667,0.684932,0.60241,0.641026


### Analysis:


> The Highest recall was obtained using a decision tree with default parameters, SVM with kernel poly using RandomSearchCV and Decision Tree using GridCV which is 0.6265.

> For models like SVM Linear Kernal with RandomSearchCV,Svm with rbf Kernal and Decision Tree with Random Search are the second-highest recall value is 0.6144

> Hence, the SVM with Poly Kernal RandomizedCV models, Decision Tree with default parameters and Decision Tree with Grid CV are the top models based on model performance.