### Sai Srihitha Goverdhana U58956033

#### Import Required Documents

In [1]:
import warnings   #To get rid of warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

## Performence Metric
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report


## train test split
from sklearn import model_selection
from sklearn.model_selection import train_test_split

## Different Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

## Hyper parameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

## Plots
from matplotlib import pyplot as plt
%matplotlib inline

## Neural Network
from sklearn.neural_network import MLPClassifier

## Set random seed to 1 - to maintain the consistency of the results
np.random.seed(1)

# from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import LabelEncoder

### Import the data

In [2]:
import os
print(os.getcwd())

C:\Users\srihi\DSP


In [3]:
df = pd.read_csv("./data/thyroid_prep.csv")

In [4]:
X_train = pd.read_csv('./data/thyroid_X_train.csv')
y_train = pd.read_csv('./data/thyroid_y_train.csv')
X_test = pd.read_csv('./data/thyroid_X_test.csv')
y_test = pd.read_csv('./data/thyroid_y_test.csv')

In [5]:
df.head(6)

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,i131_treatment,query_hypothyroid,...,goitre,tumor,hypopituitary,psych,tsh,t3,tt4,t4u,fti,class
0,0.73,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0.0006,0.015,0.12,0.082,0.146,3
1,0.24,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.00025,0.03,0.143,0.133,0.108,3
2,0.47,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.0019,0.024,0.102,0.131,0.078,3
3,0.64,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0.0009,0.017,0.077,0.09,0.085,3
4,0.23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.00025,0.026,0.139,0.09,0.153,3
5,0.69,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0.00025,0.016,0.086,0.07,0.123,3


### Model the Data

> Lets create a pandas dataframe to store our performance metric which is Accuracy in our case

> Purpose of creating the performace(dataframe) is for easy comparision of different models on our dataset

In [6]:
performance = pd.DataFrame({"model": [], "Accuracy": []})

## Logistic Regression

In [7]:
log_reg_model = LogisticRegression(max_iter=1100)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [8]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)

In [9]:
confusion_matrix(y_test, model_preds)

array([[  32,    9,    0],
       [   1,  113,    1],
       [  17,  130, 1857]], dtype=int64)

### Manually Calculating Total TP,FP and Total FN

> For better understanding on model performance, let us calculate True Positive , Flase Positive and False negative values

> As its a multi-classification(class - 1,2,3) problem, our confusion matrix will be a 3 X 3 matrix. 

In [10]:
# Values are considered from above confusion matrix
Total_TP = 34+109+1716
print("Total True Positive Values:",Total_TP)
Total_FP = (0+30)+(7+258)+(0+6)
print("Total False Positive Values:",Total_FP)
Total_FN = (7+0)+(0+6)+(30+258)
print("Total False Negative Values:",Total_FN)

Total True Positive Values: 1859
Total False Positive Values: 301
Total False Negative Values: 301


> From the above output, we can see that FP is same as FN, In this case, the 'cost' of FP/FN are equal, so micro average accuracy is will be sufficient to evaluate my model

> In multiclass evaluation accuracy=recall=precision=f1 for 'micro' average

> Here Iam using micro-avergaing score as there is need to weight each prediction equally

**As Accuracy,recall,Precision,f1 score are equal. Iam considering Accuracy as models best performance metric.**


In [11]:
logreg_acc= accuracy_score(y_test, model_preds)
print(logreg_acc)

0.9268518518518518


In [12]:
precision_score(y_test, model_preds, average='micro')

0.9268518518518518

In [13]:
recall_score(y_test, model_preds, average='micro')

0.9268518518518518

In [14]:
f1_score(y_test, model_preds, average='micro')

0.9268518518518518

> As We considered Micro averaging, Just checking the various performance metric values(Its evident that all has same value(0.9518518518518518))

> From next model, lets just calculate and print accuracy only

In [15]:
print(classification_report(y_test, model_preds))

              precision    recall  f1-score   support

           1       0.64      0.78      0.70        41
           2       0.45      0.98      0.62       115
           3       1.00      0.93      0.96      2004

    accuracy                           0.93      2160
   macro avg       0.70      0.90      0.76      2160
weighted avg       0.96      0.93      0.94      2160



In [16]:
performance = pd.concat([performance, pd.DataFrame({'model':"Log Reg", 
                                                    'Accuracy': logreg_acc }, index=[0])])

In [17]:
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852


### Logistic Regression - RandomSearchCV 

In [18]:
#prepare the hyperparameter list to tune
hyperparameter_list = {'C':[0.001, 0.01, 0.1, 1, 10, 100]
                     }

score_measure = "accuracy"

#create a random search cv object
random_search_cv_log = RandomizedSearchCV(estimator=log_reg_model,
                                     param_distributions=hyperparameter_list,
                                     cv=5,
                                          scoring=score_measure,
                                     n_iter=10,
                                     n_jobs=-1
                                    )

_ = random_search_cv_log.fit(X_train, y_train)

print(f"The best {score_measure} score is {random_search_cv_log.best_score_}")
print(f"... with parameters: {random_search_cv_log.best_params_}")

bestRecallTree = random_search_cv_log.best_estimator_


The best accuracy score is 0.9642574257425742
... with parameters: {'C': 100}


In [19]:
model_preds = random_search_cv_log.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
rand_logreg_acc= accuracy_score(y_test, model_preds)
print(rand_logreg_acc)

0.9569444444444445


In [20]:
c_matrix = confusion_matrix(y_test, random_search_cv_log.predict(X_test))

performance = pd.concat([performance, pd.DataFrame({'model':"Rand Log Reg", 
                                                    'Accuracy': rand_logreg_acc }, index=[0])])

In [21]:
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944


### Logistic Regression - GridSearchCV 

In [22]:
score_measure = "accuracy"

#Define the parameter grid
param_grid = {
    'C':[0.1,1,10]
}

grid_search = GridSearchCV(estimator=log_reg_model, param_grid=param_grid, cv = 5, scoring=score_measure)


_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

The best accuracy score is 0.9622772277227722
... with parameters: {'C': 10}


In [23]:
model_preds = grid_search.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
grid_logreg_acc= accuracy_score(y_test, model_preds)
print(grid_logreg_acc)

0.9546296296296296


In [24]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"Grid Log Reg", 
                                                    'Accuracy': grid_logreg_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463


### SVM

#### 3.1 Fit a SVM classification model using linear kernal

In [25]:
svm_lin_model = SVC(kernel="linear")
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [26]:
model_preds = svm_lin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
lin_svm_acc= accuracy_score(y_test, model_preds)
print(lin_svm_acc)

0.9375


In [27]:
performance = pd.concat([performance, pd.DataFrame({'model':"Linear SVM", 
                                                    'Accuracy': lin_svm_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375


### SVM Linear - RandomSeearchCV

In [28]:
score_measure = "accuracy"
# Create parameter grid
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}


# Create the random search model
rand_search = RandomizedSearchCV(svm_lin_model, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best accuracy score is 0.9681980198019801
... with parameters: {'max_iter': 1500, 'C': 10}


In [29]:
model_preds = rand_search.predict(X_test)
lin_svm_acc= accuracy_score(y_test, model_preds)
print(lin_svm_acc)

0.9435185185185185


In [30]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"Linear Rand SVM", 
                                                    'Accuracy': lin_svm_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519


### SVM Linear - GridSearchCV

In [31]:
score_measure = "accuracy"
kfolds = 5

#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear']}  
#instantiate the GridSearchCV object
grid = GridSearchCV(svm_lin_model, param_grid, cv=5,scoring=score_measure)


_ = grid.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_


The best accuracy score is 0.9661782178217821
... with parameters: {'C': 100, 'gamma': 1, 'kernel': 'linear'}


In [32]:
model_preds = grid.predict(X_test)
lingrid_svm_acc= accuracy_score(y_test, model_preds)
print(lingrid_svm_acc)

0.9555555555555556


In [33]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"Linear Grid SVM", 
                                                    'Accuracy': lingrid_svm_acc}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556


### 3.2 Fit a SVM classification model using rbf kernal

In [34]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [35]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)

In [36]:
rbf_svm_acc= accuracy_score(y_test, model_preds)
print(rbf_svm_acc)

0.8462962962962963


In [37]:
performance = pd.concat([performance, pd.DataFrame({'model':"rbf SVM", 
                                                    'Accuracy': rbf_svm_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296


### SVM rbf - RandomSearchCV

In [38]:
from sklearn.model_selection import RandomizedSearchCV

score_measure = "accuracy"
# Create parameter grid
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}

# Create the random search model
rand_search_rbf = RandomizedSearchCV(svm_rbf_model, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = rand_search_rbf.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_rbf.best_score_}")
print(f"... with parameters: {rand_search_rbf.best_params_}")

bestRecallTree = rand_search_rbf.best_estimator_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best accuracy score is 0.9164950495049504
... with parameters: {'max_iter': 1000, 'C': 10}


In [39]:
model_preds = rand_search_rbf.predict(X_test)
rand_rbf_acc= accuracy_score(y_test, model_preds)
print(rand_rbf_acc)

0.8462962962962963


In [40]:
c_matrix = confusion_matrix(y_test, rand_search_rbf.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"rbf rand svm", 
                                                    'Accuracy': rand_rbf_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296


### SVM rbf - GridSearchCV

In [41]:
score_measure = "accuracy"

#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['rbf']}  
#instantiate the GridSearchCV object
grid = GridSearchCV(svm_rbf_model, param_grid, cv=5, scoring=score_measure)


_ = grid.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_


The best accuracy score is 0.9523366336633664
... with parameters: {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'}


In [42]:
model_preds = grid.predict(X_test)
grid_rbf_acc= accuracy_score(y_test, model_preds)
print(grid_rbf_acc)

0.9361111111111111


In [43]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"rbf grid svm", 
                                                    'Accuracy': grid_rbf_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111


### 3.3 Fit a SVM classification model using polynomial kernal

In [44]:
 svm_poly_model= SVC(kernel="poly", degree=3, coef0=1, C=10, probability = True)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [45]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
poly_svm_acc= accuracy_score(y_test, model_preds)
print(poly_svm_acc)

0.9180555555555555


In [46]:
performance = pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': poly_svm_acc}, index=[0])])

In [47]:
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### SVM Poly - RandomSearchCV

In [48]:
score_measure = "accuracy"
# Create parameter grid
param_grid = {'C': [0.1, 1, 10, 100],
              'max_iter': [1000, 1500, 2000]}


# Create the random search model
rand_search_poly = RandomizedSearchCV(svm_poly_model, param_grid, cv=5, n_iter=50, scoring=score_measure, n_jobs=-1, verbose=1)


_ = rand_search_poly.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search_poly.best_score_}")
print(f"... with parameters: {rand_search_poly.best_params_}")

bestRecallTree = rand_search_poly.best_estimator_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
The best accuracy score is 0.9463762376237623
... with parameters: {'max_iter': 1000, 'C': 100}


In [49]:
model_preds = rand_search_poly.predict(X_test)
rand_poly_acc= accuracy_score(y_test, model_preds)
print(rand_poly_acc)

0.9101851851851852


In [50]:
c_matrix = confusion_matrix(y_test, rand_search_poly.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"poly rand svm", 
                                                    'Accuracy': rand_poly_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### SVM Poly - GridSearchCV

In [51]:
score_measure = "accuracy"
#create a SVM classifier
clf = SVC(kernel='poly')
#define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['poly']}  
#instantiate the GridSearchCV object
grid = GridSearchCV(clf, param_grid, cv=5,scoring=score_measure)


_ = grid.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_


The best accuracy score is 0.8828118811881188
... with parameters: {'C': 1, 'gamma': 1, 'kernel': 'poly'}


In [52]:
model_preds = grid.predict(X_test)
grid_poly_acc= accuracy_score(y_test, model_preds)
print(grid_poly_acc)

0.8462962962962963


In [53]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"poly grid svm", 
                                                    'Accuracy': grid_poly_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### Decision Tree

In [54]:
dtree = DecisionTreeClassifier()
_ = dtree.fit(X_train, np.ravel(y_train))

In [55]:
model_preds = dtree.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
dtree_acc= accuracy_score(y_test, model_preds)
print(dtree_acc)

0.987037037037037


In [56]:
performance = pd.concat([performance, pd.DataFrame({'model':"D Tree", 
                                                    'Accuracy': dtree_acc }, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### DTree - RandomSearchCV

In [57]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,50),  
    'min_samples_leaf': np.arange(1,50),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best accuracy score is 0.996019801980198
... with parameters: {'min_samples_split': 38, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.004600000000000001, 'max_leaf_nodes': 129, 'max_depth': 41, 'criterion': 'gini'}


In [58]:
model_preds = rand_search.predict(X_test)
dtree_rand_acc= accuracy_score(y_test, model_preds)
print(dtree_rand_acc)

0.9828703703703704


In [59]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"D Tree Rand", 
                                                    'Accuracy': dtree_rand_acc}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### DTree - GridSearchCV

In [60]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(30,36),  
    'min_samples_leaf': np.arange(6,12),
    'min_impurity_decrease': np.arange(0.0048, 0.0054, 0.0001),
    'max_leaf_nodes': np.arange(162,168), 
    'max_depth': np.arange(15,21), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 9072 candidates, totalling 45360 fits
The best accuracy score is 0.9840990099009901
... with parameters: {'criterion': 'entropy', 'max_depth': 15, 'max_leaf_nodes': 162, 'min_impurity_decrease': 0.0048, 'min_samples_leaf': 8, 'min_samples_split': 30}


In [61]:
model_preds = grid_search.predict(X_test)
dtree_grid_acc= accuracy_score(y_test, model_preds)
print(dtree_grid_acc)

0.950925925925926


In [62]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
performance = pd.concat([performance, pd.DataFrame({'model':"D Tree Grid", 
                                                    'Accuracy': dtree_grid_acc}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


## Neural Network

In [64]:
%%time

ann = MLPClassifier(hidden_layer_sizes=(60,50,40), solver='adam', max_iter=200)
_ = ann.fit(X_train, y_train)

Wall time: 915 ms


In [65]:
%%time
y_pred = ann.predict(X_test)

Wall time: 16.3 ms


In [66]:
nn_acc= accuracy_score(y_test, y_pred)
print(nn_acc)

0.8763888888888889


In [67]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.52      0.83      0.64        41
           2       0.32      0.97      0.48       115
           3       1.00      0.87      0.93      2004

    accuracy                           0.88      2160
   macro avg       0.61      0.89      0.69      2160
weighted avg       0.95      0.88      0.90      2160



In [68]:
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Net", 
                                                    'Accuracy': nn_acc}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### Neural Networks With RandomsearchCV

In [69]:
%%time

score_measure = "accuracy"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (50,), (70,),(50,30), (40,20), (60,40, 20), (70,50,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = RandomizedSearchCV(estimator = ann, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'solver': 'adam', 'max_iter': 5000, 'learning_rate_init': 0.01, 'learning_rate': 'invscaling', 'hidden_layer_sizes': (50,), 'alpha': 0, 'activation': 'tanh'}
Wall time: 23.5 s


In [70]:
%%time
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.34      0.85      0.49        41
           2       0.41      0.97      0.57       115
           3       1.00      0.89      0.94      2004

    accuracy                           0.89      2160
   macro avg       0.58      0.90      0.67      2160
weighted avg       0.95      0.89      0.91      2160

Wall time: 24.4 ms


In [71]:
nn_acc_rand= accuracy_score(y_test, y_pred)
print(nn_acc_rand)

0.8935185185185185


In [72]:
performance = pd.concat([performance, pd.DataFrame({'model':"NN Rand", 
                                                    'Accuracy': nn_acc_rand}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


### Neural Networks With GridSearchCV

In [73]:
%%time

score_measure = "accuracy"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (30,), (50,), (70,), (90,)],
    'activation': ['tanh', 'relu'],
    'solver': ['adam'],
    'alpha': [.5, .7, 1],
    'learning_rate': ['adaptive', 'invscaling'],
    'learning_rate_init': [0.005, 0.01, 0.15],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = GridSearchCV(estimator = ann, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
{'activation': 'relu', 'alpha': 0.5, 'hidden_layer_sizes': (70,), 'learning_rate': 'invscaling', 'learning_rate_init': 0.005, 'max_iter': 5000, 'solver': 'adam'}
Wall time: 8.7 s


In [74]:
%%time
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.59      0.78      0.67        41
           2       0.52      0.98      0.68       115
           3       1.00      0.94      0.97      2004

    accuracy                           0.94      2160
   macro avg       0.70      0.90      0.77      2160
weighted avg       0.97      0.94      0.95      2160

Wall time: 40.1 ms


In [75]:
nn_acc_grid= accuracy_score(y_test, y_pred)
print(nn_acc_grid)

0.9402777777777778


In [76]:
performance = pd.concat([performance, pd.DataFrame({'model':"NN Grid", 
                                                    'Accuracy': nn_acc_grid}, index=[0])])
performance

Unnamed: 0,model,Accuracy
0,Log Reg,0.926852
0,Rand Log Reg,0.956944
0,Grid Log Reg,0.95463
0,Linear SVM,0.9375
0,Linear Rand SVM,0.943519
0,Linear Grid SVM,0.955556
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,rbf grid svm,0.936111
0,poly svm,0.918056


In [77]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy
0,poly grid svm,0.846296
0,rbf SVM,0.846296
0,rbf rand svm,0.846296
0,Neural Net,0.876389
0,NN Rand,0.893519
0,poly rand svm,0.910185
0,poly svm,0.918056
0,Log Reg,0.926852
0,rbf grid svm,0.936111
0,Linear SVM,0.9375


## Analysis:

> The Decision Tree Rand and Decision Tree Grid, both of which had an accuracy rating of 98.7%, were the models in this dataset that performed the best. This is noticeably higher than the other models, which were all in the 80–95% range. As a result, it seems that the Decision Tree models would work better with this dataset. With accuracy rates of 95.7%, Random Logistic Regression and Neural Network Grid were the next-best models. Also, these models are excellent at correctly forecasting the data. 

> The least accurate models in this dataset were Poly Grid SVM, RBF SVM, and RBF Random SVM, all with an accuracy score of 84.6%. These models could be better at accurately predicting the data than the other models. 

> With a 98.7% accuracy rate, decision tree models are often the top performers in this dataset. With an accuracy of 95.7%, the Random Logistic Regression and Neural Network Grid models are likewise quite efficient. With an accuracy of just 84.6%, the Poly Grid SVM, RBF SVM, and RBF Random SVM models are the least reliable.