# Santosh Ainumpudi -U68091846

# Parkinsons Disease Prediction

My main analysis from this dataset is to differentiate between healthy individuals and those with parkinsons disease (PD). This is indicated by the "status" column, which has a value of 0 for healthy individuals and 1 for those with parkinsons disease (PD).

This dataset contains voice measurements from 31 individuals, 23 of whom have Parkinson's disease (PD). Each row represents a different voice recording, identified by the name of the individual. The columns correspond to various voice measurements.

Attribute Information:


1.Name                                
ASCII subject name and recording number

2.MDVP:Fo(Hz)                         
Average vocal fundamental frequency

3.MDVP:Fhi(Hz)                        
Maximum vocal fundamental frequency

4.MDVP:Flo(Hz)                        
Minimum vocal fundamental frequency

5.MDVP:Jitter(%) , MDVP:Jitter(Abs) ,MDVP:RAP , MDVP:PPQ , Jitter:DDP
Several measures of variation in fundamental frequency

6.MDVP:Shimmer , MDVP:Shimmer(dB) , Shimmer:APQ3 , Shimmer:APQ5 , MDVP:APQ , Shimmer:DDA
Several measures of variation in amplitude

7.NHR , HNR
Two measures of ratio of noise to tonal components in the voice

8.status                              
Health status of the subject (one) - Parkinson's, (zero) - healthy

9.RPDE , D2                           
Two nonlinear dynamical complexity measures

10.DFA                                
Signal fractal scaling exponent

11.spread1 , spread2 , PPE            
Three nonlinear measures of fundamental frequency variation

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix

# Read the Data

In [2]:
df=pd.read_csv('parkinsons.csv')

In [3]:
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

The target variable is the "status" column, which is an integer indicating the presence or absence of Parkinson's disease in the patient. The input variables are all the other columns except for the "name" column, which just contains the name of the patient and is not relevant for the analysis.

In [5]:
df = df.drop('name', axis=1)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MDVP:Fo(Hz)       195 non-null    float64
 1   MDVP:Fhi(Hz)      195 non-null    float64
 2   MDVP:Flo(Hz)      195 non-null    float64
 3   MDVP:Jitter(%)    195 non-null    float64
 4   MDVP:Jitter(Abs)  195 non-null    float64
 5   MDVP:RAP          195 non-null    float64
 6   MDVP:PPQ          195 non-null    float64
 7   Jitter:DDP        195 non-null    float64
 8   MDVP:Shimmer      195 non-null    float64
 9   MDVP:Shimmer(dB)  195 non-null    float64
 10  Shimmer:APQ3      195 non-null    float64
 11  Shimmer:APQ5      195 non-null    float64
 12  MDVP:APQ          195 non-null    float64
 13  Shimmer:DDA       195 non-null    float64
 14  NHR               195 non-null    float64
 15  HNR               195 non-null    float64
 16  status            195 non-null    int64  
 1

Most of the columns are of type float64 i.e 22 of them , with one column (status) of type int64

In [7]:
df.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [8]:
df.columns

Index(['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE'],
      dtype='object')

In [9]:
df.shape

(195, 23)

# Find the missing values

In [10]:
df.isnull().sum()

MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

The df.isnull().sum() method returns the number of missing values in each column of the DataFrame. In this case, all columns have zero missing values, which indicates that the data is complete and ready for processing. Therefore, we can proceed with our analysis without having to impute or drop any missing values.





# Split into Features and Target

The features are the input variables that the machine learning model uses to make predictions, while the target variable is the output variable that the model is trying to predict.

In [11]:
X = df.drop(['status'], axis=1)
y = df['status']

# Split the data

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


To evaluate the performance of our model on unseen data we will split the data into training and testing sets, where 30% of the data will be used for testing, and the remaining 70% will be used for training

#### Scalling of the variables

In [13]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()

In [14]:
X_train=sc.fit_transform(X_train)

In [15]:
X_test=sc.transform(X_test)

# SVM Classification model with Linear Kernel

In [16]:
svm_lin_model = SVC(kernel="linear", probability=True)
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [17]:
# define the performance DataFrame
performance = pd.DataFrame(columns=['model', 'Accuracy', 'Precision', 'Recall', 'F1'])


In [18]:
model_preds = svm_lin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"svm with linear kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043


# SVM Classification model with rbf Kernel

In [19]:
svm_rbf_model = SVC(kernel="rbf", C=4, gamma='scale', probability=True)
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [20]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"svm with rbf kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237


# SVM Classification model with Polynomial Kernel

In [21]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=4, probability=True)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [22]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

if TP+FP == 0:
    precision = 0
else:
    precision = TP/(TP+FP)

performance = pd.concat([performance, pd.DataFrame({'model':"svm with polynomial kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [precision], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522


# SVM Classification model with Randomized Search Kernel


In [23]:
score_measure = "accuracy"
kfolds = 3
param_grid = {'C': [1, 2, 5], 
              'gamma': [1, 0.1, 0.01],
              'kernel': ['linear','poly']} 
  
rand_search = RandomizedSearchCV(SVC(), param_grid, refit=True, verbose=3)
  
# fitting the model for randomized search
rand_search.fit(X_train, y_train)
print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_


Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END .......C=2, gamma=1, kernel=linear;, score=0.857 total time=   0.0s
[CV 2/5] END .......C=2, gamma=1, kernel=linear;, score=0.926 total time=   0.0s
[CV 3/5] END .......C=2, gamma=1, kernel=linear;, score=0.852 total time=   0.0s
[CV 4/5] END .......C=2, gamma=1, kernel=linear;, score=0.889 total time=   0.0s
[CV 5/5] END .......C=2, gamma=1, kernel=linear;, score=0.741 total time=   0.0s
[CV 1/5] END .......C=2, gamma=0.1, kernel=poly;, score=0.929 total time=   0.0s
[CV 2/5] END .......C=2, gamma=0.1, kernel=poly;, score=0.963 total time=   0.0s
[CV 3/5] END .......C=2, gamma=0.1, kernel=poly;, score=0.889 total time=   0.0s
[CV 4/5] END .......C=2, gamma=0.1, kernel=poly;, score=0.926 total time=   0.0s
[CV 5/5] END .......C=2, gamma=0.1, kernel=poly;, score=0.778 total time=   0.0s
[CV 1/5] END ....C=5, gamma=0.01, kernel=linear;, score=0.893 total time=   0.0s
[CV 2/5] END ....C=5, gamma=0.01, kernel=linear;

In [24]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Random search SVM Linear", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556


# SVM Classification model with Grid Search Kernel


In [25]:
score_measure = "accuracy"
kfolds = 3
param_grid = {'C': [0.1, 1, 5], 
              'gamma': [1, 0.1, 0.01],
              'kernel': ['linear','poly']} 
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.929 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.963 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.852 total time=   0.0s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.889 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.778 total time=   0.0s
[CV 1/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.929 total time=   0.0s
[CV 2/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.926 total time=   0.0s
[CV 3/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.852 total time=   0.0s
[CV 4/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.963 total time=   0.0s
[CV 5/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.852 total time=   0.0s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.929 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear;

In [26]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Grid search SVM Linear", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033


# Logistic Regression model without Tuning

In [27]:
log_reg_model = LogisticRegression(penalty='none')
_ = log_reg_model.fit(X_train, np.ravel(y_train))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
df = pd.DataFrame({'model': ['logistic regression', 'SVM', 'decision tree'], 'accuracy': [0.8, 0.75, 0.9]})

# sort the DataFrame by the 'accuracy' column in descending order
df_sorted = df.sort_values(by='accuracy', ascending=False)

# print the sorted DataFrame
print(df_sorted)

                 model  accuracy
2        decision tree      0.90
0  logistic regression      0.80
1                  SVM      0.75


In [29]:
# model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522


# RandomizedSearch with Logistic Regression

In [30]:
score_measure = "accuracy"
LR=LogisticRegression()
kfolds = 5
param_grid = {'C': [0.1, 1, 10,0.001], 
              "solver" : [ 'lbfgs', 'liblinear'],
              "penalty" : ['l1','l2','lasso','elastic']} 
  
grid = RandomizedSearchCV(LR, param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END C=10, penalty=elastic, solver=liblinear;, score=nan total time=   0.0s
[CV 2/5] END C=10, penalty=elastic, solver=liblinear;, score=nan total time=   0.0s
[CV 3/5] END C=10, penalty=elastic, solver=liblinear;, score=nan total time=   0.0s
[CV 4/5] END C=10, penalty=elastic, solver=liblinear;, score=nan total time=   0.0s
[CV 5/5] END C=10, penalty=elastic, solver=liblinear;, score=nan total time=   0.0s
[CV 1/5] END C=10, penalty=l1, solver=liblinear;, score=0.750 total time=   0.0s
[CV 2/5] END C=10, penalty=l1, solver=liblinear;, score=0.926 total time=   0.0s
[CV 3/5] END C=10, penalty=l1, solver=liblinear;, score=0.815 total time=   0.0s
[CV 4/5] END C=10, penalty=l1, solver=liblinear;, score=0.889 total time=   0.0s
[CV 5/5] END C=10, penalty=l1, solver=liblinear;, score=0.741 total time=   0.0s
[CV 1/5] END C=1, penalty=lasso, solver=liblinear;, score=nan total time=   0.0s
[CV 2/5] END C=1, penalty=lasso, 

30 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 441, in _check_solver
    raise ValueError(
ValueError: Logistic Regression supports only penalties in ['l1', 'l2', 'elasticnet', 'none'], got elastic.

-

# Logistic Regression with Grid Search

In [31]:
model_preds = grid.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Logistic Regression Randomised", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783


In [32]:
score_measure = "accuracy"
kfolds = 5
param_grid = {'C': [0.1, 1, 10], 
              'solver' : [ 'lbfgs', 'liblinear'],
              'penalty' : ['l1','l2','lasso','elastic']} 
  
grid = GridSearchCV(LogisticRegression(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 2/5] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 3/5] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 4/5] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 5/5] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 1/5] END C=0.1, penalty=l1, solver=liblinear;, score=0.857 total time=   0.0s
[CV 2/5] END C=0.1, penalty=l1, solver=liblinear;, score=0.889 total time=   0.0s
[CV 3/5] END C=0.1, penalty=l1, solver=liblinear;, score=0.815 total time=   0.0s
[CV 4/5] END C=0.1, penalty=l1, solver=liblinear;, score=0.852 total time=   0.0s
[CV 5/5] END C=0.1, penalty=l1, solver=liblinear;, score=0.741 total time=   0.0s
[CV 1/5] END ...C=0.1, penalty=l2, solver=lbfgs;, score=0.857 total time=   0.0s
[CV 2/5] END ...C=0.1, penalty=l2, solver=

75 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

---------------------------

The best accuracy score is 0.8537037037037039
... with parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}


In [33]:
model_preds = grid.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Logistic Regression Grid", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783


# Decision tree model using the randomsearch

In [34]:
score_measure = "accuracy"
kfolds = 3

param_grid = {
    'min_samples_split': np.arange(1,60),  
    'min_samples_leaf': np.arange(1,50),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1, return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 3 folds for each of 100 candidates, totalling 300 fits
The best accuracy score is 0.8676328502415459
... with parameters: {'min_samples_split': 10, 'min_samples_leaf': 7, 'min_impurity_decrease': 0.0081, 'max_leaf_nodes': 199, 'max_depth': 25, 'criterion': 'entropy'}


9 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\santo\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 0.75732689 0.71980676 0.75732689 0.66956522 0.71239936 0.72028986
 0.80901771 

In [35]:
c_matrix = confusion_matrix(y_test, rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision tree random search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783
0,Decision tree random search,0.864407,0.909091,0.909091,0.909091


# Decision tree model using the Gridsearch

In [36]:
score_measure = "accuracy"
kfolds = 3

param_grid = {
    'min_samples_split': np.arange(25,32),  
    'min_samples_leaf': np.arange(3,6),
    'min_impurity_decrease': np.arange(0.0001, 0.0004, 0.0001),
    'max_leaf_nodes': np.arange(194,200), 
    'max_depth': np.arange(15,21), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 3 folds for each of 2268 candidates, totalling 6804 fits
The best accuracy score is 0.8380032206119163
... with parameters: {'criterion': 'entropy', 'max_depth': 15, 'max_leaf_nodes': 194, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 5, 'min_samples_split': 27}


In [37]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

if TP+FP == 0:
    precision = 0  # or precision = np.nan
else:
    precision = TP/(TP+FP)

performance = pd.concat([performance, pd.DataFrame({'model':"Grid search DT", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [precision], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

performance


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783
0,Decision tree random search,0.864407,0.909091,0.909091,0.909091
0,Grid search DT,0.830508,0.904762,0.863636,0.883721


In [38]:
df = pd.DataFrame({'model': ['logistic regression', 'SVM', 'decision tree'], 'accuracy': [0.8, 0.75, 0.9]})

# sort the DataFrame by the 'accuracy' column in descending order
df_sorted = df.sort_values(by='accuracy', ascending=False)

# print the sorted DataFrame
print(df_sorted)

                 model  accuracy
2        decision tree      0.90
0  logistic regression      0.80
1                  SVM      0.75


In [39]:
performance.sort_values(by =['Accuracy'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Grid search DT,0.830508,0.904762,0.863636,0.883721
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,Decision tree random search,0.864407,0.909091,0.909091,0.909091
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033


# Neural Network with MLPClassifer

In [40]:
from __future__ import print_function
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import matplotlib.pyplot as plt

from sklearn import datasets
import pandas as pd

np.random.seed(1)

In [41]:
%%time

ann = MLPClassifier(hidden_layer_sizes=(60,50,40), solver='adam', max_iter=200)
_ = ann.fit(X_train, y_train)

Wall time: 994 ms


In [42]:
%%time
y_pred = ann.predict(X_test)

Wall time: 2.11 ms


In [43]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.80      0.86        15
           1       0.93      0.98      0.96        44

    accuracy                           0.93        59
   macro avg       0.93      0.89      0.91        59
weighted avg       0.93      0.93      0.93        59



# With RandomizedSearchCV

In [44]:
%%time

score_measure = "accuracy"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (50,), (70,),(50,30), (40,20), (60,40, 20), (70,50,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = RandomizedSearchCV(estimator = ann, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'solver': 'adam', 'max_iter': 5000, 'learning_rate_init': 0.001, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (60, 40, 20), 'alpha': 1, 'activation': 'tanh'}
Wall time: 36.6 s


In [45]:
%%time
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.87      0.93        15
           1       0.96      1.00      0.98        44

    accuracy                           0.97        59
   macro avg       0.98      0.93      0.95        59
weighted avg       0.97      0.97      0.97        59

Wall time: 8.12 ms


In [46]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

if TP+FP == 0:
    precision = 0  # or precision = np.nan
else:
    precision = TP/(TP+FP)

performance = pd.concat([performance, pd.DataFrame({'model':"Neural Network Randomized search DT", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [precision], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783
0,Decision tree random search,0.864407,0.909091,0.909091,0.909091
0,Grid search DT,0.830508,0.904762,0.863636,0.883721


# With GridSearchCV

In [47]:
%%time

score_measure = "accuracy"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (30,), (50,), (70,), (90,)],
    'activation': ['tanh', 'relu'],
    'solver': ['adam'],
    'alpha': [.5, .7, 1],
    'learning_rate': ['adaptive', 'invscaling'],
    'learning_rate_init': [0.005, 0.01, 0.15],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = GridSearchCV(estimator = ann, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
{'activation': 'tanh', 'alpha': 0.5, 'hidden_layer_sizes': (70,), 'learning_rate': 'adaptive', 'learning_rate_init': 0.15, 'max_iter': 5000, 'solver': 'adam'}
Wall time: 23.1 s


In [48]:
%%time
y_pred = bestRecallTree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.87      0.90        15
           1       0.96      0.98      0.97        44

    accuracy                           0.95        59
   macro avg       0.94      0.92      0.93        59
weighted avg       0.95      0.95      0.95        59

Wall time: 12.6 ms


In [49]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]

if TP+FP == 0:
    precision = 0  # or precision = np.nan
else:
    precision = TP/(TP+FP)

performance = pd.concat([performance, pd.DataFrame({'model':"Neural Network Grid search DT", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [precision], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with linear kernel,0.864407,0.875,0.954545,0.913043
0,svm with rbf kernel,0.915254,0.897959,1.0,0.946237
0,svm with polynomial kernel,0.932203,0.916667,1.0,0.956522
0,Random search SVM Linear,0.932203,0.934783,0.977273,0.955556
0,Grid search SVM Linear,0.949153,0.93617,1.0,0.967033
0,default logistic,0.932203,0.916667,1.0,0.956522
0,Logistic Regression Randomised,0.898305,0.895833,0.977273,0.934783
0,Logistic Regression Grid,0.898305,0.895833,0.977273,0.934783
0,Decision tree random search,0.864407,0.909091,0.909091,0.909091
0,Grid search DT,0.830508,0.904762,0.863636,0.883721


# Using Keras

# Deep Network

In [50]:
import tensorflow as tf
from tensorflow import keras

# fix random seed for reproducibility
np.random.seed(1)
tf.random.set_seed(1)

In [51]:
X_train.shape

(136, 22)

In [52]:
%%time

# create model stucture
model = keras.models.Sequential()
model.add(keras.layers.Input(shape=22))
model.add(keras.layers.Dense(22, activation='relu'))
model.add(keras.layers.Dense(22, activation='relu'))
model.add(keras.layers.Dense(22, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid')) # final layer, 10 categories


# compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# if you want to overide the defaults for the optimizer....
#adam = keras.optimizers.Adam(learning_rate=0.01)
#model.compile(loss='sparse_categorical_crossentropy', optimizer=adam, metrics=['accuracy'])


Wall time: 112 ms


In [53]:
%%time

# fit the model
history = model.fit(X_train, y_train, 
                    validation_data=(X_test, y_test), 
                    epochs=20, batch_size=100)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Wall time: 3.21 s


In [54]:
# evaluate the model

scores = model.evaluate(X_test, y_test, verbose=0)
scores
# In results, first is loss, second is accuracy

[0.5043603777885437, 0.8305084705352783]

In [55]:
# let's format this into a better output...

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

loss: 0.50
accuracy: 83.05%


# Wide and Deep Network

In [56]:
#Define the model: for multi-class

model = keras.models.Sequential()

model.add(keras.layers.Input(shape=22))
model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

In [57]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

In [58]:
# Fit the model

history = model.fit(X_train, y_train, 
                    validation_data=(X_test, y_test), 
                    epochs=20, batch_size=100)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [59]:
# evaluate the model

scores = model.evaluate(X_test, y_test, verbose=0)
scores

# In results, first is loss, second is accuracy

[0.2870670258998871, 0.9152542352676392]

In [60]:
# extract the accuracy from model.evaluate

print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

loss: 0.29
accuracy: 91.53%


# RandomGridSearch with kernal initializer

In [61]:
!pip install scikeras



In [62]:
%%time

# If you don't have the following installed, from command line '!pip install scikeras'
from scikeras.wrappers import KerasClassifier
from keras.initializers import GlorotNormal

score_measure = "accuracy"
kfolds = 5

def build_clf(hidden_layer_sizes, dropout):
    ann = tf.keras.models.Sequential()
    ann.add(keras.layers.Input(shape=22)),
    for hidden_layer_size in hidden_layer_sizes:
        model.add(keras.layers.Dense(hidden_layer_size, kernel_initializer= tf.keras.initializers.GlorotUniform(), 
                                     bias_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None), activation="relu"))
        model.add(keras.layers.Dropout(dropout))
    ann.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    ann.compile(loss = 'binary_crossentropy', metrics = ['accuracy'])
    return ann


Wall time: 39.7 ms


In [63]:
from scikeras.wrappers import KerasClassifier

keras_clf = KerasClassifier(
    model=build_clf,
    hidden_layer_sizes=64,
    dropout = 0.0
)


In [64]:
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import RandomizedSearchCV

params = {
    'optimizer__learning_rate': [0.0005, 0.001, 0.005],
    'model__hidden_layer_sizes': [(70,),(90, ), (100,), (100, 90)],
    'model__dropout': [0, 0.1],
    'batch_size':[20, 60, 100],
    'epochs':[10, 50, 100],
    'optimizer':["adam",'sgd']
}
keras_clf.get_params().keys()



dict_keys(['model', 'build_fn', 'warm_start', 'random_state', 'optimizer', 'loss', 'metrics', 'batch_size', 'validation_batch_size', 'verbose', 'callbacks', 'validation_split', 'shuffle', 'run_eagerly', 'epochs', 'hidden_layer_sizes', 'dropout', 'class_weight'])

In [65]:
rnd_search_cv = RandomizedSearchCV(estimator=keras_clf, param_distributions=params, scoring='accuracy', n_iter=50, cv=5)

import sys
sys.setrecursionlimit(10000) # note: the default is 3000 (python 3.9)

earlystop = EarlyStopping(monitor='val_loss', patience=5, verbose=0, mode='auto')
callback = [earlystop]

_ = rnd_search_cv.fit(X_train, y_train, callbacks=callback, verbose=0)




In [66]:
rnd_search_cv.best_params_

{'optimizer__learning_rate': 0.005,
 'optimizer': 'sgd',
 'model__hidden_layer_sizes': (100, 90),
 'model__dropout': 0,
 'epochs': 100,
 'batch_size': 20}

In [67]:
best_net = rnd_search_cv.best_estimator_
print(rnd_search_cv.best_params_)

{'optimizer__learning_rate': 0.005, 'optimizer': 'sgd', 'model__hidden_layer_sizes': (100, 90), 'model__dropout': 0, 'epochs': 100, 'batch_size': 20}


In [68]:
%%time
y_pred = best_net.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.73      0.71        15
           1       0.91      0.89      0.90        44

    accuracy                           0.85        59
   macro avg       0.80      0.81      0.80        59
weighted avg       0.85      0.85      0.85        59

Wall time: 151 ms


# Analysis

The Parkinson's disease dataset was used to train several predictive models, including SVM models with linear, rbf, and polynomial kernels, as well as logistic regression and decision tree models.



Among the SVM models, the Random search SVM Linear model and Grid search SVM Linear model had the highest accuracy, both achieving 94.9%. The polynomial kernel SVM model also performed well, achieving an accuracy of 93.2%.



For logistic regression, the model without tuning had an accuracy of 93.2%, while the Randomized and Grid search models had accuracies of 89.8% and 89.8%, respectively. The decision tree models had lower accuracies, with the best models achieving accuracies of 86.4% and 83.1% for the Random search and Grid search models, respectively.



The MLPClassifier models outperformed all of the other models, with the Randomized search model achieving the highest accuracy at 96.6% and the Grid search model achieving an accuracy of 94.9%. The Keras models also performed well, with the Wide and Deep Network achieving an accuracy of 91% and the Deep Network achieving an accuracy of 83%.



Overall, the MLPClassifier models showed the best performance, followed by the SVM models and the Keras models. While the decision tree models had lower accuracies, they still showed some promise and could potentially be improved with further tuning.

However, it's important to note that the performance of these models may vary depending on the specific features and characteristics of the dataset being used. It's also possible that with further experimentation and tuning of hyperparameters, other models could outperform the ones listed here.
