# Introduction
---

This notebook is about learning to compare 3 common ML model in classifier (Logistic Regression, KNN, Dtree) including pre-processing and tuning hyperparameter to get the best result.

**In this dataset, the data incline (imbalanced) more to the patients who are diagnosed as negative** and causing the development of the model become bias that the prediction is more likely to result as negative diabetes.

Therefore, the solution of this dataset is to **create a model that has high sensitivity in detecting** whether the patient is positive or negative diabetes in order to overcome the bias. This kind of model will work best rather than a model that has high accuracy in predicting , yet the sensitivity is low that cause the bias in detecting the diagnose to negative only because the accuracy will count the whole data and sensitivity will count only those who are diagnosed positive diabetes.

Below are the table of the content:
- <a href='#1'>1. EDA</a>
- <a href='#2'>2. Raw Data ML Modeling & Evaluation for Benchmark</a>
- <a href='#3'>3. Pre-Processing</a>
    - <a href='#3.1'>3.1 Dealing with Outliers</a>
    - <a href='#3.2'>3.2 Dealing with Normal and Skewed Distribution</a>
    - <a href='#3.3'>3.3 Dealing with Different Units Measurement</a>
- <a href='#4'>4. Pre-Processed Data ML Modeling & Evaluation</a>
- <a href='#5'>5. Hypertuning Parameters with Grid & Random Search</a>
- <a href='#6'>6. Evaluation</a>


Dataset originally from https://www.kaggle.com/uciml/pima-indians-diabetes-database

# Data Import & Collection
---

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import stats
from scipy.stats import norm, skew
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.head()

# EDA<a id='1'>
---

## Descriptive Analysis

**Dataset Information:**
> - Pregnancies: Number of times pregnant
> - Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
> - Blood Pressure: Diastolic blood pressure (mm Hg)
> - Skin Thickness: Triceps skin fold thickness (mm)
> - Insulin: 2-Hour serum insulin (mu U/ml)
> - BMI: Body mass index (weight in kg/(height in m)^2)
> - Diabetes pedigree function: Self-Explained
> - Age: Years
> - Outcome: 1 is True, 0 is False

### Column, NULL Values, DTypes
See if the dataset count did not match with number of row and as well if any of column have mismatch Dtype

In [None]:
df.info()

### Statistical Summary

In [None]:
df.describe()

## Descriptive Analysis Summary
> - According to df.info() above, there are no null data that we have to deal with later in pre-processing.
> - Dtype on every features is make sense hence no need to change it in the pre-processing.
> - Based on statistical summary above, there is data issue with people registered with 0 Glucose, 0 Blood Pressure, 0 Skin Thickness. This might be a wrong data input. 
> - There are a great outlier for example a data with 17 pregancies. Other than that we can see several data with Outliers that will be processed in the next step.

## Univariate Analysis

Let's analyze data distribution of each column individually!

In [None]:
numericals = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
target = ['Outcome']

In [None]:
plt.figure(figsize=(16,4))
for i in range(0, len(numericals)):
    plt.subplot(1,8, i+1)
    sns.boxplot(x=df[numericals[i]])
    plt.tight_layout()

In [None]:
plt.figure(figsize=(16,10))
for i in range(0, len(numericals)):
    plt.subplot(2,4, i+1)
    sns.distplot(df[numericals[i]])
    plt.tight_layout

In [None]:
count_classes = pd.value_counts(df['Outcome'], sort=True)
count_classes.plot(kind='bar', rot=0)
plt.title('Diabetes Outcome Distribution')
plt.xlabel('Class')
plt.ylabel('Frequency')

### Univariate Analysis Summary
> - According to boxplot visualization, we have Outliers in every indepdent features where some are lot and some only a few.
> - According to distribution visualization, we see that there are only 2 features haivng a normal distribution where other features is skewed. This might impact how we will deal with the pre-processing later when we have to remove the Outliers.
> - According to the bar plot above, our target data is imbalance towards to detected as not diabetes. This might impact on our ML evaluation, therefore during pre-processing we can use oversampling so we can make data more balance for modeling. 

# Multivariate Analysis
---

Let's analyze how each columns relationship strength to each other by using Correlation and Pairplot!

## Correlation & Heatmap

In [None]:
sns.heatmap(df.corr(), annot=True,  fmt='.2f')
plt.show()

## Pairplot

In [None]:
sns.pairplot(df, diag_kind='kde')

> ## Feature Categorization
> 
> Based on all summary above, all independent features will be selected and will be categorized as numericals.

## Pairplot + Hue

In [None]:
sns.pairplot(df, diag_kind='kde', hue='Outcome')

## Multivariate Analysis Summary
> - According to heatmap visualization, there are no strong relationship between each independent features hence all features will be selected in Modeling.
> - According to pairplot visualization, in each independent feature there are just a little pattern showing different clusters.

- - -
# Modeling w/o Pre-processing (Raw Dataset)

Develop the first ML model without any pre-processing. The goal is to compare the model performance raw data and pre-processed data later.

## Split Train Test Data

In [None]:
X = df[numericals]
y = df[target]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=0) #Using 70-30 Rule
X_train.shape

## Logistic Regression (Raw Data)

### Fit & Predict

In [None]:
from sklearn.linear_model import LogisticRegression

logReg = LogisticRegression(random_state=0, max_iter=400)
logReg.fit(X_train, y_train)
y_predicted = logReg.predict(X_test)
y_predicted_proba = logReg.predict_proba(X_test)

### Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score,  roc_auc_score, precision_score
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## Logistic Regression Evaluation

> - We're having a good start with 0.78 accuracy in overall in the Logistic Regression. However as described in the introduction, the data is biased toward the negative diabetes making evaluation unfair. 
> - Checking on Recall score, if the patient that actually diabetes, the model only label as positive diabetes at 0.53 rate. However if patient actually negative and labeling it as negative the model perform at 0.90 rate. The model is still very confused to label actual diabetes patient as positive diabetes and as well as make a stronger assumption that the dataset is biased to negative diabetes.
> - Let's check on the Precision, when we predict negative and actually negative, the model perform good at 0.80 and as well predicting positive and actually positif at 0.71. The precision have a great evaluation here.
> - at F1-Score we had averaged 0.85 rate at the negative diabetes but still low at 0.60 averaged in positive diabetes. 

## KNN (Raw Data)

### Fit & Predict

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5) #using default value
knn.fit(X_train, y_train)
y_predicted = knn.predict(X_test)

### Evaluation

In [None]:
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))


print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## KNN Evaluation
> - We're having a good start with evaluation at 0.78 accuracy in overall. However same as before the data is biased toward the negative diabetes making evaluation unfair. 
> - Checking on Recall score, if the patient that actually diabetes, the model only label as positive diabetes at 0.53 rate. However if patient actually negative and labeling it as negative the model perform at 0.85 rate. The model is still very confused to label actual diabetes patient as positive diabetes and as well as make a stronger assumption that the dataset is biased to negative diabetes.
> - Let's check on the Precision, when we predict negative and actually negative, the model perform good at 0.80 and as well predicting positive and actually positif at 0.63. Comparing with Logistic Regression, KNN have worst evaluation at the precision. 
> - at F1-Score, this KNN model averaged lower than Logistic Regression at scoring positive and negative diabetes.

## DTree (Gini as default) 

### Fit & Predict

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=0) #using entropy as calculation
dt.fit(X_train, y_train)
y_predicted = dt.predict(X_test)

### Evaluation

In [None]:
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## Decision Tree (Gini) Evaluation
> - The model evaluate at 0.72 accuracy in overall which is lower than 2 other models. However same as before the data is still biased toward the negative diabetes making evaluation unfair. 
> - On Recall, the model is still pretty bad at labeling actual positive diabetes as other 2 models and just good at labeling actual negative diabetes.
> - On the Precision, the model evaluation is also still bad at predicting positive diabetes but good at predicting negative diabetes. 
> - at F1-Score, this DTree model averaged not great at scoring positive diabetes and just good at scoring negative diabetes.

- - -
# Pre-Processing

In Pre-Processing part, we're going to use few techniques such as find missing data, duplicated data, outliers, standarization/normalization, feature encoding and Over/Undersampling.

## Dealing with Outliers

We're going to use a boxplot in our Univariate analysis to detect outliers visualization

In [None]:
from scipy.stats import norm
df_p1 = df #df_p1 to separate data between RAW and Preprocessed.

In [None]:
ax = sns.distplot(df_p1['Pregnancies'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Pregnancies'])))

In [None]:
#Using IQR since pregnancies data is skewed.
Q1 = df_p1['Pregnancies'].quantile(0.25)
Q3 = df_p1['Pregnancies'].quantile(0.75)
IQR = Q3-Q1
low_limit = Q1 - (1.5 * IQR)
high_limit = Q3 + (1.5 * IQR)
filtered_entries = ((df_p1['Pregnancies'] >= low_limit) & (df_p1['Pregnancies'] <= high_limit))
df_p1 = df_p1[filtered_entries]
print('Q1=',Q1,'Q3=',Q3,'IQR=',IQR,'low_limit=',low_limit,'high_limit=',high_limit)

#plot the new data after outliers removed.
ax = sns.distplot(df_p1['Pregnancies'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Pregnancies'])))

In [None]:
ax = sns.distplot(df_p1['Glucose'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Glucose'])))

In [None]:
# using Z-Score as the data distribution is normal
from scipy import stats
z_scores = np.abs(stats.zscore(df_p1['Glucose']))
filtered_entries = (z_scores < 3)
df_p1 = df_p1[filtered_entries]

ax = sns.distplot(df_p1['Glucose'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Glucose'])))

In [None]:
ax = sns.distplot(df_p1['BloodPressure'], color="y")
plt.title('skew: {}'.format(skew(df_p1['BloodPressure'])))

In [None]:
# using Z-Score as the data distribution is normal
from scipy import stats
z_scores = np.abs(stats.zscore(df_p1['BloodPressure']))
filtered_entries = (z_scores < 3)
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['BloodPressure'], color="y")
plt.title('skew: {}'.format(skew(df_p1['BloodPressure'])))

In [None]:
ax = sns.distplot(df['SkinThickness'], color="y")
plt.title('skew: {}'.format(skew(df_p1['SkinThickness'])))

In [None]:
# using Z-Score as the data distribution is normal
from scipy import stats
z_scores = np.abs(stats.zscore(df_p1['SkinThickness']))
filtered_entries = (z_scores < 3)
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['SkinThickness'], color="y")
plt.title('skew: {}'.format(skew(df_p1['SkinThickness'])))

In [None]:
ax = sns.distplot(df_p1['Insulin'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Insulin'])))

In [None]:
# Using IQR as data distribution is skewed
Q1 = df_p1['Insulin'].quantile(0.25)
Q3 = df_p1['Insulin'].quantile(0.75)
IQR = Q3-Q1
low_limit = Q1 - (1.5 * IQR)
high_limit = Q3 + (1.5 * IQR)
print('Q1=',Q1,'Q3=',Q3,'IQR=',IQR,'low_limit=',low_limit,'high_limit=',high_limit)
filtered_entries = ((df_p1['Insulin'] >= low_limit) & (df_p1['Insulin'] <= high_limit))
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['Insulin'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Insulin'])))

In [None]:
ax = sns.distplot(df_p1['BMI'], color="y")
plt.title('skew: {}'.format(skew(df_p1['BMI'])))

In [None]:
# using Z-Score as the data distribution is normal
from scipy import stats
z_scores = np.abs(stats.zscore(df_p1['BMI']))
filtered_entries = (z_scores < 3)
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['BMI'], color="y")
plt.title('skew: {}'.format(skew(df_p1['BMI'])))

In [None]:
ax = sns.distplot(df_p1['DiabetesPedigreeFunction'], color="y")
plt.title('skew: {}'.format(skew(df_p1['DiabetesPedigreeFunction'])))

In [None]:
# Using IQR as data distribution is skewed
Q1 = df_p1['DiabetesPedigreeFunction'].quantile(0.25)
Q3 = df_p1['DiabetesPedigreeFunction'].quantile(0.75)
IQR = Q3-Q1
low_limit = Q1 - (1.5 * IQR)
high_limit = Q3 + (1.5 * IQR)
print('Q1=',Q1,'Q3=',Q3,'IQR=',IQR,'low_limit=',low_limit,'high_limit=',high_limit)
filtered_entries = ((df_p1['DiabetesPedigreeFunction'] >= low_limit) & (df_p1['DiabetesPedigreeFunction'] <= high_limit))
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['DiabetesPedigreeFunction'], color="y")
plt.title('skew: {}'.format(skew(df_p1['DiabetesPedigreeFunction'])))

In [None]:
ax = sns.distplot(df_p1['Age'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Age'])))

In [None]:
# Using IQR as data distribution is skewed
Q1 = df_p1['Age'].quantile(0.25)
Q3 = df_p1['Age'].quantile(0.75)
IQR = Q3-Q1
low_limit = Q1 - (1.5 * IQR)
high_limit = Q3 + (1.5 * IQR)
print('Q1=',Q1,'Q3=',Q3,'IQR=',IQR,'low_limit=',low_limit,'high_limit=',high_limit)
filtered_entries = ((df_p1['Age'] >= low_limit) & (df_p1['Age'] <= high_limit))
df_p1 = df_p1[filtered_entries]
ax = sns.distplot(df_p1['Age'], color="y")
plt.title('skew: {}'.format(skew(df_p1['Age'])))

## Outliers Detection Summary
> Detect and remove about 126 outliers in dataset

In [None]:
plt.figure(figsize=(16,10))
for i in range(0, len(numericals)):
    plt.subplot(2,4, i+1)
    sns.distplot(df[numericals[i]], color="b")
    plt.tight_layout
    plt.title('Before Outliers Removal')

In [None]:
plt.figure(figsize=(16,10))
for i in range(0, len(numericals)):
    plt.subplot(2,4, i+1)
    sns.distplot(df_p1[numericals[i]], color="y")
    plt.tight_layout
    plt.title('after Outliers Removal')

In [None]:
print(df.shape)
print(df_p1.shape)

## Oversampling for Imbalanced Data

To deal with majority data in the negative diabetes category, we're going to do oversampling to make distribution more even

In [None]:
df_p2 = df_p1
print(df_p2['Outcome'].value_counts())
x = df_p2[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
y = df_p2['Outcome']

In [None]:
from imblearn import over_sampling
x_over, y_over = over_sampling.RandomOverSampler().fit_resample(x,y)
print(pd.Series(y_over).value_counts())

In [None]:
df_over = pd.DataFrame(x_over, columns= df[numericals].columns)
target_over = pd.DataFrame(y_over, columns= df[target].columns)
df_over.describe()

In [None]:
count_classes = pd.value_counts(target_over['Outcome'], sort=True)
count_classes.plot(kind='bar', rot=0)
plt.title('Diabetes Outcome Distribution')
plt.xlabel('Class')
plt.ylabel('Frequency')

### Feature Scaling

As we have several different measurement unit in the independent features, therefore we're going to scale it using standarization and use it on Logistic and KNN as it does not really affect DTree algorithm.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
Std = StandardScaler()
dfX = pd.DataFrame(Std.fit_transform(df_over),
        columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'])

In [None]:
dfX.describe()

- - -
# Modeling After Pre-processing

In here we're going to try to do modeling first with pre-processing.

## Split Train Test Data

In [None]:
X = dfX[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
y = target_over['Outcome']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=0)
X_train.shape

## Logistic Regression

### Fit & Predict

In [None]:
from sklearn.linear_model import LogisticRegression

logReg = LogisticRegression(random_state=0, max_iter=400)
logReg.fit(X_train, y_train)
y_predicted = logReg.predict(X_test)
y_predicted_proba = logReg.predict_proba(X_test)

### Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, recall_score,  roc_auc_score, precision_score
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## Logistic Regression Pre-Processing Evaluation

> - As a result of several Pre-Processing, overall the model have lower accuracy at 0.75 and also other negative score prediction at Precision, Recall and the f-1 score.
> - However Pre-Processing really bump up the model performance at the positive prediction sector. Especially on the Recall where it bumped from 0.53 to 0.69. On the Precision it went up to from 0.71 to 0.80
> - The F1 Score average at scoring positive diabetes is also improved from 0.60 to 0.74
> - Overall the Pre-Processing make the Logistic Regression model more good at predicting positive diabetes.

## KNN

### Fit & Predict

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5) #using default value
knn.fit(X_train, y_train)
y_predicted = knn.predict(X_test)
print(y_predicted)

### Evaluation

In [None]:
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## KNN Pre-Processing Evaluation

> - Pre-Processing also make good improvement on KNN, overall the model have better accuracy at 0.77 from 0.72 and other positive scoring on Recall and Precision is as well bump up hugely.
> - Recall got bumped up from 0.59 to 0.74 in positive scoring as well as precision went up from 0.56 to 0.80
> - The F1 Score average at scoring positive diabetes is also improved from 0.58 to 0.77
> - Overall the Pre-Processing make the KNN model more good at predicting positive diabetes.

## DTree (Gini) 

Refitting the X as we don't want to have DTree using standarization feature scaling

In [None]:
X = df_over[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
y = target_over['Outcome']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=0)
X_train.shape

### Fit & Predict

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=0) #using entropy as calculation
dt.fit(X_train, y_train)
y_predicted = dt.predict(X_test)
print(y_predicted)

### Evaluation

In [None]:
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\n======================\nClassification Report:\n') # generate the confusion matrix
print(classification_report(y_test, y_predicted))

## Decision Tree (Gini) Evaluation

> - Pre-Processing make Decision Tree exceptionally good, overall the model have better accuracy at 0.85 from 0.72 and other positive scoring on Recall and Precision is as well bump up hugely.
> - Recall got bumped up from 0.59 to 0.89 in positive scoring as well as precision went up from 0.56 to 0.84
> - The F1 Score average at scoring positive diabetes is also improved from 0.58 to 0.85
> - Overall the Pre-Processing make the Decision Tree model really good at predicting both negative positive diabetes.

# Hyperparameter Tuning
---

## Logistic Regression

### Logistic Regression GridSearch

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

In [None]:
grid_search = GridSearchCV(logReg, param_grid, cv=5, verbose=True, n_jobs=-1)
best_model = grid_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
accuracy = best_model.best_score_
accuracy

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

### Logistic Regression RandomSearch

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
random_search = RandomizedSearchCV(logReg, param_grid, cv=5, verbose=True, n_jobs=-1)

In [None]:
best_model = random_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
accuracy = best_model.best_score_
accuracy

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

## KNN

### KNN GridSearch

In [None]:
param_grid = [    
    {'n_neighbors':[5,6,7,8,9,10,11],
     'leaf_size':[1,2,3,5],
     'weights':['uniform', 'distance'],
     'algorithm':['auto', 'ball_tree','kd_tree','brute']
    }
]

In [None]:
grid_search = GridSearchCV(knn, param_grid, cv=5, verbose=True, n_jobs=-1)
best_model = grid_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
accuracy = best_model.best_score_
accuracy

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

### KNN RandomSearch

In [None]:
random_search = RandomizedSearchCV(knn, param_grid, cv=5, verbose=True, n_jobs=-1)

In [None]:
best_model = random_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
accuracy = best_model.best_score_
accuracy

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

## DTree GridSearch

In [None]:
param_grid = [    
    {'criterion':['gini','entropy'],
     'splitter':['best','random'],
     'max_features':['auto','sqrt','log2']
    }
]

In [None]:
grid_search = GridSearchCV(dt, param_grid, cv=5, verbose=True, n_jobs=-1)
best_model = grid_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
accuracy = best_model.best_score_
accuracy

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

## DTree RandomSearch

In [None]:
random_search = RandomizedSearchCV(dt, param_grid, cv=5, verbose=True, n_jobs=-1)
best_model = random_search.fit(X_train,y_train)

In [None]:
best_model.best_params_

In [None]:
best_model.best_score_

In [None]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test,y_pred))

## Hyperparameter Tuning Evaluation

> - Both GridSearch and RandomSearch greatly boost KNN Model performance with GridSearch at it best.
> - We don't see any significance boost yet we saw performance get lowered abit at Logistic Regression and Decision Tree with Hypertuning Parameter.

# Evaluation Summary
---
So far from the evaluation, KNN with GridSearch Hyperparameter tuning develop outstanding ML model to detect diabetes wether it is negative or positive diabetes compared to the other 2 in this dataset.

However Logistic Regression and Decision Tree is also still a great model for this dataset.

In the future I might update the overall evaluation of this notebook with ROCAUC and K-Cross Validation.