**Welcome to my Prediction Modeling Notebook **

Background:

The goal of this notebook is to predict weather or not a Pima Indian will have diabetes, given certain characteristics. The Pima Indians have the highest prevalence of type 2 diabetes in the world, and well above the U.S. average. 

According to the U.S. Department of Health and Human Services, "You are more likely to develop type 2 diabetes if you are age 45 or older, have a family history of diabetes, or are overweight or obese."

Many of the participants are below the age of 45 (635 out of 768 participants, 83%), and accuratley predicting weather or not an individual will get type 2 diabetes early is vital to the individual's health.


***

For prediction modeling, I will be using KNN and Logistic Regression




In [None]:
# Imports
import numpy as np 
import pandas as pd 
import sklearn
import os
import math
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score


from sklearn.linear_model import LogisticRegression




**Observing and Formatting Data**

In [None]:
# Observing and Formatting Data

df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

# print(df.isnull().values.any())
# No missing values, but any value that is 0 is a missing entry, 
# so we must take care of these values

# I will replace  " 0s " with the mean

# Percentage of missing values




columns_remove_zero = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"
                      ,"DiabetesPedigreeFunction"]


for column in columns_remove_zero:
    
    df[column] = df[column].replace(0, np.NaN)
    column_mean = int (df[column].mean(skipna = True))
    df[column] = df[column].replace(np.NaN, column_mean)
    




In [None]:
# Visualizing Data

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True, annot_kws={'size':10}, cmap='coolwarm')


There does not appear to be any strong correlations between predictors or outcome variables, Glucose level seems to have the highest correlation with the outcome variable.

In [None]:
plt.pie(df['Outcome'].value_counts(),autopct='%.2f')
plt.legend(['Non-Diabetic','Diabetic'],loc='best', bbox_to_anchor=(1, 0, .5, 0.75))
plt.title("Percentage Breakdown of Non-Diabetic and Diabetic Individuals")


**Prediction Modeling KNN Model**


In [None]:
# First Prediction Model: K-nearest neighbor

x=df.drop('Outcome',axis=1)
y=df['Outcome']


x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=0, test_size = 0.2)

sc_x = StandardScaler()

x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)



In [None]:
# Find optimal k value

acc = []

for k in range(1, 51):
    knnModel = KNeighborsClassifier(n_neighbors=k, p = 2, metric = 'euclidean')
    
    knnModel.fit(x_train, y_train)
    y_pred = knnModel.predict(x_test)
    acc.append(accuracy_score(y_test, y_pred))
    
max_value = np.argmax(acc)    
print(max_value)    
plt.plot(acc)





In [None]:

# p = number of classes (0 or 1 in this case) p = 2
knnModel = KNeighborsClassifier(n_neighbors=32, p = 2, metric = 'euclidean')

# Train Model
knnModel.fit(x_train, y_train)

y_pred_knnModel = knnModel.predict(x_test)



In [None]:


# Confusion Matrix of the KNN Model - Code from https://github.com/DTrimarchi10/confusion_matrix



confMatrix = confusion_matrix(y_test, y_pred)
# sns.heatmap(confMatrix, annot= True, cmap = 'Blues')

group_names = ['True Neg','False Pos','False Neg','True Pos']

group_counts = ["{0:0.0f}".format(value) for value in
                confMatrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in
                     confMatrix.flatten()/np.sum(confMatrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(confMatrix, annot=labels, fmt = '', cmap='Blues')

In [None]:

print(f1_score(y_test, y_pred_knnModel))

print(accuracy_score(y_test, y_pred_knnModel))

**Logistic Regression for Prediction**





In [None]:
logModel= LogisticRegression(solver='liblinear')
logModel.fit(x_train,y_train)


y_pred_logModel = logModel.predict(x_test)



In [None]:
# Confusion Matrix of the Logistic Model

# Code is relativley similar - Code from https://github.com/DTrimarchi10/confusion_matrix

confMatrix = confusion_matrix(y_test, y_pred_logModel)
# sns.heatmap(confMatrix, annot= True, cmap = 'Blues')

group_names = ['True Neg','False Pos','False Neg','True Pos']

group_counts = ["{0:0.0f}".format(value) for value in
                confMatrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in
                     confMatrix.flatten()/np.sum(confMatrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)

sns.heatmap(confMatrix, annot=labels, fmt = '', cmap='Blues')



In [None]:
# Accuracy and f1-score


print(f1_score(y_test, y_pred_logModel))

print(accuracy_score(y_test, y_pred_logModel))


