In [None]:
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


## Classification Model of Diabetic Patients

### Problem Statement


Diabetes is diagnosed with fasting sugar blood tests or with A1c blood tests, also known as glycated hemoglobin tests. 
A fasting blood sugar test is performed after you have had nothing to eat or drink for at least eight hours.
In most cases, if **glucose** (blood sugar) level is equal to or greater than 126 mg/dl (7 mmol/l), the patient can be diagnosed with the disease.

In this study case, patient data was used to predict the likelyhood of a patient being diagnosed with Diabetes Disease.
This model can serve as an indicator prior testing.


### Understanding the Dataset: 

In [None]:
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
data.head()

In [None]:
data.shape

The data comprises of 768 records.
Besides including 8 patient metrics (such as blood pressure, BMI, age...), the dataset also contains a dummy variable - Outcome - indicating wether the patient was diagnosed with diabetes or not (1 - has the disease, 0 - doesn't have the disease).

### Data Cleaning

Before begining analysis and modeling, we need to clean and prepare the dataset.
Let's start by inspecting the data:

In [None]:
data.describe()

Since we have values of 0 for Glucose levels, Blood Pressure and BMI, where they can be considered abnormal, we can remove them:

In [None]:
data = data[data['Glucose'] != 0]
data = data[data['BloodPressure'] != 0]
data = data[data['BMI'] != 0]
data.describe()

### Patient Sample Caracterization

In [None]:
data.describe()

Number of healthy and diabetic patients in the dataset:

In [None]:
diabetic_patients = data[data['Outcome'] == 1] # diabetic patients
num_diabetics = diabetic_patients.shape[0] # count of diabetic patients

healthy_patients = data[data['Outcome'] == 0] # healthy patients
num_healthy = healthy_patients.shape[0] # count of healthy patients

print('Number of diabetic patients: ' + str(num_diabetics))
print('Number of healthy patients: ' + str(num_healthy))


Patient age:

In [None]:
data['Age'].describe()

While there's a range of patients aged between 21 and 81 years, 75% of patients are below the age of 36.

## Classification Model

For this project, a simple K-Nearest Neighbors Classifier is used to predict wether a patient could have diabetes or not.
As mentioned before, the dataset provides the *labels* for the model with the variable **Outcome**.
The goal is to predict the possibility of having diabetes from a subset of features provided, such as blood pressure, insulin levels, glucose, skin thickness, and BMI.

### 1. Normalization

Before building the model, the feature data needs to be normalized. (More info:
[Why normalize data in KNN?](https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7))
This is done below with the help of sklearn:

In [None]:
from sklearn import preprocessing

features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']

features =  data[features]

target = data['Outcome'] ## labels

X = preprocessing.normalize(features) ## normalized features

### 2. Choosing K

The algorithm clusters classes based on a number, *k*, of nearest neighbors.
To choose the value we will use for K in the KNN model, it's necessary to test the value that can get the best ***accuracy score*** for the model - the proportion of correct classifications over all predictions made.

This can be done by plotting different values of k-score pairs.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# splitting train and test data
x_train, x_test, y_train, y_test = train_test_split(X, target, random_state=3)

# we'll store each k value and score tested
k_scores = {} # will map k and score
k = list(range(1,25))

scores = [] # stores model scores

## looping to test different k values
for i in k:
    # create and fit the model
    classifier = KNeighborsClassifier(n_neighbors=i)
    classifier.fit(x_train, y_train)
    y_pred = classifier.predict(x_test) # predicted values
    
    # store test results
    k_scores[i] = accuracy_score(y_test, y_pred) # stores classification score for k
    scores.append(k_scores[i]) # append result in list

Now we can plot the scores:

In [None]:
plt.plot(k, scores)
plt.xlabel("K Value")
plt.ylabel("Score")
plt.title('Accuracy Score for Each Value of K')

In [None]:
## prints out max score for the model
max_test = max(scores)
for val in k:
    if k_scores[val] == max_test:
        result = val

print('Max test score: ' + str(max_test) + ' with k: ' + str(result))


With k = 6, the model makes 68.5% of predictions correct.

### 3. Model Evaluation and Tuning

We have determined that 6 would be the number of neighbors necessary in order to maximize the model's accuracy score (=0.68).
However, [other measures should be taken into account when building classification models](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9):
* Precision score: measures the proportion of true positives over all positive values (in this case, the number of correctly "diagnosed" patients over all labeled diabetic patients).
* F1 Score: takes into account accuracy and precision (using the harmonic mean).

In [None]:
# final model
classifier = KNeighborsClassifier(n_neighbors=6)
classifier.fit(x_train, y_train)

predictions = classifier.predict(x_test) ## model predictions using test set

## f1 score
f1 = sklearn.metrics.f1_score(y_test, predictions)
## metrics
accuracy = sklearn.metrics.accuracy_score(y_test, predictions)
precision = sklearn.metrics.precision_score(y_test, predictions)
recall = sklearn.metrics.recall_score(y_test, predictions)

print("Accuracy: " + str(accuracy))
print("F1 score: " + str(f1))
print("Precision: " + str(precision))

While the model has an accuracy score of 68.5%, it has a precision score of 62%, which means that of all patients who are diagnosed (as in, predicted by the model to have the disease) only 62% actually have it.

We can also display the results with the help of a confusion matrix:

In [None]:
pd.crosstab(y_test, predictions, rownames=['Actual'], colnames=['Predicted'])