# Predict wheter a person will have diabetes or not

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split # Split our data into trainning, validation and testing
from sklearn.preprocessing import StandardScaler # Scale our data between -1 and 1
from sklearn.neighbors import KNeighborsClassifier # Real classifier for KNN
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Datasets/Diabetes/Cópia de diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
len(df)

768

## Preparing Column Values

Analysing the dataset, we see that we have rows (observations) that has missing data for: Glucose, BloodPressure, SkinThickness, Inlusin and BMI. Since there is no way to a person has a 0 value on those columns (person would be dead), we are going to replace these missing values for the mean of the values at the respective column.

In [None]:
columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for column in columns:
  df[column] = df[column].replace(0, np.NaN)
  mean = df[column].mean(skipna = True)
  df[column] = df[column].replace(np.NaN, mean)

In [None]:
df[df['Glucose'] == 0] # No more data with 0 values (replaced by mean)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


## Splitting the Data

Now we have to split our data into train and testing units. 

Our X is going to be all the feature values that are important to predict the diabetes. On this dataset, all the features are important (all the columns except the last one).


Our y will be the last column, which is the class representing wheter the person with that row of attributes has diabetes or not.

*obs*: df.iloc function is for positional indexing the data frame.

In [None]:
train_test_split?

In [None]:
X = df.iloc[:, 0:8]
y = df.iloc[:, 8]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2)

## Standardizing the Data

At any ML algorithm that the model uses distances or assumes normality at the data: NORMALIZE THE DATA.

We cannot apply some ML algorithms and except great results if our data has values from 0 to 6 at one column and from 0 to 256 at the other column: we need to standardize it.

In [None]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

## Creating our KNN Model

For doing this, we are going to use ``KNeighborsClassifier`` pikcing up a value of k given as the square root ot the length of y_test subtracted by one to give us a odd number.

In [None]:
import math 

math.sqrt(len(y_test))

12.409673645990857

In [None]:
classifier = KNeighborsClassifier(n_neighbors = 11, weights = 'distance', p = 2, metric = 'euclidean')

## Trainning the KNN Model

Trainning KNN model for X_train, y_train, and 11 k neighbors and euclidean distance calculus.

In [None]:
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=11, weights='distance')

## Predict Values of the Classifier

In [None]:
y_pred = classifier.predict(X_test)
y_pred

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Evaluate the Model

Compute confusion matrix to evaluate the accuracy of a classification.

By definition a confusion matrix $C$ is such that $C_{i,j}$ is equal to the number of observations known to be in group $i$ and predicted to be in group $j$.

Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.

In [None]:
confusion_matrix?

In [None]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[93 14]
 [16 31]]


This means that 94 observations of the y predicted are true negatives, 13 are false positives, 16 are false negatives and 31 are true positives.

In [None]:
f1_score?

In [None]:
print(f1_score(y_test, y_pred)) # Count false positives and false negatives (reduces the accuracy)

0.6739130434782609


In [None]:
print(accuracy_score(y_test, y_pred)) # Does not count the false positive and negatives

0.8051948051948052
