# Pima Indians Diabetes Database

### Importing libraries and loading dataset

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# loading dataset
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

display(df)

In [None]:
df.info()

The dataset contains 9 columns, out of which 8 are features based on certain diagnostic measurements, and the last column is the OUTCOME which has two values, 0 and 1. There are 768 rows and no null values.

**Here, OUTCOME is the target variable.**

## Handling outliers

A pairplot can help in visualising and plotting pairwise relationships in a dataset.

In [None]:
for i in range(0, len(df.columns), 5):
    sns.pairplot(data=df, x_vars=df.columns[i:i+5], y_vars=['Outcome'])

It is clear that there are outliers present in the data (eg. in INSULIN, SKINTHICKNESS etc.)

The outliers will be removed using the help of IQR scores. 

In [None]:
# calculating IQR scores for each column
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3-q1
print(iqr)

We can use these calculated IQR scores to filter out the outliers.

In [None]:
df = df[~((df < (q1 - 1.5 * iqr)) |(df > (q3 + 1.5 * iqr))).any(axis=1)]

# plotting the pairplots again
for i in range(0, len(df.columns), 5):
    sns.pairplot(data = df, x_vars = df.columns[i:i+5], y_vars = ['Outcome'])

Now that the dataset looks much cleaner, we can build our model.

## K-Nearest Neighbour

To create the model, the data will be split into two sets.

> Training set - 90%

> Testing set - 10%

In [None]:
outcome = df['Outcome']
features = df.drop('Outcome',axis=1)

# splitting the data into training and testing sets
features_train,features_test,labels_train,labels_test = train_test_split(features,outcome,test_size = 0.1)

# creating the classifier
clf = KNeighborsClassifier()
clf.fit(features_train,labels_train)

pred = clf.predict(features_test)

print("Accuracy =", accuracy_score(pred, labels_test)*100, "%")