# CONTENT <a id='content'></a>
##   1.[Intro and imports](#intro)

##   2.[Reading CSV and EDA](#csv)

##   3.[Data Preprocessing](#preprocess) <br>

## 4.[Model selection](#mod)
#### [KNN](#KNN)
#### [Logistic Regression](#lr)
#### [SVM](#svm)<br>
---

# Intro and imports <a id='intro'></a>
##### [Back to CONTENT](#content)
#### Data
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)
#### Objective
We will try to build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#model selection
from sklearn.model_selection import train_test_split , GridSearchCV, cross_val_score, cross_val_predict

#models.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

#validation 
from sklearn.metrics import classification_report , precision_score, recall_score, f1_score, accuracy_score, roc_auc_score, roc_curve

# Reading CSV and EDA <a id='csv'></a>

##### [Back to CONTENT](#content)

In [None]:
# reading CSV file from kaggle
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
# reading CSV file from my PC
# data = pd.read_csv('diabetes.csv')


In [None]:
data.head()   

In [None]:
data.info()

In [None]:
data.describe()

#### histogram

In [None]:
data.hist(figsize=(20,10));

In [None]:
fig=plt.figure(figsize=(20,12))
gs=fig.add_gridspec(2,4)
ax0=fig.add_subplot(gs[0,0])
ax1=fig.add_subplot(gs[0,1])
ax2=fig.add_subplot(gs[0,2])
ax3=fig.add_subplot(gs[0,3])
ax4=fig.add_subplot(gs[1,0])
ax5=fig.add_subplot(gs[1,1])
ax6=fig.add_subplot(gs[1,2])
ax7=fig.add_subplot(gs[1,3])

sns.boxplot(data['Pregnancies'],data=data,ax=ax0)
sns.boxplot(data['Glucose'],data=data,ax=ax1)
sns.boxplot(data['BloodPressure'],data=data,ax=ax2)
sns.boxplot(data['SkinThickness'],data=data,ax=ax3)
sns.boxplot(data['Insulin'],data=data,ax=ax4)
sns.boxplot(data['BMI'],data=data,ax=ax5)
sns.boxplot(data['DiabetesPedigreeFunction'],data=data,ax=ax6)
sns.boxplot(data['Age'],data=data,ax=ax7)

#### correlation

In [None]:
plt.subplots(figsize=(20,15))
ax = sns.heatmap(
data.corr() ,
cmap=sns.diverging_palette(20,220, n=200),
annot=True)

In [None]:
corr_matrix = data.corr()
corr_matrix["Outcome"].sort_values(ascending = False)
# ('Glucose', 'BMI', 'Age', 'Pregnancies', 'DiabetesPedigreeFunction', 'Insulin', 'SkinThickness', 'BloodPressure')

# Data preprocessing <a id='preprocess'></a>

##### [Back to CONTENT](#content)

In [None]:
y = data.Outcome
X = data.drop(['Outcome'],axis=1)
X.head()

In [None]:
from sklearn.model_selection import train_test_split

# X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

# Model selection <a id='mod'></a>
##### [Back to CONTENT](#content)

## KNN <a id='KNN'></a>
##### [Back to CONTENT](#content)

In [None]:
from sklearn.neighbors import KNeighborsClassifier as knn

knn_model = knn(n_neighbors=110)
knn_model.fit(X_train, y_train)
y_preds = knn_model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
mse = mean_squared_error(y_test, y_preds)
rmse = sqrt(mse)
print("RMSE = ",rmse)

In [None]:
print('My accuracy: ', knn_model.score(X_test, y_test)*100, '%')
# My accuracy:  78.57142857142857 %
print("Classification Report is:\n",classification_report(y_test, y_preds))

## Logistic Regression <a id='lr'></a>
##### [Back to CONTENT](#content)

In [None]:
from sklearn.linear_model import LogisticRegression

lg_model = LogisticRegression(solver='liblinear')
lg_model.fit(X_train, y_train)
y_pred = lg_model.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, y_pred)
rmse = sqrt(mse)
print("RMSE = ",rmse)

In [None]:
print('My accuracy: ', lg_model.score(X_test, y_test)*100, '%')
# My accuracy:  81.81818181818183 %
print("Classification Report is:\n",classification_report(y_test, y_pred))

## SVM <a id='svm'></a>

##### [Back to CONTENT](#content)

In [None]:
from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train, y_train)
y_preds = svc_model.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, y_preds)
rmse = sqrt(mse)
print("RMSE = ",rmse)

In [None]:
print('My accuracy = ', svc_model.score(X_test, y_test)*100, '%')
# My accuracy:  79.22077922077922 %
print("Classification Report is:\n",classification_report(y_test, y_preds))

## The best one is Logistic Regression