# Introduction

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content:
1. [Load and Check Data](#1)
2. [Variable Description](#2)
3. [Basic Data Analysis  - Visualization](#3)
    * [Pregnancies - Outcome Relationship](#4)
    
    * [Glucose - Outcome Relationship](#5)
  
    * [BloodPressure - Outcome Relationship](#6)
   
    * [SkinThickness - Outcome	Relationship](#7)
    
    * [Insulin - Outcome Relationship](#8)
    
    * [BMI - Outcome Relationship](#9)
    
    * [DiabetesPedigreeFunction - Outcome Relationship](#10)
    
    * [Age - Outcome Relationship](#11)
    
    
4. [Outlier Detection](#12)
5. [Missing Value](#13)
7. [Feature Engineering](#14)
    * [Normalization and Data Split](#15)
8. [Modeling](#16)
    * [Decision Tree Classifier](#17)
    * [Random Forest Classifier](#18)
    * [Logistic Regression](#19)
    * [Support Vector Machine Algorithm](#20)
    * [Naive Bayes Algorithm](#21)


<a id = "1"></a><br>
# Load And Check Data

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.graph_objs as go
import plotly.express as px



import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [1]:
data = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

In [1]:
data.head()

In [1]:
data.describe()

<a id = "2"></a><br>

# Variable Description

In [1]:
data.info()

### What do the columns in the data set mean?

1. Pregnancies: Number of times pregnant

1. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

1. BloodPressure:Diastolic blood pressure (mm Hg)

1. SkinThickness: Triceps skin fold thickness (mm)

1. Insulin:2-Hour serum insulin (mu U/ml)

1. BMI:Body mass index (weight in kg/(height in m)^2)

1. DiabetesPedigreeFunction: Diabetes pedigree function

1. Age: Age

1. Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0









* The Outcome column is whether the person the program asks us to identify has diabetes.

## Univariate Variable Analysis


* There is no categorical data in the data set.

In [1]:
var_val = data.Pregnancies.values
var_ind = data.Pregnancies.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "Pregnancies")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "Pregnancies" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.Glucose.values
var_ind = data.Glucose.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "Glucose")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "Glucose" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.BloodPressure.values
var_ind = data.BloodPressure.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "BloodPressure")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "BloodPressure" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.SkinThickness.values
var_ind = data.SkinThickness.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "SkinThickness")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "SkinThickness" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.Insulin.values
var_ind = data.Insulin.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "Insulin")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "Insulin" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.BMI.values
var_ind = data.BMI.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "BMI")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "BMI" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.DiabetesPedigreeFunction.values
var_ind = data.DiabetesPedigreeFunction.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "DiabetesPedigreeFunction")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "DiabetesPedigreeFunction" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

In [1]:
var_val = data.Age.values
var_ind = data.Age.index

var_data = pd.DataFrame({ "Index" : var_ind , "Values" : var_val})
var_data

trace1 =go.Scatter(
    x = var_data["Index"],
    y = var_data["Values"],
    name = "Age")

İlişki_data =[trace1]

layout = go. Layout(
    dict(title = 'Kişilere Göre "Age" Özelliğinin Değişimi',
              xaxis= dict(title= 'Kişi Sayısı',ticklen= 5,zeroline= False)) )


fig =go.Figure(data=İlişki_data , layout=layout)
fig

<a id = "3"></a><br>

# Basic Data Analysis - Visualization

<a id = "4"></a><br>

### Pregnancies - OutCome Relationship

In [1]:
a = data[["Pregnancies","Outcome"]].groupby(["Pregnancies"], as_index = False).mean().sort_values(by="Outcome",ascending = False)
a

In [1]:
fig = px.bar(a , x = a.Pregnancies , y = a.Outcome)
fig.show()

* According to the graphs above, as the number of conception increases, the probability of having diabetes increases.

<a id = "5"></a><br>

### Glucose - Outcome Relationship

In [1]:
x = data[["Glucose" , "Outcome"]].groupby(["Glucose"], as_index = False).mean().sort_values(by = "Glucose" , ascending = True)

x

In [1]:
fig = px.bar(x , x = x.Glucose , y = x.Outcome)
fig.show()

* According to the graphs, as the amount of glucose in the blood increases, the probability of having diabetes increases.

<a id = "6"></a><br>

### BloodPressure - Outcome Relationship

In [1]:
y = data[["BloodPressure" , "Outcome"]].groupby(["BloodPressure"] , as_index = False).mean().sort_values(by = "BloodPressure" , ascending = True)

y

In [1]:
fig = px.bar(y , x = y.BloodPressure , y = y.Outcome)
fig.show()

<a id = "7"></a><br>

### SkinThickness - Outcome Relationship

In [1]:
z = data[["SkinThickness" , "Outcome"]].groupby(["SkinThickness"] , as_index = False).mean().sort_values(by = "SkinThickness" , ascending = True)
z

In [1]:
fig = px.bar(z , x = z.SkinThickness , y = z.Outcome)
fig.show()

<a id = "8"></a><br>

### Insulin - Outcome Relationship

In [1]:
f = data[["Insulin" , "Outcome"]].groupby(["Insulin"] , as_index = False).mean().sort_values(by = "Insulin" , ascending = True)
f

In [1]:
fig = px.bar(f , x = f.Insulin , y = f.Outcome)
fig.show()

<a id = "9"></a><br>

### BMI - Outcome Relationship

In [1]:
g = data[["BMI" , "Outcome"]].groupby(["BMI"] , as_index = False).mean().sort_values(by = "BMI" , ascending = True)
g

In [1]:
fig = px.bar(g , x = g.BMI , y = g.Outcome)
fig.show()

<a id = "10"></a><br>

### DiabetesPedigreeFunction - Outcome Relationship

In [1]:
c = data[["DiabetesPedigreeFunction" , "Outcome"]].groupby(["DiabetesPedigreeFunction"] , as_index = False).mean().sort_values(by = "DiabetesPedigreeFunction" , ascending = True)
c

In [1]:
fig = px.bar(c , x = c.DiabetesPedigreeFunction , y = c.Outcome)
fig.show()

<a id = "11"></a><br>

### Age - Outcome Relationship

In [1]:
b = data[["Age" , "Outcome"]].groupby(["Age"] , as_index = False).mean().sort_values(by = "Age" , ascending = True)
b

In [1]:
fig = px.bar(b , x = b.Age , y = b.Outcome)
fig.show()

* As can be seen, there is an inverse relationship between age and having diabetes.

In [1]:
sns.heatmap(data.corr() , annot = True)
plt.show()

* There is a large positive correlation between the number of pregnancies and age.

* There is a positive correlation of 0.47 between the amount of glucose in the blood and the likelihood of having Diabetes.

* "Pregnancies", "BMI", "DiabetesPedigreeFunction" and "Age" values ​​are also positively associated with Diabetes status.

<a id = "12"></a><br>

## Outlier Detection

In [1]:
def outlier(df,features):
    outlier_indices = []
    
    for i in features:
        Q1 = np.percentile(df[i] , 25)
        Q3 = np.percentile(df[i] , 75)
        IQR = Q3-Q1
        outlier_step = IQR*1.5
        
        outlier_list_col = df[(df[i] < Q1 - outlier_step) | (df[i] > Q3 + outlier_step)].index
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [1]:
data.loc[outlier(data,["Pregnancies" , "Glucose" , "BloodPressure" , "SkinThickness" , "Insulin" , "BMI" , "DiabetesPedigreeFunction" , "Age"])]


<a id = "13"></a><br>
## Missing Value

In [1]:
data.isnull().sum()

#### there is no missing data in the data set


<a id = "14"></a><br>
## Feature Engineering

<a id = "15"></a><br>

### Normalization and Data Split

In [1]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV


In [1]:
x = data.drop("Outcome" , axis = 1 )
y = data.Outcome

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 42)

In [1]:
print("X_train: {}".format(X_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_train: {}".format(y_train.shape))
print("y_test: {}".format(y_test.shape))

In [1]:
X_train = (X_train- np.min(X_train))/ (np.max(X_train)-np.min(X_train)).values
X_test = (X_test- np.min(X_test))/ (np.max(X_test)-np.min(X_test)).values


<a id = "16"></a><br>

## Modeling

In [1]:
#To run algorithm

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier


#To develop hyperparameter

from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

#to find the accuracy rate

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score


<a id = "17"></a><br>

### Decision Tree Classifier

In [1]:
dc = DecisionTreeClassifier()
dc.fit(X_train , y_train)
print("Score:" , dc.score(X_test , y_test) )

<a id = "18"></a><br>

### Random Tree Classifier

In [1]:
score = []
for i in range(1,50):
    rf = RandomForestClassifier(n_estimators = i , random_state = 42)
    rf.fit(X_train , y_train)
    score.append(rf.score(X_test , y_test))
    
plt.plot(score)
plt.show()

In [1]:
rf = RandomForestClassifier(n_estimators = 33 , random_state = 42)
rf.fit(X_train , y_train)
print("Random Forest Score: {}".format(rf.score(X_test ,y_test)))

<a id = "19"></a><br>

### Logistic Regression

In [1]:
model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l1' , 'l2']
c_values = [100, 10, 1.0, 0.1, 0.01]

grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [1]:
model = KNeighborsClassifier()
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

<a id = "20"></a><br>

### Support Vector Machine Algorithm

In [1]:
model = SVC()
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
# define grid search
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

<a id = "21"></a><br>

### Naive Bayes Algorithm

In [1]:
nb = GaussianNB()
nb.fit(X_train , y_train)

print("Print Accuracy Of Naive Bayes Algorithm:" , nb.score(X_test , y_test))

* According to these results, the best result is obtained with Random Forest Classifier with 77.95% success rate.