# Introduction
Hello people, welcome to my kernel. In this kernel I'll examine the dataset and after that, I will train a neural network using the dataset. Before the start, let's take a look at the content

# Content
1. Importing Libraries and The Data
1. Data Overview
1. Simple Data Analyses
    * Examining Pragnancies Feature
    * Examining Glucose Feature
    * Examining Blood Pressure Feature
    * Examining Skin Thickness Feature
    * Examining Insulin Feature
    * Examining BMI Feature
    * Examining DiabetesPedigreeFunction Feature
    * Examining Age Feature
    * Examining Outcome Feature
1. Outlier Detection
    * Defining Function
    * Dropping Outliers
1. Detailed Data Analyses
    * Correlation Heatmap
    * Glucose - Outcome
    * BMI - Outcome
    * Age - Outcome
1. Preprocessing
    * Preparing Pregnancies Feature
    * Normalization
    * Train Test Split
1. Modeling
1. Predictinig
1. Conclusion

# Importing Libraries and The Dataset

In this section I will import the libraries and the dataset. I am not going to import deep learning libraries, I am going to import them when I will need. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wrn

wrn.filterwarnings('ignore')
sns.set_style("whitegrid")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

* And now I'll import the data.

In [None]:
data = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')


# Data Overview
In this section I will get a general idea about the dataset.

In [None]:
data.head()

* There are 9 features in the dataset.

In [None]:
data.info()

* All of the features are numerical. 6 of them are int and the rest are float.
* There is no missing values.
* There are 768 rows in the dataset.

# Simple Data Analyses

In this section I will examine each feature's value's distribution. In order to do this I am going to use distplots and count plots.

## Examining Pregnancies Feature

In [None]:
data["Pregnancies"].value_counts()

* Although there are 17 unique values, most of them is 1,0 and 2. 
* We can join 11,12,13,14,15 and 17.


In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.countplot(data["Pregnancies"])
plt.show()

## Examining Glucose Feature

In [None]:
data["Glucose"].head()

* As we can see, glucose data is not categorical, so we should use a distplot for examining it.

In [None]:
fig,ax = plt.subplots(figsize = (10,7))
sns.distplot(data["Glucose"],color="#FE5205")
plt.show()

* Most of the values are between 70 and 130. 

## Examining Blood Pressure Feature

* Let's start with reminding the feature

In [None]:
data["BloodPressure"].head(10)

* This is not a categorical feature as well. 
* So let's use a distplot.

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.distplot(data["BloodPressure"],color="#00B037")
plt.show()

* An interesting chart. There are may 0 values in the dataset. However values that between 0 and 40 are very rare.
* And most of the dataset is between 40 and 100 especially 60 and 80


## Examining Skin Thickness Feature

In [None]:
data["SkinThickness"].head(10)

* This feature is not a categorical like the Blood Pressure and Glucose

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.distplot(data["SkinThickness"],color="#C0F714")
plt.show()

* The values are between 0 and 60. Most of the dataset's value is 0. 

## Examining Insulin Feature

In [None]:
data["Insulin"].head(10)

* Most of the dataset's value must be 0.

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.distplot(data["Insulin"],color="#077F8F")
plt.show()

* Most of the dataset is 0
* Although they are rare, there are values between 0 and 400.


## Examining BMI Feature


In [None]:
data["BMI"].head(10)

In [None]:
plt.subplots(figsize=(10,7))
sns.distplot(data["BMI"],color="#DB6A14")
plt.show()

* Not an interesting distplot
* Most of the values are between 20 and 50.

## DiabetesPedigreeFunction Feature


In [None]:
data["DiabetesPedigreeFunction"].head()

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.distplot(data["DiabetesPedigreeFunction"],color="#8F105A")
plt.show()

* Although most of the dataset between 0 and 1, there are values between 1 and 2.5

## Examining Age Feature


In [None]:
data["Age"].head(10)

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.distplot(data["Age"],color="#DB620D")
plt.show()

* Most of the dataset between 20 and 40.

## Examining Outcome Feature

* Outcome feature is our label.
* It is a categorical feature, so we can use count plot.

In [None]:
data["Outcome"].value_counts()

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.countplot(data["Outcome"])
plt.show()

* This is an unbalanced data.
* Most of the values are 0 
* They are 500 0 values and 268 1 values.

# Outlier Detection
In this section I will drop the outliers, because you know, outliers can cause trouble. I am going to drop outliers using a handwritten function so let's start with defining the function.

## Defining Function

In [None]:
def outlier_dropper(dataset):
    check_index = []
    final_index = []
    for feature in dataset: # Each iteration is a different feature
        
        Q1 = dataset[feature].describe()["25%"] # Lower Quartile
        Q3 = dataset[feature].describe()["75%"] # Upper Quartile
        
        IQR = Q3-Q1
        STEP = IQR*1.5
        
        
        indexes = data[(data[feature]<Q1-IQR) | (data[feature]>Q3+IQR)].index.values # Taking outlier's indexes.
        
        for i in indexes:  
            check_index.append(i) # Appending each index into the check_index list.
    
    for i in check_index:        
        check_index.remove(i)
        if i in check_index: # If i still exists (If there is two outliers in the i index)
            final_index.append(i) # Append it.
    
    return np.unique(final_index)

* And now let's use our function.

## Dropping Outliers


In [None]:
indexes = outlier_dropper(data)
print(indexes)
print("------------------------------------------------------------------------------")
print(len(indexes))

* There are 47 rows in the dataset that have outliers more than one.

In [None]:
data.drop(indexes,inplace=True)

In [None]:
data.info()

* Now we have 721 entries.

# Detailed Data Analyses
In this section I am going to examine the correlations between the features. I am going to start with examining the correlation heatmap.

In [None]:
fig,ax = plt.subplots(figsize=(8,8))
sns.heatmap(data.corr(),annot=True,fmt=".2f",linewidths=1.5)
plt.show()

* They are three strong correlation between outcome and other features

* Glucose - Outcome (0.46)
* Age - Outcome (0.24)
* BMI - Outcome (0.29)

Let's examine them using different plots.

## Glucose - Outcome

In [None]:
fig = plt.figure(figsize=(7,5))
fig.add_subplot(1,2,1)
sns.kdeplot(data["Glucose"],data["Outcome"])
fig.add_subplot(1,2,2)
sns.scatterplot(data["Glucose"],data["Outcome"])
plt.show()

* When outcome is 1, glucose is bigger than 100. 

## Age - Outcome

In [None]:
fig = plt.figure(figsize=(7,5))
fig.add_subplot(1,2,1)
sns.kdeplot(data["Outcome"],data["Age"])
fig.add_subplot(1,2,2)
sns.scatterplot(data["Outcome"],data["Age"])
plt.show()

## BMI - Outcome

In [None]:
fig = plt.figure(figsize=(7,5))
fig.add_subplot(1,2,1)
sns.kdeplot(data["BMI"],data["Outcome"])
fig.add_subplot(1,2,2)
sns.scatterplot(data["BMI"],data["Outcome"])
plt.show()

# Preprocessing
In this section I am going to preapre the dataset for modeling. In order to prepare the dataset. I will follow these steps:

* Preparing Pregnancies Feature
    * Joinining 11,12,13,14,15,17
    * One Hot Encoding
* Normalization
* Train Test Splitting

## Preparing Pregnancies Feature

### Joining 11,12,13,14,15,17 

* Before the joining, let's remind the countplot of pregnancies feature.

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.countplot(data["Pregnancies"])
plt.show()

In [None]:
pregnancies = []

for i in data["Pregnancies"]:
    
    if i==11 or i==12 or i==13 or i==14 or i==15 or i==17:
        pregnancies.append(11)
    
    else:
        pregnancies.append(i)

data.Pregnancies = pregnancies

* And now I will check countplot again.

In [None]:
fig,ax = plt.subplots(figsize=(10,7))
sns.countplot(data["Pregnancies"])
plt.show()

* Okey, we are ready one hot encoding

### One Hot Encoding

In [None]:
data = pd.get_dummies(data,columns=["Pregnancies"])
data.head()

## Normalization (Scaling)

And now I am going to normalize data because if we normalize the data, training time will be better.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

x = data.drop("Outcome",axis=1)
y = data.Outcome

x = scaler.fit_transform(x)

I've created x and y in this section, because I don't want to normalize y axis.

In [None]:
print("Shape of x",x.shape)
y = y.values
print("Shape of y",y.shape)

In [None]:
y = y.reshape(-1,1)

## Train Test Splitting
In this section I will split the data into train and test. In order to do this I will use SKLearn library's train_test_split

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=1)

* Finally we are ready for modeling!

# Modeling
In this section I'll build the model using Keras library and after that I will fit it using our x_train and y_train.

In [None]:
from keras.layers import Dropout,Dense
from keras.models import Sequential

In [None]:
model = Sequential()
model.add(Dense(units=16,kernel_initializer="uniform",activation="tanh",input_dim=19)) # Layer 1
model.add(Dropout(0.25))

model.add(Dense(units=16,kernel_initializer="uniform",activation="tanh")) # Layer 2
model.add(Dropout(0.50))

model.add(Dense(units=32,kernel_initializer="uniform",activation="tanh")) # Layer 3
model.add(Dropout(0.50))

model.add(Dense(units=32,kernel_initializer="uniform",activation="tanh")) # Layer 4 
model.add(Dropout(0.50))

model.add(Dense(units=32,kernel_initializer="uniform",activation="tanh")) # Layer 5
model.add(Dropout(0.50))

model.add(Dense(units=32,kernel_initializer="uniform",activation="tanh")) # Layer 6
model.add(Dropout(0.50))

model.add(Dense(units=32,kernel_initializer="uniform",activation="tanh")) # Layer 7
model.add(Dropout(0.50))

model.add(Dense(units=1,kernel_initializer="uniform",activation="sigmoid")) # Output Layer
model.compile(optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"])

* Our frame is ready, let's fit it using our train arrays.

In [None]:
model.fit(x_train,y_train,epochs=250)

# Predicting
In this section I will predict the values using our model and after that I will take a look at the confusion matrix and the score.

In [None]:
from sklearn.metrics import accuracy_score
y_head = model.predict_classes(x_test)

print("The score is ",accuracy_score(y_test,y_head))

* Not bad but not good.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test,y_head)

fig,ax = plt.subplots(figsize=(6,6))
sns.heatmap(confusion_matrix,annot=True,fmt="0.1f",cmap="Greens_r",linewidths=1.5)
plt.show()

* Model had difficulty when it predict 1 values. 
* It is a predictible result, because you will remember, the number of 1 values in the dataset is lower than 0 values.

# Conclusion

Thanks for your attention, if you have any questions in your mind, you can ask me in the comment section. I am waiting for your comments, questions and upvotes. 
