![diabetes%20data.png](attachment:diabetes%20data.png)

One of the viusalization from the dataset. 


# Context


This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

# Content 


The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

# Goal 

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?



Importing and reading the requited files 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.head()

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set(style='darkgrid')

## First we need to do the feature engineering and data visualization 



In [None]:
sns.heatmap(df.corr())

Conclusion from the correlation matrix:-
    1. Glucose level plays the most important role in determining the whether the person is diabetic or not.
    2. After that, BMI, pregnancy and Age plays the second most important role. 
    3. Third comes insulin and DiabetesPedigreeFunction. 
    4. And finally all other factors plays minor role as well. 

Lets analyse the dataset a little more. You can skip the data visualization and exploratory data analysis after this and jump directly into the 'ANN with pytorch' section if you are not interested. 

In [None]:
df.isnull().sum()

So there is no nan or missing values in the dataset.


In [None]:
df['Outcome'].value_counts()

From the above observation we can say that the dataset is not imbalanced as one type of outcome does not dominate the dataset completely

In [None]:
Glucose_mean = df.groupby('Outcome').Glucose.mean()
Glucose_min = df.groupby('Outcome').Glucose.min()
Glucose_max = df.groupby('Outcome').Glucose.max()
print('Mean value of glucose of people affected and not affected with diabetes', Glucose_mean)
print('Minimum value of glucose of people affected and not affected with diabetes', Glucose_min)
print('Maximum value of glucose of people affected and not affected with diabetes', Glucose_max)

So from the above data things that we can conclude are :-
1. People with higher glucose level has more chances of diabetes as the mean value suggests
2. However someone with zero glucose level can also be affected with diabetes. 
3. And someone with very high glucose level around 197 may not suffer from diabetes. 

In [None]:
sns.scatterplot(x = 'Glucose',y = 'Insulin', hue = 'Outcome', data=df)
plt.title('Relation between Insulin and Glucose and how it affects diabetes')

In [None]:
sns.distplot(df['Insulin'],bins = 8)
plt.title('Distribution of Insulin column in the dataset')

In [None]:
sns.countplot(df['Pregnancies'],hue=df['Outcome'])

In [None]:
plt.figure(figsize=(10,6))
plt.boxplot([df['Age'], df['BMI'], df['BloodPressure'], df['Glucose']], vert=False)
plt.yticks([1, 2, 3, 4], ['Age', 'BMI', 'BloodPressure', 'Glucose'])
plt.xlabel('Value')
plt.title("Box Plot")

It is clear from the box plot that there are many outliers. We will see whether they affect the prediction or not

In [None]:
sns.scatterplot(x=df['Glucose'], y=df['BMI'], hue=df['Outcome'])
plt.title('Relation between glucose level and BMI and how it affects diabetes')

## ANN with Pytorch 

- Pytorch is basically a library that uses tensors to build neural network. A tensor in pytorch is pretty similar to a numpy array but it can use the power of the GPU.


The neural network that I am going to create will have three layers 
- hidden layer 1 - 20 neurons 
- hidden layer 2 - 10 neurons
- hidden layer 3 - 5 neurons 

In [None]:
# Taking all the dependent variables in x and all the independent variables in y 

X=df.drop('Outcome',axis=1).values
y=df['Outcome'].values

In [None]:
# Making a 80:20 train test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

X_train.shape

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

The first step towards creating the model is to create a tensor out of the pandas dataframe or numpy array.

In [None]:
##### Creating Tensors
X_train=torch.FloatTensor(X_train)
X_test=torch.FloatTensor(X_test)
y_train=torch.LongTensor(y_train)
y_test=torch.LongTensor(y_test)

f_connected in this case means fully connected layer 

In [None]:
class ANN_Model(nn.Module):
    def __init__(self,input_features=8,hidden1=20,hidden2=10,hidden3= 5, out_features=2):
        super().__init__()
        self.f_connected1=nn.Linear(input_features,hidden1)
        self.f_connected2=nn.Linear(hidden1,hidden2)
        self.f_connected3=nn.Linear(hidden2,hidden3)
        self.out=nn.Linear(hidden3,out_features)
    def forward(self,x):
        x=F.relu(self.f_connected1(x))
        x=F.relu(self.f_connected2(x))
        x=F.relu(self.f_connected3(x))
        x=self.out(x)
        return x

In [None]:
torch.manual_seed(20)
model=ANN_Model()
# torch.manual_seed() fixes the random value to a certain value and does not change the value everytime one reruns it

In [None]:
model.parameters

In [None]:
###Backward Propogation-- Define the loss_function,define the optimizer
loss_function=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),lr=0.01)

In [None]:
epochs=500
final_losses=[]
for i in range(epochs):
    i=i+1
    y_pred=model.forward(X_train)
    loss=loss_function(y_pred,y_train)
    final_losses.append(loss)
    if i%10==1:
        print("Epoch number: {} and the loss : {}".format(i,loss.item()))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In [None]:
### plot the loss function
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(range(epochs),final_losses)
plt.ylabel('Loss')
plt.xlabel('Epoch')

In [None]:
predictions=[]
with torch.no_grad():
    for i,data in enumerate(X_test):
        y_pred=model(data)
        predictions.append(y_pred.argmax().item())

In [None]:
from sklearn.metrics import accuracy_score
score=accuracy_score(y_test,predictions)
score