### **1. Loading the data and imports**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns

In [None]:
df= pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

### **2. Exploratory Data Analysis**

In [None]:
df.info()

In [None]:
df.head()

**Let us start by looking at the count of people with and without Diabetes**

In [None]:
sns.countplot(x='Outcome', data= df)

**Now, let us look at how Diabetes Varies with Age**

In [None]:
plt.figure(figsize=(12,4))
sns.displot(data=df, x='Age', hue='Outcome')

**Looking at the plot, It seems that as the Age increases, there are more people with Diabetes**


In [None]:
plt.figure(figsize=(12,4))
sns.displot(data=df, x='Pregnancies', hue='Outcome')

In [None]:
df_corr= df.corr()

In [None]:
sns.heatmap(df_corr, annot=True, cmap='viridis')

In [None]:
df.corr()['Outcome'].sort_values().drop('Outcome').plot(kind='bar')

**The outcome has a good correlation with Glucose and BMI Understably**

**It is now time to see how different features affect the outcome**

In [None]:
sns.boxplot(x='Outcome', y='BloodPressure', data=df)

In [None]:
df.groupby('Outcome')['BloodPressure'].mean()

**People with Diabetes have a higher Blood pressure in general** 

In [None]:
df.columns

**Let us have a look how Skin thickness varies for people with and without Diabetes**


In [None]:
sns.boxplot(x='Outcome', y='SkinThickness', data=df)

In [None]:
sns.pairplot(df)

**Inference from the Pair plot:**
1. BMI and Skin Thickness and
2. Insulin and Glucose 
Have a linear relationship. 

### **3. Data Cleaning and PreProcessing**

Let us start this process by creating a series which shows us the count of null values

In [None]:
df.isnull().sum()

There are no null values, thus we can proceed to Model training 

## **4. Model Creation and Evaluation**

## **1. I am going to first use a Random Forest classifier here**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X= df.drop('Outcome', axis=1)
y=df['Outcome']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
RC= RandomForestClassifier(n_estimators=100)

In [None]:
RC.fit(X_train, y_train)

In [None]:
pred= RC.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
print(classification_report(y_test, pred))

**We have obtained a decent Precision and Recall score but the accuracy is quite low, Let's try an ANN**

In [None]:
len(df[df['Outcome']==0]) /( len(df[df['Outcome']==1])+ len(df[df['Outcome']==0])     )

## **ANN**

**Before we start creating a Neural Network, let us first scale the data**

In [None]:
df.select_dtypes(['object']).columns

There are no Object variables here and thus we don't have to convert them into Categorical variables. 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler= MinMaxScaler()

In [None]:
X2= df.drop('Outcome', axis=1)
y2=df['Outcome']

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.33, random_state=42)

In [None]:
X2_train=scaler.fit_transform(X2_train)
X2_test=scaler.transform(X2_test)

**Creating the Model**

In [None]:
df.shape

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout


In [None]:
model=Sequential()

In [None]:
model.add(Dense(8,  activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(2, activation='relu'))
model.add(Dropout(0.2))

# output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

In [None]:
model.fit(x=X2_train, 
          y=y2_train, 
          epochs=500,
          validation_data=(X2_test, y2_test), verbose=1,
          callbacks=[early_stop]
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

In [None]:
pred1= model.predict_classes(X2_test)

In [None]:
print(classification_report(y2_test,pred1))

**Let's compare ANN's report with Random Forest Classifier**

In [None]:
print(classification_report(y_test,pred))

**Clearly, Random Forrest Classifier has performed better as it is well known for binary classification. Maybe, if i remove the dropout layers, the accuracy will improve but we don't need over-fitting.** 

### **Thank You! I would appreciate your Feedback**