**The Pima Indian Diabetes dataset consists of women who are diagnosed with and without diabetes along with various features represented as columns that play an important role in the diagnosis. This notebook shows a comparative study between various classifier models used to perform Binary Classification on this data set. In the end we conclude which classifier model gives us the best result.

In [None]:
#Importing all the necessary libraries and data set
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#display the data
data = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
data.head()

In [None]:
#checking 'NaN' values in the data
data.isnull().sum()

In [None]:
#describing data to view various statstical variables
data.describe()

# Data Visualization

In [None]:
#Plot a count plot
sns.countplot(data = data, x = "Outcome",hue = "Outcome")
plt.title("WomenDiabetesData")

In [None]:
#Plot Box plot
fig, ax = plt.subplots(figsize = (5,5))
sns.boxplot(data = data, y = "Pregnancies", x = "Outcome", hue = "Outcome")
plt.title("Pregnancies")

In [None]:
#Plot a line plot
sns.lineplot(data = data, x = 'Age', y = 'BloodPressure', hue = 'Age')
plt.title("Age and BP")

In [None]:
#Plotting histograms for all features in the data set
for i in data.columns:
    plt.figsize=(5,5)
    plt.hist(data[i])
    plt.title(i)
    plt.show()

In [None]:
#Plot scatter plot for all features in the data set
sns.pairplot(data = data, hue = "Outcome")

## Correlation of each features

In [None]:
#Plot heatmap and show case correlation of each features
corr_mat = data.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr_mat, annot = True, cmap = "coolwarm")
plt.show()

## Standard Scaling


In [None]:
#Assigning X and Y
X = data[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
Y = data.Outcome

In [None]:
#Assigning training and test data
from sklearn.model_selection import train_test_split
X_train_orig, X_test, Y_train,Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [None]:
#Feature Scaling the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train_orig)
X_train = sc.transform(X_train_orig)
X_test = sc.transform(X_test)

# Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
lr_classifier = LogisticRegression()
lr_classifier.fit(X_train, Y_train)
Y_pred = lr_classifier.predict(X_test)
print("Accuracy of the model:",accuracy_score(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred))

# Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train, Y_train)
Y_pred = gnb_classifier.predict(X_test)
print("Accuracy of the model:",accuracy_score(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred))

# Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, Y_train)
Y_pred = rf_classifier.predict(X_test)
print("Accuracy of the model:",accuracy_score(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred))

# Support Vector Machine

In [None]:
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier.fit(X_train, Y_train)
Y_pred = svm_classifier.predict(X_test)
print("Accuracy of the model:",accuracy_score(Y_test, Y_pred))
print(classification_report(Y_test, Y_pred))

# Neural Network Model

In [None]:
import tensorflow as tf 
from keras.models import Sequential
from keras.layers import Dense

#define keras model
model = Sequential()
model.add(Dense(12, input_dim = 8, activation = 'relu'))
model.add(Dense(8, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

#compile the keras model
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])


#fit the model
model.summary()
model.fit(X_train, Y_train, epochs = 150, batch_size = 10)

In [None]:
#Predict the model
_, accuracy = model.evaluate(X_train, Y_train, verbose = 0)
print("Accuracy of the model:", accuracy)

**We can conclude that SVM and Logistic Regression model give the best results in terms of recall. 
In any sort of medical diagnosis it is important to consider recall value and from the above results we conclude the same.**