# **<p style="color:red;">About The Dataset :</p>**

<p>This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.</p>


<img src= "https://www.pngitem.com/pimgs/m/255-2558137_diabetes-png-pictures-diabetes-png-transparent-png.png" alt ="Titanic" style='width: 600px;'>

[](https://www.google.com/imgres?imgurl=https%3A%2F%2Fcdn-prod.medicalnewstoday.com%2Fcontent%2Fimages%2Farticles%2F323%2F323627%2Fdiabetes.jpg&imgrefurl=https%3A%2F%2Fwww.medicalnewstoday.com%2Farticles%2F323627&tbnid=8Uh9XWHpI-PPHM&vet=12ahUKEwi3k7u7ppHwAhXnDLcAHYrgCNIQMygAegUIARDQAQ..i&docid=F90ufqoDOf6rXM&w=1100&h=734&q=diabetes&ved=2ahUKEwi3k7u7ppHwAhXnDLcAHYrgCNIQMygAegUIARDQAQ)

# **<p style="color:red;">About The Data :</p>**


1. Pregnancies - Number of times pregnant

2. Glucose - Plasma glucose concentration 

3. BloodPressure - Diastolic blood pressure (mm Hg)

4. SkinThickness - Triceps skin fold thickness (mm)

5. Insulin - 2-Hour serum insulin (mu U/ml)

6. BMI - Body mass index

7. DiabetesPedigreeFunction - Diabetes pedigree function

8. Age - Age (years)

9. Outcome - Class variable (0 or 1) - (Target variable)


# **<p style="color:red;">Aim?</p>**


Building a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?



#  <p style="color:Blue;">1. Importing The Modules :</p>


In [None]:
import numpy as np 
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
warnings.filterwarnings("ignore")

#  <p style="color:Blue;">2. Reading The Dataset and Creating DataFrame. </p>

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

diabetes = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")


#  <p style="color:Blue;">3. Analyzing the Data </p>

**3.1 Checking Initial 10 Records of the DataFrame**

In [None]:
diabetes.head(10)

**3.2 Checking The Number Of Rows and Columns In DataFrame**

In [None]:
print(f"Number of rows in dataframe are : {diabetes.shape[0]} \nNumber of columns in dataframe are : {diabetes.shape[1]} \n")

**3.3 Checking The Information Of DataFrame**

In [None]:
diabetes.info()

> * 7 columns of type "int".
> * 2 columns of type "float"

**3.4 Checking Statistical Data Of DataFrame**

In [None]:
diabetes.describe()

**3.5 Checking The Duplicate Rows**

In [None]:
diabetes[diabetes.duplicated()]

> **0** duplicate rows found in the dataframe.

**3.6 Checking Unique Values In DataFrame**

In [None]:
diabetes.nunique()

> * Highest number of unique values are in column DiabetesPedigreeFunction.

**3.7 Checking For The Null Values in DataFrame**

In [None]:
diabetes.isnull().sum()

> **0** null values found

**3.8 Checking Class Distribution**

In [None]:
print("Number of samples for Outcome 0 are : ",len(diabetes[diabetes['Outcome']==0]))
print("Number of samples for Outcome 1 are : ",len(diabetes[diabetes['Outcome']==1]))


> Classes are **imbalanced**.

#  <p style="color:Blue;">4. Data Visualization</p>

In [None]:
# Add all column names to a list except for the target variable
columns=diabetes.columns
columns=list(columns)
columns.pop()
print("Column names except for the target column are :",columns)

#Graphs to be plotted with these colors
colours=['b','c','g','k','m','r','y','b']
print()
print('Colors for the graphs are :',colours)

**4.1 Distplot For Various Features**

In [None]:
sns.set(rc={'figure.figsize':(15,17)})
sns.set_style(style='white')
for i in range(len(columns)):
    
    plt.subplot(4,2,i+1)
    sns.distplot(diabetes[columns[i]], hist=True, rug=True, color=colours[i])

**4.2 ViolinPlot For Outcome Vs. Other Attributes**

In [None]:
sns.set(rc={'figure.figsize':(15,17)})
colors_list = ['#78C850', '#F08030']
j=1
sns.set_style(style='white')

for i in (columns):
    plt.subplot(4,2,j)

    sns.violinplot(x="Outcome", y=i,data=diabetes, kind="violin", split=True, height=4, aspect=.7,palette=colors_list)
       
    sns.swarmplot(x='Outcome', y=i,data=diabetes, color="k", alpha=0.8)
    

    j=j+1


**4.3 ScatterPlot Of All Attributes Against Each Other**

In [None]:
sns.set(rc={'figure.figsize':(20,100)})
j=1

sns.set_style(style='white')
for i in range(len(columns)):
    for k in range(i,len(columns)):
        try:
            if i==k:
                continue
            plt.subplot(18,2,j)
            sns.scatterplot(x=diabetes[columns[i]],y=diabetes[columns[k]],hue="Outcome",data=diabetes)
            j=j+1
        except:
            break

**4.4 Strip Plot Distribution Of Attributes Vs Outcome**

In [None]:
sns.set(rc={'figure.figsize':(15,15)})
j=1
sns.set_style(style='white')

for i in range(len(columns)):
    plt.subplot(4,2,j)
    sns.stripplot(x='Outcome', y=columns[i] , data=diabetes)
    j=j+1

    


**4.5 Plotting The Pair Plot**

In [None]:
sns.set(rc={'figure.figsize':(15,100)})
sns.set_style(style='white')

sns.pairplot(diabetes, hue='Outcome')


**4.6 HeatMap**

In [None]:
plt.figure(figsize=(20,20))
sns.light_palette("seagreen", as_cmap=True)
sns.heatmap(diabetes.corr(), annot=True)


**4.7 Distribution Of Target Variable**

In [None]:
plt.figure(figsize=(6,8))
sns.set_style(style='white')
sns.countplot(diabetes['Outcome'])


### 4.8 KDEplot

In [None]:
with sns.axes_style("white"):
    sns.set_palette("BuGn_r")
    g2 = sns.jointplot("Insulin", "BMI", data=diabetes,
                kind="kde", space=0)


#  <p style="color:Blue;">5. Data PreProcessing</p>

**5.1 Feature Engineering**

In [None]:
# Scaling those columns which have values greater than 1

scaleIt = MinMaxScaler()
columns_to_be_scaled = [c for c in diabetes.columns if diabetes[c].max() > 1]
print("The columns which are to be scaled are :",columns_to_be_scaled)

scaled_columns = scaleIt.fit_transform(diabetes[columns_to_be_scaled])
scaled_columns = pd.DataFrame(scaled_columns, columns=columns_to_be_scaled)
scaled_columns['Outcome'] = diabetes['Outcome'] 



#copying the scaled DataFrame to original DataFrame

diabetes=scaled_columns
diabetes

**5.2 Dividing The Data Into X And Y**

In [None]:
x=diabetes.iloc[:,:-1]
y=diabetes.iloc[:,-1:]
x.head(5),y.head(5)

**5.3 Train Test Split**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y , test_size = 0.2, random_state = 42)

**5.4 Using SMOTE To Handle Class Imbalance**


In [None]:
print("Percentage of Positive Values in training data before Smote :",y_train.value_counts(normalize=True)[1]/(y_train.value_counts(normalize=True)[0]+y_train.value_counts(normalize=True)[1])*100,"%")
print("Percentage of Negative Values in training data before Smote :",y_train.value_counts(normalize=True)[0]/(y_train.value_counts(normalize=True)[0]+y_train.value_counts(normalize=True)[1])*100,"%")

print()
print('Shape of x before applying SMOTE :', x_train.shape)


smote = SMOTE()
x_train,y_train = smote.fit_resample(x_train,y_train)

print('Shape of x after applying SMOTE : ', x_train.shape)
print()

print("Percentage of Positive Values in training data after Smote :",y_train.value_counts(normalize=True)[1]/(y_train.value_counts(normalize=True)[0]+y_train.value_counts(normalize=True)[1])*100,"%")
print("Percentage of Negative Values in training data after Smote :",y_train.value_counts(normalize=True)[0]/(y_train.value_counts(normalize=True)[0]+y_train.value_counts(normalize=True)[1])*100,"%")


#  <p style="color:Blue;">6. Building The Models</p>

**6.1 Logistic Regression**

In [None]:
model = LogisticRegression()
model.fit(x_train, y_train)
predicted=model.predict(x_test)
conf = confusion_matrix(y_test, predicted)
print ("Confusion Matrix : \n", conf)
print()
print ("The accuracy of Logistic Regression is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for Logistic Regression is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for Logistic Regression is :",recall_score(y_test, predicted,)*100, "%")


**6.2 Gaussian Naive Bayes**


In [None]:
model = GaussianNB()
model.fit(x_train, y_train)
  
predicted = model.predict(x_test)
  
conf = confusion_matrix(y_test, predicted)
print ("Confusion Matrix : \n", conf)
print()
print ("The accuracy of Gaussian Naive Bayes is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for Gaussian Naive Bayes is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for Gaussian Naive Bayes is :",recall_score(y_test, predicted,)*100, "%")


**6.3 Bernoulli Naive Bayes**


In [None]:
model = BernoulliNB()
model.fit(x_train, y_train)
  
predicted = model.predict(x_test)
conf = confusion_matrix(y_test, predicted)
print ("Confusion Matrix : \n", conf)
print()
print ("The accuracy of Bernoulli Naive Bayes is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for Bernoulli Naive Bayes is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for Bernoulli Naive Bayes is :",recall_score(y_test, predicted,)*100, "%")


**6.4 Support Vector Machine**


In [None]:
model = SVC()
model.fit(x_train, y_train)
predicted = model.predict(x_test)

conf = confusion_matrix(y_test, predicted)
print ("Confusion Matrix : \n", conf)
print()
print ("The accuracy of SVM is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for SVM is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for SVM is :",recall_score(y_test, predicted,)*100, "%")


**6.5 K Nearest Neighbours**

In [None]:
model = KNeighborsClassifier(n_neighbors = 1)  
model.fit(x_train, y_train)
predicted = model.predict(x_test)

print()
print ("The accuracy of KNN is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for KNN is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for KNN is :",recall_score(y_test, predicted,)*100, "%")




**6.6 X Gradient Boosting**

In [None]:
model = xgb.XGBClassifier(use_label_encoder=False)
model.fit(x_train, y_train)
predicted = model.predict(x_test)


print()
print ("The accuracy of XGBoost is : ", accuracy_score(y_test, predicted)*100, "%")
print()
print("Precision score for XGBoost is :",precision_score(y_test, predicted,)*100, "%")
print()
print("Recall score for XGBoost is :",recall_score(y_test, predicted,)*100, "%")


#  <p style="color:Blue;">7. Conclusion</p>
1. Class Imbalance was found.
2. Gaussian Naive bayes is performing the best in context of precision and recall.