# **Diabetes Prediction**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

![](https://image.freepik.com/free-vector/cute-happy-blood-drop-with-glucose-measuring-device-character-flat-style-cartoon-illustration-icon-isolated-white-normal-risk-diabetes-blood-sugar-glucometer-level_92289-457.jpg)

**Data contains;**

1. **Pregnancies:** Number of times pregnant

2. **Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3. **BloodPressure:** Diastolic blood pressure (mm Hg)

4. **SkinThickness:** Triceps skin fold thickness (mm)

5. **Insulin:** 2-Hour serum insulin (mu U/ml)

6. **BMI:** Body mass index (weight in kg/(height in m)^2)

7. **DiabetesPedigreeFunction:** Diabetes pedigree function

8. **Age:** Age (years)

9. **Outcome:** 1 if diabetes, 0 if no diabetes

**Load the important required libraries**

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

**Let us load the data set**

In [None]:
diab_df= pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

### **Data Analysis On Diabetes Data Set**

**Checking first 5 and last 5 records from the datasets**

In [None]:
diab_df.head(5)

In [None]:
diab_df.tail(5)

**Let's check the duplicate data in data set**

In [None]:
diab_df.duplicated().sum()

In [None]:
diab_df.shape

In [None]:
diab_df.info()

In [None]:
diab_df.isnull().sum()

**So, there 768 records in 9 columns. Also, there are no null records as well as duplicate values.**

**But we can observe that, there are lot of 0s present in the dataset. It is better to replace zeros with NaN and after that counting them would be easier and 0s need to be replaced with suitable values.**

In [None]:
diab_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diab_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [None]:
diab_df.isnull().sum()

**We can see here that, there were lot of 0s present in the above mentioned columns.**

**To fill these 0s with Nan values the let's see the data distribution.**

In [None]:
diab_df.hist(figsize = (11,11), color="#008080")

**Let's aim to replace NaN values for the columns in accordance with their distribution**

In [None]:
diab_df['Glucose'].fillna(diab_df['Glucose'].mean(), inplace = True)
diab_df['BloodPressure'].fillna(diab_df['BloodPressure'].mean(), inplace = True)
diab_df['SkinThickness'].fillna(diab_df['SkinThickness'].median(), inplace = True)
diab_df['Insulin'].fillna(diab_df['Insulin'].median(), inplace = True)
diab_df['BMI'].fillna(diab_df['BMI'].median(), inplace = True)

In [None]:
diab_df.isnull().sum()

**After replacing NaN Values, the dataset is almost clean now. We can move ahead with our EDA.**

In [None]:
diab_df.info()

### **Exploratory Data Analysis**

In [None]:
plt.figure(figsize=(10,5))
plt.title('Diabetes Plot Yes/No', fontsize=14)
sns.countplot(x="Outcome", data=diab_df, palette=('#23C552','#C52219'))
plt.xlabel("Diabetes (0 = No, 1= Yes)", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

**From above plot, we can say that there are less number of diabetic patients in the data set.**

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(18, 12))
fig.suptitle('Diabetes Outcome Distribution WRT All Independent Variables', fontsize=16)

sns.boxplot(ax=axes[0, 0], x=diab_df['Outcome'], y=diab_df['Pregnancies'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[0, 0].set_title("Diabetes Outcome vs Pregnancies", fontsize=12)

sns.boxplot(ax=axes[0, 1], x=diab_df['Outcome'], y=diab_df['Glucose'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[0, 1].set_title("Diabetes Outcome vs Glucose", fontsize=12)

sns.boxplot(ax=axes[0, 2], x=diab_df['Outcome'], y=diab_df['BloodPressure'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[0, 2].set_title("Diabetes Outcome vs BloodPressure", fontsize=12)

sns.boxplot(ax=axes[0, 3], x=diab_df['Outcome'], y=diab_df['SkinThickness'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[0, 3].set_title("Diabetes Outcome vs SkinThickness", fontsize=12)

sns.boxplot(ax=axes[1, 0], x=diab_df['Outcome'], y=diab_df['Insulin'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[1, 0].set_title("Diabetes Outcome vs Insulin", fontsize=12)

sns.boxplot(ax=axes[1, 1], x=diab_df['Outcome'], y=diab_df['BMI'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[1, 1].set_title("Diabetes Outcome vs BMI", fontsize=12)

sns.boxplot(ax=axes[1, 2], x=diab_df['Outcome'], y=diab_df['DiabetesPedigreeFunction'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[1, 2].set_title("Diabetes Outcome vs DiabetesPedigreeFunction", fontsize=12)

sns.boxplot(ax=axes[1, 3], x=diab_df['Outcome'], y=diab_df['Age'], hue=diab_df['Outcome'], palette=('#23C552','#C52219'))
axes[1, 3].set_title("Diabetes Outcome vs Age", fontsize=12)

**From above Boxplot, we can see that those who are diabetic tends to have higher Glucose levels, Age, BMI, Pregnancies and Insulin measures.**

In [None]:
sns.pairplot(diab_df, hue='Outcome', palette=('#23C552','#C52219'))

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(diab_df.corr(), annot=True, cmap='RdYlGn')
plt.title("Feature Correlation Matrix",fontsize=20)
plt.show()

**We can see that a few of the features are moderately correlated - Age and number of Pregnancies, Insulin and Glucose levels, Skin Thickness and BMI - but not so much as to cause concern.**

### **Modelling**

In [None]:
diab_df.describe()

In [None]:
diab_df['Outcome'].value_counts()

**Here we have just checked the distribution.**

**First Let's split the data in to x and y.**

In [None]:
from sklearn.model_selection import train_test_split
x = diab_df.drop(['Outcome'],axis=1)
y = diab_df['Outcome']

**Then we use the standard scaler to scale the data.**

In [None]:
from sklearn.preprocessing import StandardScaler

sc= StandardScaler()
x_scaled= sc.fit_transform(x)

**First Let's split the data into train and test.**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.3, random_state=0)

In [None]:
x_train.shape, y_train.shape

In [None]:
x_test.shape, y_test.shape

### **Applying Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)

**Computing confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
confmat = confusion_matrix(y_pred, y_test)
confmat

In [None]:
from sklearn import metrics
cm=metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_pred,y_test,labels=logreg.classes_),
                              display_labels=logreg.classes_)
cm.plot(cmap="magma")

In [None]:
accuracy_score(y_pred, y_test)

### **Model Accuracy with Logistic Regression: 76.19%**

### **Applying Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(criterion = "gini", 
                                       min_samples_leaf = 1, 
                                       min_samples_split = 10,   
                                       n_estimators=100, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1, 
                                       n_jobs=-1)

random_forest.fit(x_train, y_train)
y_pred = random_forest.predict(x_test)

**Computing Confusin Matrix**

In [None]:
confmat1 = confusion_matrix(y_pred, y_test)
confmat1

In [None]:
from sklearn import metrics
cm=metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_pred,y_test,labels=random_forest.classes_),
                              display_labels=random_forest.classes_)
cm.plot(cmap="magma")

In [None]:
accuracy_score(y_pred, y_test)

### **Model Accuracy with Random Forest: 77.05%**