## Diabetes Prediction: Extensive EDA, Feature Engineering, Visualizations and Modeling

<div align="center">
    <img src="https://cdn-a.william-reed.com/var/wrbm_gb_food_pharma/storage/images/publications/food-beverage-nutrition/nutraingredients-asia.com/news/regulation-policy/fiji-s-diabetes-epidemic-nation-already-exceeding-who-s-predicted-rate-for-2030/8258832-1-eng-GB/Fiji-s-diabetes-epidemic-Nation-already-exceeding-WHO-s-predicted-rate-for-2030_wrbm_large.jpg" alt="diabetes image" width="500" height="300" style="border-radius:10px;"/>

</div>

<b>Data Dictionary</b>
<ul>
    <li>Pregnancies: Number of times pregnant</li>
    <li>Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
    <li>BloodPressure: Diastolic blood pressure (mm Hg)</li>
    <li>SkinThickness: Triceps skin fold thickness (mm)</li>
    <li>Insulin: 2-Hour serum insulin (mu U/ml)</li>
    <li>BMI: Body mass index (weight in kg/(height in m)^2)</li>
    <li>DiabetesPedigreeFunction: Diabetes pedigree function</li>
    <li>Age: Age (years)</li>
    <li>Outcome: Class variable (0 or 1)</li>
</ul>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import imblearn
from imblearn.over_sampling import SMOTE
from collections import Counter

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, plot_confusion_matrix

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [None]:
DATA_PATH = "../input/pima-indians-diabetes-database/diabetes.csv"

In [None]:
data = pd.read_csv(DATA_PATH)
data.head()

## Exploring the data

In [None]:
data.info()

In [None]:
data.describe()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>From the above table we can observe that the minimum values for the features Glucose, BloodPressure, SkinThickness, Insulin, BMI is 0 which is impossible and doesn't make any sense. Thus let us replace the 0 in those feature with NaN, later with which we can deal in Univariate Analysis.
</div>

In [None]:
columns_with_wrong_data = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]

def replace_func(x):
    if x == 0:
        return np.nan
    return x

for column in columns_with_wrong_data:
    data[column] = data[column].map(replace_func).values

In [None]:
print(data.isnull().sum())
data.isnull().sum().plot(kind = "bar")
plt.title("NaN values Plot")
plt.show()

## Univariate Analysis

<b>Analyzing Outcome</b>

In [None]:
counts = data["Outcome"].value_counts()
diag_cols = ["Non Diabetic", "Diabetic"]
diag_counts = [counts[0], counts[1]]

nd = (diag_counts[0] / sum(diag_counts))*100
d = (diag_counts[1] / sum(diag_counts)) * 100

print(f"Diabetic: {d}%")
print(f"Non Diabetic: {nd}%")

print()

plt.figure(figsize = (10, 8))
sns.barplot(x = diag_cols, y = diag_counts)
plt.show()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>From the above plot we can observe that the data is imbalanced. So we need to perform Upsampling.
</div>

<b>Analyzing Pregnancies Column</b>

In [None]:
print(f"Number of unique values in Pregnancies: {len(data.Pregnancies.unique())}")
print(f"Unique values in Pregnancies: {data.Pregnancies.unique()}")

In [None]:
pd.crosstab(data["Pregnancies"], data["Outcome"])

In [None]:
data.groupby(by="Pregnancies")["Outcome"].sum().sort_values(ascending=False).plot(kind = "bar")
plt.show()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>From the above plot we can observe that the patients with less number of pregnancies ar more prone to diabetes.
</div>

<b>Analyzing Glucose Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["Glucose"])
plt.show("Glucose distribution plot")
plt.show()

In [None]:
data["Glucose"].isnull().sum()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>There are 5 missing data points in the Glucose column. From the distribution plot we can observe that there is no much skewness present in the data. So, let us replace the missing values with mean of the data.
</div>

In [None]:
gluc_imputer = SimpleImputer(strategy="mean")
data["Glucose"] = gluc_imputer.fit_transform(data["Glucose"].values.reshape(-1, 1)).copy()
data["Glucose"].isnull().sum()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["Glucose"])
plt.show("Glucose distribution plot after Imputing with mean")
plt.show()

<b>Analyzing BloodPressure Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["BloodPressure"])
plt.title("BloodPressure Distribution Plot")
plt.show()

In [None]:
data["BloodPressure"].isnull().sum()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>There are 35 missing data points in the BloodPressure column. From the distribution plot we can observe that there is no skewness present in the data. So, let us replace the missing values with mean of the data.
</div>

In [None]:
bp_imputer = SimpleImputer(strategy="mean")
data["BloodPressure"] = bp_imputer.fit_transform(data["BloodPressure"].values.reshape(-1, 1)).copy()
data["BloodPressure"].isnull().sum()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["BloodPressure"])
plt.title("BloodPressure Distribution Plot after impution")
plt.show()

<b>Analyzing SkinThickness Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["SkinThickness"])
plt.title("SkinThickness Distribution Plot")
plt.show()

In [None]:
data["SkinThickness"].isnull().sum()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>There are 227 missing data points in the SkinThickness column. From the distribution plot we can observe that the SkinThickness data right skewed. So, let us replace the missing values with median of the data.
</div>

In [None]:
skt_imputer = SimpleImputer(strategy="median")
data["SkinThickness"] = skt_imputer.fit_transform(data["SkinThickness"].values.reshape(-1, 1)).copy()
data["SkinThickness"].isnull().sum()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["SkinThickness"])
plt.title("SkinThickness Distribution Plot after impution")
plt.show()

<b>Analyzing Insulin Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["Insulin"])
plt.title("Insulin Distribution Plot")
plt.show()

In [None]:
data["Insulin"].isnull().sum()

In [None]:
percent_of_missing = (data["Insulin"].isnull().sum() / data.shape[0]) *100
print(f"{percent_of_missing}% of Insulin data is missing.")

In [None]:
insulin_imputer = SimpleImputer(strategy="median")
data["Insulin"] = insulin_imputer.fit_transform(data["Insulin"].values.reshape(-1, 1)).copy()
data["Insulin"].isnull().sum()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["Insulin"])
plt.title("Insulin Distribution Plot after imputing")
plt.show()

<b>Analyzing BMI Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["BMI"])
plt.title("BMI Distribution Plot")
plt.show()

In [None]:
data["BMI"].isnull().sum()

In [None]:
bmi_imputer = SimpleImputer(strategy="mean")
data["BMI"] = bmi_imputer.fit_transform(data["BMI"].values.reshape(-1, 1)).copy()
data["BMI"].isnull().sum()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["BMI"])
plt.title("BMI Distribution Plot after imputation")
plt.show()

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["DiabetesPedigreeFunction"])
plt.title("DiabetesPedigreeFunction Distribution Plot")
plt.show()

<b>Analyzing Age Column</b>

In [None]:
plt.figure(figsize = (7, 4))
sns.distplot(data["Age"])
plt.title("Age Distribution Plot")
plt.show()

## Bivariate Analysis

In [None]:
continuous_data_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age']

In [None]:
plt.figure(figsize = (11,7))
sns.heatmap(data[continuous_data_cols].corr(), center = 0, annot = True)
plt.title("Correlation Plot")
plt.show()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>There is no multicollinearity problem in this data.
</div>

In [None]:
plt.figure(figsize = (11,7))
sns.pairplot(data[continuous_data_cols + ["Outcome"]], hue = "Outcome")
plt.show()

In [None]:
all_columns = list(data.columns)
X = data[all_columns[:-1]]
y = data[all_columns[-1]]

## Using Cross Validation for Base Model Selection

In [None]:
models = {
    "xgb_classifier": XGBClassifier(eval_metric="logloss"),
    "rf_model": RandomForestClassifier(random_state = 18),
    "svm_model":SVC(),
    "logistic_regression":LogisticRegression(),
    "ada_boost": AdaBoostClassifier(RandomForestClassifier(random_state = 18))
}

for model_name in models:
    print(f"Model Name: {model_name}")
    print("Cross validation Scores")
    cv_scores = cross_val_score(make_pipeline(StandardScaler(), models[model_name]), X, y, cv = 5)
    print(f"Min Score: {min(cv_scores)}")
    print(f"Max Score: {max(cv_scores)}")    
    print(f"Mean Score: {np.mean(cv_scores)}")
    print()

<div style="color:black;background-color:lightgreen;border-radius:10px;padding:20px;">
<b>OBSERVATION</b><br/>We can notice that Logistic Regression, AdaBoost Model, RandomForest Model are performing better than remaining models.
</div>

## Splitting the data into train and test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size = 0.2, random_state = 0)
print(f"Train Data: {X_train.shape}, {y_train.shape}")
print(f"Train Data: {X_test.shape}, {y_test.shape}")

## Upsampling using SMOTE

In [None]:
counter = Counter(y_train)
counter

In [None]:
upsample = SMOTE()
X_train, y_train = upsample.fit_resample(X_train, y_train)
counter = Counter(y_train)
print(counter)

In [None]:
print(f"Total Data after Upsampling: {len(X_train)}")

In [None]:
print(f"Train Data: {X_train.shape}, {y_train.shape}")
print(f"Train Data: {X_test.shape}, {y_test.shape}")

## Logistic Regression

In [None]:
logistic_pipeline = make_pipeline(StandardScaler(), LogisticRegression())
logistic_pipeline.fit(X_train, y_train)

# Accuray On Test Data
predictions = logistic_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy on Test Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y_test, predictions)}")
print(f"Recall Score: {recall_score(y_test, predictions)}")
print(f"F1 Score: {f1_score(y_test, predictions)}")
plot_confusion_matrix(logistic_pipeline, X_test, y_test)
plt.title("Confusion Matrix for Test Data")
plt.show()

print()

# Accuray On Whole Data
predictions = logistic_pipeline.predict(X.values)
accuracy = accuracy_score(y, predictions)
print(f"Accuracy on Whole Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y, predictions)}")
print(f"Recall Score: {recall_score(y, predictions)}")
print(f"F1 Score: {f1_score(y, predictions)}")
plot_confusion_matrix(logistic_pipeline, X.values, y)
plt.title("Confusion Matrix for Whole Data")
plt.show()

## RandomForest Classifier

In [None]:
rf_pipeline = make_pipeline(StandardScaler(), RandomForestClassifier(random_state = 18))
rf_pipeline.fit(X_train, y_train)

# Accuray On Test Data
predictions = rf_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy on Test Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y_test, predictions)}")
print(f"Recall Score: {recall_score(y_test, predictions)}")
print(f"F1 Score: {f1_score(y_test, predictions)}")
plot_confusion_matrix(rf_pipeline, X_test, y_test)
plt.title("Confusion Matrix for Test Data")
plt.show()

print()

# Accuray On Whole Data
predictions = rf_pipeline.predict(X.values)
accuracy = accuracy_score(y, predictions)
print(f"Accuracy on Whole Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y, predictions)}")
print(f"Recall Score: {recall_score(y, predictions)}")
print(f"F1 Score: {f1_score(y, predictions)}")
plot_confusion_matrix(rf_pipeline, X.values, y)
plt.title("Confusion Matrix for Whole Data")
plt.show()

## Adaboost Classifier

In [None]:
ada_pipeline = make_pipeline(StandardScaler(), AdaBoostClassifier(RandomForestClassifier(random_state = 18)))
ada_pipeline.fit(X_train, y_train)

# Accuray On Test Data
predictions = ada_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy on Test Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y_test, predictions)}")
print(f"Recall Score: {recall_score(y_test, predictions)}")
print(f"F1 Score: {f1_score(y_test, predictions)}")
plot_confusion_matrix(ada_pipeline, X_test, y_test)
plt.title("Confusion Matrix for Test Data")
plt.show()

print()

# Accuray On Whole Data
predictions = ada_pipeline.predict(X.values)
accuracy = accuracy_score(y, predictions)
print(f"Accuracy on Whole Data: {accuracy*100}%")
print(f"Precision Score: {precision_score(y, predictions)}")
print(f"Recall Score: {recall_score(y, predictions)}")
print(f"F1 Score: {f1_score(y, predictions)}")
plot_confusion_matrix(ada_pipeline, X.values, y)
plt.title("Confusion Matrix for Whole Data")
plt.show()

<div style="color:black;background-color:lightblue;border-radius:10px;padding:20px;">
<b>RESULT</b><br/>After extensive Data Analysis, Feature Engineering and Modeling. RandomForestClassifier out performed other models with a recall score of 0.80 and accuracy of 82.46% on test data and recall score of 0.96 and accuracy of 96.48% on whole data.
    
    
<div align="center" style="color:black;background-color:lightblue">
<b>Please do Upvote this notebook if you liked my work.</b>
</div>
</div>