# Introduction
In this notebook I'll be exploring, cleaning and making a model on the *PIMA Diabetes* dataset.

![Diabetes Picture](https://images.unsplash.com/photo-1593491205049-7f032d28cf5c?ixlib=rb-1.2.1&q=80&fm=jpg&crop=entropy&cs=tinysrgb&dl=mykenzie-johnson-4qjxCUOc3iQ-unsplash.jpg)

# About the dataset

## Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

# Setting up the environment

In [None]:
import numpy as np # for linear algebra
import pandas as pd # data processing, CSV file I/O, etc
import seaborn as sns # for plots
import plotly.graph_objects as go # for plots
import plotly.express as px #for plots
import matplotlib.pyplot as plt # for visualizations and plots
import missingno as msno # for plotting missing data

# this eliminates the requirement to use plt.show() after every plot
%matplotlib inline

# changing the default figure sizes
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10

import random # random library
pallete = ['Accent_r', 'Blues', 'BrBG', 'BrBG_r', 'BuPu', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'OrRd', 'Oranges', 'Paired', 'PuBu', 'PuBuGn', 'PuRd', 'Purples', 'RdGy_r', 'RdPu', 'Reds', 'autumn', 'cool', 'coolwarm', 'flag', 'flare', 'gist_rainbow', 'hot', 'magma', 'mako', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'terrain', 'turbo', 'twilight']

from sklearn.model_selection import train_test_split # spliting training and testing data
from sklearn.preprocessing import MinMaxScaler # data normalization with sklearn
from sklearn.preprocessing import StandardScaler # data standardization with  sklearn
from sklearn.ensemble import RandomForestClassifier # model
from sklearn.linear_model import LogisticRegression # model
from sklearn.neighbors import KNeighborsClassifier # model
from sklearn.metrics import classification_report, confusion_matrix # to evaluate the model
from mlxtend.plotting import plot_confusion_matrix # plot confusion matrix
from sklearn.model_selection import GridSearchCV # to finetune the model

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Read the dataset

In [None]:
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
df.head() # displays the top 5 values in the dataset

# Getting info about the dataset
### General stats

In [None]:
df.info()

In [None]:
df.describe()

### Checking for NaN values

In [None]:
df.isnull().sum()

Wow, this dataset doesn't have any null values!!

# Data Cleaning
In the above stats we can see that there are people with 0 BP (dead person?), 0 skin thickness (skeleton?) and 0 Glucose (how do you even survive?).

Let's convert those 0s to NaN.

In [None]:
df["Glucose"] = df["Glucose"].apply(lambda x: np.nan if x == 0 else x)
df["BloodPressure"] = df["BloodPressure"].apply(lambda x: np.nan if x == 0 else x)
df["SkinThickness"] = df["SkinThickness"].apply(lambda x: np.nan if x == 0 else x)
df["Insulin"] = df["Insulin"].apply(lambda x: np.nan if x == 0 else x)
df["BMI"] = df["BMI"].apply(lambda x: np.nan if x == 0 else x)

In [None]:
df.isnull().sum()

# EDA
### Distribution of the data

In [None]:
px.pie(df, names="Outcome")

Here we can see 65.1% of the people in this dataset doesn't have Diabetes and 34.9% does.

In [None]:
sns.countplot(x="Outcome", data=df, palette=random.choice(pallete))

### Pregnencies vs Outcome

In [None]:
sns.countplot(x="Pregnancies", hue = "Outcome", data=df, palette=random.choice(pallete))

In [None]:
sns.histplot(x="Pregnancies", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

### Blood Pressure vs Outcome

In [None]:
sns.histplot(x="BloodPressure", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

Here we can see that the BP levels of diabetic people is a little high.

### Glucose vs Outcome

In [None]:
sns.histplot(x="Glucose", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

Here we can see that the glucose levels of diabetic people is generally high.

### Skin Thickness vs Outcome

In [None]:
sns.histplot(x="SkinThickness", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

Here we can see diabetic people have a little more thick skin.

### Insulin vs Outcome

In [None]:
sns.histplot(x="Insulin", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

Here we can see diabetic people have a little more insulin.

### Age vs Outcome

In [None]:
sns.histplot(x="Age", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

We can see that old people are more diabetic.

### BMI vs Outcome

In [None]:
sns.histplot(x="BMI", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

Diabetic people have higher BMI.

### DiabetesPedigreeFunction vs Outcome

In [None]:
sns.histplot(x="DiabetesPedigreeFunction", hue="Outcome", data=df, kde=True, palette=random.choice(pallete))

### Pairplot

In [None]:
sns.pairplot(df, hue='Outcome',palette=random.choice(pallete))

### Boxplots

In [None]:
fig, axs = plt.subplots(4, 2, figsize=(20,20))
axs = axs.flatten()
for i in range(len(df.columns)-1):
    sns.boxplot(data=df, x=df.columns[i], ax=axs[i], palette=random.choice(pallete))

### Correlation Matrix

In [None]:
sns.heatmap(df.corr(), linewidths=0.1, vmax=1.0, square=True, cmap='coolwarm', linecolor='white', annot=True).set_title("Correlation Map")

`Outcome` is highliy correlated with `Glucose`.

# Cleaning the dataset

## NaN Values Analysis
Let's get rid of them NaNs.

In [None]:
df.isnull().sum()

### Barplot

In [None]:
msno.bar(df)

### Matrix

**How to read?**

Each row in the matrix represents that row in the dataset. If any value in that row is NaN, then it is white else black(/gray).

The graph on the right shows the number of missing values in each row. If a row has too many NaN values, we can remove that.

In [None]:
msno.matrix(df, figsize=(20,35))

### Heatmap
The heatmap is used to identify correlations of the nullity between each of the different columns.

In [None]:
msno.heatmap(df, cmap=random.choice(pallete))

Here we can see that Insulin and Skin Thickness are highly positively correlated with each other (nullity corr).

### Dendrogram
The dendrogram plot provides a tree-like graph generated through hierarchical clustering and groups together columns that have strong correlations in nullity.


**How to read?**

If a number of columns are grouped together at level zero, then the presence of nulls in one of those columns is directly related to the presence or absence of nulls in the others columns. The more separated the columns in the tree, the less likely the null values can be correlated between the columns.

In [None]:
msno.dendrogram(df)

### Percentages of NaNs

In [None]:
df.isnull().sum()/len(df)*100

We can see that the `Insulin` column has nearly 50% of NaN values. Therefore, it would be wise to drop the column entirely!

In [None]:
df.drop(columns=["Insulin"], inplace=True)

In [None]:
df.describe()

In [None]:
df.skew()

For highly skewed values we'll impute the column with **median** else **mean**.

In [None]:
# Highly skewed
df["BMI"].replace(to_replace=np.nan,value=df["BMI"].median(), inplace=True)
df["Pregnancies"].replace(to_replace=np.nan,value=df["Pregnancies"].median(), inplace=True)

# Normal
df["Glucose"].replace(to_replace=np.nan,value=df["Glucose"].mean(), inplace=True)
df["BloodPressure"].replace(to_replace=np.nan,value=df["BloodPressure"].mean(), inplace=True)
df["SkinThickness"].replace(to_replace=np.nan,value=df["SkinThickness"].mean(), inplace=True)

### Outliers
Outliers... more like **OUT**-liers!

**Method 1**

*IQR Method*

This technique uses the IQR scores calculated earlier to remove outliers. The rule of thumb is that anything not in the range of $(Q1 - 1.5 IQR)$ and $(Q3 + 1.5 IQR)$ is an outlier, and can be removed.

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
df_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
print(f'Before: {df.shape}, After: {df_out.shape}')

Using this method, we will be losing around 140 data points.

**Method 2**

*Median Method*

In this method we will replace the outliers with *median*.

In [None]:
for col in df.columns[:-1]:
    up_out = df[col].quantile(0.90)
    low_out = df[col].quantile(0.10)
    med = df[col].median()
#     print(col, up_out, low_out, med)
    df[col] = np.where(df[col] > up_out, med, df[col])
    df[col] = np.where(df[col] < low_out, med, df[col])

In [None]:
df.describe()

# Modeling

## Split the data

In [None]:
X = df_out[df_out.columns[:-1]]
y = df_out['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

## Normalize the data

In [None]:
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

# Models
## Logistic Regression

In [None]:
log_params = {'C': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 10, 100, 100]} 
log_model = GridSearchCV(LogisticRegression(), log_params, cv=5)
log_model.fit(X_train_norm, y_train)
log_pred = log_model.predict(X_test_norm)

## Random Forest Classifier

In [None]:
rf_params = {'criterion' : ['gini', 'entropy'],
             'n_estimators': list(range(60, 140, 20)),
             'max_depth': list(range(3, 20, 2))}
rf_model = GridSearchCV(RandomForestClassifier(), rf_params, cv=5)
rf_model.fit(X_train_norm, y_train)
rf_pred = rf_model.predict(X_test_norm)

## K Neighbors Classifier

In [None]:
knn_params = {'n_neighbors': list(range(1,50))}
knn_model = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5)
knn_model.fit(X_train_norm, y_train)
knn_pred = knn_model.predict(X_test_norm)

# Evaluation
For the evaluation we will be mainly looking at `Precision` & `Recall` values. This is because in the dataset there are very less points for diabetic people, because of which even if a model predicts `0` for everyone, it can be very accurate!

In [None]:
print("Logistic Regression: \n", classification_report(y_test, log_pred)) 
print("\nRandom Forest Classifier: \n", classification_report(y_test, rf_pred)) 
print("\nK Neighbors Classifier: \n", classification_report(y_test, knn_pred)) 

Here we can see, *Random Forest Classifier* has better Precision and Recall (hence, better f1-score) compared to other models.

# Confusion Matrix

## Logistic Regression

In [None]:
labels = ["Not Diabetic", "Diabetic"]
cm  = confusion_matrix(y_test, log_pred)
plt.figure()
plot_confusion_matrix(cm, hide_ticks=True, cmap="Reds")
plt.xticks(range(2), labels, fontsize=14)
plt.yticks(range(2), labels, fontsize=14)
plt.show()

## Random Forest Classifier

In [None]:
labels = ["Not Diabetic", "Diabetic"]
cm  = confusion_matrix(y_test, rf_pred)
plt.figure()
plot_confusion_matrix(cm, hide_ticks=True, cmap="Blues")
plt.xticks(range(2), labels, fontsize=14)
plt.yticks(range(2), labels, fontsize=14)
plt.show()

## K Neighbors Classifier

In [None]:
labels = ["Not Diabetic", "Diabetic"]
cm  = confusion_matrix(y_test, knn_pred)
plt.figure()
plot_confusion_matrix(cm, hide_ticks=True, cmap="Greens")
plt.xticks(range(2), labels, fontsize=14)
plt.yticks(range(2), labels, fontsize=14)
plt.show()

We can conclude that **Random Forest Classifier** model works better in this case.

I hope you liked my notebook. Do not forget to upvote it.
## Thank You