**Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high.**
The objective of the dataset is to **diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset**. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

### Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

### Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

## Setting up Environment

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Cleaning dataset

In [None]:
df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.head()

The different columns present in the dataset are:

- **Pregnancies** -> Number of times Pregnant

- **Glucose** -> Plasma glucose concentration

- **BloodPressure** -> Diastolic blood pressure (mm Hg)

- **SkinThickness** -> Triceps skin fold thickness (mm)

- **Insulin** -> 2-Hour serum insulin (mu U/ml)

- **BMI** -> Body Mass Index

- **DiabetesPedigreeFunction** -> Diabetes pedigree function (a function which scores likelihood of diabetes based on family history).

- **Age** -> Age in years

- **Outcome** -> Whether the lady is diabetic or not, 0 represents the person is not diabetic and 1 represents that the person is diabetic.

In [None]:
df.info()

In [None]:
for col in df.columns:
    print("The minimum value fore the columns {} is {}".format(col, df[col].min()))


Let's replace all 0 values with null.

In [None]:
 df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.nan)
df.info()

## EDA and Visualization

In [None]:
import seaborn as sns
from itertools import cycle
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

sns.countplot(df['Outcome'])
plt.show()

In [None]:
df['Outcome'].value_counts()

In [None]:
import plotly.graph_objects as go

labels = ['Diabetic', 'Non-Diabetic']
percentages = [34.89, 65.10]
fig = go.Figure(data=[go.Pie(labels=labels, values=percentages, pull=[0.05, 0])])
fig.show()

In [None]:
def msv_1(data, thresh = 20, color = 'black', edgecolor = 'black', height = 3, width = 15):
    
    plt.figure(figsize = (width, height))
    percentage = (data.isnull().mean()) * 100
    percentage.sort_values(ascending = False).plot.bar(color = color, edgecolor = edgecolor)
    plt.axhline(y = thresh, color = 'r', linestyle = '-')
    
    plt.title('Missing values percentage per column', fontsize=20, weight='bold' )
    
    plt.text(len(data.isnull().sum()/len(data))/1.7, thresh+2.5, f'Columns with more than {thresh}% missing values', fontsize=10, color='crimson',
         ha='left' ,va='top')
    plt.text(len(data.isnull().sum()/len(data))/1.7, thresh - 0.5, f'Columns with less than {thresh}% missing values', fontsize=10, color='green',
         ha='left' ,va='top')
    plt.xlabel('Columns', size=15, weight='bold')
    plt.ylabel('Missing values percentage')
    
    return plt.show()
msv_1(df, 15, color=sns.color_palette('Oranges',15))

In [None]:
df['Insulin'] = df['Insulin'].fillna(df['Insulin'].median()) # Filling null values with the median.

for col in ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']:
    df[col] = df[col].fillna(df[col].mean())
df.isnull().sum()

**All null values have been taken care of.**

Let's have a look at the distribution of the data.

In [None]:
columns=df.columns
columns=list(columns)
columns.pop()
print("Column names except for the target column are :",columns)

#Graphs to be plotted with these colors
colours=['r','c','k','m','r','y','b','g']
sns.set(rc={'figure.figsize':(15,17)})
sns.set_style(style='white')
for i in range(len(columns)):
    
    plt.subplot(4,2,i+1)
    sns.distplot(df[columns[i]], hist=True, rug=True, color=colours[i])

In [None]:
sns.lmplot(data=df, x="Age", y="BloodPressure",hue = "Outcome",palette="Set2",col = "Outcome")
sns.lmplot(data=df, x="Age", y="Glucose",hue = "Outcome",palette="Set1")
sns.lmplot(data=df, x="Age", y="SkinThickness",hue = "Outcome",palette="Set3")

In [None]:
import plotly.express as px
for i in df.columns:
    if i!='Outcome':
        fig = px.box(df, df["Outcome"],y=df[i], color = 'Outcome')
        fig.show()


#### Thus, we reach a few important conclusions.
- Higher Glucose level leads to more chances of Diabetes!
- Probabilty of diabetes is higher when Blood pressure is high.
- Higher the Insulin level more the chances of diabetes.
- Higher the BMI more the chances of diabetes.
- Diabetic people have higher DiabetesPedigreeFunction value i.e. genetic influence plays some role in the Diabetes among patients.
- There is less chance of diabetes among young people.

### **Let's now check how Diabetes affects the chances of being pregnant.**

In [None]:
fig = px.histogram(df, x = df['Pregnancies'], color = 'Outcome')
fig.update_layout(
    bargap=0.2)
fig.show()

We can see that **higher the number of pregnancies**, **more is the risk of having diabetes**.

Let's now check **skewness of the data**.

In [None]:
from scipy.stats import skew
for col in df.drop('Outcome', axis = 1).columns:
    print("Skewness for the column {} is {}".format(col, df[col].skew()))

We see columns **Insulin** and **DiabetesPedigreeFunction** are quite skewed. On the other hand, columns like **Pregnancies, Glucose, BloodPressure, SkinThickness** and **BMI** are not that much skewed. We can fill null values with the mean for these columns, but for columns like Insulin and DiabetesPedigreeFunction, we will have to replace them will median due to the effect of skewness.

## Correlation Matrix

In [None]:
corr = df.corr()
f, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(corr,vmax=1.0, center=0,
            square=True, linewidths=1, cbar_kws={"shrink": .5}, annot = True)

From the above heatmap, we can observe that all the features are weakly correlated, so that removes multicollinearity out of equation.

## Dataset Splitting and Features Scaling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('Outcome', axis = 1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

sc = StandardScaler()
X_train =  pd.DataFrame(sc.fit_transform(X_train),
        columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age'])
X_test = pd.DataFrame(sc.fit_transform(X_test),
        columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age'])

## Baseline Models

## Modelling

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

Let's model these one by one.

In [None]:
log_params = {'penalty':['l1', 'l2'], 
              'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 100], 
              'solver':['liblinear', 'saga']} 
log_model = GridSearchCV(LogisticRegression(), log_params, cv=5) #Tuning the hyper-parameters
log_model.fit(X_train, y_train)
log_predict = log_model.predict(X_test)
log_score = log_model.best_score_

In [None]:
# knn
knn_params = {'n_neighbors': list(range(3, 20, 2)),
          'weights':['uniform', 'distance'],
          'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
          'metric':['euclidean', 'manhattan', 'chebyshev', 'minkowski']}
knn_model = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5) #Tuning the hyper-parameters
knn_model.fit(X_train, y_train)
knn_predict = knn_model.predict(X_test)
knn_score = knn_model.best_score_

In [None]:
# Support Vector Classification(svc)
svc_params = {'C': [0.001, 0.01, 0.1, 1],
              'kernel': [ 'linear' , 'poly' , 'rbf' , 'sigmoid' ]}
svc_model = GridSearchCV(SVC(), svc_params, cv=5) #Tuning the hyper-parameters
svc_model.fit(X_train, y_train)
svc_predict = svc_model.predict(X_test)
svc_score = svc_model.best_score_

In [None]:
# Decision Tree
dt_params = {'criterion' : ['gini', 'entropy'],
              'splitter': ['random', 'best'], 
              'max_depth': [3, 5, 7, 9, 11, 13]}
dt_model = GridSearchCV(DecisionTreeClassifier(), dt_params, cv=5) #Tuning the hyper-parameters
dt_model.fit(X_train, y_train)
dt_predict = dt_model.predict(X_test)
dt_score = dt_model.best_score_

In [None]:
# Random Forest
rf_params = {'criterion' : ['gini', 'entropy'],
             'n_estimators': list(range(5, 26, 5)),
             'max_depth': list(range(3, 20, 2))}
rf_model = GridSearchCV(RandomForestClassifier(), rf_params, cv=5) #Tuning the hyper-parameters
rf_model.fit(X_train, y_train)
rf_predict = rf_model.predict(X_test)
rf_score = rf_model.best_score_

In [None]:
# lgb
lgb_params = {'n_estimators': [5, 10, 15, 20, 25, 50, 100],
                   'learning_rate': [0.01, 0.05, 0.1],
                   'num_leaves': [7, 15, 31],
                  }
lgb_model = GridSearchCV(LGBMClassifier(), lgb_params, cv=5) #Tuning the hyper-parameters
lgb_model.fit(X_train, y_train)
lgb_predict = lgb_model.predict(X_test)
lgb_score = lgb_model.best_score_


In [None]:
# xgb
xgb_params = {'max_depth': [3, 5, 7, 9],
              'n_estimators': [5, 10, 15, 20, 25, 50, 100],
              'learning_rate': [0.01, 0.05, 0.1]}
xgb_model = GridSearchCV(xgb.XGBClassifier(eval_metric='logloss'), xgb_params, cv=5) #Tuning the hyper-parameters
xgb_model.fit(X_train, y_train)
xgb_predict = xgb_model.predict(X_test)
xgb_score = xgb_model.best_score_

## Evaluation
Time to evaluate the models.

In [None]:
models = ['LogisticRegression', 'KNeighborsClassifier', 'SVC', 'DecisionTreeClassifier', 
          'RandomForestClassifier', 'LGBMClassifier', 'XGBClassifier']
scores = [log_score, knn_score, svc_score,dt_score,rf_score, lgb_score, xgb_score]
score_table = pd.DataFrame({'Model':models, 'Score':scores})
score_table.sort_values(by='Score', axis=0, ascending=False)
print(score_table.sort_values(by='Score', ascending=False))
sns.barplot(x = score_table['Score'], y = score_table['Model'], palette='inferno');

Thus, **Logistic Regression** is the best performing model.

In [None]:
from sklearn import metrics
print('Classification Report_test','\n',metrics.classification_report(y_test, log_predict))

Hope you liked the notebook, any suggestions would be highly appreciated.

**Please upvote if you liked it.**