# **DIABETES DATA EDA MODEL BUILDING**

![](https://images.everydayhealth.com/images/diabetes-awareness-month-1440x810.jpg)

# Overview

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

Over time, having too much glucose in your blood can cause health problems. Although diabetes has no cure, you can take steps to manage your diabetes and stay healthy.

According to Wikipedia the number of people with diabetes in India has increased from 26 million in 1990 to 65 million in 2016. According to the 2019 National Diabetes and Diabetic Retinopathy Survey report released by the Ministry of Health and Family Welfare, the prevalence was found to be 11.8% in people over the age of 50. The prevalence of diabetes is 6.5% and prediabetes 5.7% among the adults below the age of 50 years, according to the DHS survey.The prevalence was similar in both male (12%) and female (11.7%) populations. It was higher in urban areas. When surveyed for diabetic retinopathy, which threatens eyesight, 16.9% of the diabetic population aged up to 50 years were found to be affected. Per the report, diabetic retinopathy in the 60-69-years age group was 18.6%, in the 70-79-years age group it was 18.3%, and in those over 80 years of age it was 18.4%. A lower prevalence of 14.3% was observed in the 50-59-years age group. High prevalence of diabetes is reported in economically and epidemiologically advanced states such as Tamil Nadu and Kerala, where many research institutes which conduct prevalence studies are also present.

# How will we proceed ?

1. **Understanding the Data**

2. **EDA**

3. **Model Building**

4. **Model Performance**

5. **Inference**


# **UNDERSTANDING THE DATA**

# Including Required Packages 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**READING THE DATA**

In [None]:
df= pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()


# What are the features?


In [None]:
df.columns

In [None]:
df.shape

So we know that there are 9 features that has been included in the dataset needed to determine Heart Attack

In [None]:
df.info()

**DESCRIPTION OF THE DATASET**

In [None]:
df.describe()

**Let Us Know if We Have any missing values**

In [None]:
features_with_na=[features for features in df.columns if df[features].isnull().sum()>1]
## 2- step print the feature name and the percentage of missing values
for feature in features_with_na:
    print(feature, np.round(df[feature].isnull().mean(), 4),  ' % missing values')
features_with_na

**GREAT! We don't have any null values in the dataset! That would make the work a lot easier**

# **EDA**

# Number of Numerical Variables

In [None]:
numerical_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
len(numerical_features)


In [None]:
numerical_features

Wow!! We got to know all of the features are numerical variables ! 

**We need to know the number of discrete variables, Let us find it out !**

In [None]:
discrete_feature=[feature for feature in numerical_features if len(df[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

In [None]:
discrete_feature

**Now let's deal with the Continuous Variables**

In [None]:
continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature]
print("Continuous feature Count {}".format(len(continuous_feature)))

In [None]:
for feature in continuous_feature:
    data=df.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,ax=ax)

**Results against the Age**

In [None]:
sns.displot(x='Age', hue='Outcome', data=df, alpha=0.6)
plt.show()

In [None]:
diabetes = df[df['Outcome']==1]
sns.displot(diabetes.Age, kind='kde')
plt.show()

In [None]:
sns.displot(diabetes.Age, kind='ecdf')
plt.grid(True)
plt.show()

In [None]:
ranges = [0, 30, 40, 50, 60, 70, np.inf]
labels = ['0-30', '30-40', '40-50', '50-60', '60-70', '70+']

diabetes['Age'] = pd.cut(diabetes['Age'], bins=ranges, labels=labels)
diabetes['Age'].head()

In [None]:
sns.countplot(diabetes.Age)

**WE SEE THAT AGES BETWEEN 50-60 ARE THE MOST PRONE TO HEART ATTACKS**

In [None]:
df.head()

In [None]:
sns.displot(diabetes.BMI,kind='kde'),sns.displot(df.BMI,kind='kde')


In [None]:
df.head()

In [None]:
categorical_vars = ['Pregnancies']
continuous_vars= ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']

In [None]:
for feature in continuous_vars:
    data=df.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

**NOTE THAT THERE AREN'T OUTLIERS**

# **MODEL BUILDING**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import  BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

**PREPARING THE DATASET FOR MODEL**

In [None]:
#Creating a copy
data= df

In [None]:
data.head()

In [None]:

scaler = StandardScaler()

# define the columns to be encoded and scaled


# encoding the categorical columns
data = pd.get_dummies(data, columns = categorical_vars, drop_first = True)

X = data.drop(['Outcome'],axis=1)
y = data[['Outcome']]

data[continuous_vars] = scaler.fit_transform(X[continuous_vars])

# defining the features and target
X = data.drop(['Outcome'],axis=1)
y = data[['Outcome']]



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)

In [None]:
lr = LogisticRegression(random_state=42)

knn = KNeighborsClassifier()
para_knn = {'n_neighbors':np.arange(1, 50)}

grid_knn = GridSearchCV(knn, param_grid=para_knn, cv=5)

dt = DecisionTreeClassifier()
para_dt = {'criterion':['gini','entropy'],'max_depth':np.arange(1, 100), 'min_samples_leaf':[1,2,4,5,10,20,30,40,80,100]}
grid_dt = GridSearchCV(dt, param_grid=para_dt, cv=5)

rf = RandomForestClassifier()

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators':[100, 350, 500],
    'min_samples_leaf':[2, 10, 30]
}
grid_rf = GridSearchCV(rf, param_grid=params_rf, cv=5)

In [None]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=20, min_samples_leaf=5, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=2, random_state=42)

In [None]:
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt), ('Random Forest', rf)]


# **MODEL PERFORMANCES**

In [None]:
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

**WE SEE THAT LOGISTIC REGRESSION PERFORMS THE BEST**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(base_estimator=rf, n_estimators=100, random_state=1)

ada.fit(X_train, y_train)

y_pred = ada.predict(X_test)

accuracy_score(y_pred, y_test)

In [None]:
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
plt.figure(figsize=(10, 10))
importances_sorted.plot(kind='bar',color='orange')
plt.title('Features Importances')
plt.show()

# LIGHT GBM

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import accuracy_score

def cross_val(X, y, model, params, folds=5):

    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=21)
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
        print(f"Fold: {fold}")
        x_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

        alg = model(**params)
        alg.fit(x_train, y_train,
                eval_set=[(x_test, y_test)],
                early_stopping_rounds=100,
                verbose=400)

        pred = alg.predict(x_test)
        accuracy = accuracy_score(y_test, pred)
#         log_loss_score = log_loss(y_test,pred)
        print(f" accuracy : {accuracy}")
        print("-"*50)
    return alg
        



In [None]:
lgb_params= {'learning_rate': 0.0001, 
             'n_estimators': 20000, 
             'max_bin': 94,
             'num_leaves': 5, 
             'max_depth': 30, 
             'reg_alpha': 8.457, 
             'reg_lambda': 6.853, 
             'subsample': 0.749}

In [None]:
from lightgbm import LGBMClassifier
lgb_model = cross_val(X, y, LGBMClassifier, lgb_params)

# XGBoost

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier(n_estimators = 10000,predictor = 'gpu_predictor',tree_method = 'gpu_hist',learning_rate = 0.01,max_depth=29,max_leaves = 31,eval_metric = 'mlogloss',verbosity = 3)
classifier.fit(X,y)

In [None]:
y_pred=classifier.predict(X_test)
y_test=np.array(y_test)
print("accuracy_score_XGBOOST: ",accuracy_score(y_pred,y_test))

# NEURAL NETWORK APPROACH

**IMPORTING THE NECESSARY LIBRARIES**

In [None]:
from tensorflow.keras.layers import Dense,Dropout,Flatten
from tensorflow.keras.layers import MaxPooling2D,GlobalAveragePooling2D,BatchNormalization,Activation
from tensorflow import keras
import tensorflow as tf

In [None]:
X_train.shape

In [None]:

model = tf.keras.Sequential()
model.add(Dense(1024, input_dim=23, activation= "relu"))
model.add(Dropout(0.3))
model.add(Dense(512, activation= "relu"))
model.add(Dropout(0.4))
model.add(Dense(128, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(32, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(1))
model.summary() #Print model Summary

In [None]:
model.compile(loss= "binary_crossentropy" , optimizer="adam", metrics=["accuracy"])

In [None]:
Performance = model.fit(X_train, y_train, validation_split =0.1,epochs=5)

In [None]:
model.evaluate(X_test,y_test)

In [None]:
my_dpi = 50 # dots per inch .. (resolution)
plt.figure(figsize=(400/my_dpi, 400/my_dpi), dpi = my_dpi)
plt.plot(Performance.history['accuracy'], label='train accuracy')
plt.plot(Performance.history['val_accuracy'], label='val accuracy')
plt.legend()
plt.show()
plt.savefig('AccVal_acc')

# Inference

**The accuracy of the following models are** 
1. **Logistic Regression : 0.805**
2. **K Nearest Neighbours : 0.727**
3. **Classification Tree : 0.740**
4. **Random Forest : 0.779**
5. **Adaboost Classiefier : 0.779**
6. **ANN : 0.770**
**Note that the Neural Network model overfits thus it isn't advisable to use Neural Network models as there are not complex patterns we need to know , neither we need to figure out any high degree of non-linearity.**

**So we see that the most important factor which leads to Diabetes is age and blood glucose level, so it is advisable to the general people to take proper care of the aged people as much as they can and following are the few guidelines that help them.
Doctors generally advise a person to get his/her blood sugar tested when:**

**Urinate (pee) a lot, often at night**

**Are very thirsty**

**Lose weight without trying**

**Are very hungry**

**Have blurry vision**

**Have numb or tingling hands or feet**

**Feel very tired**

**Have very dry skin**

**Have sores that heal slowly**

**Have more infections than usual**




# THANK YOU , IF YOU LIKE THE NOTEBOOK PLEASE DO UP VOTE