## Problem Statement

<div class="alert alert-warning">
    <strong>We intend to predict whether any given individual is likely to have diabetes using the dataset we obtained from Kaggle which contains 768 data of diabetics and non-diabetic individuals.<br><br>
        Factors include:<br>
        1) Pregnancies<br>
        2) Glucose<br>
        3) Blood Pressure <br>
        4) Skin Thickness <br>
        5) Insulin <br>
        6) BMI <br>
        7) Diabetes Pedigree Function [indicates the function which scores likelihood of diabetes based on family history]<br>
        8) Age <br>
        9) Diabetic Outcome<br>
    </strong>

</div>

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
from sklearn.utils import resample,shuffle
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression

In [2]:
data = pd.read_csv("diabetes.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


---
## Improving Data By Further Cleaning

In [3]:
#Check for NULL data
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [4]:
#Correlation between Variables
data.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


<div class="alert alert-warning">
    <strong>The dataset contains data filled with integer "0" which signifies NULL. We decided to replace "0"s with NaN.</strong>

</div>



In [5]:
#Replace "0" data with NULL and find the number of NULL for each variable
data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN)
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

<div class="alert alert-warning">
    <strong>We decided to drop Insulin and SkinThickness from dataset as there are too many missing data for the variables. We realised there is not much correlation with the diabetes outcome. <br><br>Although blood pressure correlation is low, it have very little missing data. Thus, by replacing it with median might improve the correlation.</strong>

</div>

In [6]:
#Drop Insulin and SkinThickness as explained above.
data = data.drop(["Insulin", "SkinThickness"], axis =1)

In [7]:
#Get the median values
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,757.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.535641,12.382158,6.924988,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,32.3,0.3725,29.0,0.0
75%,6.0,141.0,80.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,67.1,2.42,81.0,1.0


In [8]:
#Replace zero values in Glucose, BMI, Bloodpressure with median values
data["Glucose"] = data["Glucose"].replace(np.NaN, 117)
data["BMI"] = data["BMI"].replace(np.NaN, 32.3)
data["BloodPressure"] = data["BloodPressure"].replace(np.NaN, 72)
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<div class="alert alert-warning">
    <strong>We replaced the NaN value with median because the outliers will affect the mean.<br>
We don't use mode because it is not a frequency (discrete value) but a continous data.</strong>

</div>


