# 1. Introduction

## 1.1 General Description of the Dataset

__Introduction__

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage (by UCI Machine Learning).

__Variables:__

* Pregnancies: Number of times pregnant

    Gestational diabetes is a type of diabetes that can develop during pregnancy in women who don’t already have     diabetes. Having gestational diabetes can increase your risk of high blood pressure during pregnancy (Centers for Disease Control and Prevention. Gestational Diabetes. https://www.cdc.gov/diabetes/basics/gestational.html)
        
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test 

* BloodPressure: Diastolic blood pressure (mm Hg)

* SkinThickness: Triceps skin fold thickness (mm) - Vücut yağ oranını gösterir.

   Below 20%  Under fat
   
   21-33%  Healthy
   
   34-39% Over fat
   
   Above 39  Obese
   
   
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
    
    BMI Categories:
     
     Underweight = <18.5
     
     Normal weight = 18.5–24.9
     
     Overweight = 25–29.9
     
     Obesity = BMI of 30 or greater



* DiabetesPedigreeFunction: Diabetes pedigree function
* Age: Age (years)
* Outcome: Class variable(0 or 1) (Target Variable)


## 1.2 Import Libraries

In [None]:
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, \
    classification_report
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

__Settings__

In [None]:
warnings.simplefilter(action='ignore', category=FutureWarning)
np.warnings.filterwarnings('ignore')
pd.pandas.set_option('display.max_columns', None)
pd.pandas.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 1.3 Read Data

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

# 2. Overview

In [None]:
df.head()

In [None]:
df.info()

It seems that there aren't any categorical variables. However, there may be categorical variables as type of integer variable. We can observe it simply by checking unique values of each variable.

In [None]:
# Numerical variables selected (Target Variable excluded)
cols = [col for col in df.columns if df[col].dtypes != 'O' and col not in "Outcome"]

In [None]:
df[cols].nunique()

* Unique values of each variable are very high. It seems there aren't any categorical variables within numerical variables.

In [None]:
# Plotting each variables histograms
def hist_for_nums(data, numeric_cols):
    for col in numeric_cols:
        data[col].hist()
        plt.xlabel(col)
        plt.title(col)
        plt.show()

hist_for_nums(df, cols)

* There are several right-skewed(Age, Diabetes Pedigree Function, BMI, Insulin, SkinThickness, Pregnancies) and normally distributed (Blood Pressure, Glucose) variables. We may observe outliers at those right-skewed variables. Also, these variables may indicate a range boundary.

* We can observe outliers by checking the boxplot of each variable.

In [None]:
def boxplot_for_nums(data, numeric_cols):
    for col in numeric_cols:
        sns.boxplot(data[col])
        plt.xlabel(col)
        plt.title(col)
        plt.show()

boxplot_for_nums(df, cols)


In [None]:
df.describe().T

# 3. Data Processing

 ## 3.1 Missing Value Treatment


In [None]:
df.isnull().sum()

In [None]:
df[df == 0]

* At first sight, it can be thought that there are no missing values in dataset.
* However, variables such as Age, Blood Pressure, BMI, Glucose, Insulin, Skin Thickness contains 0. This is simply not possible. Those values need to be changed with NaNs.

In [None]:
df_zeros = ['Age','BloodPressure','BMI','Glucose','Insulin','SkinThickness']

df[df_zeros]= df[df_zeros].replace(0, np.NaN)

In [None]:
df.isnull().sum()

* In order to fill missing values, we can group by each missing variables median values in accordance with the outcome variable and assign those values to missing values (Cited from Vincent Lugat's work. https://www.kaggle.com/vincentlugat/pima-indians-diabetes-eda-prediction-0-906)

__Glucose__

In [None]:
df.groupby("Outcome")["Glucose"].median()

In [None]:
df.loc[(df['Outcome'] == 0 ) & (df['Glucose'].isnull()), 'Glucose'] = 107
df.loc[(df['Outcome'] == 1 ) & (df['Glucose'].isnull()), 'Glucose'] = 140

__Blood Pressure__

In [None]:
df.groupby("Outcome")["BloodPressure"].median()


In [None]:
df.loc[(df['Outcome'] == 0 ) & (df['BloodPressure'].isnull()), 'BloodPressure'] = 70
df.loc[(df['Outcome'] == 1 ) & (df['BloodPressure'].isnull()), 'BloodPressure'] = 74.5

__Skin Thickness__

In [None]:
df.groupby("Outcome")["SkinThickness"].median()

In [None]:
df.loc[(df['Outcome'] == 0 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 27
df.loc[(df['Outcome'] == 1 ) & (df['SkinThickness'].isnull()), 'SkinThickness'] = 32

__Insulin__

In [None]:
df.groupby("Outcome")["Insulin"].median()

In [None]:
df.loc[(df['Outcome'] == 0 ) & (df['Insulin'].isnull()), 'Insulin'] = 102.500
df.loc[(df['Outcome'] == 1 ) & (df['Insulin'].isnull()), 'Insulin'] = 169.500

__BMI__

In [None]:
df.groupby("Outcome")["BMI"].median()

In [None]:
df.loc[(df['Outcome'] == 0 ) & (df['BMI'].isnull()), 'BMI'] = 30.100
df.loc[(df['Outcome'] == 1 ) & (df['BMI'].isnull()), 'BMI'] = 34.300

In [None]:
df.isnull().sum()

## 3.2 Outlier Treatment

In [None]:
# Determine Outlier Thresholds of each variable
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.25)
    quartile3 = dataframe[variable].quantile(0.75)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
# Function prints the variables which have values lower than low_limit (Determined by outlier_thresholds function) and higher than high_limit (Determined by outlier_thresholds function).
def has_outliers(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    if dataframe[(dataframe[variable] < low_limit) | (dataframe[variable] > up_limit)].any(axis=None):
        print(variable, "yes")

for col in cols:
    has_outliers(df, col)


* It seems that all variables except Glucose has outliers.

In [None]:
# Set outliers to low_limit and up_limit respectively
def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

replace_with_thresholds(df, 'BloodPressure')
replace_with_thresholds(df, 'DiabetesPedigreeFunction')

In [None]:
# Set outliers which are only higher than up_limit to up_limit
def replace_with_thresholds_2(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

replace_with_thresholds_2(df, 'Insulin')
replace_with_thresholds_2(df, 'SkinThickness')

* __These outliers are set to those limits after several trial and error attempts.__

# 4. Feature Engineering

__New Body Fat__

* According to Durnin and Womersley, Skinfold Thickness can be converted into Body fat (Durnin, J., & Womersley, J.,1973). 
* Body fat can be measured from the sum of four skinfolds (biceps, triceps, subscapular, and supra-iliac) of males and females of different ages.
* However, in this dataset we only have skinfold thickness of triceps.
* In order to overcome this problem, the skin thickness variable is multiplied by 4.

https://www.cambridge.org/core/services/aop-cambridge-core/content/view/DAC8BA25856FCEB30E22F60E0AF80D07/S0007114574000614a.pdf/body_fat_assessed_from_total_body_density_and_its_estimation_from_skinfold_thickness_measurements_on_481_men_and_women_aged_from_16_to_72_years.pdf

In [None]:
df["New_SkinThickness"] = df["SkinThickness"] * 4

In [None]:
# Body fat values are assigned according to skin thickness and age. Measurements are taken from the paper that is described above.
df.loc[(df['Age'] >= 21 ) & (df['Age'] <= 29) & (df["New_SkinThickness"] < 34), 'New_Body_Fat'] = "Underfat"
df.loc[(df['Age'] >= 30 ) & (df['Age'] <= 39) & (df["New_SkinThickness"] < 29), 'New_Body_Fat'] = "Underfat"
df.loc[(df['Age'] >= 40 ) & (df['Age'] <= 49) & (df["New_SkinThickness"] < 23), 'New_Body_Fat'] = "Underfat"
df.loc[(df['Age'] >= 50 ) & (df["New_SkinThickness"] < 20), 'New_Body_Fat'] = "Underfat"

df.loc[(df['Age'] >= 21 ) & (df['Age'] <= 29) & (df["New_SkinThickness"] >= 34) & (df["New_SkinThickness"] <= 79), 'New_Body_Fat'] = "Healthy"
df.loc[(df['Age'] >= 30 ) & (df['Age'] <= 39) & (df["New_SkinThickness"] >= 29) & (df["New_SkinThickness"] <= 73), 'New_Body_Fat'] = "Healthy"
df.loc[(df['Age'] >= 40 ) & (df['Age'] <= 49) & (df["New_SkinThickness"] >= 23) & (df["New_SkinThickness"] <= 59), 'New_Body_Fat'] = "Healthy"
df.loc[(df['Age'] >= 50 ) & (df["New_SkinThickness"] >= 20) & (df["New_SkinThickness"] <= 49), 'New_Body_Fat'] = "Healthy"

df.loc[(df['Age'] >= 21 ) & (df['Age'] <= 29) &( df["New_SkinThickness"] > 79) & (df["New_SkinThickness"] <= 120), 'New_Body_Fat'] = "OverFat"
df.loc[(df['Age'] >= 30) & (df['Age'] <= 39) & (df["New_SkinThickness"] > 73) & (df["New_SkinThickness"] <= 115), 'New_Body_Fat'] = "OverFat"
df.loc[(df['Age'] >= 40 ) & (df['Age'] <= 49) & (df["New_SkinThickness"] > 59) & (df["New_SkinThickness"] <= 95), 'New_Body_Fat'] = "OverFat"
df.loc[(df['Age'] >= 50 ) & (df["New_SkinThickness"] > 49) & (df["New_SkinThickness"] <= 77), 'New_Body_Fat'] = "OverFat"

df.loc[(df['Age'] >= 21 ) & (df['Age'] <= 29) & (df["New_SkinThickness"] > 120), 'New_Body_Fat'] = "Obese"
df.loc[(df['Age'] >= 30 ) & (df['Age'] <= 39) & (df["New_SkinThickness"] > 115) , 'New_Body_Fat'] = "Obese"
df.loc[(df['Age'] >= 40 ) & (df['Age'] <= 49) & (df["New_SkinThickness"] > 95) , 'New_Body_Fat'] = "Obese"
df.loc[(df['Age'] >= 50 ) & (df["New_SkinThickness"] > 77) , 'New_Body_Fat'] = "Obese"


In [None]:
df.drop(["SkinThickness","New_SkinThickness"], axis =1, inplace=True)

__Glucose__

In [None]:
# Glucose level under 140 are assigned as No_Risk and above 140 are assigned as Prediabetes.
df.loc[(df["Glucose"] <= 140), "New_Glucose"] = "No_Risk"
df.loc[(df["Glucose"] > 140), "New_Glucose"] = "Prediabetes"

# 5. Encoding

In [None]:
# Categorical variables are selected
cat_cols = [col for col in df.columns if df[col].dtypes == 'O']

In [None]:
# Categorical variables are encoded
def one_hot_encoder(dataframe,categorical_cols,nan_as_category=False):
    original_columns = list(dataframe.columns)
    dataframe = pd.get_dummies(dataframe,columns = categorical_cols,dummy_na = nan_as_category,drop_first = True)
    new_columns = [c for c in dataframe.columns if c not in original_columns]
    return dataframe,new_columns

df,new_cols_ohe = one_hot_encoder(df,cat_cols)

# 6. Modelling

In [None]:
y = df["Outcome"] # Target variable is assigned to y
X = df.drop(["Outcome"], axis=1) # Control variables are assigned to X

In [None]:
# All models are assigned to models variable
models = [('LR', LogisticRegression()),
          ('KNN', KNeighborsClassifier()),
          ('CART', DecisionTreeClassifier()),
          ('RF', RandomForestClassifier()),
          ('SVM', SVC(gamma='auto')),
          ('GB', GradientBoostingClassifier()),
          ("LightGBM", LGBMClassifier())] 

In [None]:
# Dataset split into test and train (Holdout)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=46) # 80% train, 20% test 

In [None]:
results = []
names = []
# K-fold cross validation applied into train dataset
for name, model in models:
    kfold = KFold(n_splits=10, random_state=123456) # Train dataset is split into 10
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring="accuracy")
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


* According to the results, our best models are KNN, GB, LightGBM, RF.

In [None]:
# Boxplot Algorithm Comparison
fig = plt.figure(figsize=(15, 10))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

* It seems that KNN, RF, GB and LightGMB models deviations are low. Therefore, those can be considered succesfull.

In [None]:
# Test data set prediction
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    msg = "%s: (%f)" % (name, acc)
    print(msg)

In [None]:
# K-fold cross validation applied into train dataset
for name, model in models:
    kfold = KFold(n_splits=10, random_state=123456) # Train dataset is split into 10
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring="accuracy")
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


* It seems that cross validation scores are larger than models accuracy score. There is a risk of overfitting. However, scores are very close. Therefore, it can be negligible.

* In my point of view, this problem occurs because the dataset is an unbalanced dataset.

In [None]:
sns.countplot(x = "Outcome", data = df)
plt.show()
print(pd.DataFrame({col: df["Outcome"].value_counts(),
                           "Ratio": 100 * df["Outcome"].value_counts()/ len (df)}), end = "\n\n\n")

* Unbalanced dataset is pretty common in disease data. Because most of the samples are healthy individuals. In our dataset, it is clear that 65% of control sample are healthy individuals.
* This problem can be overcome by resampling methods or by collecting more data.