<h1 align='center'> Diabetes Diagnostic Model Based On Convolutional Neural Network</h1>


-----

|  **Contacts**  ||
|--------------------------|--------
|      **Supervisor**      |  **Dr Md Zakir Hossain**
|        **Student**       |  **Zeyu Zhang**





## Import Libraries

In [1]:
# Code Imports
# Every import is here, you may need to uncomment additional items as necessary.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline

import sqlite3
from sqlite3 import Error
from scipy import stats
from sklearn.linear_model import LogisticRegression     # Logistic Regression
from sklearn.neighbors import KNeighborsClassifier      # k-Nearest Neighbours
from sklearn.preprocessing import LabelEncoder          # encooding variables
from sklearn.preprocessing import StandardScaler        # encooding variables
from sklearn.model_selection import train_test_split    # testing our models
#from sklearn.preprocessing import OneHotEncoder         # nominal variable
from sklearn.metrics import confusion_matrix            # scoring
from sklearn.tree import DecisionTreeClassifier         # decision trees
from sklearn.tree import DecisionTreeRegressor          # decision trees
from sklearn import tree                                # decision trees
from sklearn.decomposition import PCA                   # PCA 
from sklearn.cluster import KMeans                      # KMeans Clustering
from sklearn import metrics                             # metrics

# import math module for roundings
import math

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# import sklearn metrics for validation
import sklearn.metrics as skm

# import cdist for SSE (Distortion)
from scipy.spatial.distance import cdist

# cross validation
from sklearn.model_selection import cross_validate 

## Screening of Datasets

First we need to study the correlation between the predictors and outcomes. In order to find the appropriate dataset as the research object, the screening of multiple is indispensable.  


### Canidate Data Sources

- **`CDC_BRFSS2015`** - It's a dataset of 253,680 survey responses to the CDC's BRFSS2015.
- **`NIDDK_Pima`** - It's a dataset of 768 females at least 21 years old of Pima Indian heritage responses to National Institute of Diabetes and Digestive and Kidney Diseases's survey.
- **`Sylhet`** - This dataset has been collected using direct questionnaires from 520 patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh.

-----





### CDC BRFSS2015 Database

This is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015 with 22 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

####  The Columns
| Column Name    | Expression    |
| :------------- | :------------- |
| Diabetes_binary| 0 = no diabetes 1 = diabetes |
| HighBP      | 0 = no high BP 1 = high BP |
| HighChol    | 0 = no high cholesterol 1 = high cholesterol |
| CholCheck       | 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years |
| BMI            | Body Mass Index |
| Smoker      | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes |
| Stroke       | (Ever told) you had a stroke. 0 = no 1 = yes |
| HeartDiseaseorAttack        | coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes |
| PhysActivity       | physical activity in past 30 days - not including job 0 = no 1 = yes |
| Fruits        | Consume Fruit 1 or more times per day 0 = no 1 = yes |
| Veggies        | Consume Vegetables 1 or more times per day 0 = no 1 = yes |
| HvyAlcoholConsump        | (adult men >=14 drinks per week and adult women>=7 drinks per week) 0 = no 1 = yes |
| AnyHealthcare        | Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes |
| NoDocbcCost        | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes |
| GenHlth       | Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor |
| MentHlth        | days of poor mental health scale 1-30 days |
| PhysHlth        | physical illness or injury days in past 30 days scale 1-30 |
| DiffWalk        | Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes |
| Sex        | 0 = female 1 = male |
| Age        | 13-level age category 1 = 18-24 9 = 60-64 13 = 80 or older |
| Education        | Education level scale 1-6 1 = Never attended school or only kindergarten 2 = elementary etc. |
| Income        | Income scale 1-8 1 = less than `$10000`, 5 = less than `$35000`, 8 = `$75000` or more |






In [2]:
# Read 'CDC_BRFSS2015.csv'
cdc_df = pd.read_csv("./data/CDC_BRFSS2015.csv")

# Drop NaN value
print("Any null value:", any(cdc_df.isnull()))
print("Any NaN value:", any(cdc_df.isna()))
cdc_df = cdc_df.dropna()

cdc_df.tail()

Any null value: True
Any NaN value: True


Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,1.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0
253679,1.0,1.0,1.0,1.0,25.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,9.0,6.0,2.0


In [3]:
# Correlation contingency table
cdc_corr = cdc_df.corr()
cdc_corr

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
Diabetes_binary,1.0,0.263129,0.200276,0.064761,0.216843,0.060789,0.105816,0.177282,-0.118133,-0.040779,...,0.016255,0.031433,0.293569,0.069315,0.171337,0.218344,0.03143,0.177442,-0.124456,-0.163919
HighBP,0.263129,1.0,0.298199,0.098508,0.213748,0.096991,0.129575,0.209361,-0.125267,-0.040555,...,0.038425,0.017358,0.30053,0.056456,0.161212,0.223618,0.052207,0.344452,-0.141358,-0.171235
HighChol,0.200276,0.298199,1.0,0.085642,0.106722,0.091299,0.09262,0.180765,-0.078046,-0.040859,...,0.04223,0.01331,0.208426,0.062069,0.121751,0.144672,0.031205,0.272318,-0.070802,-0.085459
CholCheck,0.064761,0.098508,0.085642,1.0,0.034495,-0.009929,0.024158,0.044206,0.00419,0.023849,...,0.117626,-0.058255,0.046589,-0.008366,0.031775,0.040585,-0.022115,0.090321,0.00151,0.014259
BMI,0.216843,0.213748,0.106722,0.034495,1.0,0.013804,0.020153,0.052904,-0.147294,-0.087518,...,-0.018471,0.058206,0.239185,0.08531,0.121141,0.197078,0.04295,-0.036618,-0.103932,-0.100069
Smoker,0.060789,0.096991,0.091299,-0.009929,0.013804,1.0,0.061173,0.114441,-0.087401,-0.077666,...,-0.023251,0.048946,0.163143,0.092196,0.11646,0.122463,0.093662,0.120641,-0.161955,-0.123937
Stroke,0.105816,0.129575,0.09262,0.024158,0.020153,0.061173,1.0,0.203002,-0.069151,-0.013389,...,0.008776,0.034804,0.177942,0.070172,0.148944,0.176567,0.002978,0.126974,-0.076009,-0.128599
HeartDiseaseorAttack,0.177282,0.209361,0.180765,0.044206,0.052904,0.114441,0.203002,1.0,-0.087299,-0.01979,...,0.018734,0.031,0.258383,0.064621,0.181698,0.212709,0.086096,0.221618,-0.0996,-0.141011
PhysActivity,-0.118133,-0.125267,-0.078046,0.00419,-0.147294,-0.087401,-0.069151,-0.087299,1.0,0.142756,...,0.035505,-0.061638,-0.266186,-0.125587,-0.21923,-0.253174,0.032482,-0.092511,0.199658,0.198539
Fruits,-0.040779,-0.040555,-0.040859,0.023849,-0.087518,-0.077666,-0.013389,-0.01979,0.142756,1.0,...,0.031544,-0.044243,-0.103854,-0.068217,-0.044633,-0.048352,-0.091175,0.064547,0.110187,0.079929


Moreover, we evaluate the Weighted Average of the correlation of predictors:

$\text{Weighted Average of Correlation}$ $=$ $\frac{\text{Sum of Correlation of Predictors}}{\text{Number of Predictors}}$ $=$ $0.073
$


In [4]:
# Evaluate the Weighted Average of the correlation of predictors

print("Weighted Average:", (sum(cdc_corr.iloc[0])-1)/21)

Weighted Average: 0.0731949124664723


-----
### NIDDK Pima Indians Diabetes Database

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

####  The Columns
| Column Name    | Expression    |
| :------------- | :------------- |
| Pregnancies | Number of times pregnant |
| Glucose | Plasma glucose concentration a 2 hours in an oral glucose tolerance test |
| BloodPressure | Diastolic blood pressure (mm Hg) |
| SkinThickness | Triceps skin fold thickness (mm) |
| Insulin | 2-Hour serum insulin (mu U/ml) |
| BMI | Body mass index (weight in kg/(height in m)^2) |
| DiabetesPedigreeFunction | Diabetes pedigree function |
| Age | Age (years) |
| Outcome | Class variable (0 or 1) |





In [5]:
# Read 'CDC_BRFSS2015.csv'
niddk_df = pd.read_csv("./data/NIDDK_Pima.csv")

# Drop NaN value
print("Any null value:", any(niddk_df.isnull()))
print("Any NaN value:", any(niddk_df.isna()))
niddk_df = niddk_df.dropna()

niddk_df.tail()

Any null value: True
Any NaN value: True


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [6]:
# Correlation contingency table
niddk_corr = niddk_df.corr()
niddk_corr

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


Moreover, we evaluate the Weighted Average of the correlation of predictors:

$\text{Weighted Average of Correlation}$ $=$ $\frac{\text{Sum of Correlation of Predictors}}{\text{Number of Predictors}}$ $=$ $0.208
$

In [7]:
# Evaluate the Weighted Average of the correlation of predictors

print("Weighted Average:", (sum(niddk_corr.iloc[-1])-1)/8)

Weighted Average: 0.2079678511272704
