# Diabetes Risk Streamlit App

## [Diabetes Risk Streamlit App Live Link](https://share.streamlit.io/alexteboul/diabetes-risk-app/diabetes_app.py)

## [GitHub Link](https://github.com/AlexTeboul/diabetes-risk-app)

## Purpose
* The purpose of this notebook is to collect/clean BRFSS data, build machine learning models to predict diabetes risk, and save/export those models for use in a Streamlit web app.
* I code up the actual Streamlit app in Visual Studio Code on my Mac.
* The dataset originally has 330 features (columns), but based on diabetes disease research and past experience using this dataset only a 8 features are included.
* Data comes from the 2015 BRFSS survey.

### Selected Subset of Features from BRFSS 2015
Given different diabetes risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS 2015 Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset I downloaded from Kaggle. I also reference some of the same features chosen for a research paper by Zidian Xie et al for *Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques* using the 2014 BRFSS.

**BRFSS 2015 Codebook:** https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

**Relevant Research Paper using BRFSS for Diabetes ML:** https://www.cdc.gov/pcd/issues/2019/19_0109.htm

**Dependent Variable:**
*   (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> DIABETE3

**Independent Variables:**

1. About how much do you weight without shoes? (lbs) --> WEIGHT2
2. Height in inches--> HTIN4
3. Fourteen-level age category --> _AGEG5YR
4. Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> TOLDHI2
5. Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> _RFHYPE5
6. Would you say that in general your health is: --> GENHLTH

## 1. Collect/Clean the data

In [11]:
#imports
import os
import pandas as pd
import numpy as np
import random
import time

random_state = 7
random.seed(random_state)

In [12]:
def get_filenames(years):
    filenames = []
    for year in years:
        filenames.append(f'2015.csv')
    return filenames

In [13]:
def clean_data(filename):
    # read in .csv data
    df = pd.read_csv(filename) 
        
    # select specific columns
    #df = df[['DIABETE3','SEX','WEIGHT2','HTIN4','_RACE','_AGEG5YR','TOLDHI2','_RFHYPE5','GENHLTH']]
    #df = df[['DIABETE3','WEIGHT2','HTIN4','_AGEG5YR','TOLDHI2','_RFHYPE5','GENHLTH']]
    df = df[['DIABETE3','_BMI5','_AGEG5YR','TOLDHI2','_RFHYPE5','GENHLTH']]
    
    # drop missing values
    df = df.dropna()
    
    # 0 DIABETE3 - Diabetes
    # 1 is yes, stays yes
    # 2 gestational but i'll make it 1 (yes diabetes risk) because 50% of women with gestational go on to develop Type 2 DM, 
    # 3 is no, make it 0
    # 4 is for pre-diabetes or borderline diabetes, so make it 1 for yes diabetes risk
    # Remove all 7 (dont knows)
    # Remove all 9 (refused)
    df['DIABETE3'] = df['DIABETE3'].replace({2:1, 3:0, 4:1})
    df = df[df.DIABETE3 != 7]
    df = df[df.DIABETE3 != 9]
    
    # 2 WEIGHT2 - Weight
    # In the codebook, 7777 is don't know, 9999 is refused, 9000-9998 is metric. drop them all. 50 pounds to 999 pounds acceptable here.
    #df = df[df['WEIGHT2'].between(50, 999)]

    #3 HTIN4 - Height
    # The computer height in inches. In the codebook, 36-95 inches are acceptable based on the survey
    #df = df[df['HTIN4'].between(36, 95)]
    
    #4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
    df['_BMI5'] = df['_BMI5'].div(100).round(0)
    df._BMI5.unique()
    
    # 5 _AGEG5YR - Age
    # already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
    # remove 14 because it is don't know or missing
    df = df[df._AGEG5YR != 14]
    
    # 6 TOLDHI2 - HighChol
    # Change 2 to 0 because it is No
    # Remove all 7 (dont knows)
    # Remove all 9 (refused)
    df['TOLDHI2'] = df['TOLDHI2'].replace({2:0})
    df = df[df.TOLDHI2 != 7]
    df = df[df.TOLDHI2 != 9]
    
    # 7 _RFHYPE5 - HighBP
    #Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure
    df['_RFHYPE5'] = df['_RFHYPE5'].replace({1:0, 2:1})
    df = df[df._RFHYPE5 != 9]
    
    # 8 GENHLTH - GenHlth
    # This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
    # Remove 7 and 9 for don't know and refused
    df = df[df.GENHLTH != 7]
    df = df[df.GENHLTH != 9]
    
    #Rename the columns to make them more readable
    df = df.rename(columns = {'DIABETE3':'Diabetes_risk', 
                              #'WEIGHT2':'Weight',
                              #'HTIN4':'Height',
                              '_BMI5':'BMI',
                              '_AGEG5YR':'Age',
                              'TOLDHI2':'HighChol',
                              '_RFHYPE5':'HighBP',
                              'GENHLTH':'GenHlth'})
    
    return df

In [14]:
#select years to include
years = ['2015']

# merge all the years of BRFSS .csv data
df = pd.concat(map(clean_data, get_filenames(years)), ignore_index=True)
df.shape

(344940, 6)

**At this point we have 344,940 records and 6 columns. Each record contains an individual's BRFSS survey responses.**

## 2. Basic data description

In [15]:
df.describe()

Unnamed: 0,Diabetes_risk,BMI,Age,HighChol,HighBP,GenHlth
count,344940.0,344940.0,344940.0,344940.0,344940.0,344940.0
mean,0.171186,28.226631,8.240372,0.426935,0.441807,2.572163
std,0.376672,6.602288,3.151097,0.494633,0.496603,1.090519
min,0.0,12.0,1.0,0.0,0.0,1.0
25%,0.0,24.0,6.0,0.0,0.0,2.0
50%,0.0,27.0,9.0,0.0,0.0,2.0
75%,0.0,31.0,11.0,1.0,1.0,3.0
max,1.0,98.0,13.0,1.0,1.0,5.0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344940 entries, 0 to 344939
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Diabetes_risk  344940 non-null  float64
 1   BMI            344940 non-null  float64
 2   Age            344940 non-null  float64
 3   HighChol       344940 non-null  float64
 4   HighBP         344940 non-null  float64
 5   GenHlth        344940 non-null  float64
dtypes: float64(6)
memory usage: 15.8 MB


In [17]:
for col in df:
    print(f'{col} unique values= {np.sort(df[col].unique())}')

Diabetes_risk unique values= [0. 1.]
BMI unique values= [12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.
 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65.
 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83.
 84. 85. 86. 87. 88. 89. 90. 91. 92. 95. 96. 97. 98.]
Age unique values= [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]
HighChol unique values= [0. 1.]
HighBP unique values= [0. 1.]
GenHlth unique values= [1. 2. 3. 4. 5.]


In [18]:
#Check how many respondents have no diabetes, prediabetes or diabetes. Note the class imbalance!
df.groupby(['Diabetes_risk']).size()

Diabetes_risk
0.0    285891
1.0     59049
dtype: int64

In [19]:
#Separate the 0(No Diabetes) and 1(Pre-diabetes and Diabetes)
#Get the 1s
is1 = df['Diabetes_risk'] == 1
df_1 = df[is1]

#Get the 0s
is0 = df['Diabetes_risk'] == 0
df_0 = df[is0] 

#Select the 39977 random cases from the 0 (non-diabetes group). we already have 58949 cases from the diabetes risk group
df_0_rand1 = df_0.take(np.random.permutation(len(df_0))[:58949])

#Append the 39977 1s to the 39977 randomly selected 0s
df_balanced = pd.concat([df_0_rand1, df_1], ignore_index=True)

In [20]:
df_balanced.head()

Unnamed: 0,Diabetes_risk,BMI,Age,HighChol,HighBP,GenHlth
0,0.0,21.0,9.0,0.0,0.0,1.0
1,0.0,26.0,8.0,0.0,0.0,2.0
2,0.0,30.0,8.0,1.0,0.0,5.0
3,0.0,22.0,13.0,1.0,1.0,1.0
4,0.0,25.0,11.0,0.0,0.0,2.0


In [21]:
df_balanced.groupby(['Diabetes_risk']).size()

Diabetes_risk
0.0    58949
1.0    59049
dtype: int64

## 3. Save to csv
First save version where diabetes is the target variable and in the first column. This is the full cleaned dataset with prediabetes still there.

In [22]:
#************************************************************************************************
df.to_csv('diabetes_risk.csv', sep=",", index=False)
df_balanced.to_csv('diabetes_risk_balanced.csv', sep=",", index=False)
#************************************************************************************************

## 4. Model Building
* simple random forest because it can handle the categorical race variable without needing to dummy. 

In [23]:
pip install pydotplus

Defaulting to user installation because normal site-packages is not writeable
Collecting pydotplus
  Downloading pydotplus-2.0.2.tar.gz (278 kB)
[K     |████████████████████████████████| 278 kB 2.1 MB/s eta 0:00:01
Building wheels for collected packages: pydotplus
  Building wheel for pydotplus (setup.py) ... [?25ldone
[?25h  Created wheel for pydotplus: filename=pydotplus-2.0.2-py3-none-any.whl size=24575 sha256=722c92491457db845b30ca74cf27b802c58e7a46b0e31c9dafa55dfb21de5f90
  Stored in directory: /Users/faizatululya/Library/Caches/pip/wheels/89/e5/de/6966007cf223872eedfbebbe0e074534e72e9128c8fd4b55eb
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [25]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.tree import DecisionTreeClassifier, export_text, export_graphviz
import graphviz
import pydotplus
from IPython.display import Image 

from sklearn.metrics import confusion_matrix, make_scorer, classification_report

In [26]:
df_balanced.head()

Unnamed: 0,Diabetes_risk,BMI,Age,HighChol,HighBP,GenHlth
0,0.0,21.0,9.0,0.0,0.0,1.0
1,0.0,26.0,8.0,0.0,0.0,2.0
2,0.0,30.0,8.0,1.0,0.0,5.0
3,0.0,22.0,13.0,1.0,1.0,1.0
4,0.0,25.0,11.0,0.0,0.0,2.0


In [27]:
#select HeartDiseaseorAttack as target variable:
y = df_balanced['Diabetes_risk']

#select all the other columns minus HeartDiseaseorAttack as the feature variables:
X = df_balanced.drop(['Diabetes_risk'],axis=1)

In [28]:
#now make the train-test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state=random_state)
print('Dimensions: \n x_train:{} \n x_test{} \n y_train{} \n y_test{}'.format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

Dimensions: 
 x_train:(100298, 5) 
 x_test(17700, 5) 
 y_train(100298,) 
 y_test(17700,)


In [29]:
X_test

Unnamed: 0,BMI,Age,HighChol,HighBP,GenHlth
33238,24.0,9.0,0.0,0.0,5.0
73115,27.0,7.0,1.0,1.0,5.0
24610,43.0,11.0,1.0,1.0,3.0
56154,42.0,11.0,1.0,0.0,2.0
15559,23.0,10.0,0.0,0.0,3.0
...,...,...,...,...,...
68393,37.0,12.0,1.0,1.0,3.0
69933,25.0,10.0,1.0,1.0,2.0
45604,27.0,11.0,0.0,0.0,3.0
13586,22.0,4.0,0.0,1.0,1.0


In [30]:
#create true negative, false positive, false negative, and true positive 
def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0]
def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1]
def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0]
def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 1]

In [31]:
#Setup classifier scorers
scorers = {'Accuracy': 'accuracy', 
           'roc_auc': 'roc_auc', 
           'Sensitivity':'recall', 
           'precision':'precision',
            'tp': make_scorer(tp), 
           'tn': make_scorer(tn),
           'fp': make_scorer(fp), 
           'fn': make_scorer(fn)}   

### DT

In [32]:
#change this name here to change the print name
classifier_name = 'DT: '

start_ts = time.time()
#try swapping out the classifier for a different one or changing the parameters
clf = DecisionTreeClassifier(max_depth = 3, min_samples_split=500, criterion='entropy',random_state=random_state)
scores = cross_validate(clf, X, y, scoring=scorers, cv=5)          

Sensitivity = round(scores['test_tp'].mean() / (scores['test_tp'].mean() + scores['test_fn'].mean()),3)*100   #TP/(TP+FN) also recall
Specificity = round(scores['test_tn'].mean() / (scores['test_tn'].mean() + scores['test_fp'].mean()),3)*100    #TN/(TN+FP)
PPV = round(scores['test_tp'].mean() / (scores['test_tp'].mean() + scores['test_fp'].mean()),3)*100           #PPV = tp/(tp+fp) also precision
NPV = round(scores['test_tn'].mean() / (scores['test_fn'].mean() + scores['test_tn'].mean()),3)*100           #TN(FN+TN)

scores_Acc = scores['test_Accuracy']                                                                                                                                    
print(f"{classifier_name} Acc: %0.2f (+/- %0.2f)" % (scores_Acc.mean(), scores_Acc.std() * 2))                                                                                                    
scores_AUC = scores['test_roc_auc']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} AUC: %0.2f (+/- %0.2f)" % (scores_AUC.mean(), scores_AUC.std() * 2))      
scores_sensitivity = scores['test_Sensitivity']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} Recall/Sensitivity: %0.2f (+/- %0.2f)" % (scores_sensitivity.mean(), scores_sensitivity.std() * 2)) 
scores_precision = scores['test_precision']                                                                     #Only works with binary classes, not multiclass                  
print(f"{classifier_name} Precision/PPV: %0.2f (+/- %0.2f)" % (scores_precision.mean(), scores_precision.std() * 2))                          
print(f"{classifier_name} Specificity = ", round(Specificity,2), "%") 
print(f"{classifier_name} NPV = ", round(NPV,2), "%")

print("Runtime:", round(time.time()-start_ts,2), 'seconds')

DT:  Acc: 0.70 (+/- 0.02)
DT:  AUC: 0.76 (+/- 0.01)
DT:  Recall/Sensitivity: 0.75 (+/- 0.02)
DT:  Precision/PPV: 0.68 (+/- 0.01)
DT:  Specificity =  65.4 %
DT:  NPV =  72.3 %
Runtime: 0.42 seconds


In [33]:
#train the decision tree
dt = DecisionTreeClassifier(max_depth = 3, min_samples_split=500, random_state=random_state)
dt.fit(X_train,y_train)

In [36]:
y_pred = dt.predict(X_test)

In [37]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[5919 3006]
 [2185 6590]]
              precision    recall  f1-score   support

         0.0       0.73      0.66      0.70      8925
         1.0       0.69      0.75      0.72      8775

    accuracy                           0.71     17700
   macro avg       0.71      0.71      0.71     17700
weighted avg       0.71      0.71      0.71     17700



### Saving the Model

In [38]:
import joblib
joblib.dump(dt, 'dt_model.pkl') 

['dt_model.pkl']

* Now that we have our model saved, image of the model structure, and data saved to .csv we're ready to go into VS Code and code up the streamlit app itself.