## Data acquisition

* The analysis was initiated using a cleaned dataset obtained from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) or [Kaggle](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_012_health_indicators_BRFSS2015.csv). The process of data curation was shown in the [Jupyter Notebook]((https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook)). Upon reviewing the process, a [raw dataset](https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system/data?select=2015.csv) is required as input. In addition, following issues were identified.
    * The type of diabetes i.e. Type I or II was not specified
    * The cases was not excluded due to pregnancy. The data provider considered them as no diabetes
    * Pre-diabetes was not considered as positive cases. This contradicted to the description of dataset.
* To address these problems, a new notebook has been developed to curate the data appropriately
    * Unfortunately, there is no direct method to distinguish between the two types of diabetes. To address this issue, participants aged under 30 were removed during the 2.3 Data Preprocessing.ipynb step.

### Import packages

In [1]:
import pandas as pd

### Import data

In [2]:
df0 = pd.read_csv("./hide/raw data/BRFSS-2015.csv")

### Select columns

In [3]:
columns = ['DIABETE3', '_RFHYPE5', 'TOLDHI2', '_CHOLCHK', '_BMI5', 'SMOKE100', 'CVDSTRK3', '_MICHD', '_TOTINDA', '_FRTLT1', '_VEGLT1', '_RFDRHV5', 'HLTHPLN1', 'MEDCOST', 'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2']
df = df0[columns]

In [4]:
df.shape

(441456, 22)

### Select rows
* Exclude the records whose response is 'Don't know,' 'Not Sure,' 'Refused,' or 'Missing.'
* For the field 'Ever told you have diabetes' (`DIABETE3`), records with gestational diabetes (`DIABETE3` = 2), 'Don't know/Not Sure' (`DIABETE3` = 7), or 'Refused' (`DIABETE3` = 9) were excluded.

In [None]:
condition = ~(df.DIABETE3.isin([2,7,9])) & ~(df._RFHYPE5.isin([9])) & ~(df.TOLDHI2.isin([7,9])) & ~(df._CHOLCHK.isin([9])) &~(df.SMOKE100.isin([7,9])) & ~(df.CVDSTRK3.isin([7,9])) & ~(df._TOTINDA.isin([9])) &  ~(df._FRTLT1.isin([9])) & ~(df._VEGLT1.isin([9])) & ~(df._RFDRHV5.isin([9])) & ~(df.HLTHPLN1.isin([7,9])) &  ~(df.MEDCOST.isin([7,9])) & ~(df.GENHLTH.isin([7,9])) & ~(df.MENTHLTH.isin([77,99])) & ~(df.PHYSHLTH.isin([77,99])) & ~(df.DIFFWALK.isin([7,9])) & ~(df._AGEG5YR.isin([14])) & ~(df.EDUCA.isin([9])) & ~(df.INCOME2.isin([77,99]))
df = df[condition]

### Mapping values

In [6]:
# df.DIABETE3 = df.DIABETE3.replace({2:0, 3:0, 1:2, 4:1}) # check the data provider
df.DIABETE3 = df.DIABETE3.replace({3:0, 4:1, 1:2})
df._RFHYPE5 = df._RFHYPE5.replace({1:0, 2:1})
df.TOLDHI2 = df.TOLDHI2.replace({2:0})
df._CHOLCHK = df._CHOLCHK.replace({3:0,2:0})
df._BMI5 = df._BMI5.div(100).round(0)
df.SMOKE100 = df.SMOKE100.replace({2:0})
df.CVDSTRK3 = df.CVDSTRK3.replace({2:0})
df._MICHD = df._MICHD.replace({2:0})
df._TOTINDA = df._TOTINDA.replace({2:0})
df._FRTLT1 = df._FRTLT1.replace({2:0})
df._VEGLT1 = df._VEGLT1.replace({2:0})
df._RFDRHV5 = df._RFDRHV5.replace({1:0, 2:1})
df.HLTHPLN1 = df.HLTHPLN1.replace({2:0})
df.MEDCOST = df.MEDCOST.replace({2:0})
df.MENTHLTH = df.MENTHLTH.replace({88:0})
df.PHYSHLTH = df.PHYSHLTH.replace({88:0})
df.DIFFWALK = df.DIFFWALK.replace({2:0})
df.SEX = df.SEX.replace({2:0})

In [7]:
df.shape

(294938, 22)

In [8]:
df.DIABETE3.value_counts(dropna=False)

DIABETE3
0.0    251882
2.0     37955
1.0      5098
NaN         3
Name: count, dtype: int64

### Create Binary Dataset for diabetes vs no diabetes
* Combine pre-diabetes (1) and diabetes (2) into a single value of 1, compared to no diabetes (0)

In [9]:
df_binary = df.copy()
df_binary = df_binary.dropna()
print("After removing missing records:")
print(df_binary.DIABETE3.value_counts(dropna=False))
df_binary.DIABETE3 = df_binary.DIABETE3.replace({2:1})
print("After further grouping pre-diabetes and diabetes:")
df_binary.DIABETE3.value_counts(dropna=False)

After removing missing records:
DIABETE3
0.0    211725
2.0     35346
1.0      4631
Name: count, dtype: int64
After further grouping pre-diabetes and diabetes:


DIABETE3
0.0    211725
1.0     39977
Name: count, dtype: int64

#### Rename columns

In [10]:
column_name = {'DIABETE3':'Diabetes_binary', 
               '_RFHYPE5':'HighBP',  
               'TOLDHI2':'HighChol', 
               '_CHOLCHK':'CholCheck', 
               '_BMI5':'BMI', 
               'SMOKE100':'Smoker', 
               'CVDSTRK3':'Stroke', 
               '_MICHD':'HeartDiseaseorAttack', 
               '_TOTINDA':'PhysActivity', 
               '_FRTLT1':'Fruits', 
               '_VEGLT1':"Veggies", 
               '_RFDRHV5':'HvyAlcoholConsump', 
               'HLTHPLN1':'AnyHealthcare', 
               'MEDCOST':'NoDocbcCost', 
               'GENHLTH':'GenHlth', 
               'MENTHLTH':'MentHlth',
               'PHYSHLTH':'PhysHlth', 
               'DIFFWALK':'DiffWalk', 
               'SEX':'Sex', 
               '_AGEG5YR':'Age', 
               'EDUCA': 'Education',
               'INCOME2':'Income'}
df_binary = df_binary.rename(columns = column_name)

### Export data

In [11]:
df_binary.to_csv("./data/BRFSS-2015_binary.csv", index=False)