# Diabetes Classification

## About dataset
- The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that collects data from U.S. residents on their health-related risk behaviors, chronic health conditions, and use of preventive services
- The dataset has been established in 1984 with 15 states, it now collects data from all 50 states, D.C., and 3 U.S. territories
- Over 400,000 adult interviews are completed each year, making it the largest continuous health survey system in the world
- Factors assessed include tobacco use, healthcare coverage, HIV/AIDS knowledge/prevention, physical activity, and fruit/vegetable consumption
- A record in the data corresponds to a single respondent (each from a single household)
- The description of columns can be found in the linked PDF file

#### Features description
| Feature               | Description                                                                  |
|-----------------------|------------------------------------------------------------------------------|
| diabetes              | Subject was told they have diabetes                                          |
| high_blood_pressure   | Subject has high blood pressure                                              |
| high_cholesterol      | Subject has high cholesterol                                                 |
| cholesterol_check     | Subject had cholesterol check within the last five years                     |
| bmi                   | BMI of the subject                                                           |
| smoked_100_cigarettes | Subject has smoked at least 100 cigarettes during their life                 |
| stroke                | Subject experienced stroke during their life                                 |
| coronary_disease      | Subject has/had coronary heart disease or myocardial infarction              |
| exercise              | Subject does regular exercise or physical activity                           |
| consumes_fruit        | Subject consumes fruits at least once a day                                  |
| consumes_vegetables   | Subject consumes vegetables at least once a day                              |
| insurance             | Subject has some kind of health plan (insurance, prepaid plans, ...)         |
| no_doctor_money       | Subject was unable to visit doctor in the past 12 months because of cost     |
| health                | How good is the health of the subject (self rated)                           |
| mental_health         | Number of days in the past month when subject's mental health was not good   |
| physical_health       | Number of days in the past month when subject's physical health was not good |
| climb_difficulty      | Subject has difficulties climbing stairs                                     |
| sex                   | Sex of the subject                                                           |
| age_category          | Age category of the subject                                                  |
| educatation_level     | Highest level of education achieved by the subject                           |
| income                | Income of subject's household                                                |

In [18]:
from pathlib import Path

import pandas as pd


def load_dataset(part_dir):
    dataset_parts_df = [pd.read_csv(Path(part_dir) / f"part{part_num}.csv") for part_num in range(1, 6)]
    return pd.concat(dataset_parts_df)


dataset = load_dataset("data")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500000 entries, 0 to 99999
Data columns (total 24 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Unnamed: 0  500000 non-null  int64  
 1   ID          500000 non-null  int64  
 2   DIABETE3    499995 non-null  float64
 3   _RFHYPE5    500000 non-null  int64  
 4   TOLDHI2     433630 non-null  float64
 5   _CHOLCHK    500000 non-null  int64  
 6   _BMI5       457835 non-null  float64
 7   SMOKE100    480420 non-null  float64
 8   CVDSTRK3    500000 non-null  int64  
 9   _MICHD      495475 non-null  float64
 10  _TOTINDA    500000 non-null  int64  
 11  _FRTLT1     500000 non-null  int64  
 12  _VEGLT1     500000 non-null  int64  
 13  _RFDRHV5    500000 non-null  int64  
 14  HLTHPLN1    500000 non-null  int64  
 15  MEDCOST     500000 non-null  int64  
 16  GENHLTH     499995 non-null  float64
 17  MENTHLTH    500000 non-null  int64  
 18  PHYSHLTH    500000 non-null  int64  
 19  DIFFWALK

In [19]:
rename_map = {
    "DIABETE3": "diabetes",
    "_RFHYPE5": "high_blood_pressure",
    "TOLDHI2": "high_cholesterol",
    "_CHOLCHK": "cholesterol_check",
    "_BMI5": "bmi",
    "SMOKE100": "smoked_100_cigarettes",
    "CVDSTRK3": "stroke",
    "_MICHD": "coronary_disease",
    "_TOTINDA": "exercise",
    "_FRTLT1": "consumes_fruit",
    "_VEGLT1": "consumes_vegetable",
    "HLTHPLN1": "insurance",
    "MEDCOST": "no_doctor_money",
    "GENHLTH": "health",
    "MENTHLTH": "mental_health",
    "PHYSHLTH": "physical_health",
    "DIFFWALK": "climb_difficulty",
    "SEX": "sex",
    "_AGEG5YR": "age_category",
    "EDUCA": "education_level",
    "INCOME2": "income"
}

dataset.rename(columns=rename_map, inplace=True)

In [21]:
from categories import *

# Create a dictionary mapping column names to their corresponding Enum classes
enum_mapping = {
    "diabetes": Diabetes,
    "high_blood_pressure": HighBloodPressure,
    "high_cholesterol": BloodCholesterolHigh,
    "cholesterol_check": CholesterolChecked,
    "smoked_100_cigarettes": SmokedAtLeast100Cigarettes,
    "stroke": EverDiagnosedWithStroke,
    "coronary_disease": EverHadCHDorMI,
    "exercise": LeisureTimePhysicalActivity,
    "consumes_fruit": ConsumeFruitFrequency,
    "consumes_vegetable": ConsumeVegetablesFrequency,
    "insurance": HealthCareCoverage,
    "no_doctor_money": CouldNotSeeDoctorBecauseOfCost,
    "health": GeneralHealth,
    "climb_difficulty": DifficultyWalkingOrClimbingStairs,
    "sex": RespondentSex,
    "age_category": AgeFiveYearCategories,
    "education_level": EducationLevel,
    "income": IncomeLevel,
}


def to_category_value(val):
    """ Convert numerical values to string representations from the corresponding Enum classes """

    convert_to_nan = {
        "REFUSED",
        "Blank",
        "DONT_KNOW_OR_NOT_SURE",
        "DONT_KNOW_OR_NOT_SURE_OR_REFUSED_OR_MISSING",
        "DONT_KNOW_OR_REFUSED_OR_MISSING",
        "BLANK",
        "DONT_KNOW_REFUSED_OR_MISSING",
    }

    if pd.isna(val) or enum_class(val).name in convert_to_nan:
        return pd.NA
    return str(enum_class(val))


# Replace numerical values with string representations from the corresponding Enum classes
for column, enum_class in enum_mapping.items():
    dataset[column] = dataset[column].apply(to_category_value)


# Convert columns which make use of Enum classes to category types
object_cols = dataset.select_dtypes(include=["object"]).columns
dataset[object_cols] = dataset[object_cols].astype("category")

In [24]:
dataset.head()


Unnamed: 0.1,Unnamed: 0,ID,diabetes,high_blood_pressure,high_cholesterol,cholesterol_check,bmi,smoked_100_cigarettes,stroke,coronary_disease,...,insurance,no_doctor_money,health,mental_health,physical_health,climb_difficulty,sex,age_category,education_level,income
0,0,0,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,4018.0,SmokedAtLeast100Cigarettes.YES,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.NO,GeneralHealth.POOR,18,15,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.LESS_THAN_20000
1,1,1,Diabetes.NO,HighBloodPressure.NO,BloodCholesterolHigh.NO,CholesterolChecked.NOT_CHECKED_IN_PAST_5_YEARS,2509.0,SmokedAtLeast100Cigarettes.YES,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,...,HealthCareCoverage.NO,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.GOOD,88,88,DifficultyWalkingOrClimbingStairs.NO,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_50_TO_54,EducationLevel.COLLEGE_4_YEARS_OR_MORE,IncomeLevel.LESS_THAN_10000
2,2,2,Diabetes.NO,HighBloodPressure.NO,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2204.0,,EverDiagnosedWithStroke.YES,,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.NO,GeneralHealth.FAIR,88,15,,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_70_TO_74,EducationLevel.GRADE_12_OR_GED,IncomeLevel.REFUSED
3,3,3,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2819.0,SmokedAtLeast100Cigarettes.NO,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.POOR,30,30,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.GREATER_THAN_OR_EQUAL_75000
4,3,3,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2819.0,SmokedAtLeast100Cigarettes.NO,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.POOR,30,30,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.GREATER_THAN_OR_EQUAL_75000
