# Diabetes Classification

## About dataset
- The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that collects data from U.S. residents on their health-related risk behaviors, chronic health conditions, and use of preventive services
- The dataset has been established in 1984 with 15 states, it now collects data from all 50 states, D.C., and 3 U.S. territories
- Over 400,000 adult interviews are completed each year, making it the largest continuous health survey system in the world
- Factors assessed include tobacco use, healthcare coverage, HIV/AIDS knowledge/prevention, physical activity, and fruit/vegetable consumption
- A record in the data corresponds to a single respondent (each from a single household)
- The description of columns can be found in the linked PDF file

#### Features description
| Feature               | Description                                                                  |
|-----------------------|------------------------------------------------------------------------------|
| diabetes              | Subject was told they have diabetes                                          |
| high_blood_pressure   | Subject has high blood pressure                                              |
| high_cholesterol      | Subject has high cholesterol                                                 |
| cholesterol_check     | Subject had cholesterol check within the last five years                     |
| bmi                   | BMI of the subject                                                           |
| smoked_100_cigarettes | Subject has smoked at least 100 cigarettes during their life                 |
| stroke                | Subject experienced stroke during their life                                 |
| coronary_disease      | Subject has/had coronary heart disease or myocardial infarction              |
| exercise              | Subject does regular exercise or physical activity                           |
| consumes_fruit        | Subject consumes fruits at least once a day                                  |
| consumes_vegetables   | Subject consumes vegetables at least once a day                              |
| insurance             | Subject has some kind of health plan (insurance, prepaid plans, ...)         |
| no_doctor_money       | Subject was unable to visit doctor in the past 12 months because of cost     |
| health                | How good is the health of the subject (self rated)                           |
| mental_health         | Number of days in the past month when subject's mental health was not good   |
| physical_health       | Number of days in the past month when subject's physical health was not good |
| climb_difficulty      | Subject has difficulties climbing stairs                                     |
| sex                   | Sex of the subject                                                           |
| age_category          | Age category of the subject                                                  |
| educatation_level     | Highest level of education achieved by the subject                           |
| income                | Income of subject's household                                                |

Load the dataset. All 5 parts are concatenated

In [8]:
from utils import load_dataset


dataset = load_dataset("data")

Do basic preprocessing on columns and categorical values in order to make the dataset more humanly readable.

In [9]:
from utils import process_columns

process_columns(dataset)

# 'Unnamed: 0' is a duplicate column of ID
dataset.drop("Unnamed: 0", axis=1, inplace=True)

In [10]:
dataset.head()


Unnamed: 0,ID,diabetes,high_blood_pressure,high_cholesterol,cholesterol_check,bmi,smoked_100_cigarettes,stroke,coronary_disease,exercise,...,insurance,no_doctor_money,health,mental_health,physical_health,climb_difficulty,sex,age_category,education_level,income
0,0,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,4018.0,SmokedAtLeast100Cigarettes.YES,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,LeisureTimePhysicalActivity.NO_PHYSICAL_ACTIVI...,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.NO,GeneralHealth.POOR,18,15,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.LESS_THAN_20000
1,1,Diabetes.NO,HighBloodPressure.NO,BloodCholesterolHigh.NO,CholesterolChecked.NOT_CHECKED_IN_PAST_5_YEARS,2509.0,SmokedAtLeast100Cigarettes.YES,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,LeisureTimePhysicalActivity.HAD_PHYSICAL_ACTIV...,...,HealthCareCoverage.NO,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.GOOD,88,88,DifficultyWalkingOrClimbingStairs.NO,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_50_TO_54,EducationLevel.COLLEGE_4_YEARS_OR_MORE,IncomeLevel.LESS_THAN_10000
2,2,Diabetes.NO,HighBloodPressure.NO,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2204.0,,EverDiagnosedWithStroke.YES,,,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.NO,GeneralHealth.FAIR,88,15,,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_70_TO_74,EducationLevel.GRADE_12_OR_GED,
3,3,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2819.0,SmokedAtLeast100Cigarettes.NO,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,LeisureTimePhysicalActivity.NO_PHYSICAL_ACTIVI...,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.POOR,30,30,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.GREATER_THAN_OR_EQUAL_75000
4,3,Diabetes.NO,HighBloodPressure.YES,BloodCholesterolHigh.YES,CholesterolChecked.CHECKED_IN_PAST_5_YEARS,2819.0,SmokedAtLeast100Cigarettes.NO,EverDiagnosedWithStroke.NO,EverHadCHDorMI.DID_NOT_REPORT_HAVING_MI_OR_CHD,LeisureTimePhysicalActivity.NO_PHYSICAL_ACTIVI...,...,HealthCareCoverage.YES,CouldNotSeeDoctorBecauseOfCost.YES,GeneralHealth.POOR,30,30,DifficultyWalkingOrClimbingStairs.YES,RespondentSex.FEMALE,AgeFiveYearCategories.AGE_60_TO_64,EducationLevel.GRADE_12_OR_GED,IncomeLevel.GREATER_THAN_OR_EQUAL_75000
