The objective of this project is to build a classifier to detect signs of diabetes or prediabetes in the given dataset. The dataset contains information from over 70,000 patients who have filled out questionnaires. The dataset consists of 22 columns, including various health-related features and target labels indicating the presence or absence of diabetes.
y_Diabetes
: Indicates whether the individual has diabetes or prediabetes (yes or no)HighBP
: High blood pressureHigh Cholesterol
: High cholesterol levelCheck Cholesterol
: Whether the individual has undergone cholesterol checkup or notBMI
: Body Mass IndexSmoker
: Smoking statusStroke
: Occurrence of strokeHeartDiseaseorAttack
: History of heart disease or heart attackPhysical Activity
: Level of physical activityFruits
: Consumption of fruitsVegetables
: Consumption of vegetablesHeavy Alcohol Consumption
: Heavy alcohol consumptionCare Health Any
: Health insurance coverageCost of because Doctor No
: Avoidance of doctor visits due to cost concernsGeneral Health
: General health conditionMental Health
: Mental health conditionPhysical Health
: Physical health conditionWalking Difficulty
: Difficulty in walkingSex
: GenderAge
: Age of the individualEducation
: Education levelIncome
: Income level
- Removing missing data
- Normalizing features
- Feature engineering
To build the classifier, we will use the XGBoost library. The classifier will be trained on the dataset to predict the presence or absence of diabetes.
We will perform hyperparameter tuning to optimize the performance of the classifier.
Visualizing the changes in hyperparameters during the tuning process.
After training the classifier, we achieved an accuracy of 75% on the test set. The picture below shows the confusion matrix.
Please refer to the code files for more detailed implementation and usage instructions.