# REPORT
This document outlines the steps taken to clean and preprocess the dataset `train.csv` used in the prediction of CHD. It also contains a summary of findings as well as recommendations.

## DATA CLEANING AND TRANSFORMATION

Libraries used: 
pandas, numpy, matplotlib, seaborn, sklearn and ydata_profiling.

*STEP 1:*
Dataset was imported using pandas.

*STEP 2:*
Initial inspection was conducted using pandas and automated EDA via ydata_profiling.

*STEP 3:*
Seven numeric columns with missing data were identified.

*STEP 4:*
A binary feature was created to highlight missing data from one of the columns, _glucose_, seeing as it had almost 10% of data missing.

*STEP 5:*
Missing data were filled with median using sklearn's SimpleImputer.

*STEP 6:*
Columns with float datatype but containing only whole numbers were identified and converted to int for memory efficiency and clarity

*STEP 7:*
Outliers were identified in *twelve* columns and retained as they represented rare but valid cases.

*STEP 8:*
Cleaned dataset, ready for modeling, was exported using pandas.

*STEP 9:*
Data was split

*STEP 10:*
- Columns with outliers representing a meaningful minority class were retained.
- _age_ and _education_ features were scaled using StandardScaler.
- Other numeric features with outliers were scaled using RobustScaler.
- Categorical features were encoded using OneHotEncoder to preserve nominal classification.

## SUMMARY OF FINDINGS 

#### Key insights gained from analysis:

1. **Important Predictors of CHD:**
   * History of stroke played the most important role in predicting CHD risk.
   * Features such as _age_, _cigsPerDay_, _sysBP_, _diabetes_, _prevalentHyp_ and _sex_M_ were highly influential as well.
   * Missing glucose values are also useful in the prediction of CHD risk.

     
2. **Effectiveness of Data Cleaning and Imputation:**
   * Median imputation helped manage missing data without introducing bias.
   * Outlier detection was considered, particularly for numerical features.

#### How they can be used to inform public health initiatives or personalized medical interventions for CHD prevention:
1. **Informing Public Health Initiatives:**

   * **Targeted Interventions:** Risk factors like rate of smoking, high blood pressure and diabetes prevalence suggest targeted education and intervention campaigns especially towards males
   * **Resource Allocation:** Public health resources can be concentrated on populations with key risk markers (e.g., elderly with elevated systolic BP).

2. **Personalized Medical Interventions:**

   * **Early Screening Programs:** Individuals with early risk factors can be flagged for regular cardiovascular monitoring.
   * **Tailored Lifestyle Plans:** Personalized recommendations can be generated from the model output (e.g., cholesterol management, smoking cessation, diet for diabetics).

## FUTURE DIRECTIONS

#### Ways in which the model can  be improved in the future and additional features which may enhance its predictive power:
1. **Model Enhancements:**

   * **Incorporate Temporal Data:** Longitudinal health data (e.g., changes in BP, BMI over time) could improve predictions.
   * **Training with balanced classes or better handling of imbalanced classes:** If the CHD class is balanced, it could easily prevent bias. If the CHD class is imbalanced, Resampling techniques or class-weighted models can improve sensitivity and reduce false negatives.


2. **Additional Predictive Features:**

   * **Family history of CHD**
   * **Physical activity levels**
   * **Socioeconomic status or geographic location**
   * **Dietary habits and alcohol consumption**

#### How the model can be integrated into healthcare systems for early detection of CHD
  
1. **Electronic Health Record (EHR) Integration:**

   * The trained model can be embedded into EHR systems to run predictions during routine checkups or hospital visits.
   * Automatic flagging of high-risk patients enables real-time decision support for physicians.

2. **Mobile and Wearable Integration:**

   * With the rise of wearable health monitors, patient data (e.g., heart rate, activity) can feed directly into the model for continuous CHD risk assessment.

3. **Remote Risk Assessment Tools:**

   * Deploy as a web or mobile app allowing users to assess their CHD risk based on self-reported or synced health data.

4. **Clinical Decision Support:**

   * Doctors can use model outputs to justify further diagnostic testing or preventive therapies.
