# Appendix – Detailed Explanations

This appendix contains extended descriptions of the dataset choice, preprocessing decisions, feature engineering, model behavior, and evaluation results. These details are not required for the presentation but provide deeper context and are useful for answering questions.



## A1. Rationale for Choosing the Raw Merged Heart Dataset

The project initially used a different heart disease dataset, but that dataset was extremely clean:  
- almost no missing values,  
- uniform distributions,  
- minimal noise,  
- and preprocessing steps had little to no effect.

Because the dataset was so polished, it did not meaningfully demonstrate the impact of preprocessing, feature engineering, or model selection.  
To create a more realistic machine learning workflow, I switched to the **raw_merged_heart_dataset**, which includes natural irregularities found in real-world clinical data.

This dataset is more suitable because:
- Missing values appear in several important clinical variables.
- Numerical and categorical distributions are messy and not linearly separable.
- Outliers and nonlinear relationships are present.
- Categorical variables play a strong predictive role.
- Preprocessing choices (e.g., imputation, encoding) genuinely affect performance.

Using this dataset made the comparisons between models — especially linear vs. tree-based — much more meaningful and reflective of applied data science challenges.



## A2. Handling Missing Values: Why LightGBM Uses NaNs

Missing values were concentrated in *slope*, *ca*, and *thal* (8–13%). Removing them would have discarded almost 40% of the dataset.

**Baseline Models (Logistic Regression, Decision Tree)**

These models cannot handle NaN values and require:
- numerical imputation (median),  
- categorical imputation (most frequent),  
- one-hot encoding for categories.

This is standard and ensures stable training.

**LightGBM**

LightGBM handles missing values **natively**. During tree construction:
- NaN values are evaluated as a separate branch,
- LightGBM automatically assigns missing values to the left or right split based on which yields the best gain.

Because missingness itself can carry signal (e.g., missing exercise test results may relate to disease severity), keeping NaNs allows the model to use this information rather than hiding it through imputation.

For this reason, **the LightGBM preprocessing pipeline does not impute missing values** and keeps numeric columns unscaled, while categorical variables are ordinal-encoded.



## A3. Feature Engineering: Clinical Motivation and Model Behavior

Feature engineering was performed to explore whether domain-informed transformations could reveal additional signal. The engineered features include:

**Ratio Features**

- `chol_over_age`
- `restbps_over_age`

These normalize cholesterol and resting blood pressure by age, reflecting age-adjusted cardiovascular stress.

**Heart Rate Reserve Features**

- `maxhr_reserve = (220 - age) - thalachh`
- `percentage_maxhr = thalachh / (220 - age)`

These measure exercise capacity and heart performance relative to predicted maximum heart rate.

**Log-Transform**

- `oldpeak_log = log1p(oldpeak)`

This addresses heavy skew in the ST depression measure.

**Interaction Terms**

- `cp_exang = cp * exang`
- `slope_oldpeak = slope * oldpeak`

These capture clinically meaningful interactions between symptoms and exercise responses.

**Outcome**

Feature engineering produced clinically meaningful variables, several of which appear high in LightGBM’s feature importance rankings.  
However, **model accuracy decreased slightly**, showing that:
- LightGBM already learns nonlinear interactions internally,
- Additional engineered features introduced redundancy and slight noise,
- The simplest feature representation (raw features only) generalizes best.



## A4. Why Feature Engineering Reduced Performance for LightGBM

Although engineered features ranked highly in importance, the feature-engineered LightGBM model consistently performed **worse** than the raw-feature model, even after hyperparameter tuning.

This is expected for boosted tree models because:

1. **Redundancy**

LightGBM already constructs interaction-like splits. Ratios and interaction features duplicated patterns the model could infer independently.

2. **Increased Dimensionality**

More features increase the number of possible splits, which can:
- add noise,  
- cause small overfitting,  
- reduce generalization.

3. **Lower Signal-to-Noise Ratio**

Some engineered features had weaker predictive power than the raw ones they were derived from, slightly diluting the model’s focus.

4. **LightGBM’s inherent flexibility**

Since LightGBM naturally captures nonlinear structure, manual feature engineering rarely provides large gains unless the engineered features encode domain relationships the model cannot infer automatically.

Thus, the **non-FE LightGBM model** was selected as the final model: it is simpler, cleaner, and performs best.



## A5. Limitations of the Current Approach

Despite strong results, several limitations remain:

1. **Dataset Size**

With ~2000 rows, performance may vary depending on the train-test split. The model could benefit from more data, especially more positive cases.

2. **Single Dataset**

The model is evaluated on one dataset from one population.  
Generalization to other hospitals, demographics, or data collection protocols is not guaranteed.

3. **Missingness Patterns Not Modeled Explicitly**

Although LightGBM handles NaNs well, the underlying *reason* for missingness (e.g., clinical workflow) was not analyzed.

4. **Limited Model Family**

The project focused on a small number of models: logistic regression, decision tree, and LightGBM.  
Other competitive tabular models (CatBoost, XGBoost, Random Forests) were not tested.

5. **No Probability Calibration**

For medical use, calibrated probabilities can matter more than classification accuracy.  
This was outside the scope of the project.

6. **Interpretability Constraints**

While feature importance was examined, deeper interpretability methods such as SHAP values would help clinicians understand risk factors more clearly.
