# Predictive modeling for heart disease classification using Naive Bayes algorithm

---

## Project Information

**Authors**: 
- Vladyslav Lysenko
- Parmida Mashadi Assadollahi
- Sankruththian Senathirajah
  
**Course**: SCS 3251-071 Statistics for Data Science  
**Instructor**: Sergiy Nokhrin  
**Institution**: University of Toronto School of Continuing Studies  
**Submission Date**: November 2025  
**Purpose**:  


![University Logo](https://learn.utoronto.ca/themes/custom/de_theme/logo.svg)


---

## Table of Contents

1. [Project Information](#Project-Information)
2. [Introduction](#Introduction)
3. [Material and Methods](#Material-and-Methods)
4. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis)
    - 4.1 [Univariate Analysis](#Univariate-Analysis)
    - 4.2 [Multivariate Analysis](#Multivariate-Analysis)
5. [Hypothesis Testing](#Hypothesis-Testing)
6. [Modeling](#Modeling)
7. [Evaluation](#Evaluation)
8. [Key Findings](#Key-Findings)
9. [Limitations](#Limitations)
10. [Future Work](#Future-Work)
11. [Conclusions](#Conclusions)
12. [Appendix](#Appendix)

---
## Introduction

Cardiovascular disease (CVD) is one of the most known and deadly diseases in the world, taking millions of lives each year <sup>[[1]](#C.-References)</sup>. Early detection plays a critical role in improving outcomes, as timely intervention can significantly reduce the risk of severe complications or death. However, the diagnostic process is often complex, requiring multiple clinical tests, specialized expertise, and considerable time and resources.

Machine learning (ML) offers a complementary approach to traditional diagnostics. By analyzing patterns in patient data, ML models can provide rapid, low-cost, and consistent assessments that support clinicians in identifying individuals at risk. While not a replacement for medical evaluation, ML can serve as an effective tool for preliminary screening and decision support, helping streamline the diagnostic workflow<sup>[[2]](#C.-References)</sup>.

### Objective

The objective of applying machine learning in our project is to explore whether simple statistical models can detect patterns associated with heart disease earlier or more reliably than manual inspection alone. By modeling how clinical features interact, we aim to produce a transparent risk-prediction tool that could support clinicians in screening and decision-making.

In this project we aim to obtain an ML model that can predict heart disease using a probabilistic Gaussian Naive Bayes classifier. To accomplish that, we first perform exploratory data analysis to characterize the distribution of key variables and identify potential risk factors. We then preprocess the data (handling missing values, encoding categorical features, and scaling numerical variables where appropriate) and derive a binary outcome reflecting the presence or absence of heart disease. The model is trained and evaluated on separate train-test splits, and its performance is assessed using standard classification metrics (accuracy, precision, recall, F1-score, and ROC-AUC). Finally, we interpret the learned conditional distributions and feature effects to understand which clinical variables contribute most to predicted risk, and we discuss the potential usefulness and limitations of such a simple, interpretable model in a real clinical screening setting.

---
## Material and Methods

This project focuses on applying a simple, interpretable machine learning approach to the problem of heart disease prediction by using the Gaussian Naive Bayes algorithm as the primary classifier. The method was chosen for its efficiency, transparency, and strong performance on smaller datasets, making it well suited for an initial evaluation of predictive potential in medical data. Basic preprocessing steps were carried out to ensure clean and consistent input, including encoding categorical variables, standardizing selected numerical features, and splitting the data into training and testing sets. The Cleveland dataset used in this study was obtained from the UC Irvine Machine Learning Repository.

### Dataset Overview

The Cleveland heart disease dataset is commonly used for heart disease prediction with supervised Machine Learning. The Cleveland dataset is obtained from the UC Irvine Machine Learning repository.

The Cleveland dataset was collected for use in a study in the field of health research by the Cleveland Clinic Foundation in 1988. In the original of this dataset, 76 different features of 303 subjects were recorded. However, it is known that most researchers use only 14 of these features, including the target class feature. These features include age, gender, blood pressure, cholesterol, blood sugar, and many more health metrics. 

The original Cleveland dataset has five class labels. It has integer values ranging from zero (no presence) to four. The Cleveland dataset experiments have focused on just trying to discriminate between presence (Values 1, 2, 3, 4) and absence (Value 0). However, the number of samples for each class is not homogeneous (Values 0, 1, 2, 3, 4-samples 164, 55, 36, 35, 13). Researchers suggest that the five class features of this data set be reduced to two classes; 0 = no disease and 1 = disease. The target feature refers to the presence of heart disease in the subject. Table 1 shows the features included in the Cleveland heart disease dataset.

| Order | Feature   | Description                                                                                          | Feature Value Range                                                                                                                             |
|-------|-----------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| 1     | Age       | Age in years                                                                                         | 29 to 77                                                                                                                                        |
| 2     | Sex       | Gender                                                                                               | 1 = male, 0 = female                                                                                                                            |
| 3     | Cp        | Chest pain type                                                                                       | 0 = typical angina; 1 = atypical angina; 2 = non-anginal pain; 3 = asymptomatic                                                                 |
| 4     | Trestbps  | Resting blood pressure (mm Hg on hospital admission)                                                  | 94 to 200                                                                                                                                       |
| 5     | Chol      | Serum cholesterol (mg/dL)                                                                             | 126 to 564                                                                                                                                      |
| 6     | Fbs       | Fasting blood sugar > 120 mg/dL                                                                       | 1 = true, 0 = false                                                                                                                             |
| 7     | Restecg   | Resting electrocardiographic results                                                                  | 0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy (Estes’ criteria)                                                       |
| 8     | Thalach   | Maximum heart rate achieved                                                                           | 71 to 202                                                                                                                                       |
| 9     | Exang     | Exercise-induced angina                                                                               | 1 = yes, 0 = no                                                                                                                                 |
| 10    | Oldpeak   | ST depression induced by exercise relative to rest                                                    | 0 to 6.2                                                                                                                                        |
| 11    | Slope     | Slope of the peak exercise ST segment                                                                 | 0 = upsloping; 1 = flat; 2 = downsloping                                                                                                        |
| 12    | Ca        | Number of major vessels colored by fluoroscopy                                                        | 0 to 3                                                                                                                                          |
| 13    | Thal      | Thallium heart rate test result                                                                       | 0 = normal; 1 = fixed defect; 2 = reversible defect                                                                                             |
| 14    | Target    | Diagnosis of heart disease                                                                            | 0 = no disease; 1 = disease                                                                                                                     |


<center><i>Table 1: List of features in the Cleveland heart disease dataset.</i></center>

In the original dataset, a total of 6 samples have null values; 4 samples in the “Ca (Number of Major Vessels)” feature and 2 samples in the “Thal (Thallium Heart Rate)” feature. Since null values are very few, these samples can be removed from the dataset. The dataset used in this study contains a total of 303 samples. A total of 137 samples belong to the disease (1), and 160 of these samples belong to the no disease (0) class. Histograms of all features in the Cleveland heart disease dataset are shown in [multivariate analysis](#Multivariate-Analysis).

### Machine Learning Algorithms

- #### Gaussian Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable<sup>[[3]](#C.-References)</sup>.

Naive conditional independece assumption: $$P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y)$$

Naive Bayes classification rule: \begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\end{aligned}\end{align}

The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of $P(x_i\:|\:y)$. In case of Gaussian Naive Bayes, we, as comes from the name, assume probability to be normaly distributed. The likelihood of the features is assumed to be Gaussian: $$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$

**Strengths**:
- Extremely fast to train
- Works well even with small datasets
- Easy to interpret (means and variances reveal feature influence)

**Limitations**:
- Independence assumption is rarely true
- Probability outputs are often not well-calibrated
- Performance may lag behind more flexible models

Despite these limitations, Naïve Bayes often performs strongly for classification tasks, and it serves as a clear, interpretable baseline for health data.

- #### Logistic Regression

Logistic regression is a classification algorithm that models the probability of a binary outcome<sup>[[4]](#C.-References)</sup>. Instead of predicting $y$ directly, it estimates: $$\hat{p} = P(y=1 \mid \mathbf{x}) = \sigma\!\left(\beta_0 + \sum_{i=1}^{n} \beta_i x_i\right)$$

where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function. The model parameters $\beta$ are learned by maximizing the log-likelihood (or equivalently minimizing cross-entropy loss): $$L = -\sum_{j=1}^{m} \left[ y_j \log(\hat{p}_j) + (1 - y_j)\log(1 - \hat{p}_j) \right]$$


In this project, logistic regression serves as a baseline classifier to compare against the main algorithm, Naive Bayes. Its simplicity and interpretability make it a standard reference for tasks like predicting heart disease presence in the Cleveland dataset, where outputs are binary.

Logistic regression performs reasonably because it captures nonlinear output behavior through the sigmoid but still relies on linear combinations of features. Thus, it may not fully model complex feature interactions. Comparing its accuracy with Naive Bayes helps demonstrate whether probabilistic modeling of feature distributions provides superior predictive performance for this medical dataset.

---
## Exploratory Data Analysis

The dataset comprises 303 subjects and a set of clinical variables commonly associated with cardiovascular risk. The cohort is predominantly middle-aged to older and male. Resting systolic blood pressure and total cholesterol are moderately elevated on average (131.7 mmHg and 246.7 mg/dL, respectively), while maximum heart rate (thalach) and ST depression (oldpeak) show substantial variability, indicating heterogeneous cardiovascular fitness and ischemic burden. 

The outcome variable (target) spans 0 - 4 and is further recoded into a binary indicator distinguishing healthy from diseased subjects. In the following [univariate analysis](#Univariate-Analysis), we first inspect the overall distribution of this outcome, and then examine how disease prevalence varies across key risk factors, focusing on sex and age groups as illustrated in the subsequent plots.

| Feature   | Count |   Mean    |    Std    | Min | 25%  | 50%  | 75%  | Max |
|-----------|-------|-----------|-----------|-----|------|------|------|-----|
| age       |   303 |   54.4389 |   9.03866 |  29 |   48 |   56 |   61 |  77 |
| sex       |   303 |    0.6799 |   0.4673  |   0 |    0 |    1 |    1 |   1 |
| cp        |   303 |    3.1584 |   0.9601  |   1 |    3 |    3 |    4 |   4 |
| trestbps  |   303 |  131.6900 |  17.5997  |  94 |  120 |  130 |  140 | 200 |
| chol      |   303 |  246.6930 |  51.7769  | 126 |  211 |  241 |  275 | 564 |
| fbs       |   303 |    0.1485 |   0.3562  |   0 |    0 |    0 |    0 |   1 |
| restecg   |   303 |    0.9901 |   0.9950  |   0 |    0 |    1 |    2 |   2 |
| thalach   |   303 |  149.6070 |  22.8750  |  71 | 133.5|  153 |  166 | 202 |
| exang     |   303 |    0.3267 |   0.4698  |   0 |    0 |    0 |    1 |   1 |
| oldpeak   |   303 |    1.0396 |   1.1611  |   0 |    0 |  0.8 |  1.6 | 6.2 |
| slope     |   303 |    1.6007 |   0.6162  |   1 |    1 |    2 |    2 |   3 |
| ca        |   299 |    0.6722 |   0.9374  |   0 |    0 |    0 |    1 |   3 |
| thal      |   301 |    4.7342 |   1.9397  |   3 |    3 |    3 |    7 |   7 |
| target    |   303 |    0.9373 |   1.2285  |   0 |    0 |    0 |    2 |   4 |


Most categorical predictors (cp, fbs, restecg, exang, slope, ca, thal) cover multiple categories, with only minimal missingness in ca and thal.


| #  | Column    | Non-Null Count | Dtype    |
|----|-----------|----------------|----------|
| 0  | age       | 303            | int64    |
| 1  | sex       | 303            | int64    |
| 2  | cp        | 303            | int64    |
| 3  | trestbps  | 303            | int64    |
| 4  | chol      | 303            | int64    |
| 5  | fbs       | 303            | int64    |
| 6  | restecg   | 303            | int64    |
| 7  | thalach   | 303            | int64    |
| 8  | exang     | 303            | int64    |
| 9  | oldpeak   | 303            | float64  |
| 10 | slope     | 303            | int64    |
| 11 | ca        | 299            | float64  |
| 12 | thal      | 301            | float64  |
| 13 | target    | 303            | int64    |


### Univariate Analysis

The study cohort consisted of 297 individuals, with a slightly higher proportion classified as healthy (160 cases, 53.9%) compared with those with disease (137 cases, 46.1%). Thus, the sample is nearly evenly split between conditions, providing a balanced basis for comparing outcomes between healthy and diseased groups.

<p align="center">
  <img src="artifacts/proportion-of-outcomes.png" alt="Proportion of outcomes" width="1000"/>
</p>

<p align="center"><em>Figure 1: Proportion of outcomes</em></p>

Stratification by gender showed a marked imbalance in disease burden. Among males, the number of disease cases exceeded healthy cases (approximately 110 vs. 90), indicating that disease is the predominant condition in this subgroup. In contrast, females were more often healthy than diseased (around 70 vs. 25 cases), suggesting a lower disease prevalence in women compared with men.

<p align="center">
  <img src="artifacts/proportion-of-disease-by-gender.png" alt="Proportion of disease by gender" width="700"/>
</p>

<p align="center"><em>Figure 2: Proportion of disease by gender</em></p>

Age-specific analysis indicates that disease is rare in the youngest participants (20-30 years) and remains less frequent than healthy status up to 50 years. From 50-60 years onward, the pattern reverses: disease cases slightly exceed healthy cases in the 50-60 group and remain relatively high in the 60-70 group, suggesting a shift in risk around midlife. In the oldest group (70-80 years) both healthy and diseased counts are low, indicating fewer individuals in this age range rather than a clear difference in disease prevalence.

<p align="center">
  <img src="artifacts/disease-occurences-by-age-group.png" alt="Disease occurences by age group" width="700"/>
</p>

<p align="center"><em>Figure 3: Disease occurences by age group</em></p>

In the univariate stage, we next examined the marginal distributions of all predictors (Figure 4). Continuous variables such as age, resting blood pressure (trestbps), cholesterol (chol), and maximum heart rate (thalach) are approximately bell-shaped, centered around mid-50s for age, ~130 mmHg for trestbps, ~250 mg/dL for chol, and ~150 bpm for thalach, with a few high-value outliers for chol and oldpeak. ST depression at peak exercise (oldpeak) and the number of major vessels (ca) are clearly right-skewed, with many patients showing no or minimal abnormalities and a small subset with markedly elevated values.

<p align="center">
  <img src="artifacts/distributions.png" alt="Distributions of predictors" width="1200"/>
</p>

<p align="center"><em>Figure 4: Distributions of predictors</em></p>

To further characterize the continuous variables, we inspected them for outliers using boxplots. Oldpeak shows a pronounced right tail, with several observations above 4-6 units, consistent with the strong right-skew seen in its histogram. For age and resting blood pressure, the spread is relatively compact with only a few high values, whereas cholesterol exhibits multiple high outliers, including one very extreme measurement. Maximum heart rate (thalach) also contains a small number of unusually low or high values compared with the bulk of the data.

All of these extreme values remain within clinically plausible ranges and likely reflect true high-risk patients rather than data errors. Therefore, they were retained in the dataset, but their presence is important to keep in mind when fitting models and interpreting coefficients, particularly for chol, oldpeak, and thalach.

<p align="center">
  <img src="artifacts/outliers.png" alt="Outliers" width="900"/>
</p>

<p align="center"><em>Figure 5: Outliers</em></p>

<p align="center">
  <img src="artifacts/oldpeak.png" alt="Oldpeak skeweness" width="900"/>
</p>

<p align="center"><em>Figure 6: Oldpeak skeweness</em></p>

### Multivariate Analysis

In the multivariate step, we examined the correlation structure among all variables (Figure 7). Overall, correlations between predictors are moderate, suggesting limited multicollinearity. Exercise-related variables form a clear cluster: maximum heart rate (thalach) is inversely correlated with age, while ST depression (oldpeak), ST slope (slope), and exercise-induced angina (exang) are positively interrelated, reflecting a common ischemia/effort tolerance dimension.

For the subsequent hypothesis testing, we focus in particular on age, the number of affected vessels (ca), and the disease outcome (target). Age shows a moderate positive correlation with ca (r ≈ 0.36), indicating that older patients tend to have more obstructed vessels. Both variables are also positively correlated with target (r ≈ 0.23 for age-target and r ≈ 0.46 for ca-target), with ca displaying one of the strongest associations with disease in the dataset. This pattern suggests that structural coronary damage (captured by ca) increases with age and is closely linked to the presence of disease. In the next section, we formally test these relationships by comparing age distributions across outcome groups and assessing the association between ca and target using Chi$^2$ test for independence.

<p align="center">
  <img src="artifacts/corr.png" alt="Correlation between variables" width="1200"/>
</p>

<p align="center"><em>Figure 7: Correlation between variables</em></p>

The contingency heatmap of heart disease status by number of major vessels (ca) shows a clear monotonic pattern. Among patients with no visible major vessel involvement (ca = 0), the majority are disease-free (129 without vs. 45 with disease). As the number of affected vessels increases, this relationship reverses: for ca = 1, counts of diseased and non-diseased patients are roughly similar (44 vs. 21), while for ca = 2 and ca = 3, diseased patients clearly dominate (31 vs. 7 and 17 vs. 3, respectively).

These frequencies suggest that higher vessel involvement is strongly associated with the presence of heart disease, consistent with the relatively high positive correlation between ca and the target variable observed in the correlation matrix. In the following hypothesis testing section, we formally assess this association using a test for independence between ca and disease outcome.

<p align="center">
  <img src="artifacts/major-vessels-heart-disease.png" alt="Observed implication of major vessels visibility on heart disease" width="600"/>
</p>

<p align="center"><em>Figure 8: Observed implication of major vessels visibility on heart disease</em></p>

---
## Hypothesis Testing

Motivated by the strong positive correlation between the number of major vessels (ca) and disease status, and the clear pattern in the contingency heatmap, we formally tested whether ca and heart disease are statistically associated.

Both variables are categorical (ca: 0-3 vessels; disease: present/absent), so we use a chi-square test of independence on their contingency table.
- Null hypothesis $H_0$: Number of major vessels on imaging and heart disease outcome are independent.
- Alternative hypothesis $H_A$: Number of major vessels on imaging and heart disease outcome are NOT independent (are associated).

Observed contingency table:

| Number of major vessels (ca) | No heart disease | Heart disease | Row total |
| ---------------------------- | ---------------- | ------------- | --------- |
| 0                            | 129              | 45            | 174       |
| 1                            | 21               | 44            | 65        |
| 2                            | 7                | 31            | 38        |
| 3                            | 3                | 17            | 20        |
| **Column total**             | **160**          | **137**       | **297**   |

<center><i>Table 2: Contigency table of observed values.</i></center>

Expected counts are computed from row and column totals as $$E_{ij} = \frac{(row\ total_i)(column\ total_j​)}{N}$$ \
where $N = 297$.

For example, for $ca = 0$ and “No heart disease”: $$E_{11} = \frac{174 * 160}{297} \approx 93.74$$

The full table of expected frequencies is

| Number of major vessels (ca) | No heart disease (E) | Heart disease (E) | Row total |
| ---------------------------- | -------------------- | ----------------- | --------- |
| 0                            | 93.74                | 80.26             | 174       |
| 1                            | 35.02                | 29.98             | 65        |
| 2                            | 20.47                | 17.53             | 38        |
| 3                            | 10.77                | 9.23              | 20        |
| **Column total**             | **160.00**           | **137.00**        | **297**   |

<center><i>Table 3: Contigency table of expected values.</i></center>

The chi-square test statistic is $$\chi^2 = \sum_{i=1}^4\sum_{j=1}^2{\frac{(O_{ij} - E_{ij})^2}{E_{ij}}}$$
where $O_{ij}$ and $E_{ij}$ are the observed and expected counts.

For example, the contribution from the cell ($ca = 0$, no disease) is $$\frac{(129 - 93.74)^2}{93.74} \approx 13.27$$

Summing the contributions from all 8 cells gives $$\chi^2 \approx 72.30$$

The degrees of freedom are $$df = (r - 1)(c - 1) = (4 - 1)(2 - 1) = 3$$

and the corresponding p-value is $$p \approx 1.37 * 10^{-15}$$

#### Interpretation

With $\chi^2 \approx 72.30, df = 3$, and an extremely small p-value, we reject $H_0$ at any conventional significance level. There is very strong evidence that the number of major vessels (ca) and heart disease status are not independent: patients with more affected vessels are much more likely to have heart disease. This result supports the earlier descriptive findings and justifies using ca as a key predictor in subsequent modeling.

---
## Modeling

For the modeling stage, we first separated predictors and outcome by defining the feature matrix $X$ as all variables except target, and the response vector $y$ as the binary heart disease indicator. The data was then split into training and test sets using a stratified train-test split with 70% of observations used for training and 30% for testing. Stratification on $y$ ensures that the proportion of diseased and non-diseased patients is preserved in both subsets, which is important for obtaining an unbiased estimate of model performance.

Next, we constructed a preprocessing pipeline based on the variable types. Numerical predictors were identified using their integer/float data types, and categorical predictors were identified using `category` and boolean types. A `ColumnTransformer` was then used to apply different transformations to these two groups:
- **Numerical features** were scaled with RobustScaler, which centers variables using the median and rescales them based on the interquartile range. This approach reduces the influence of extreme values and skewed distributions on the model, making the inputs more stable for algorithms that estimate class-conditional densities, such as Gaussian Naive Bayes.
- **Categorical features** were encoded using `OneHotEncoder` with `sparse_output=False` to produce a dense design matrix. This is necessary because the Gaussian Naive Bayes implementation in scikit-learn expects dense input and cannot directly work with sparse matrices.

All remaining columns were dropped (`remainder='drop'`), so only the explicitly specified numerical and categorical predictors enter the model. This preprocessing pipeline is later combined with the classifier in a single workflow, ensuring that the same transformations are consistently applied during both training and evaluation.

> For complete preprocessing step, please, see [implementation notebook](https://github.com/vregi/heart-disease-classification/blob/main/implementation.ipynb).

For the classification task, we compared two simple but commonly used baseline models: Logistic Regression and Gaussian Naive Bayes. Logistic Regression was configured with an increased maximum number of iterations (`max_iter` = 1000) to ensure convergence, while Gaussian Naive Bayes was used with its default probabilistic formulation. Both models were wrapped inside a single `Pipeline` together with the previously defined preprocessing step, so that scaling and encoding are applied consistently during training and testing.

Each model was trained on the same stratified training set $(X_{train}, y_{train})$, and evaluated on the held-out test set $(X_{test}, y_{test})$. After fitting, we obtained class predictions $\hat{y}$ using `predict` and estimated class probabilities using `predict_proba`, from which we extracted the probability of heart disease (positive class). Model performance was quantified using several standard metrics: overall accuracy on the test set, the **ROC-AUC** computed from the predicted probabilities (to assess discrimination across all possible thresholds), and the **confusion matrix**, which summarizes true positives, true negatives, false positives, and false negatives. For each model, all relevant objects (fitted pipeline, predictions, scores, confusion matrix, and metrics) were stored for subsequent comparison and visualization in the results section.

> For complete modeling step, please, see [implementation notebook](https://github.com/vregi/heart-disease-classification/blob/main/implementation.ipynb).

---
## Evaluation

After training both logistic regression and Gaussian Naive Bayes on the processed dataset, we assessed their performance on a held-out test set using threshold-dependent and threshold-independent metrics. In particular, we report overall accuracy and ROC-AUC to quantify global discrimination, and precision, recall, and F1-scores separately for the disease and no-disease classes to capture the trade-off between correctly detecting heart disease and avoiding false alarms. 

This section summarizes and compares these results and discusses their practical implications in a screening context.

### Metrics

On the held-out test set, both models achieved strong predictive performance. 

The Gaussian Naive Bayes classifier reached an accuracy of about **0.83** with a ROC-AUC above **0.93**, indicating very good separation between diseased and non-diseased patients; it showed high precision for the disease class and particularly strong recall for the no-disease class, meaning it correctly identified most healthy individuals but still missed a fraction of true positives. 

The logistic regression model obtained a slightly higher overall accuracy (around **0.86**) and ROC-AUC close to **0.95**, with very high recall for the no-disease class and excellent precision for the disease class, reflecting reliable positive predictions but again some missed cases. Overall, both approaches deliver clinically useful discrimination on this dataset, with a trade-off between correctly ruling out healthy patients and capturing all true disease cases.

| Model                | Overall |  | No disease |  |  | Disease |  |  |
|----------------------|:-------:|:-------:|:----------:|:----------:|:----------:|:-------:|:-------:|:-------:|
|                      | Accuracy| ROC-AUC | Precision  |   Recall   |    F1      | Precision|  Recall |   F1    |
| Logistic Regression  | 0.8333  | 0.9325  |   0.81     |   0.90     |   0.85     |  0.86   |  0.76   |  0.81   |
| Gaussian Naive Bayes | 0.8556  | 0.9469  |   0.82     |   0.94     |   0.87     |  0.91   |  0.76   |  0.83   |

<center><i>Table 4: Metrics table</i></center>

### Confusion Matrices

The confusion matrices highlight how the two models trade off different types of errors. 

For logistic regression, 43 of 48 no-disease cases are correctly classified (5 false positives), and 32 of 42 disease cases are correctly identified (10 false negatives). 

Gaussian Naive Bayes slightly improves performance on the no-disease class, correctly classifying 45 of 48 healthy patients (3 false positives), while keeping the same number of true positives for disease (32) and thus the same number of false negatives (10). 

Overall, Naive Bayes is slightly more conservative in predicting disease, reducing false alarms at the cost of not improving sensitivity for true disease cases.

<p align="center">
  <img src="artifacts/confusion_matrices.png" alt="Confusion Matrices" width="1200"/>
</p>

<p align="center"><em>Figure 9: Confusion Matrices</em></p>

### ROC Curves

The ROC curves show that both models achieve strong discrimination between patients with and without heart disease across a wide range of decision thresholds. 

The curves for logistic regression and Gaussian Naive Bayes lie well above the diagonal, with areas under the curve of approximately **0.93** and **0.95**, respectively. The Naive Bayes curve is slightly closer to the top-left corner, reflecting a small advantage in simultaneously maintaining high true-positive rates at relatively low false-positive rates. 

This confirms that, on the test set, both classifiers provide clinically meaningful ranking of risk, with Gaussian Naive Bayes offering marginally better overall ranking performance.

<p align="center">
  <img src="artifacts/roc_curves.png" alt="ROC Curves" width="1200"/>
</p>

<p align="center"><em>Figure 10: ROC Curves</em></p>

### Cross-Validation

To assess how stable these results are with respect to the particular train-test split, we performed stratified 5-fold cross-validation on the training data for both models. 

Logistic Regression achieved mean accuracy of about **0.84** (range ≈ 0.75-0.92), whereas Gaussian Naive Bayes reached a lower mean accuracy of about **0.79** (range ≈ 0.70-0.88). 

Thus, although Naive Bayes appeared slightly better on the single held-out test set, repeated resampling suggests that Logistic Regression is, on average, the more reliable classifier, with better expected performance across different splits of the data. This supports choosing Logistic Regression as the primary model for deployment, while still recognizing Naive Bayes as a competitive and more parsimonious alternative.

| Model                | Fold 1  | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean accuracy ± SD |
|----------------------|:-------:|:------:|:------:|:------:|:------:|:------------------:|
| Logistic Regression  | 0.92    | 0.85   | 0.75   | 0.85   | 0.83   | 0.84 ± 0.06        |
| Gaussian Naive Bayes | 0.88    | 0.70   | 0.75   | 0.78   | 0.85   | 0.79 ± 0.07        |

<center><i>Table 5: Stratified K-Fold cross-validation results</i></center>

---
## Key Findings

- #### Dataset and population

The dataset includes 303 predominantly middle-aged to older patients (mean age ≈ 54 years), with a clear male majority.

Classic cardiovascular risk factors (blood pressure, cholesterol, exercise response) are moderately elevated and show substantial variability, reflecting a clinically heterogeneous cohort.

- #### Univariate patterns and risk factors

Disease prevalence is slightly below 50% overall, but strongly imbalanced across subgroups.

Men and older age groups (≥50 years) show markedly higher disease frequencies than women and younger patients, supporting known epidemiological patterns.

Several continuous variables (cholesterol, oldpeak, ca) exhibit right-skew and clinically plausible high outliers; these were retained and handled via robust scaling.

- #### Association between vessels (`ca`), age, and disease

Age correlates positively with the number of affected vessels (ca), and both are positively associated with disease status.

The contingency analysis of ca vs. disease shows a clear monotonic trend: as the number of affected vessels increases from 0 to 3, the proportion of patients with heart disease rises sharply.

A chi-square test of independence confirms a strong, statistically significant association between ca and heart disease $(\chi^2 \approx 72.3, df = 3, p < 0.001)$, indicating that vessel involvement is a key structural marker of disease in this dataset.

- #### Model performance on the test set

Both models-Logistic Regression and Gaussian Naive Bayes-achieve good discrimination (ROC-AUC ≈ 0.93-0.95) and high accuracy (≈0.83-0.86) on the held-out test set.

Confusion matrices show that both models correctly classify most healthy patients, while missing a smaller but non-negligible fraction of diseased patients.

Gaussian Naive Bayes appears slightly better on the single test split (higher accuracy and ROC-AUC, fewer false positives), suggesting that a simple probabilistic model can capture much of the predictive structure in the data.

- #### Model robustness and preferred classifier

Cross-validation on the training data reveals that Logistic Regression has a higher mean accuracy (≈0.84) and slightly smaller variance across folds than Gaussian Naive Bayes (≈0.79).

This indicates that Logistic Regression is more robust to sampling variation and is likely to generalize more consistently to new data, even though Naive Bayes performed marginally better on the specific test split.

Overall, Logistic Regression emerges as the preferred model for deployment, with Gaussian Naive Bayes serving as a strong, interpretable baseline that confirms the stability of the main predictive signals.

---
## Limitations

- #### Limited clinical detail of predictors
Several variables are relatively coarse and may not reflect current best practice in cardiovascular risk assessment. For example, chol represents total cholesterol, whereas treatment decisions usually rely on **LDL** cholesterol (and its ratio to HDL) rather than on a single aggregate value. High HDL can “mask” high **LDL** in the total cholesterol measure, so important risk may be underestimated. Similar simplifications likely apply to other predictors (e.g. binary fasting blood sugar, simplified exercise-ECG categories), which can blunt the true signal and limit the clinical interpretability of the models.

- #### Restricted dataset and potential bias
The analysis is based on a single, relatively small dataset (303 patients) with a male-dominated, middle-aged population. This restricts the ability to generalize the findings to other settings, age groups, or contemporary patients with different prevalence of risk factors and treatment patterns. No external validation was performed, so the reported performance may be optimistic.

- #### Simplifying modeling assumptions
The Gaussian Naive Bayes model assumes conditional independence between predictors and (after scaling) approximately Gaussian distributions within each class-assumptions that are clearly only approximate here. Even logistic regression is linear in the log-odds and may miss more complex, non-linear interactions between features (e.g. age × exercise response, multi-factor metabolic risk).

- #### Scope of methods and role of ML in practice
We deliberately focused on simple, transparent statistical models. While their performance is encouraging, more modern approaches (e.g. gradient boosting, random forests, or neural networks) could potentially capture richer patterns and further improve discrimination, especially in larger datasets. At the same time, ML systems-particularly those trained on limited and simplified data-should not be used as autonomous decision-makers in a complex domain like cardiology. Their appropriate role is to support clinicians by highlighting patterns and estimating risk, while final diagnostic and treatment decisions remain grounded in expert clinical judgment and patient-specific context.

---
## Future Work

- #### Richer and more granular clinical variables
A natural next step is to extend the feature set with more clinically specific biomarkers (e.g. LDL, HDL, triglycerides, HbA1c), medication use, and imaging-derived measures. This would address some of the limitations of total cholesterol and other coarse predictors, and allow the models to better reflect contemporary cardiology practice.

- #### Larger and more diverse datasets
Training and validating the models on larger, multi-center cohorts would improve statistical power, allow more reliable subgroup analyses (e.g. by sex, age, comorbidity profile), and test generalizability across different populations and healthcare settings. External validation on an independent dataset should be a priority.

- #### Expanded modeling and tuning
Beyond Gaussian Naive Bayes and logistic regression, future work could explore tree-based ensembles (Random Forests, Gradient Boosting, XGBoost), regularized models, and calibrated probabilistic methods. Systematic hyperparameter tuning and model selection, guided by cross-validation and calibration metrics, may yield more accurate and better-calibrated risk predictions.

- #### Model interpretability and clinical integration
Applying model-agnostic explanation methods (e.g. partial dependence plots, SHAP values) could clarify how each feature contributes to predicted risk and help clinicians understand when to trust or question model outputs. Translating risk scores into clinically meaningful thresholds and decision rules (e.g. via decision-curve analysis) would be a key step toward practical use.

- #### Prospective and workflow-oriented evaluation
Ultimately, the most informative evaluation would involve prospective testing in routine clinical workflows, examining not only predictive performance but also effects on clinician behavior, time to diagnosis, and patient outcomes. Any such deployment should be explicitly framed as decision support, complementing rather than replacing clinician judgment.

---
## Conclusions

In summary, we showed that even simple, transparent models can extract clinically meaningful patterns from routine heart disease data. Exploratory analysis confirmed known risk gradients by age, sex, and vessel involvement, and hypothesis testing highlighted the strong association between the number of affected vessels and disease status. 

Both Gaussian Naive Bayes and logistic regression achieved good discrimination on the test set, with logistic regression emerging as the more robust choice overall. While these results are encouraging, they are based on limited and somewhat simplified data, and the models should be viewed as decision-support tools rather than replacements for clinical expertise.

---
## Appendix

### A. Notebooks

All data preparation, analysis, and visualization were performed using separate Jupyter notebook. Each step described in the report corresponds directly to it's structure. This notebook contain the full implementation and can be accessed using the link below:

Implementation Notebook - https://github.com/vregi/heart-disease-classification/blob/main/implementation.ipynb 

### B. Data Source

The data used in this project come from the Heart Disease dataset hosted at the UC Irvine Machine Learning Repository. It is a widely used benchmark dataset derived from clinical records (primarily the Cleveland Clinic subset) and contains demographic, clinical, and exercise test variables together with a labeled heart disease outcome. The dataset was obtained in its processed tabular form and used without modification apart from the preprocessing steps described in the main text.

Data Source - https://archive.ics.uci.edu/dataset/45/heart+disease

### C. References

[1] : World Health Organization. Cardiovascular diseases (CVDs): Fact sheet. Available at: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) \
[2] : Banerjee, T. et al. A systematic review of machine learning in heart disease prediction. Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC12614364/ \
[3] : scikit-learn developers. 1.9. Naive Bayes - scikit-learn User Guide. Available at: https://scikit-learn.org/stable/modules/naive_bayes.html \
[4] : Bishop, C. M. Pattern Recognition and Machine Learning. Chapter 4, p.205-206 Springer, 2006. PDF version available at: https://www.microsoft.com/en-us/research/wp-content/uploads/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

---
<div style="text-align: right; margin-top: 40px;">
<strong>Vladyslav Lysenko</strong> &nbsp;&nbsp; <strong>Parmida Mashadi Assadollahi</strong> &nbsp;&nbsp; <strong>Sankruththian Senathirajah</strong><br>
University of Toronto School of Continuing Studies<br>
August 2025
</div>