# Cardiovascular disease prediction

## Content

1. Abstract
2. Introduction
3. Datasets
4. Working process
5. Models and algorithms
6. Hyperparameter optimization
7. Evaluation
8. Results
9. References

## 1. Abstract

Cardiovascular diseases (CVDs) are a group of disorders affecting the heart and blood vessels. In 2022, an estimated 18.6 million people worldwide died from CVD, which accounted for over 33% of all global deaths. This marks a continued upward trend in mortality, emphasizing that these diseases remain one of the leading causes of death globally. As a result, there is an urgent need for effective risk prediction and management tools. This study aims to develop a machine learning model to predict the likelihood of heart disease based on key health indicators and lifestyle factors. Furthermore, I will analyze the contribution of the features to the model's predictions to identify the most significant risk factors. The insights gained from this analysis can help healthcare professionals and individuals recognize high-risk profiles for early intervention and targeted preventive measures.

## 2. Introduction

Cardiovascular disease remains a major public health concern globally, with its prevalence rising as modern lifestyles contribute to a higher risk of heart-related issues. Predictive tools are crucial for identifying individuals at higher risk of developing heart disease, allowing for early intervention. In this project, I utilize several classification models, including Logistic Regression, Decision Tree, Random Forest Classifier, XGBoost and SVM. These models are selected for their effectiveness in predicting the presence of heart disease. Logistic Regression and Decision Trees are simple models, while Random Forest is complex model, highlights feature importance through ensemble learning. XGBoost, using gradient boosting, improves predictive accuracy, making it a robust choice for medical datasets. Hyperparameter optimization is done with Optuna library. Predictions are made based on a variety of features such as blood indicators, health metrics and lifestyle habits, all of which are related to heart disease.

To enhance the interpretability of the models and better understand the decision-making process, I will also apply explainable AI techniques, specifically SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-Agnostic Explanations). These techniques will allow us to analyze the significance of individual features and understand how they contribute to the model's predictions. By identifying the key risk factors, this project aims not only to improve the accuracy of heart disease predictions but also to provide valuable insights that can aid healthcare professionals in tailoring preventive strategies and treatment plans.

## 3. Datasets

**[Dataset 1](data/heart_disease_uci.csv)** is mix combining 5 different datasets:
- Cleveland: 303 observations
- Hungarian: 294 observations
- Switzerland: 123 observations
- Long Beach VA: 200 observations
- Stalog (Heart) Data Set: 270 observations
  
It contains health-related data from hospital records that track cardiovascular health metrics. The source of this dataset is [Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction). Each of these datasets can be found in UCI Machine Learning Repository. The total observations are 1190 with 272 duplicates. Total features are 12:
- Age: The age of the patient.
- Sex: Gender of the patient, coded as M for male and F for female. 
- Chest Pain Type: The type of chest pain the patient experiences (categorical feature with multiple values: TA - Typical Angina, ATA - Atypical Angina, NAP - Non-Anginal Pain, ASY - Asymptomatic).
- RestingBP (Resting Blood Pressure): The blood pressure of the patient when at rest.
- Cholesterol: serum cholesterol (mm/dl).
- FastingBS (Fasting Blood Sugar): Indicates whether the patient has a fasting blood sugar level higher than 120 mg/dl (which means he has diabetes). 
- RestingECG (Resting Electrocardiographic Results): The results of an ECG test taken at rest (Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria).
- MaxHR: The highest heart rate achieved during exercise (numeric value between 60 and 202). 
- Exercise Induced Angina: Indicates whether the patient experiences chest pain during exercise (Y - Yes, N - No).
- Oldpeak: Depression in the ST segment of the ECG induced by exercise, relative to rest.
- ST_Slope: The slope of the peak exercise ST segment, another indicator of heart function: Up - the ST segment slopes upward, indicating better heart function. Flat - the ST segment remains level, which may indicate potential heart issues. Down - the ST segment slopes downward, often associated with possible heart problems like ischemia.
- HeartDisease: This is the target variable, indicating whether the patient has heart disease (1) or not (0).

**[Dataset 2](data/heart_disease_dataset.csv)** also contains health-related data including various factors that could potentially be linked to heart disease. The source is [Mendeley data](https://data.mendeley.com/datasets/yrwd336rkz/2). It contains 1763 records representing 1763 unique patients and 12 columns: 
- age: The age of the patient.
- sex: Gender of the patient, coded as 1 for male and 0 for female. 
- chest pain type: The type of chest pain the patient experiences (categorical feature with multiple values: 1 - Typical Angina, 2 - Atypical Angina, 3 - Non-Anginal Pain, 4 - Asymptomatic).
- trestbps (resting blood pressure): The blood pressure of the patient when at rest (numeric variable).
- cholesterol: serum cholesterol (numeric variable, mm/dl).
- fasting blood sugar: Indicates whether the patient has a fasting blood sugar level higher than 120 mg/dl - 1 (which means he has diabetes) or lower - 0.
- resting ecg (resting electrocardiographic results): The results of an ECG test taken at rest (0: Normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes' criteria).
- max heart rate: The highest heart rate achieved during exercise (numeric variable). 
- exercise angina: Indicates whether the patient experiences chest pain during exercise (1 - Yes, 0 - No).
- old peak: Depression in the ST segment of the ECG induced by exercise, relative to rest.
- ST slope: The slope of the peak exercise ST segment, another indicator of heart function: 1 - Up (the ST segment slopes upward, indicating better heart function). 2 - Flat (the ST segment remains level, which may indicate potential heart issues). 3 - Down (the ST segment slopes downward, often associated with possible heart problems like ischemia).
- target: This is the target variable, indicating whether the patient has heart disease (1) or not (0).

## 4. Working process

![process](img/working_process.png)

First collect the data. Next steps is tidying and cleaning of the two datasets separately. Then I [merge the two datasets](Merging_datasets.ipynb) and remove duplicates. The purpose of [Exploratory data analysis](Model_training_and_optimization.ipynb) is to understand the underlying structure and patterns in the datasets. I create histograms, bar charts, pie charts, correlation matrix which help to detect skewed distributions, potential outliers and relationships between variables. There are no missing values, ​​but there are zero values in cholesterol and blood_pressure columns, ​​which are invalid. I use SimpleImputer with mean strategy to handle those invalid values. 

Target (heart_disease) exploration show that there is a class imbalance: 796 - positive, 567 - negative. Class imbalance - the number of samples in one class (minority class, negative in my case) is significantly smaller than those in another class (the majority class, positive). This imbalance can lead to biased model performance, favoring the majority class and neglecting the minority class. To balance the classes I use SMOTE (Synthetic Minority Oversampling Technique). SMOTE is a technique used in ML to address the issue of class imbalance in datasets. SMOTE generates new synthetic samples by interpolating between existing samples of the minority class. It selects a random sample from the minority class. It identifies its k-nearest neighbors (typically 5 neighbors). A new synthetic sample is created by randomly selecting one of these neighbors and generating a point that lies along the line segment connecting the two samples in feature space. Then the synthetic samples are added to the original dataset, balancing the number of samples in each class. 

I split the data into training (75%) and testing (25%) sets. Using stratify = target ensures that the distribution of the target variable is preserved in both sets - there is an equal representation of both classes heart disease (1) vs no heart disease (0) in both training and test sets. This is especially useful when dealing with imbalanced datasets, as it helps maintain the same ratio of classes in both subsets, preventing the model from being biased toward the majority class. 

For data preprocessing i use ColumnTransformer. It allows to apply different preprocessing techniques to different feature subsets:
- OneHotEncoder is used for categorical features with no intrinsic order (chest_pain_type) and binary features (gender and exercise_angina).
- OrdinalEncoder is used for categorical features that have a natural order (resting_ecg and st_slope).
- StandardScaler is used to scale the values of continuous features to have a mean= 0 and a std = 1 (all numerical features: age, cholesterol, blood_pressure, max_heart_rate, oldpeak). This is important for models that are sensitive to the scale of input features, such as LR and SVM.
  
A Pipeline in ML automates workflows by chaining steps together. In your case, the pipeline bundles the preprocessing, SMOTE and model training steps together, ensuring that data is consistently preprocessed and preventing data leakage.

## 5. Models and algorithms

For the purpose of my paper, I have chosen the following algorithms: 

### Decision Tree
Decision trees are trained by recursively partitioning the data into subsets based on feature values, creating a tree-like structure. At each partition, the algorithm selects a feature that have minimum entropy. Their structure include:
- Root – the first (topmost) node, that starts the process.
- Nodes – a condition based on the data features. Decision nodes are divided into two or more groups based on the predictors.
- Leaves – the final predictions (output or classification). Each leaf contains a prediction for the target.
DecisionTreeClassifier doesnt require any data preprocessing (eg. normalization). It handles categorical variables well (especially binary ones). Tends to overfit - create complex trees that fit the training data well but fail to generalize to unseen data. Can struggle with finding optimal splits when data distributions are skewed. It predicts the leaves using majority vote - chooses the most common class). \
At each split, DecisionTree selects the feature that best separates the data into subgroups, trying to make these subgroups as homogeneous as possible. This is achieved by computing a partition criterion. Two of the most commonly used criteria are Gini index and Entropy, which are used to estimate Information Gain. IG measures how much impurity of the nodes decreases when dividing data by a given characteristic. The tree chooses the feature that minimizes Gini or Entropy after the split.\
Gini index measures the probability that a point randomly selected from a given set will be incorrectly classified if it is classified according to the distribution of classes in that set. Gini is a measure of impurity (how mixed the classes are at a given node). The Gini is lower if the node is cleaner (the fewer classes it has):
$$ Gini = 1 - \sum_{i=1}^{C} p_i^2 $$
Entropy measures the chaos or "impurity" in a given node. Lower entropy means that the node is a more homogeneous (cleaner) node.
Gini is faster to compute and more optimized for large differences. Entropy is slower to compute and more sensitive to small differences.
$$ Entropy = - \sum_{i=1}^{C} p_i \log_2(p_i) $$
where: \
$C$ is the number of classes, \
$p_i$ is the probability of class i.

### Random Forest 
A Random Forest is an ensemble learning method built upon DecisionTrees. It improves performance by combining multiple trees with similar parameters, to reduce overfitting and improve generalization. The idea is that a group of models (even weak ones) will work better than any individual model because they compensate for each other's weaknesses. \
RandomForest used bagging - multiple identical algorithms (submodels) are created and trained independently of each other. Several subsets of data are created through bootstrap sampling, and each model is trained on a different random subset. Random Forest uses multiple trees, so it can assess which features are most important for predictions by analyzing how often a given feature was used in splitting. For classification, predictions are made with majority voting. Random forest is robust algorithm that generally outperforms a single Decision tree, but training and prediction can be computationally expensive with many trees.

### Logistic Regression


Logistic Regression is a supervised learning algorithm used for binary classification problems. It predicts the probability of an outcome belonging to a particular class ( 0 / 1). Instead of directly predicting class labels, it predicts the probability and uses a threshold (usually 0.5) to classify. Logistic Regression relies on the logistic/ sigmoid function. It ensures the output is between 0 and 1, making it interpretable as a probability. LR uses the log-loss or binary cross-entropy as its loss function. The model's goal is to find the optimal values for the coefficients that minimize the difference between the predicted probabilities and the actual outcomes in the training data, using techniques like Gradient descent.
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$ 
where:\
$ z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n $ \
z is the linear combination of the inputs $x_1, x_2,..$ with weights $\beta_1, \beta_2,...$ and intercept $ \beta_n $

Logistic regression classifies the outcome based on a threshold: \
If $P(Y=1|X=x) \geq 0.5 \Rightarrow Y = 1$ \
If $P(Y=1|X=x) < 0.5 \Rightarrow Y = 0$ \
If the predicted probability of the positive class (𝑃(𝑌 = 1)) is greater than or equal to 0.5, classify the instance as class 1 (positive class). If the predicted probability of the positive class is less than 0.5, we classify the instance as class 0 (negative class).

L1, L2 and ElasticNet are regularization techniques that prevent overfitting by penalizing large coefficients (weights) in the model. In logistic regression, the regularization term is added to the loss function to reduce the complexity of the model.
L2 regularization (ridge) adds a penalty equal to the sum of the squares of the model's coefficients (weights). This prevents large coefficients, which can cause overfitting. L2 regularization forces the model to spread out the feature weights more evenly, leading to small, but non-zero coefficients for all features. In logistic regression: This is typically used when the model expects all features to contribute to the decision-making process, but none should dominate too much.
L1 Regularization (Lasso) adds a penalty equal to the sum of the absolute values of the coefficients. This tends to drive some coefficients to exactly zero, effectively performing feature selection. This is used in lasso regression. In logistic regression: L1 regularization encourages sparsity, meaning it will reduce some feature weights to zero, leaving the most important features and removing irrelevant one .
ElasticNet combines both L1 and L2 regularization, offering a balance between Lasso and Ridge.

### SVM


Support Vector Machines (SVMs) are a supervised machine learning algorithm used for both classification and regression tasks. The core idea behind SVMs is to find the optimal hyperplane that separates data points into different classes with the maximum margin. Hyperplane is a decision boundary that separates data points into different classes. Margin is the distance between the hyperplane and the nearest data points from each class. Support Vectors is the data points that lie closest to the hyperplane and influence its position. \
I use SVC (Support Vector Classifier). It is designed for classification tasks. It focuses on finding the optimal hyperplane to separate data points into different classes. It is effective in high-dimensional spaces - can handle complex datasets with many features. It is also robust to outliers. It s are less sensitive to outliers, as it focus on the support vectors. SVC is effective in high-dimensional spaces. Works well for both linearly and non-linearly separable data with the kernel tricks. Types of Support Vector Classifiers:
- Linear SVC uses a linear kernel to find a linear hyperplane that separates the data. It is suitable for data that is linearly separable. The SVM finds the optimal hyperplane by maximizing the margin, ensuring that the support vectors lie on either side of the boundary.
- Non-linear SVC with Kernel Trick: For non-linearly separable data, SVC can use a non-linear kernel function to transform the data into a higher-dimensional space where it becomes linearly separable. Kernels: Radial Basis Function (RBF) transforms the data in such a way that the decision boundary becomes circular or spherical in higher dimensions. The RBF kernel is effective for many complex classification problems. Polynomial Kernel computes the similarity between data points using polynomial functions of their original features. 


### XGBoost


XGBoost (Extreme Gradient Boosting) is machine an ensemble learning technique that combines multiple weak models to create a strong predictive model.  It is based on the concept of gradient boosting. The key idea behind gradient boosting is to build an ensemble of decision trees sequentially, where each subsequent tree corrects the errors made by the previous ones. XGBoost also focuses on optimization, regularization and parallelization to make the boosting process more efficient and scalable. Тhe larger the value of gamma, the more restrictive the model becomes, the simpler it becomes, which reduces the risk of overfitting. XGBoostt uses an objective function to guide the model learning process, consists of two parts: Loss Function (measures the difference between the true and predicted values) and Regularization term (prevents the model from overfitting). Total objective function:
$$L(\theta) = \sum_{i=1}^{n} \ell(y_i, \hat{y}_i) + \Omega(f)$$
where: \
$\ell(y_i, \hat{y}_i)$ is the loss function (log loss for classification) for the i-th data point, where $y_i$ is the true label and $\hat{y}_i$ is the predicted value. \
$Ω(f)$ is the regularization term, where f represents the model and is used to penalize overly complex models.

In binary classification, each tree gives a real-valued output, which is typically passed through the sigmoid function to produce a probability value between 0 and 1. The result is passed through the sigmoid function to produce a probability score. If the result is greater than or equal to 0.5, the predicted class is 1, otherwise, it is 0. Binary classification uses the sum of tree outputs followed by a sigmoid to produce a final probability and a threshold (commonly 0.5) determines the predicted class:
$$ \hat{y}_i = \sum_{k=1}^K f_k(x_i) $$
where:\
$f_k(x_i)$ represents the prediction of the k-th tree for the i-th data point. \
$K$ is the total number of trees in the ensemble.

## 6. Hyperparameters optimization

I used Optuna which is hyperparameter optimization framework designed to automate the search for the best hyperparameters. It efficiently explores the hyperparameter space using advanced algorithm - TPE (Tree-structured Parzen Estimator). TPE is a probabilistic model-based optimization method that builds a tree-like structure to estimate the distribution of good hyperparameter values. Instead of randomly searching through the space, TPE evaluates the distribution of previous trials, balancing exploration and exploitation to focus on promising areas in the search space. This leads to more efficient exploration and optimization compared to traditional methods like Grid search or Random search. 

Optuna uses a study-object structure to manage and track multiple optimization tasks, running trials and storing results: 
- create_study initializes a study object to manage optimization tasks.
- optimize runs the optimization, executing trials and evaluating objective functions.
- trial.suggest_  defines the hyperparameters to be optimized.
- plot_optimization_history visualizes the optimization process, showing how the objective function’s value evolves.
  
The objective function in Optuna defines what i want to optimize. This function receives a trial object and uses it to suggest hyperparameters, then returns the evaluation metric (roc auc in my case) that you aim to maximize or minimize. \
A trial represents a single run of the objective function with a specific set of hyperparameters. Optuna performs multiple trials, each with different hyperparameters, and tracks their performance to determine which set is the best.\
A study is a high-level object in Optuna that tracks the entire optimization process. It includes multiple trials and their results, such as the objective function values. A study helps manage the optimization process, handle storage, and ensure that the best hyperparameters are selected based on the trials' outcomes.

## 7. Evaluation

For my binary classification task i use several evaluation metrics that help assess model performance: 

### ROC AUC
The ROC AUC score (Receiver Operating Characteristic Area Under Curve) measures the ability of a classifier to distinguish between positive and negative classes. It is the area under the ROC curve, which plots the true positive rate (TPR) vs the false positive rate (FPR). A perfect classifier has an AUC close to 1, while a classifier that performs no better than random guessing has an AUC = 0.5.

### Accuracy
Accuracy is the proportion of correct predictions made by the model over the total number of predictions.
$$\text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN}$$

### Precision
Precision measures how many of the predicted positive cases were actually positive. It is the ratio of true positives to all predicted positives.
$$\text{Precision} = \frac{TP}{TP+FP}$$

### Recall
Recall (sensitivity) is the ratio of true positives to the total actual positive cases. High recall minimizes false negatives.

$$\text{Recall} = \frac{TP}{TP+FN}$$

## F1-Score
The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of the classifier's performance, especially when there is an imbalance between classes.
$$F1-score = 2 \times \frac{Precision \times Recall}{Precision + Recall}$$

### Confusion Matrix
The confusion matrix shows the performance of a classification algorithm. It breaks down the prediction results into four categories:
* True Positives (TP): Correctly predicted positive samples.
* True Negatives (TN): Correctly predicted negative samples.
* False Positives (FP): Incorrectly predicted positive samples (type I error).
* False Negatives (FN): Incorrectly predicted negative samples (type II error).

## 8. Results

The EDA show the following insights:
- 40% of the women and 65% of the men in the merged dataset have heart disease.
- Most people who have heart disease are around 58 years old.
- 84% of the people with exercise induced angina have heart disease.
- The average cholesterol value of people with heart disease is slightly higher than that of those without heart disease.
- People with heart disease have lower maximum heart rate (135) than those without heart disease (151), maybe because they are not so energetic and it's more difficult for them to move.
- People with heart disease have highter average oldpeak value than those without heart disease.
- Columns cholesterol, blood_pressure and oldpeak have a slight positive skew (right tail). The column max_heart_rate has a slight negative skew (left tail).

Models: \
According to roc_auc score that I set in Optuna objective function as evaluation metric all models have good performance, but XGBoost is the best model. Results from training: \
XGBoost shows the best overall performance, especially in terms of ROC AUC and F1-score, making it the most robust model in this comparison. XGBoost: ROC AUC= 0.8498, F1-Score = 0.7942, core of 79.42%, accuracy =76.61%.RandomForest perform similarly with ROC_AUC = 0.835, F1-Score: 0.7648, Accuracy: 74.66%. SVM struggles slightly more with false negatives: ROC_AUC = 0.829, f1-Score = 0.76, Accuracy= 74.46%. LogisticRegression offers good performance, especially in F1-score and accuracy: ROC AUC = 0.8352, F1-Score = 0.7923 and accuracy = 76.81%.
DecisionTree has the lowest performance, particularly in ROC AUC and accuracy, likely due to its simplicity and overfitting tendencies: ROC AUC =0.7594, F1-Score = 0.7561, accuracy = 72.21%.

Testing is a crucial step in the machine learning model evaluation process. It involves assessing the model's performance on data that it has not seen during the training phase. This helps determine how well the model generalizes to new, unseen data.\
ROC curve helps to visualize the model’s ability to discriminate between classes at various thresholds / probabilities and AUC gives a numerical summary of that ability.Roc_auc curves from testing:

![process](img/roc_curves.png)

XGBoost performance on test set:\
Roc_auc score = 0.878 \
Accuracy = 0.792 \
F1-score = 0.812 \
Recall = 0.769\
Precision = 0.859\
Confusion matrix:\
True Negative = 117 , False Positive = 25 \
False Negative = 46 , True Positive = 153

![img](img/cm.png)

### Feature importances

I use plot_importance from XGBoost model with importance_type = 'gain'. The gain tells how much a feature contributes to improving the model's predictions. Visualisation show that st_slope contribute the most to model accuracy.

![fi](img/xgb_fi.png)

## 8. References

1. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
2. https://ourworldindata.org/cardiovascular-diseases
3. https://datascience.stackexchange.com/questions/64460/strategies-to-encode-categorical-variables-with-many-categories
4. https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/
5. https://scikit-learn.org/stable/common_pitfalls.html
6. https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.fbeta_score.html
7. https://xgboost.readthedocs.io/en/latest/python/sklearn_estimator.html
8. https://medium.com/@cris.lincoleo/a-quick-guide-to-hyperparameter-optimization-with-optuna-1980f1d185dc
9. https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html