## Introduction

### Background data

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

### Project description (overview)

The input to our predictor is is a medical dataset which contains 12 features that can be used to predict mortality by heart failure.
1. Data exploration
    - Principle Components Analysis (PCA) to reduce the dimension of features to have a view of the input data distribution
    - Build a preliminary linear SVM model to incorporate all the features to see the model performance.
2. Feature selection part.
    - Chi-square test to check the correlation between each categorical feature and the target death event.
    - Heat map to return the features with high correlation coefficient with death events.
    - Visualized the each feature's contribution significance in the SVM model
    - Compared the returned features and determined the final selected features
3. Model comparison and hyperparameter tuning
    - compare the performance in difference preprocessing methods MinMaxScalar, StandardScalar, RobustScalar
    - compare the performance in k-fold cross validation and leave-one-out methods
    - compare the kernel selected in Support Vector Machine (linear or rbf)
    - grid search to find the best performance model
4. Selected model performance
    - calculated the precision, recall, accuracy and f1-score
    - plot the ROC and PR-curve
    - plot the learning curve

---

## Related work

### Explorative data analysis (EDA) approach

1. [heart-fail-analysis-and-quick-prediction](https://www.kaggle.com/nayansakhiya/heart-fail-analysis-and-quick-prediction)

**Strength**: Detailed explorative and associative data analysis with great data visualization: each factor is visualized by different types of figures;
**Weakness**: Prediction model are quite rudimentary, the author did not select the features and tune the models' hyperparameters.

### Predictive data analysis (PDA) approach

1. [heart-failure-model-prediction-comparisons-95](https://www.kaggle.com/rude009/heart-failure-model-prediction-comparisons-95)
**Strength**: The author compares six prediction models with feature selection. The Extra Gradient Booster Classifier could achieve the accuracy up to 95.0%
**Weakness**: The author consider the "time" column as the useful features. But I don't think so since "time" column stands for Follow-up period (days), which means itself could not contribute the diseases itself. Therefore, I consider this feature as uselessness in our prediction model

2. [heart-failure-prediction-auc-0-98](https://www.kaggle.com/ksvmuralidhar/heart-failure-prediction-auc-0-98)
**Strength**: The author uses a new method: Chi-square test to find the correlation between single categorical feature with target death_event
**Weakness**: The visualization part does not as fancy as previous work