### Heart Disease Risk Prediction and Early-Stage Heart Disease detection

In [2]:
import pandas as pd
import altair as alt
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

# !! more imports later

### Summary

We wish to create a simple machine learning classification model which can help us predict high risk individuals for heart disease. We try two methods: a logistic regression and an SVM with RBF kernel method to use 14 common features related to heart disease to make the predictions. We have attempted to return whether the patient is predicted to contract heart disease or not, and thus it can enable us to identify high-risk individuals and implement relevent prevention methods ahead of time. 

We select and F2 score as our model performance metric since we really care about having the least amount of False Negatives as possible. Our final classifier performed fairly well on an unseen test data set, with the F2 score, where beta = 2, of <> and an overall accuracy calculated to be <>. On the <num> test data cases, it correctly predicted <num>. It incorrectly predicted <num> cases, which were all false positives - predicting that a patient is prone to contract heart disease when they are in fact not. These kind of incorrect predictions is not as harmful as a false negative in our context. Although they could theoretically cause the patient to undergo unnecessary treatment if the model is used as a decision tool, we expect there to be additional decision layers which can mitigate this. As such, we believe this model is at the very least a useful tool for medical professionals to look at important cases more closely and have more frequent follow ups with important cases.

### Introduction

<< HAVE TO EDIT >> According to the World Health Organization (WHO) available in http://www.who.int/news-room/fact, cardiovascular diseases (CVDs) are the major reason for death worldwide. CVDs include different diseases related to heart and blood vessels, such as coronary heart disease (CHD), cerebrovascular disease, and rheumatic heart disease (RHD) among others. According to the latest WHO report available at http://www.who.int/news-room/fact and http://www.who.int/cardiovascular_diseases/en/, more than 17.7 million people are estimated to have died in 2015 due to having CVDs, accounting for 31% of all deaths globally. It also estimated that approximately 7.4 million died due to CHD, which is also called coronary artery disease (CAD)1,2,3. In other words, it can be argued that CVDs - in particular, CAD - are among the deadliest diseases in both developed and developing countries and paying attention to them is vital and indispensable.

Although the CAD mortality rate is high, the chance of survival is higher if the diagnosis is made early enough. Therefore, scientists have devised predictive models to identify high-risk patients. 

Answering this question is important because traditional methods for diagnosis are quite subjective and can depend on the diagnosing physicians skill as well as experience (Street, Wolberg, and Mangasarian 1993). Thus, if a machine learning algorithm can accurately and effectively predict whether a patient is prone to Heart Disease, this could lead to less subjective, and more scalable heart disease early prevention and diagnosis which could contribute to better patient outcomes.

### Methods

#### 1. Data

< bla bla >

#### 2. Analysis

**2.1 Importing the data and preliminary EDA**

We can start with some EDA to see what we're working with.

In [4]:
df = pd.read_csv("./data/heart.csv")
df
# EDA CODE AND MARKDOWN PART

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


We can select an 80-20 split for training and testing data respectively. 

In [None]:
train_df, test_df = train_test_split(df, test_size = 0.2,random_state=123)

train_df.to_csv("./data/processed/heart_train.csv")
test_df.to_csv("./data/processed/heart_test.csv")

**2.2 Data Prep: Splitting target column and feature column cleanup**

Before we start working on anything we will separate the test data from the training data to avoid violating the golden rule.

In [None]:
X_train = train_df.drop(columns = ['target'])
y_train = train_df['target']
X_test = test_df.drop(columns = ['target'])
y_test = test_df['target']

We can now clean up the data according to the criteria for each column.

In [6]:
binary = ['sex','fbs','exang']
ohe = ['cp','restecg','thal']
numerical = ['age','trestbps','chol','thalach','oldpeak','ca']
ordinal = ['slope']

preprocessor = make_column_transformer(
 (StandardScaler(), numerical),
 (OneHotEncoder(), ohe),
 (OrdinalEncoder(), ordinal),
 ('passthrough', binary)
)

We can view the first 5 rows of the preprocessed data to have an idea of what we are going to input into our model.

In [7]:
X_train_preprocessed = preprocessor.fit_transform(X_train)
column_names = (
 numerical
 + ordinal
 + binary
 + preprocessor.named_transformers_['onehotencoder'].get_feature_names_out(ohe).tolist())
X_train_preprocessed = pd.DataFrame(X_train_preprocessed, columns = column_names)
X_train_preprocessed.head(5)

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,slope,sex,fbs,exang,...,cp_1,cp_2,cp_3,restecg_0,restecg_1,restecg_2,thal_0,thal_1,thal_2,thal_3
0,-0.04821,3.439983,0.659128,1.968909,-0.914049,0.217877,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0
1,0.830279,-0.665467,0.621406,-2.052711,0.266165,0.217877,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
2,0.720468,0.132815,-0.265068,-0.216754,1.277777,1.173272,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,-2.134621,-0.323346,0.640267,0.264092,-0.914049,-0.737518,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,1.0
4,1.708768,-0.095265,1.394713,-1.790431,1.109175,2.128667,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0


_table 2.2.1 Preprocessed columns_

**2.3 Model #1**

In [None]:
# code and analysis for model 1

# init model
# hyperparameter tuning and cross validate
# plot the hyperparameter outputs and select the best
# accuracy, confusion matix, f score
# run on test set and get score

**2.4 Model #2**

In [None]:
# code and analysis for model 2

# init model
# hyperparameter tuning and cross validate
# plot the hyperparameter outputs and select the best
# accuracy, confusion matix, f score
# run on test set and get score

### Results and Discussion

**Final model selection**

< select the better model and discuss the results >

### References

Alizadehsani, R., Roshanzamir, M., Abdar, M. et al. A database for using machine learning and data mining techniques for coronary artery disease diagnosis. Sci Data 6, 227 (2019). https://doi.org/10.1038/s41597-019-0206-3