**Title:**
Prediction of Heart Disease Diagnosis using Hungarian Heart Disease Dataset


**Introduction:**

Heart Disease is a major issue which contributes to global morbidity and mortality (Dai et al., 2021). In the United States of America, one person dies every 34 seconds from a heart disease (Centers for Disease Control and Prevention, 2022). Many people who have a heart disease do not show any physical symptoms and as a result are not diagnosed (Jin 2014). In turn, these individuals are not taking medications to help prevent the progression of the heart disease. Thus, predictive models are needed to help diagnose patients especially those who are asymptomatic in order to intervene with the progression of the disease. 

In this project, we will be examining the Heart Disease Data from the University of California, Irvine's Machine Learning Repository. The dataset is collected from three separate countries, the United States, Switerland, and Hungary. It is composed of 14 different variables which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. All 14 variables are quantatitive.

The major role of this project is to use the variables from the dataset to help predict whether a patient has a heart disease or not. In the methods we will discuss further which variables we will be examining.

**Preliminary exploratory data analysis:**

Demonstrate that the dataset can be read from the web into R 
Clean and wrangle your data into a tidy format
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

->combine the regions together
->scale the testing and training set SEPARATELY

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

-> Histogram to show the distribution for all the variables we are choosing
-> pair-wise scatterplot for all the diff variable combos


**Methods:**

Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results

From the 14 variables, we will examine only 5 which are: age, sex, blood pressure, cholestrol, fasting blooding sugar. This is because the risk factors associated with heart disease according to the Centre for Disease and Prevention is age, sex, high blood pressure (which we will measure using the variable trestbps from the data set), high low-density lipoprotein (LDL) cholesterol (which we will measure using the variable chol from the data set), and diabetes (which we will measure using the variable fbs from the data set which tells us if someone has high blood sugar levels indicative of diabetes) (Centers for Disease Control and Prevention, 2022).

From our data, we will split the data into a training and testing set. We will use 75% of our data as the training set and 25% as the testing set. To decide which data points becomes the training or testing set, we will shuffle the data and use stratification to ensure the two split subsets of data have roughly equal proportions of the different labels. We will apply k-nearest neighbour to do the classification and use the "tidymodels" package.

To get a better estimate of accuracy of our knn classifiers, we will ustilize cross-validation where we split the training data into a training set and a validation set. Then obtain the accuracy using only the validation set and take the average of the accuracy. This procedure will help us pick the K that maximizes validation accuracy. After we have decided on a K, we will evaluate the model built with the training set on the test set. 
After testing, if the model is not accurate or a good fit, we may decide to examine other variables from the dataset.

To visualize our results we will create pair-wise scatter plots of all the different variables. In these plots we will colour the actual versus the predicted labels on the testing set to examine the accuracy of the model visually.

**Expected outcomes and significance:**

What do you expect to find?

We expect to be able to create a predictive model of whether a patient has heart disease or not based on the 14 variables we examined.

What impact could such findings have?

As many people have asymptomatic heart diseases, they are often undiagnosed and in turn not treated for their disease. This could affect their quality of life. Thus with our predictive model, we can help diagnose patients with or without symptoms. Additionally, this could help healthcare providers as this could be an additional tool they could use in their practice. This could help lower costs for patients and the healthcare system as diagnosing a disease before it becomes severe could put measures in place to prevent further progression which could be costly.

What future questions could this lead to?

Future questions this project could lead to is predicting whether a patient is likely to develop a heart disease as opposed to whether they currently have one or not.

**References:**

1. Centers for Disease Control and Prevention. (2022, October 14). Heart disease facts. Centers for Disease Control and Prevention. Retrieved October 17, 2022, from https://www.cdc.gov/heartdisease/facts.htm 
2. Centers for Disease Control and Prevention. (2022, September 8). Heart disease and stroke. Centers for Disease Control and Prevention. Retrieved October 18, 2022, from https://www.cdc.gov/chronicdisease/resources/publications/factsheets/heart-disease-stroke.htm 
3. Dai, H., Bragazzi, N. L., Younis, A., Zhong, W., Liu, X., Wu, J., & Grossman, E. (2021). Worldwide trends in prevalence, mortality, and disability-adjusted life years for hypertensive heart disease from 1990 to 2017. Hypertension, 77(4), 1223-1233.
4. Jin J. Testing for “Silent” Coronary Heart Disease. JAMA. 2014;312(8):858. doi:10.1001/jama.2014.9191

