**Title:**
Prediction Model for Heart Disease Diagnosis using Heart Disease Dataset from University of California, Irvine


**Introduction:**

Heart Disease is a major issue which contributes to global morbidity and mortality (Dai et al., 2021). In the United States of America, one person dies every 34 seconds from a heart disease (Centers for Disease Control and Prevention, 2022). Many people who have a heart disease do not show any physical symptoms and as a result are not diagnosed (Jin 2014). In turn, these individuals are not taking medications to help prevent the progression of the heart disease. Thus, predictive models are needed to help diagnose patients especially those who are asymptomatic in order to intervene with the progression of the disease. 

In this project, we will be examining the Heart Disease Data from the University of California, Irvine's Machine Learning Repository. The dataset is collected from three separate countries, the United States, Switerland, and Hungary. It is composed of 14 different variables which are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, oldpeak — ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels and Thalassemia. All 14 variables are quantatitive.

The major role of this project is to use the variables from the dataset to help predict whether a patient has a heart disease or not. In the methods we will discuss further which variables we will be examining.

**Preliminary exploratory data analysis:**



We obtained the heart disease data from the UCI ICS website as seen in the below code. Originally, the data was collected from four separate regions, Hungary, Switzerland, Cleveland, and Long Beach. As the original data sets did not have the column names, we added that into our data set based on the documentation. In our data set, we combined all four regions to create the "heart_data" dataset for our project. To do so we used rbind as all datasets which had the same number of variables.

In [4]:
library(tidyverse)
library(ggplot2)
hungary <- read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", col_names = FALSE)
colnames(hungary) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

switzerland<- read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data", col_names = FALSE)
colnames(switzerland) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

cleveland<- read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = FALSE)
colnames(cleveland) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

va<- read_csv ("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data", col_names = FALSE)
colnames(va) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

heart_data <- rbind (hungary,switzerland,cleveland,va)


[1mRows: [22m[34m294[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X5, X6, X7, X8, X9, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X10, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m123[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X6, X7, X8, X9, X10, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X5, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──

From the 14 variables, we will examine only 6 which are: age, sex, blood pressure, cholestrol, fasting blooding sugar, and diagnosis of heart disease.

This is because the risk factors associated with heart disease according to the Centre for Disease and Prevention (Centers for Disease Control and Prevention, 2022) are:

-*age* (in years)
 
-*sex* (0= Female, 1= Male)

-high blood pressure (which is measured using the variable *trestbps* from the data set, in mmHg)

-high cholesterol (which is the variable *chol* from our data set, in mg/dl)

-diabetes (which is the variable *fbs* from our data set, if 0= False/ No Diabetes, 1= True/Diabetes) 


For the diagnosis of heart disease, the original dataset set that as the "num" variable. 
From the original dataset's documentation  num=> 1 represented a diagnosis of heart disease and num=0 represented no heart disease diagnosis. The original dataset separated the heart disease diagnosis based on different severity levels. However since we are only interested in whether there was a diagnosis or not, we changed the labelling to a binary label with 0 and 1, where any num greater or equal to 1 became 1 indicating a diagnosis of heart disease.

In [9]:
heart_data_cleaned <- heart_data %>%
  select('age', 'sex', 'trestbps', 'chol', 'fbs', 'num')

heart_data_cleaned$num [heart_data_cleaned$num == "2"] <- "1"
heart_data_cleaned$num [heart_data_cleaned$num == "3"] <- "1"
heart_data_cleaned$num [heart_data_cleaned$num == "4"] <- "1"

Below, we created a summary table of our 6 variables where we show the number of observations, mean or proportion (for variables with levels), number of missing data, and the range or levels for each variable.

In [11]:
#-------------The Commented out code below is to get the range-------------
# as.numeric(range(heart_data_cleaned$age, na.rm=TRUE))
# as.numeric(range(heart_data_cleaned$sex, na.rm=TRUE))
# as.numeric(range(heart_data_cleaned$trestbps, na.rm=TRUE))
# as.numeric(range(heart_data_cleaned$chol, na.rm=TRUE))
# as.numeric(range(heart_data_cleaned$fbs, na.rm=TRUE))
# as.numeric(range(heart_data_cleaned$num, na.rm=TRUE))

tab <- matrix(c(nrow(heart_data_cleaned), round(mean(heart_data_cleaned$age), digits=2), sum(heart_data_cleaned$age=="?"), '28-77',
                nrow(heart_data_cleaned), "21.09% Female" ,sum(heart_data_cleaned$sex=="?"), "0 = Female, 1 = Male",
                nrow(heart_data_cleaned), round(mean(as.numeric(heart_data_cleaned$trestbps), na.rm=TRUE), digits=2), sum(heart_data_cleaned$trestbps=="?"), "0-98",
                nrow(heart_data_cleaned), round(mean(as.numeric(heart_data_cleaned$chol), na.rm=TRUE), digits=2), sum(heart_data_cleaned$chol=="?"), "0-85",
                nrow(heart_data_cleaned), "75.22% False", sum(heart_data_cleaned$fbs=="?"), "0 = False, 1 = True",
                nrow(heart_data_cleaned), "44.67% No Heart Disease",sum(is.na(heart_data_cleaned$num)), "0 = No Heart Disease, 1 = Heart Disease"), ncol=4, byrow=TRUE)
colnames(tab) <- c('Number of Observations','Mean or Proportion','Number of Missing Data', 'Range or Levels')
rownames(tab) <- c('Age (years)','Sex','Resting Blood Pressure (mmHg)','Cholesterol (mg/dl)','Fasting Blood Sugar > 120 mg/dl','Heart Disease Diagnosis')
tab <- as.table(tab)
tab

“NAs introduced by coercion”
“NAs introduced by coercion”


                                Number of Observations Mean or Proportion     
Age (years)                     920                    53.51                  
Sex                             920                    21.09% Female          
Resting Blood Pressure (mmHg)   920                    132.13                 
Cholesterol (mg/dl)             920                    199.13                 
Fasting Blood Sugar > 120 mg/dl 920                    75.22% False           
Heart Disease Diagnosis         920                    44.67% No Heart Disease
                                Number of Missing Data
Age (years)                     0                     
Sex                             0                     
Resting Blood Pressure (mmHg)   59                    
Cholesterol (mg/dl)             30                    
Fasting Blood Sugar > 120 mg/dl 90                    
Heart Disease Diagnosis         0                     
                                Range or Levels               

**Methods:**

From the 14 variables, we will examine only 6 which are: age, sex, blood pressure, cholestrol, fasting blooding sugar, and heart disease diagnosis. This is because the risk factors associated with heart disease according to the Centre for Disease and Prevention is age, sex, high blood pressure (which we will measure using the variable trestbps from the data set), high low-density lipoprotein (LDL) cholesterol (which we will measure using the variable chol from the data set), and diabetes (which we will measure using the variable fbs from the data set which tells us if someone has high blood sugar levels indicative of diabetes) (Centers for Disease Control and Prevention, 2022). 

From our data, we will split the data into a training and testing set. We will use 75% of our data as the training set and 25% as the testing set. To decide which data points becomes the training or testing set, we will shuffle the data and use stratification to ensure the two split subsets of data have roughly equal proportions of the different labels. We will apply k-nearest neighbour to do the classification and use the "tidymodels" package.

To get a better estimate of accuracy of our knn classifiers, we will utilize cross-validation where we split the training data into a training set and a validation set. Then obtain the accuracy using only the validation set and take the average of the accuracy. This procedure will help us pick the K that maximizes validation accuracy. After we have decided on a K, we will evaluate the model built with the training set on the test set. 
After testing, if the model is not accurate or a good fit, we may decide to examine other variables from the dataset.

To visualize our results we will create pair-wise scatter plots of all the different variables. In these plots we will colour the actual versus the predicted labels on the testing set to examine the accuracy of the model visually.

**Expected outcomes and significance:**

**What do you expect to find?**

We expect to be able to create a predictive model of whether a patient has heart disease or not based on the 14 variables we examined.

**What impact could such findings have?**

As many people have asymptomatic heart diseases, they are often undiagnosed and in turn not treated for their disease. This could affect their quality of life. Thus with our predictive model, we can help diagnose patients with or without symptoms. Additionally, this could help healthcare providers as this could be an additional tool they could use in their practice. This could help lower costs for patients and the healthcare system as diagnosing a disease before it becomes severe could put measures in place to prevent further progression which could be costly.

**What future questions could this lead to?**

Future questions this project could lead to is predicting whether a patient is likely to develop a heart disease as opposed to whether they currently have one or not.

**References:**

1. Centers for Disease Control and Prevention. (2022, October 14). Heart disease facts. Centers for Disease Control and Prevention. Retrieved October 17, 2022, from https://www.cdc.gov/heartdisease/facts.htm 
2. Centers for Disease Control and Prevention. (2022, September 8). Heart disease and stroke. Centers for Disease Control and Prevention. Retrieved October 18, 2022, from https://www.cdc.gov/chronicdisease/resources/publications/factsheets/heart-disease-stroke.htm 
3. Dai, H., Bragazzi, N. L., Younis, A., Zhong, W., Liu, X., Wu, J., & Grossman, E. (2021). Worldwide trends in prevalence, mortality, and disability-adjusted life years for hypertensive heart disease from 1990 to 2017. Hypertension, 77(4), 1223-1233.
4. Jin J. Testing for “Silent” Coronary Heart Disease. JAMA. 2014;312(8):858. doi:10.1001/jama.2014.9191

