# Project Proposal # 

## Introduction ##

Heart diseases refers to types of heart conditions in which their common symptoms vary from having heart attacks, arrhythmia (abnormal heart beats), to heart failure. Though these symptoms may exist for some people, many cannot be diagnosed unless experienced one of the conditions. Our question we would like to address is predicting the presence of heart disease by working with four variables as the predictors. The four variables are age, maximum heart rate, resting blood pressure, and cholesterol (These predictors are subjected to change as we move towards the end of the course as we learn how to find which are the best predictors). The accuracy of our model could be helpful as being the preliminary test to see if a patient needs further diagnosis before clinicians coming up with suitable treatment models for that patient.  For our project, we chose to focus on a dataset from UCI Machine Learning that contains data pulled from Hungary, Cleveland, Switzerland, and Longbeach that we found on Kaggle. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset. 



The dataset includes 14 columns:
- age
- sex
- chest pain type (4 values; 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)
- trestbps: resting blood pressure
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl
- restecg: resting electrocardiographic results (values 0,1,2)
- thalach: maximum heart rate achieved
- exang: exercise induced angina
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
- target: presence of heart disease, 0 = False, 1 = True


In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

In [None]:
heart_data <- read_csv(file = "heart.csv")

heart_data

In [None]:
#Check to see if there is any n/a value in the dataset
na_check <- heart_data |>
     map_df(is.na) |>
     map_df(sum)

na_check

In [None]:
heart_data  <- heart_data |>
    mutate(sex = as_factor(sex), target = as.logical(target)) |>
    select(age, sex, chol, trestbps, thalach, target)

colnames(heart_data) <- c("age", "sex", "cholesterol", "rest_bp", "max_heart_rate", "disease_present")

# changes 1 to male and 0 to female
levels(heart_data$sex) <- c("female", "male")
# changes disease_present to true/false (0 = false, 1 = true)
levels(heart_data$disease_present) <- c(FALSE, TRUE)

heart_data

In [None]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = disease_present)
heart_train <- training(heart_split)

heart_explore <- heart_train |>
    group_by(sex, disease_present) |>
    summarize(count = n(),
              avg_chol = mean(cholesterol),
              avg_age = round(mean(age)))

heart_explore

- We can see that the data have an unequal amount of male and female as well as unequal distribution of male & female with/without heart disease. This can be seen as an unreliable method of prediction as it affect the true average cholesterol in group. Therefore, we need to preprocess the data before working with the model

In [None]:
heart_explore_counts <- heart_train |>
    group_by(disease_present) |>
    summarize(count = n(),
              percent = (n() / nrow(heart_train)) * 100)

heart_explore_counts

- The data set has almost equal amount of percentages of patients with heart disease and patients with no heart disease and it is beneficial to our model as we can have equal cases to train our model effectively.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

heart_rate_plot <- heart_train |>
    ggplot(aes(x = age, y = max_heart_rate, color = disease_present)) +
    geom_point() +
    labs(x = "Age", y = "Max Heart Rate", color = "Disease Present") +
    facet_grid(cols = vars (disease_present)) +
    theme(text = element_text(size = 15))

heart_rate_plot

- In here, we are trying to see whether having a higher max heart rate will increase the risk of having a heart disease. We can also assume that the younger/middle age group with high max heart rate a more prone to heart disease.

In [None]:
cholesterol_plot <- heart_train |>
    ggplot(aes(x = cholesterol, fill = disease_present)) +
    geom_histogram() +
    labs(x = "Cholesterol", y = "Count", fill = "Disease Presence") +
    facet_grid(cols = vars (disease_present)) +
    theme(text = element_text(size = 15))

cholesterol_plot

- We wanted to see whether having high cholesterol will build up the risk factor from having a heart disease. According to the CDC, high cholesterol especially the "bad" type of cholesterol build up and block the vessels and inducing heart attack. However, from the graph above, it does not seem to support the claim. Cholesterol by itself could not be an accurate predictor for the presence of heart disease.

In [None]:
restbp_plot <- heart_train |>
    ggplot(aes(x = rest_bp,fill = disease_present)) +
    geom_histogram(color = "black", position = "dodge") +
    labs(x = "Rest Blood Pressure", fill = "Disease Present") +
    facet_grid(cols = vars (disease_present)) +
    theme(text = element_text(size = 15))

restbp_plot


- We want to see whether higher restingblood pressure is associated with higher chance of getting heart disease. From the graph above, the presence of heart disease is relatively high when the resting blood pressure is above 120. 

All together, we want to try out the 4 variables age, maximum heart rate, resting blood pressure, and cholesterol as our predictor for our model. We will also do preprocessing steps to make sure that our training data is centered and scaled. Furthermore, we will also be tuning our K to avoid overfitting and underfitting.

## References

https://www.cdc.gov/cholesterol/myths_facts.htm