# Project Proposal # 

## Introduction ##

Heart diseases refers to types of heart conditions in which their common symptoms vary from having heart attacks, arrhythmia (abnormal heart beats), to heart failure. Though these symptoms may exist for some people, many cannot be diagnosed unless experienced one of the conditions. Our question we would like to address is predicting the presence of heart disease by working with four variables as the predictors. The four variables are age, maximum heart rate, resting blood pressure, and cholesterol (These predictors are subjected to change as we move towards the end of the course as we learn how to find which are the best predictors). The accuracy of our model could be helpful as being the preliminary test to see if a patient needs further diagnosis before clinicians coming up with suitable treatment models for that patient.  For our project, we chose to focus on a dataset from UCI Machine Learning that contains data pulled from Hungary, Cleveland, Switzerland, and Longbeach that we found on Kaggle. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset. 
.


The dataset includes 14 columns:
- age
- sex
- chest pain type (4 values; 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)
- trestbps: resting blood pressure
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl
- restecg: resting electrocardiographic results (values 0,1,2)
- thalach: maximum heart rate achieved
- exang: exercise induced angina
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- ca: number of major vessels (0-3) colored by flourosopy
- thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
- target: presence of heart disease, 0 = False, 1 = True


The four variables we chose are resting blood pressure, age, maximum heart rate, and cholesterol. We chose Blood pressure because having high blood pressure is one of the biggest risks of heart disease, as stated on the CDC website.  Blood pressure increases the workload of the heart, as it is the pressure of blood pushing along arteries, and having a higher blood pressure makes the heart muscle become stiffer(Grey, 2021). We chose cholesterol because high cholesterol, according to the CDC, builds up and blocks the vessels, which can induce a heart attack, especially the "bad" type of cholesterol.  We chose maximum heart rate as a predictor because having a high heart rate is associated with a higher risk of mortality and heart-associated disease(Guillaume, et al.). As stated by Guillaume, et al., “...the risk associated with accelerated heart rate is not only statistically significant but also clinically relevant and that it should be taken into account in the evaluation of the patients.” We chose age because according to the CDC, with an increase in age, there is an increase in the risk of developing heart disease because there is a higher build of fatty deposits in arteries(narrower flow of blood in arteries) and also higher blood pressure


 We chose 4 specific predictors because of the online research(external resources) and readings we did prior, pinpointing the main contributors to heart disease, and we found out that these factors have an immense impact on increasing the risk of heart disease, and are the most important to consider. 



In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(broom)
options(repr.matrix.max.rows = 6)

In [None]:
#reading data
heart_data <- read_csv(file = "heart.csv")

heart_data

## Summarizing the Training Data ##

- In the beginning, we use the <code>map_df(is.na)</code> argument to see if there are any n/a variables inside our dataset.

In [None]:
#Check to see if there is any n/a value in the dataset
na_check <- heart_data |>
     map_df(is.na) |>
     map_df(sum)

na_check

- We are going to select the <code>"age"</code>, <code>"sex"</code>, <code>"chol"</code>, <code>"trestbps"</code>, <code>"thalach"</code>, <code>"target"</code> columns that we are going to use.
- We want to rename those columns and give them an appropriate name.
- Since the dataset gives us numerical variables, so we would change them into categorical variables in which "1" represents "male" and "0" means "female".
- And also, we will change the determination of disease present from numerical value into categorical value in which "0" represents "false",  and "1" means "true".

In [None]:
heart_data  <- heart_data |>
    mutate(sex = as_factor(sex), target = as_factor(target)) |>
    select(age, sex, chol, trestbps, thalach, target)

colnames(heart_data) <- c("age", "sex", "cholesterol", "rest_bp", "max_heart_rate", "disease_present")

# changes 1 to male and 0 to female
levels(heart_data$sex) <- c("female", "male")
# changes disease_present to true/false (0 = false, 1 = true)
levels(heart_data$disease_present) <- c(FALSE, TRUE)

heart_data

- Based on the data we use in the wrangling step, we have **1025** observations.
- We will use a random split of **0.75**, which will make 75% of the data split out into our training set, and the rest of 25% of the data will move to the testing set to find the model's accuracy in our prediction.
- We added the <code>set.seed()</code> argument to make our result reproducible and use the <code>initial_split()</code> argument to choose the sample from our data frame randomly.

In [None]:
set.seed(2106)
#spliting data into training set and testing set    
    
heart_split <- initial_split(heart_data, prop = 0.75, strata = disease_present)
heart_train <- training(heart_split)
heart_testing <- testing(heart_split)

- We begin our exploratory data analysis by summarizing those data into different types of tables.

In [None]:
heart_explore_counts <- heart_train |>
    group_by(disease_present) |>
    summarize(count = n(),
              percent = (n() / nrow(heart_train)) * 100)

heart_explore_counts

- In the table above, we can see that it reports those observations in each class in which <code>"FALSE"</code> represents "no disease present" inside a human body, and <code>"TRUE"</code> means there is a disease present inside a human body.
- We use <code>group_by(disease_present)</code> to group our table with the disease present variable, and we use <code>n()</code> arguments to count the numbers of those observations in each class. 
- Comparing the two results shows that the number of people who result in the disease is more significant than those who don't.

- The data set has almost equal amount of percentages of patients with heart disease and patients with no heart disease and it is beneficial to our model as we can have equal cases to train our model effectively.

In [None]:
heart_explore <- heart_train |>
    group_by(sex, disease_present) |>
    summarize(count = n(),
              avg_chol = mean(cholesterol),
              avg_age = round(mean(age)))

heart_explore

- We use <code>group_by(sex, disease_present)</code> argument to find the total number of diseases present and no disease present for males and females.
- We can see that the data have an unequal amount of male and female as well as unequal distribution of male & female with/without heart disease. This can be seen as an unreliable method of prediction as it affect the true average cholesterol in group. Therefore, we need to preprocess the data before working with the model

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

heart_rate_plot <- heart_train |>
    ggplot(aes(x = age, y = max_heart_rate, color = disease_present)) +
    geom_point() +
    labs(x = "Age", y = "Max Heart Rate", color = "Disease Present") +
    facet_grid(cols = vars (disease_present)) +
    theme(text = element_text(size = 15))

heart_rate_plot

- In here, we are trying to see whether having a higher max heart rate will increase the risk of having a heart disease. We can also assume that the younger/middle age group with high max heart rate a more prone to heart disease.

In [None]:
cholesterol_plot <- heart_train |>
    ggplot(aes(x = cholesterol, fill = disease_present)) +
    geom_histogram() +
    labs(x = "Cholesterol", y = "Count", fill = "Disease Presence") +
    facet_grid(rows = vars (disease_present)) +
    theme(text = element_text(size = 15))

cholesterol_plot

- We wanted to see whether having high cholesterol will build up the risk factor from having a heart disease. According to the CDC, high cholesterol especially the "bad" type of cholesterol build up and block the vessels and inducing heart attack. However, from the graph above, it does not seem to support the claim. Cholesterol by itself could not be an accurate predictor for the presence of heart disease.

In [None]:
restbp_plot <- heart_train |>
    ggplot(aes(x = rest_bp,fill = disease_present)) +
    geom_histogram(color = "black", position = "dodge") +
    labs(x = "Rest Blood Pressure", fill = "Disease Present") +
    facet_grid(rows = vars (disease_present)) +
    theme(text = element_text(size = 15))

restbp_plot


- We want to see whether higher restingblood pressure is associated with higher chance of getting heart disease. From the graph above, the presence of heart disease is relatively high when the resting blood pressure is above 120. 

All together, we want to try out the 4 variables age, maximum heart rate, resting blood pressure, and cholesterol as our predictor for our model. We will also do preprocessing steps to make sure that our training data is centered and scaled. Furthermore, we will also be tuning our K to avoid overfitting and underfitting.

## Predictor Variable Selection ##



In [None]:
set.seed(1)
heart_subset <- heart_train |>
    select("age", "cholesterol", "rest_bp", "max_heart_rate", "disease_present")

names <- colnames(heart_subset|> select(-disease_present))

accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

# create a model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

# create a 5-fold cross-validation object
heart_vfold <- vfold_cv(heart_subset, v = 5, strata = disease_present)

# store the total number of predictors
n_total <- length(names)

# stores selected predictors
selected <- c()

# for every size from 1 to the total number of predictors
for (i in 1:n_total) {
    # for every predictor still not added yet
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        # create a model string for this combination of predictors
        preds_new <- c(selected, names[[j]])
        model_string <- paste("disease_present", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        heart_recipe <- recipe(as.formula(model_string), 
                                data = heart_subset) |>
                          step_scale(all_predictors()) |>
                          step_center(all_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
          add_recipe(heart_recipe) |>
          add_model(knn_spec) |>
          tune_grid(resamples = heart_vfold, grid = 10) |>
          collect_metrics() |>
          filter(.metric == "accuracy") |>
          summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}
accuracies



## Tuning

In [None]:
heart_recipe <- recipe(disease_present ~ age + rest_bp + cholesterol + max_heart_rate, data = heart_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

vfold <- vfold_cv(heart_train, v = 10, strata = disease_present)

gridvals <- tibble(neighbors = seq(1, 25))

results <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = vfold, grid = gridvals) |>
    collect_metrics() |>
    filter(.metric == "accuracy")

cross_validation_plot <- results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    scale_x_continuous(limits = c(1, 25), breaks = seq(1, 25), minor_breaks = seq(1, 25, 1))

cross_validation_plot

## KNN Classification Testing ##

In [None]:
#Create the best KNN model
knn_spec_heart <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) |>
                                   set_engine("kknn") |>
                                   set_mode("classification")

#Final recipe for centering and scaling                                    
knn_recipe_heart <- recipe(disease_present ~ cholesterol + max_heart_rate + age + rest_bp, data = heart_train) |>
                                   step_scale(all_predictors()) |>
                                   step_center(all_predictors())

#using workflow to combine the recipe and model
knn_fit_heart <- workflow() |>
    add_recipe(knn_recipe_heart) |>
    add_model(knn_spec_heart) |>
    fit(data = heart_train)
                                   
heart_test_prediction <- predict(knn_fit_heart, heart_testing) |>
                                   bind_cols(heart_testing) 
                                
                                
heart_test_prediction

heart_confusion_df <- heart_test_prediction |>
    conf_mat(truth = disease_present, estimate = .pred_class)

heart_confusion_df

heart_accuracy <- heart_test_prediction |>
    metrics(truth = disease_present, estimate = .pred_class) |>
                                   select(.metric, .estimate) |>
                                   head(1)
heart_accuracy

## References

https://www.cdc.gov/cholesterol/myths_facts.htm

Grey, H. Heart disease & age. memorial Hermann. (2021, December 27). Retrieved October 29, 2022, from https://memorialhermann.org/services/specialties/heart-and-vascular/healthy-living/education/heart-disease-and-age \
Perret-Guillaume C, Joly L, Benetos A. Heart rate as a risk factor for cardiovascular disease. Prog Cardiovasc Dis. 2009 Jul-Aug;52(1):6-10. doi: 10.1016/j.pcad.2009.05.003. PMID: 19615487.
U.S. Department of Health and Human Services. (n.d.). Heart health and aging. National Institute on Aging. Retrieved October 29, 2022, from https://www.nia.nih.gov/health/heart-health-and-aging#changes 
