#  Exploring the relative significance of various factors in determining presence of Heart Disease

## Introduction
### Background Information
High blood pressure, unhealthy cholesterol levels, and resting heart rate are some of the main predictors of heart disease (CDC). Blood pressure, when too high, can put stress on the arteries in addition to organs such as the heart and kidneys. Cholesterol is a fat-like substance that can build up on the walls of the arteries and reduce blood flow to the heart. These variables can be changed via medications or lifestyle changes and are therefore worth studying as predictors of heart disease.
### Clearly state the question you will try to answer with your project
Which of the well known predictors of heart disease have the greatest impact on presence? 
### Identify and describe the dataset that will be used to answer the question
We will be using the Heart Disease Dataset which contains variables of sex, age, cholesterol, blood pressure, and smoking status.
## Preliminary exploratory data analysis
Demonstrate that the dataset can be read from the web into R 
Read from web!!!
Clean and wrangle your data into a tidy format
Remove all columns except cholesterol (chol), age, sex, blood pressure (trestbps), smoker (smoke), and diagnosis of heart disease (num). 
choose age in which to remove everything under.
Change num into a factor if needed.
### Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
Use above data ^, group by num, summarize mean smth
### Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.
Plot age vs heart disease, decide to cut off at some age.

## Methods

### Variables Used
We will make the data set smaller and more manageable by using the cholesterol, resting heart rate, blood pressure, age, and sex variables. We will also keep the column that determines whether or not the patient received a diagnosis of heart disease in order to train and then test our model. 
### Data Analysis
After cleaning the data and retaining columns for each of our variables, we will split the dataset by sex. We will perform a classification using each variable for one sex and compare the accuracies of each model for that sex. Each classification will start with a training dataset that will be used to create a recipe. We will perform 5-fold cross-validation on the training set in order to find the best k value. Then, we'll build a model using that k value and the training dataset. Finally, we'll use that model to pass the testing dataset through and evaluate its' accuracy. This process will be repeated for each variable for males and then each variable for females.
### Data Visualization
We will visualize results with bar plots that represent the accuracy of each variable, separated by sex. This will demonstrate which variables are best at predicting a diagnosis of heart disease and highlight any discrepancies between sexes. Additionally, we will have confusion matrixes for each variable. We will also have tibbles with each variables' accuracy for each sex. 

## Expected outcomes and significance

### What do you expect to find? 
We expect to find that high cholesterol, high blood pressure, and increased resting heart rate are all positively correlated to a diagnosis of heart disease. We expect to find that cholesterol is the best predictor of heart disease and heart rate is the worst predictor in both sexes.
### What impact could such findings have?
Identifying the best predictor of heart disease also identifies the variable that is most important to change. So, if cholesterol is the best predictor, then one should focus on lowering their cholesterol before worrying about the other variables. 
### What future questions could this lead to?
This could lead to questions about how our diet and lifestyles can be changed to reduce these levels. Additionally, we could wonder if more resources and energy should be targeted toward the identification and treatment of some variables over others. We also may wonder if there are additional variables that we didn't include in our tidied data that could have more of an impact over the variables that we chose to include. 

 Which factors are most prominent in positive diagnoses

In [49]:
library(tidyverse)
library(tidymodels)
library(dplyr)
# Load in libaries

#Set URL of dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

#Set file path of dataset
file_path <- "processed.cleveland.data"

#Get character vector of Column Names of Dataset
column_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                  "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")

#Download dataset into R
download.file(url, destfile = file_path, method = "auto")

#Read dataset into heart_csv
heart_csv <- read_delim("processed.cleveland.data", delim = ",", col_names = column_names)

#Clean dataset
heart_data <- heart_csv |>
    select(age, sex, chol, trestbps, target) |>
    mutate(target = as_factor(ifelse(target > 0, 1, 0))) |>
    mutate(age = as_factor(age))

#Create training and testing split for dataset
heart_split <- initial_split(heart_data, prop = 0.75, strata = target)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

#Create summary dataset 
summary_heart <- heart_data |>
    group_by(target, age) |>
    summarize(mean = mean(chol, na.rm = TRUE)) |>
    pivot_wider(names_from = age, values_from = mean)
    
summary_heart

#Create age v.s. cp plot to demonstrate which ages we will focus on



[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1m[22m`summarise()` has grouped output by 'target'. You can override using the
`.groups` argument.


target,29,34,35,37,38,39,40,41,42,⋯,65,66,67,68,69,70,71,74,76,77
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,204.0,196.0,187.5,232.5,175,246.6667,199,226.1111,237,⋯,305.75,258.5,354.6667,244.0,236.5,245,238.6667,269.0,197.0,
1,,,240.0,,231,219.0,195,172.0,315,⋯,252.25,228.6667,252.8333,233.5,254.0,255,,,,304.0
