# DSCI Project Proposal - Group 15

## Introduction

Roughly 17.9 million lives are lost each year due to cardiovascular diseases. Heart activity such as heart rate and resting ECG results can be used to detect cardiovascular diseases. For this project, we want to use the dataset “Heart Failure Prediction Dataset” to answer our question:

***Does heart activity accurately predict whether a person is at risk of heart disease?***

This dataset is a combination of independent datasets from hospitals in Hungary, Switzerland, and the United States of America. The dataset in its entirety includes eleven variables that can be used to predict heart disease. However, in our project, we will solely focus on the resting ECG results and the maximum heart rate achieved because we want to know if heart activity alone is an accurate predictor for cardiovascular diseases.

#### *Observations in dataset:*

* Age: age of the patient [years]
* Sex: sex of the patient [M: Male, F: Female]
* ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* RestingBP: resting blood pressure [mm Hg]
* Cholesterol: serum cholesterol [mm/dl]
* FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST  elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
* ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
* Oldpeak: oldpeak = ST [Numeric value measured in depression]
* ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* HeartDisease: output class [1: heart disease, 0: Normal] 

## Preliminary Exploratory Data Analysis

#### *Reading the data set:*

In [43]:
library(tidyverse)
library(repr)
library(tidymodels)

set.seed(9999)

In [44]:
heart_data <- read_csv("https://raw.githubusercontent.com/tyih985/DSCI-Project-Attempt-2/main/heart.csv")

heart_data

[1mRows: [22m[34m918[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
[32mdbl[39m (7): Age, RestingBP, Cholesterol, FastingBS, MaxHR, Oldpeak, HeartDisease

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<dbl>
40,M,ATA,140,289,0,Normal,172,N,0,Up,0
49,F,NAP,160,180,0,Normal,156,N,1,Flat,1
37,M,ATA,130,283,0,ST,98,N,0,Up,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


#### *Cleaning and wrangling the data set:*

In [45]:
heart_data_cleaned <- heart_data |> rename(resting_bp = RestingBP, max_hr = MaxHR, heart_disease = HeartDisease) |>
                                    select(resting_bp, max_hr, heart_disease) |>
                                    mutate(heart_disease = as_factor(heart_disease)) |>
                                    mutate(heart_disease = fct_recode(heart_disease, "yes" = "1", "no" = "0"))

heart_data_cleaned

resting_bp,max_hr,heart_disease
<dbl>,<dbl>,<fct>
140,172,no
160,156,yes
130,98,no
⋮,⋮,⋮
130,115,yes
130,174,yes
138,173,no


#### *Summarizing the data set:*

In [46]:
# code

#### *Visualizing the data set:*

In [47]:
# code

## Methods

The values from the selected columns (RestingECG and MaxHR) will be used in our analysis because these observations are directly monitored by heart activity, rather than other observations that are monitored by bodily fluids such as cholesterol. Using these two predictors, we will create a classification model by using K-nearest neighbors algorithm to predict whether a person is at risk of heart disease. To visualize these results, we will use scatter plots and include human readable legends and axes, shapes, and color to distinguish the classes. 

## Expected outcomes and significance:

#### *What do we expect to find?*

words

#### *What impact could such findings have?*

words

#### *What future questions could this lead to?*

words