In [1]:
library(tidyverse)

heart_disease <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", col_names = FALSE)
colnames(heart_disease) <- c("age", "sex", "chest_pain_type", "resting_blood_pressure", "chol", "fasting_blood_sugar", "resting_electrocardiographic_results", "maximum_heart_rate_achieved", "exercise_induced_angina", "oldpeak", "slope", "number_of_major_vessels", "thal", "diagnosis_of_heart_disease")
heart_data <- heart_disease |> mutate(chest_pain_type = as_factor(chest_pain_type), fasting_blood_sugar = as_factor(fasting_blood_sugar)) |>
select(age, sex, chest_pain_type, resting_blood_pressure, chol, fasting_blood_sugar, maximum_heart_rate_achieved)

heart_data


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specifica

age,sex,chest_pain_type,resting_blood_pressure,chol,fasting_blood_sugar,maximum_heart_rate_achieved
<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<fct>,<dbl>
63,1,1,145,233,1,150
67,1,4,160,286,0,108
67,1,4,120,229,0,129
37,1,3,130,250,0,187
41,0,2,130,204,0,172
56,1,2,120,236,0,178
62,0,4,140,268,0,160
57,0,4,120,354,0,163
63,1,4,130,254,0,147
53,1,4,140,203,1,155


# METHODS

### VARIABLES AND PREDICTORS

Following is the list of variables/ columns we will be using for our data analysis:

These variables are direct factors that may increase the risks of heart disease in a person.

- Resting blood pressure
    
    Elevated heart rate is associated with elevated blood pressure, increased risk of hypertension, which therefore results in an increased risk of heart disease. If a person has a higher resting blood pressure, then we can predict that they are more likely to be at risk of heart related diseases. 
    
- Fasting Blood Sugar

    High blood sugar can damage blood vessels and the nerves that control your heart. People with diabetes are also more likely to have other conditions that raise the risk for heart disease.
    
- Cholesterol levels

    With high cholesterol, a person can develop fatty deposits in their blood vessels. Eventually, these deposits grow, making it difficult for enough blood to flow through your arteries. Often, those deposits can break suddenly and form a clot that causes a heart attack.
    
- Age

    Adults age 50 and older are more likely than younger people to suffer from cardiovascular disease. Aging can cause changes in the heart and blood vessels that may increase a person's risk of developing heart disease. 
    
- Oldpeak

    Oldpeak is a classifier for heart diseases. It helps identify if a person is at risk of heart disease by identifying their ST levels(heart rate) on an ECG (electrocardiogram) monitor.

The impact of these factors can be visualized through the graphs above as well as the data summary which indicates the difference between means of predictors of the different diagnosis.

![image.jpeg](attachment:bb27128f-eddb-4fae-9eb2-dbb343282207.jpeg)

Oldpeak (ECG monitor)

### STEPS

Conducting Data Analysis involves certain steps and procedures which helps answer or predict the question about a dataset. These steps follow the cleaning and tidying of data, as well as splitting the data into training and testing set.

#### STEP 1:

    Plotting the graph for each variable in order to give a clear idea of what the information means by giving it visual context through maps or graphs. Graphs show relationships that are not obvious from studying a list of numbers. They can also provide a convenient way to compare different sets of data.

#### STEP 2:

    Identify the most optimal value of k for our data set for k-nn classification using cross validation. Cross-validation is usually used for improving model prediction when we don't have enough data to apply other more efficient methods like the 2-way split (train and test). The KNN algorithm will be used because it can compete with the most accurate models because it makes highly accurate predictions. It is practical and feasable for large datasets.
    
#### STEP 3:

    Create final model with the specific value of k. A final machine learning model is a model that can be used to make predictions on new data. The model is used to predict the expected output. This may be a classification or a regression. In this case, classification of the diagnosis of heart diseases.
    
#### STEP 4:

    Test the model by its ability to be able to predict values correctly.

### WAYS TO VISUALIZE DATA

There are many different ways we can express our data however only the following will be used:

- HISTOGRAM
    - better for larger datasets
    
    - displays major features and variables
    
- BARGRAPH
    - helps compare different things
    
    - identify and track changes over time