In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(readxl)
# library(rvest)
# library(stringr)
# install.packages('janitor')
# library(janitor)
library(ggplot2)
#options(repr.matrix.max.rows = 50)

**Demonstrating that the dataset can be read from the web into R:**

In [None]:
# reading a csv file containing the data in processed.cleveland.data, 
# with a row of column names (these names are essentially the column names specified in cleve.mod, under ‘Original atts’, without the stuff in brackets)
options(repr.matrix.max.rows = 2)

cleveland_dataset_web <- read_csv(file = url("https://archive.ics.uci.edu/static/public/45/data.csv"))

# extracting the column names from the very first line of the csv file (because the other files I plan to import do not contain column names)
first_row_contents <- read_csv(file = url("https://archive.ics.uci.edu/static/public/45/data.csv"), n_max = 1) |> names()

# reading a file containing the data in processed.hungarian.data
hungarian_dataset_web <- read_delim(file = url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"), 
                                    delim = ",", col_names = first_row_contents) |>
                        mutate(trestbps = as.numeric(trestbps), chol = as.numeric(chol), fbs = as.numeric(fbs), restecg = as.numeric(restecg),
                               thalach = as.numeric(thalach), exang = as.numeric(exang), oldpeak = as.numeric(oldpeak),
                               slope = as.numeric(slope), ca = as.numeric(ca), thal = as.numeric(thal))

# reading a file containing the data in processed.switzerland.data
switzerland_dataset_web <- read_delim(file = url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data"), 
                                    delim = ",", col_names = first_row_contents) |>
                        mutate(trestbps = as.numeric(trestbps), chol = as.numeric(chol), fbs = as.numeric(fbs), restecg = as.numeric(restecg), 
                               thalach = as.numeric(thalach), exang = as.numeric(exang), oldpeak = as.numeric(oldpeak),
                               slope = as.numeric(slope), ca = as.numeric(ca), thal = as.numeric(thal))

# reading a file containing the data in processed.va.data
virginia_dataset_web <- read_delim(file = url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data"), 
                                    delim = ",", col_names = first_row_contents) |>
                        mutate(trestbps = as.numeric(trestbps), chol = as.numeric(chol), fbs = as.numeric(fbs), 
                               thalach = as.numeric(thalach), exang = as.numeric(exang), oldpeak = as.numeric(oldpeak),
                               slope = as.numeric(slope), ca = as.numeric(ca), thal = as.numeric(thal))

# reading a file containing the data in reprocessed.hungarian.data
reprocessed_hungarian_dataset_web <- read_delim(file = url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/reprocessed.hungarian.data"), 
                                    delim = " ", col_names = first_row_contents) |>
                        mutate(trestbps = as.numeric(trestbps), chol = as.numeric(chol), fbs = as.numeric(fbs), restecg = as.numeric(restecg), 
                               thalach = as.numeric(thalach), exang = as.numeric(exang), oldpeak = as.numeric(oldpeak),
                               slope = as.numeric(slope), ca = as.numeric(ca), thal = as.numeric(thal))

global_dataset <- bind_rows(cleveland_dataset_web, hungarian_dataset_web, switzerland_dataset_web, virginia_dataset_web, reprocessed_hungarian_dataset_web)

global_dataset

Excerpt from the file named heart-disease.names : "Missing Attribute Values: Several.  Distinguished with value -9.0."

In [None]:
na_matrix <- global_dataset == "-9"

is.na(global_dataset) <- na_matrix

global_dataset

**Cleaning and wrangling the data into a tidy format:**

**summarizing the data in at least one table using only training data:**

In [None]:
# we now try to create
# a table that reports 
# the number of healthy and sick observations,
# the number of rows with missing values for healthy and sick observations,
# the percentage of healthy and sick observations,
# the average ages of healthy and sick observations, 
# the average resting blood pressures of healthy and sick observations,
# the average cholesterol of healthy and sick observations,
# average max heart rate of healthy and sick observations,
# average ST depression induced by exercise relative to rest for healthy and sick observations,
# and the average number of vessels colored by flourosopy for healthy and sick observations,
# for each class in our dataset

In [None]:
global_dataset <- as_tibble(global_dataset)

global_dataset |>
      rename(Class = num) |>
      mutate(Class = as.factor(Class)) |>
      mutate(Class = fct_recode(Class, "healthy" = "0", "sick" = "1", "sick" = "2", "sick" = "3", "sick" = "4")) |>
      mutate(row_contains_na = (is.na(age) | is.na(sex) | is.na(cp) | is.na(trestbps) | is.na(chol) | is.na(fbs) | is.na(restecg) | is.na(thalach) | is.na(exang) | is.na(oldpeak) | is.na(slope) | is.na(ca) | is.na(thal))) |>
      group_by(Class) |>
      summarize(
         count = n(), 
         num_rows_with_na = sum(row_contains_na),
         percentage = count / nrow(global_dataset) * 100,
         average_age = mean(age, na.rm = TRUE),
         avg_resting_bp = mean(trestbps, na.rm = TRUE),
         avg_cholestorol = mean(chol, na.rm = TRUE),
         avg_max_hr = mean(thalach, na.rm = TRUE),
         avg_oldpeak = mean(oldpeak, na.rm = TRUE),
         avg_ca = mean(ca, na.rm = TRUE)
         )

global_dataset |>
      pivot_longer(cols = c(age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal)) |>
      rename(Class = num) |>
      mutate(Class = as.factor(Class)) |>
      mutate(Class = fct_recode(Class, "healthy" = "0", "sick" = "1", "sick" = "2", "sick" = "3", "sick" = "4")) |>
      group_by(Class, name) |>
      summarize(
         num_missing_values = sum(is.na(value))
      ) |> 
      group_by(Class) |>
      summarize(num_cols_with_na = sum(num_missing_values > 0))


The third column of the first summary tibble conveys that, out of all the rows in the dataset, 

439 of the rows corresponding to healthy observations have NA values, 

while 475 of rows corresponding to sick observations have NA values;



On the other hand,

The second column of the second summary tibble conveys that, out of all the rows in the dataset, 

the NA values in the rows corresponding to healthy observations are all located in exactly 9 of the 14 columns,

while the NA values in the rows corresponding to sick observations are also located in exactly 10 of the 14 columns.

> **Note:**
> 
> Some issues with cleveland.data:
> 1) 'cleveland.data' is not UTF-8 encoded
> 2) I don't know if we should use this data, because not only are there a bunch of negative values, the website where we downloaded this data does not specify the attributes that each of these columns correspond to
> 3) The number of values in each row varies (wouldn't have been that big of an issue had we known what the attributes are, but alas, we don't

In [None]:
# preparing the dataset for the code that generates visualizations
global_dataset <- global_dataset |>


      mutate(sex = as_factor(sex)) |>
      mutate(sex = fct_recode(sex, "male" = "1", "female" = "0")) |>


      mutate(cp = as_factor(cp)) |>
      mutate(cp = fct_recode(cp, "typical angina" = "1", "atypical angina" = "2", "non-anginal pain" = "3", "asymptomatic" = "4")) |>


      mutate(fbs = as_factor(fbs)) |>
      mutate(fbs = fct_recode(fbs, "true" = "1", "false" = "0")) |>      


      mutate(restecg = as_factor(restecg)) |>
      mutate(restecg = fct_recode(restecg, "normal" = "0", "ST-T wave abnormality" = "1", "left ventricular hypertrophy" = "2")) |>



      mutate(exang = as_factor(exang)) |>
      mutate(exang = fct_recode(exang, "yes" = "1", "no" = "0")) |>


      mutate(slope = as_factor(slope)) |>
      mutate(slope = fct_recode(slope, "upsloping" = "1", "flat" = "2", "downsloping" = "3")) |>
      

      mutate(thal = as_factor(thal)) |>
      mutate(thal = fct_recode(thal, "normal" = "3", "fixed" = "6", "reversable" = "7")) |>


      mutate(num = as.factor(num)) |>
      mutate(num = fct_recode(num, "healthy" = "0", "sick" = "1", "sick" = "2", "sick" = "3", "sick" = "4"))

In [None]:
options(repr.matrix.max.rows = 2)
global_dataset <- global_dataset |>
      rename(Age = age, Sex = sex, "Chest_Pain_Type" = cp, "Resting_Blood_Pressure" = trestbps, Cholesterol = chol, "Fasting_blood_sugar_over_120_mg/dl" = fbs,
      "Resting_ecg_results" = restecg, "Max_heart_rate" = thalach, "Exercise_induced_angina" = exang, "ST_depression_induced_by_exercise_relative_to_rest" = oldpeak, 
      "slope_of_the_peak_exercise_ST_segment" = slope, "Number_of_major_vessels_colored_by_flourosopy" = ca, "Thalassemia" = thal, Class = num)

global_dataset

In [None]:
# visualizing the data with a plot relevant to the analysis we plan to do using only training data

options(repr.plot.width = 18, repr.plot.height = 18) 

# coloured and grouped-by-shape scatterplot

# Age & Resting_Blood_Pressure 
age_rbp <- global_dataset |> 
    ggplot(aes(x = Age, y = Resting_Blood_Pressure)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Age versus Resting Blood Pressure", x = "Age", y = "Resting Blood Pressure", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
age_rbp

# Age & Cholesterol 
age_chol <- global_dataset |> 
    ggplot(aes(x = Age, y = Cholesterol)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Age versus Cholesterol", x = "Age", y = "Cholesterol", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
age_chol

# Age & Max_heart_rate
age_maxhr <- global_dataset |> 
    ggplot(aes(x = Age, y = Max_heart_rate)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Age versus Max Heart Rate", x = "Age", y = "Max Heart Rate", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
age_maxhr

# Age & ST_depression_induced_by_exercise_relative_to_rest
age_oldpeak <- global_dataset |> 
    ggplot(aes(x = Age, y = ST_depression_induced_by_exercise_relative_to_rest)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Age versus ST Depression", x = "Age", y = "ST Depression", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
age_oldpeak

# Age & Number_of_major_vessels_colored_by_flourosopy
age_ca <- global_dataset |> 
    ggplot(aes(x = Age, y = Number_of_major_vessels_colored_by_flourosopy)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Age versus Number of major vessels colored by flourosopy", x = "Age", y = "Number of major vessels colored by flourosopy", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
age_ca

# Resting_Blood_Pressure & Cholesterol
rbp_chol <- global_dataset |> 
    ggplot(aes(x = Resting_Blood_Pressure, y = Cholesterol)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Resting Blood Pressure versus Cholesterol", x = "Resting Blood Pressure", y = "Cholesterol", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
rbp_chol

# Resting_Blood_Pressure & Max_heart_rate
rbp_mhr <- global_dataset |> 
    ggplot(aes(x = Resting_Blood_Pressure, y = Max_heart_rate)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Resting Blood Pressure versus Max Heart Rate", x = "Resting Blood Pressure", y = "Max Heart Rate", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
rbp_mhr

# Resting_Blood_Pressure & ST_depression_induced_by_exercise_relative_to_rest
rbp_oldpeak <- global_dataset |> 
    ggplot(aes(x = Resting_Blood_Pressure, y = ST_depression_induced_by_exercise_relative_to_rest)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Resting Blood Pressure versus ST Depression induced by exercise relative to rest", x = "Resting Blood Pressure", y = "ST Depression induced by exercise relative to rest", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
rbp_oldpeak

# Resting_Blood_Pressure & Number_of_major_vessels_colored_by_flourosopy
rbp_ca <- global_dataset |> 
    ggplot(aes(x = Resting_Blood_Pressure, y = Number_of_major_vessels_colored_by_flourosopy)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Resting Blood Pressure versus Number of major vessels colored by flourosopy", x = "Resting Blood Pressure", y = "Number of major vessels colored by flourosopy", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
rbp_ca

# Cholesterol & Max_heart_rate
chol_mhr <- global_dataset |> 
    ggplot(aes(x = Cholesterol, y = Max_heart_rate)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Cholesterol versus Max Heart Rate", x = "Cholesterol", y = "Max Heart Rate", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
chol_mhr

# Cholesterol & ST_depression_induced_by_exercise_relative_to_rest
chol_oldpeak <- global_dataset |> 
    ggplot(aes(x = Cholesterol, y = ST_depression_induced_by_exercise_relative_to_rest)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Cholesterol versus ST depression induced by exercise relative to rest", x = "Cholesterol", y = "ST depression induced by exercise relative to rest", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
chol_oldpeak

# Cholesterol & Number_of_major_vessels_colored_by_flourosopy
chol_ca <- global_dataset |> 
    ggplot(aes(x = Cholesterol, y = Number_of_major_vessels_colored_by_flourosopy)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Cholesterol versus Number of major vessels colored by flourosopy", x = "Cholesterol", y = "Number of major vessels colored by flourosopy", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
chol_ca

# Max_heart_rate & ST_depression_induced_by_exercise_relative_to_rest
mhr_oldpeak <- global_dataset |> 
    ggplot(aes(x = Max_heart_rate, y = ST_depression_induced_by_exercise_relative_to_rest)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Max Heart Rate versus ST Depression induced by exercise relative to rest", x = "Cholesterol", y = "ST Depression induced by exercise relative to rest", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
mhr_oldpeak

# Max_heart_rate & Number_of_major_vessels_colored_by_flourosopy
mhr_ca <- global_dataset |> 
    ggplot(aes(x = Max_heart_rate, y = Number_of_major_vessels_colored_by_flourosopy)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "Max Heart Rate versus Number of major vessels colored by flourosopy", x = "Max Heart Rate", y = "Number of major vessels colored by flourosopy", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
mhr_ca

# ST_depression_induced_by_exercise_relative_to_rest & Number_of_major_vessels_colored_by_flourosopy
oldpeak_ca <- global_dataset |> 
    ggplot(aes(x = ST_depression_induced_by_exercise_relative_to_rest, y = Number_of_major_vessels_colored_by_flourosopy)) + 
    geom_point(aes(colour = Class, shape = Class), size = 3) + 
    labs(title = "ST depression induced by exercise relative to rest versus Number of major vessels colored by flourosopy", x = "ST depression induced by exercise relative to rest", y = "Number of major vessels colored by flourosopy", colour = "Heart Disease Diagnosis", shape = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
oldpeak_ca

In [None]:
# transparent black scatterplot

# Max_heart_rate & ST_depression_induced_by_exercise_relative_to_rest
mhr_oldpeak <- global_dataset |> 
    ggplot(aes(x = Max_heart_rate, y = ST_depression_induced_by_exercise_relative_to_rest)) + 
    geom_point(alpha = 0.3, size = 3) + 
    labs(title = "Max Heart Rate versus ST Depression induced by exercise relative to rest", x = "Cholesterol", y = "ST Depression induced by exercise relative to rest") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
mhr_oldpeak

In [None]:
# grouped-by-colour scatterplot

# Max_heart_rate & ST_depression_induced_by_exercise_relative_to_rest
mhr_oldpeak <- global_dataset |> 
    ggplot(aes(x = Max_heart_rate, y = ST_depression_induced_by_exercise_relative_to_rest, colour = Class)) + 
    geom_point(alpha = 0.3, size = 3) + 
    labs(title = "Max Heart Rate versus ST Depression induced by exercise relative to rest", x = "Cholesterol", y = "ST Depression induced by exercise relative to rest", colour = "Heart Disease Diagnosis") + 
    theme(text = element_text(size = 20), plot.title = element_text(face = "bold"))
mhr_oldpeak

# Methods #

## Explain how you will conduct either your data analysis and which variables/columns you will use. ##
K-nearest neighbors algorithm will be used to predict if a patient is healthy or sick primarily based on their age, cholesterol, and/or resting blood pressure. Other factors, such as type of chest pain, may be explored to determine if filtering among groups would result in a more precise and accurate classification instead. 


## Describe at least one way that you will visualize the results ##

Histograms will first be used to determine if there is an association between age, cholesterol, resting blood pressure and sick vs. healthy patients. Then, a scatterplot will be used to plot two of these variables against each other. The points will be colored based on sick vs. healthy. 




# Expected outcomes and significance #

## What do you expect to find? ##
This study aims to propose an accurate classification model for heart disease prediction using a machine learning classification algorithm, K-nearest neighbors.

## What impact could such findings have? ##
The findings of this study could have the potential to make an impact in the field of medical health since an accurate heart disease prediction model can assist intervention measures, which may lead to better patient outcomes.

## What future questions could this lead to? ##
Future questions that this could raise involve the comparability of the K-nearest neighbors classifier model to other models. For example, how does the accuracy of prediction with the K-nearest neighbors approach compared with other prediction models based on different machine learning algorithms? What are the advantages and limitations of the K-nearest neighbors method in comparison to the other methods in the case of predicting the heart disease? 