## DSCI 100 (Mar. 2023) Group Project Final Report

Group 13
- Wenhui (Bessie) Bao, 59773879
- Lily (Hsin Yi) Wang, 76330125
- Sai Gubba, 94736980
- Isabella Paolozza, 81172967

### Addressing BC’s Healthcare Shortage: Prediction of Angiographic Disease (Heart Disease) Status Using KNN Classification

#### Introduction:

Heart disease was one of the 10 leading causes of deaths worldwide in 2019 (World Health Organization, 2020). These casualties are attributed to numerous potential factors, including but not limited to blood pressure, age, and sex. Our research project aims to answer the following question: what percent accuracy can KNN classification predict the angiographic disease status, mild or severe by investigating a dataset with a sample of 219 people in Cleveland. 

There are datasets with heart disease information from different regions that can be found at “https://archive.ics.uci.edu/ml/datasets/Heart+Disease”. To be more specific, we are using the Cleveland dataset. This dataset includes 219 observations from individual patients (each row represents a single patient) recorded for 14 variables (see definitions of column names 1 in appendix), with no missing values.

##### Background information (Literature Summary): 
Studies have explored the relationship between certain risk factors and angiographic disease (Benson et al., 2019; Carson et al., 2020; Jousilaht et al., 1999). Correlating factors included: age, sex, diabetes, blood pressure and more (Benson et al.; Jousihlaht et al.). Other attributes, like smoking, are not included in this dataset. Risk factors such as cholesterol levels and chest pain are found to not have meaningful correlation to angiographic disease (Benson et al.; Carson et al.). The above are expected to have good predictive power for angiographic disease diagnosis and will be considered when choosing predictors for the KNN classifier.

#### Methods

##### Initial Selection of Predictor Variables for our KNN Classification Algorithm

In order to create a K Nearest Neighbors Classification algorithm for diagnosis of severity of angiographic disease, we first had to choose our predictor variables. We began by reviewing existing scientific literature on factors that relate to heart disease, so as to determine which variables will potentially be useful for the development of a knn-classification model with the maximum accuracy. 

Subsequently, the heart disease dataset (Cleveland) was read into an R Jupyter Notebook to be inspected and tidied using the tidyverse package. The original dataset was mostly tidy, there are no missing values so removal of NA was not necessary. However, all variables, including categorical ones, were in dbl and the columns were unnamed, so required data wrangling included assigning the data frame to an object, converting categorical variables to fct and naming columns as defined by descriptions from the database.

Next, the dataset was split into a training and testing set with seed(2023) and 75% of the rows in the training set. Data from the training set was to be used to build the knn-classification model, while data from the testing set represented “new patients” and will be used to test the model’s accuracy since the true diagnosis is known. 

An exploratory data analysis was performed using the training data, which should be reflective of the overall dataset. The size of the dataset was identified, as well as the means, standard deviations, minimums and maximums of each variable; these characteristics will be considered for scaling and choosing folds for cross validation. 

Graphs were created, plotting each variable against diagnosis. Histograms were used to visualize the distribution of numerical variables, while bar plots were used to visualize the proportions of categorical attributes between mild and severe patients. Potential efficacy of a variable was visually determined by looking for differences in distributions and proportions in attributes between patient groups (patients of mild vs. severe diagnosis).

##### Final Selection of Predictor Variables for our KNN Classification Algorithm

For the next steps, numerical variables were necessary in order for the knn-classification model to calculate distances between points. Factor variables were converted back to numeric. 

Using the 8 initially selected variables, ones that demonstrate different distribution between mild and severe diagnosis, the 4-loop method was performed to definitively determine the predictive power of each variable. This method tests each combination of variables using a model workflow with tuned/optimal k value then evaluates the maximum accuracy of each combination using 10-fold cross validation. We chose 10 folds instead of 5 folds since a higher number of folds will provide a more robust estimate of the model's performance in terms of accuracy with a relatively small training dataset. We visually inspected the resulting accuracies in a table. Variables that decrease the accuracy of the classifier, or do not contribute to a meaningful increase in accuracy, were omitted in the final selection of variables. 

##### Finalizing and Testing Accuracy of our KNN Classification Algorithm

Using our final selection of predictors (age, sex, exang, restecg, thal; see definitions in appendix), we created a classification model to tune. With a recipe of our selected predictors that scales and centers all predictors so that they are on a comparable scale, we tested 10 different k values from 1 to 20, then extracted and visualized the resulting accuracy with a line plot. After determining the k value that optimizes accuracy, we created our final classification model specified with k = 15. This algorithm was then tested using observations measured from “new patients” (ie. testing set). The predictions were bound to the testing data frame, accuracy was computed then visualized using a confusion matrix due to the large number of predictors. An additional pair of scatterplots for true vs predicted diagnosis, using age and thal (which had the most variation in values) and color coded by diagnosis, were used to visualize the accuracy of the predicted diagnosis.

##### Corresponding Coding for Data Analysis

In [None]:
## attaching R pakages
library(tidyverse)
library(repr)
library(dplyr)
install.packages('tidymodels')
library(tidymodels)
library(RColorBrewer)
install.packages('corrplot')
library(corrplot)

In [None]:
##read in dataset, preview and wrangle it
    #name columns according to variable description from database
    #all variables are dbl in the original dataset, convert appropriate ones to factor
    #there are no missing values, no need to remove NA's
    #otherwise, the dataset is tidy

heart_disease <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",col_names=FALSE)
colnames(heart_disease) <- c('age','sex','chest_pain_type','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num')
heart_disease <- heart_disease |> 
        filter(num == 0 | num == 1) |>
        mutate(num = as_factor(num)) |>
        mutate(sex = as_factor(sex)) |>
        mutate(chest_pain_type = as_factor(chest_pain_type)) |>
        mutate(fbs = as_factor(fbs)) |>
        mutate(restecg = as_factor(restecg)) |>
        mutate(exang = as_factor(exang)) |>
        mutate(slope = as_factor(slope)) |>
        mutate(ca = as_factor(ca)) |>
        mutate(thal = as_factor(thal))

##### Table 1: Preview of Tidied Data Set

In [None]:
head(heart_disease, n = 5)

In [None]:
## find size of data set
rows <- nrow(heart_disease)
columns <- ncol(heart_disease)
    # there are 219 rows of observations and 14 columns, 13 variables and one column of diagnosis

##check for missing values
missing_values <- any(is.na(heart_disease))
sum_table=rbind(rows, columns, missing_values)

##### Table 2: Size and Number of Missing Values of Data Set

In [None]:
sum_table

In [None]:
## split data into testing and training sets, 75% in training set
set.seed(2023)
heart_split = initial_split(heart_disease, prop=0.75, strata = num)
heart_training = training(heart_split)
heart_testing = testing(heart_split)

##check specific sizes of training and testing data
train_rows <- nrow(heart_training) #164 rows
test_rows <- nrow(heart_testing)  #55 rows

In [None]:
##summary statistics table for all interested numerical variables:

heart_training_numerics = select_if(heart_training,is.numeric)
mean=sapply(heart_training_numerics, mean, na.rm=TRUE)
means=as.data.frame(mean)
means <- data.frame(t(means))

sd=sapply(heart_training_numerics, sd, na.rm=TRUE)
sds=as.data.frame(sd)
sds=data.frame(t(sds))

max=sapply(heart_training_numerics, max, na.rm=TRUE)
maxs=as.data.frame(max)
maxs=data.frame(t(maxs))

min=sapply(heart_training_numerics, min, na.rm=TRUE)
mins=as.data.frame(min)
mins=data.frame(t(mins))

summary_table_1 = rbind(means,sds,maxs,mins)

##checking size of training set and any missing values
training_rows <- nrow(heart_training)
missing <- any(is.na(heart_training))
summary_table_2 = rbind(training_rows, missing)

##### Tables 3 and 4: Summary Statistics of Training Set

In [None]:
summary_table_1
summary_table_2

In [None]:
##choosing predictors using visualizations
##visualizations shown below demonstrates potentially significant differences in distribution for the following predictors
##We used proportions of patients as the y-axis to better compare the distributions of the characteristics of the two groups due to different group sizes.

# Patients in the age range [50, 60] are mostly likely to suffer from mild angiographic disease, whereas patients about 60 years old are mostly likely to suffer from
# severe angiographis disease.
age_plot <- heart_training |>
    ggplot(aes(x = age, fill = num)) +
    geom_histogram(aes(y = 100*stat(count)/sum(stat(count))),binwidth = 5) +
    facet_grid(rows = vars(num),labeller = labeller(num = c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    labs(x = "Age (years)", y = "Proportions of Patient", fill = "Diagnosis\n0: Mild\n1: Severe") +
    ggtitle("Graph 1: Age distribution of Patients Upon Admission to Hospital") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")


# More males tend to have severe angiographic disease 
sex_plot <- heart_training |>
    group_by(sex, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = sex, y = prop, fill = sex)) +
    geom_bar(stat = 'identity', position = 'dodge') +
    labs(x = 'Biological Sex',
    y = 'Proportion of Patients',
    fill = "Biological Sex\n0: female\n1: male") +
    facet_grid(cols = vars(num),labeller = labeller(num = c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 2: Sex Proportion of Patients") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")

# Asymptomatic chest pain type is the most prevalent in the severe patient group, instead of non-anginal pain as the most common type in the mild patient group.
pain_plot <- heart_training |>
    group_by(chest_pain_type, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = chest_pain_type, y = prop, fill = chest_pain_type)) +
    geom_bar(stat = 'identity', position = 'dodge') +
    labs(x = 'Chest Pain Type', y = 'Proportion of People with this Symptom', 
         fill = "Chest Pain Type\n1: typical angina\n2: atypical angina\n3: non-anginal pain\n4: asymptomatic") +
    facet_grid(cols = vars(num), labeller = labeller(num = c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 3: Proportion of Chest Pain Type Symptoms of Patients") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")


# Observable higher proportion of severe patient with fasting blood sugar less than 120. 
fbs_plot <- heart_training |>
    group_by(fbs, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = fbs, y = prop)) +
    geom_bar(stat = "identity", position = "dodge", aes(fill = fbs)) +
    labs(x = "Fasting Blood Sugar", y = "Proportion of Patients",
    fill = "Fasting Blood Sugar\n0: Less than 120 mg/dl\n1: More than 120 mg/dl") +
    facet_grid(cols = vars(num),labeller = labeller(num = c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 4: Fasting Blood Sugar of Patients") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")


# Higher proportion of severe patient with left ventricular hypertrophy. 
ecg_plot <- heart_training |>
    group_by(restecg, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = restecg, y = prop)) +
    geom_bar(stat = "identity", position = "dodge", aes(fill = restecg)) +
    labs(x = "Resting Electrocardiographic Results", y = "Proportion of People with this Symptom",
         fill = "Resting Electrocardiographic Results\n0: Normal\n2: Shows Left Ventricular Hypertrophy") +
    facet_grid(cols = vars(num),labeller = labeller(num=c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 5: Resting Electrocardiographic Results") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")


# Higher proportion of severe patient with asymptomatic exercised induced angina. 
exang_plot <- heart_training |>
    group_by(exang, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = exang, y = prop, fill = exang)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(x = "whether a patient experiences exercise induced angina",
    y = "Proportion of Patients with this Symptom",
    fill = "exercise induced angina\n0: symptomatic\n1: asymptomatic") +
    facet_grid(cols = vars(num),labeller = labeller(num=c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 6: Presence of exercise induced angina as a symptom") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")


# Higher proportion of severe patient with flat slope of the peak exercise ST
slope_plot <- heart_training |>
    group_by(slope, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = slope, y = prop, fill = slope)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(x = "type of slope of the peak exercise ST segment", y = "Proportion of Patients",
         fill = "slope of the peak exercise ST segment\n1: upsloping\n2: flat\n3: downsloping") +
    facet_grid(cols = vars(num),labeller = labeller(num=c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 7: Type of Slope of the Peak Exercise ST Segment of Patients") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")

# Highest proportion in the severe patient group with blood vessel status as "Reversable Defect" instead of "Normal" in the 
# mild patient group
thal_plot <- heart_training |>
    filter(thal!='?') |>
    group_by(thal, num) |>
    summarize(count = n()) |>
    mutate(prop = ifelse(num == 0, count/123, count/41)) |>
    ggplot(aes(x = thal, y = prop, fill = thal)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(x = "Thalassemia Diagnosis",
    y = "Proportion of Patients with this Symptom",
    fill = "Thalassemia Diagnosis \n3: Normal\n6: Fixed defect\n7: Reversable Defect") +
    facet_grid(cols = vars(num), labeller = labeller(num=c("0" = "Mild Angiographic Disease", "1" = "Severe Angiograhic Disease"))) +
    ggtitle("Graph 8: Thalassemia Diagnosis in Patients with Angiographic Disease") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Reds")

In [None]:
age_plot
sex_plot
pain_plot
fbs_plot
ecg_plot
exang_plot
slope_plot
thal_plot

In [None]:
set.seed(2023)
##returning all selected variables back to numeric in order for knn-classification model to calculate distance between points
heart_training = heart_training |>
    mutate(sex = as.numeric(sex)) |>
    mutate(chest_pain_type = as.numeric(chest_pain_type)) %>%
    mutate(fbs = as.numeric(fbs)) |>
    mutate(restecg = as.numeric(restecg)) |>
    mutate(exang = as.numeric(exang)) |>
    mutate(slope = as.numeric(slope)) |>
    mutate(ca = as.numeric(ca)) |>
    mutate(thal = as.numeric(thal))

heart_testing = heart_testing |>
    mutate(sex = as.numeric(sex)) |>
    mutate(chest_pain_type = as.numeric(chest_pain_type)) %>%
    mutate(fbs = as.numeric(fbs)) |>
    mutate(restecg = as.numeric(restecg)) |>
    mutate(exang = as.numeric(exang)) |>
    mutate(slope = as.numeric(slope)) |>
    mutate(ca = as.numeric(ca)) |>
    mutate(thal = as.numeric(thal))

In [None]:
##code for draft model with initially selected predictors:
set.seed(2023)

heart_training_interested <- heart_training |>
    select(sex, chest_pain_type, age, fbs, exang, slope, thal, restecg)

predictor_names <- colnames(heart_training_interested)

# create an empty tibble to store the results 
accuracies <- tibble(size = integer(),                      
                     model_string = character(),                    
                     accuracy = numeric())

# create a model specification 
knn_spec <- nearest_neighbor(weight_func = "rectangular",neighbors = tune()) |>      
  set_engine("kknn") |>      
  set_mode("classification")

# create a 10-fold cross-validation object 
heart_vfold <- vfold_cv(heart_training, v = 10, strata = num)

# store the total number of predictors 
n_total <- length(predictor_names)

# stores selected predictors 
selected <- c()

# for every size from 1 to the total number of predictors 
for (i in 1:n_total) {     
  # for every predictor still not added yet     
  accs <- list()     
  models <- list()     
  for (j in 1:length(predictor_names)) {
      
    # create a model string for this combination of predictors        
    preds_new <- c("sex", 'chest_pain_type', 'age', 'fbs', 'exang', 'slope', 'thal','restecg')[1:i]
    model_string <- paste("num", "~", paste(preds_new, collapse="+"))
   
     # create a recipe from the model string         
    heart_recipe <- recipe(as.formula(model_string), data = heart_training) |>
        step_scale(all_predictors()) |>       
        step_center(all_predictors())
    
    # tune the KNN classifier with these predictors,         
    # and collect the accuracy for the best K         
    acc <- workflow() |>           
      add_recipe(heart_recipe) |>           
      add_model(knn_spec) |>           
      tune_grid(resamples = heart_vfold, grid = 10) |>   
      collect_metrics() |>           
      filter(.metric == "accuracy") |>           
      summarize(mx = max(mean))         
    acc <- acc$mx |> unlist()
    
      # add this result to the dataframe         
    accs[[j]] <- acc         
    models[[j]] <- model_string     
  }     
  jstar <- which.max(unlist(accs))     
  accuracies <- accuracies |>       
    add_row(size = i,               
            model_string = models[[jstar]],               
            accuracy = accs[[jstar]])     
  selected <- c(selected, predictor_names[[jstar]])     
  predictor_names <- predictor_names[-jstar] 
  } 

##### Table 5: Accuracy of Predictor Combinations

In [None]:
accuracies

In [None]:
    # A drop of accuracy due to addition of fbs and slope is unexpected; potentially a multicollinearity problem
##creating a correlation plot to check of correlation between variables
heart_training_interested <- heart_training |> 
    select(sex, age, exang, thal, restecg)

##### Graph 9: Correlation Plot of 5 Selected Final Predictors

In [None]:
corrplot(cor(heart_training_interested))

In [None]:
set.seed(2023)

##Refining classification model of manually selected predictors after observing accuracies to have optimal k value

    ##create recipe with scaled predictors
recipe <- recipe(num ~ sex+age+exang+thal+restecg, heart_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors()) 

    ##creating model spcifications
model <- nearest_neighbor(weight_func='rectangular',neighbors=tune()) |>
    set_engine('kknn') |>
    set_mode('classification')

    ##preparing tibble for cross validation with 10 different k values and 10 folds
k_vals <- tibble(neighbors=seq(from = 1, to = 20, by = 2)) 

data_vfold <- vfold_cv(heart_training, v = 10, strata = num)

In [None]:
set.seed(2023)

##putting together model specifications with the recipe to create a workflow, conduct cross validation and extract accuracies
knn_results <- workflow() |>
    add_recipe(recipe) |>
    add_model(model) |>
    tune_grid(resamples = data_vfold, grid=k_vals) |>
    collect_metrics()

accuracies_k = knn_results %>%
    filter(.metric=='accuracy')

    ##creating line graph to visually choose optimal k-value
kneighbors <- accuracies_k |>
    ggplot(aes(x=neighbors,y=mean)) + 
    geom_point() + 
    geom_line() +
    labs(x = "k values", y = "Accuracy of Classifier (0 to 1)") +
    theme(text = element_text(size = 15))
# Optimal k = 15

In [None]:
import mitosheet
mitosheet.sheet()

In [None]:
import mitosheet
mitosheet.sheet()

##### Table 10: Accuracy of Classifier with Different K Values

In [None]:
accuracies_k

##### Graph 10: Accuracy of Classifier with Different K Values

In [None]:
kneighbors

#### Results

- Variables chest_pain_type, fbs and slope were found to not contribute significantly to an increasing of accuracy.
- Using the K values plot, k = 15 optimizes accuracy of the classifier using the final selected predictors of age, sex, exang, restecg, thal
- With the aforementioned predictor combinations and k = 15, the model had an approximately 75% accuracy when tested on "new patients" from the testing set; accuractely diagnosed 43 out of 55 "new patients," misdiagnosed 10 severe patients as mild and 2 mild patients as severe

In [None]:
set.seed(2023)

##creating final model using optimized k value
model_known_k <- nearest_neighbor(weight_func = 'rectangular',neighbors = 15) |>
        set_engine('kknn') |>
        set_mode('classification')

data_fit <- workflow() |>
    add_recipe(recipe) |>
    add_model(model_known_k) |>
    fit(data = heart_training)

data_test_predictions <- predict(data_fit , heart_testing) |>
       bind_cols(heart_testing)

data_prediction_accuracy <- data_test_predictions |>
    metrics(truth = num, estimate = .pred_class)  |>
    filter(.metric == "accuracy")     

data_conf_mat <- data_test_predictions |>
    conf_mat(truth = num, estimate = .pred_class)

##displaying accuracy of predictions
data_prediction_accuracy
data_conf_mat

In [None]:
##creating colour-coded scatterplot of true vs. predicted diagnosis
true_diagnosis_plot <- data_test_predictions |>
    ggplot(aes(y = age, x = thal, color = num)) +
    geom_point() +
    labs(x = "Thalassemia Diagnosis", y = "Age", color = "Angiographic Diagnosis\n0: Mild\n1:Severe") +
    ggtitle("True Diagnosis") +
    theme(text = element_text(size = 15))

pred_results_plot <- data_test_predictions |>
    ggplot(aes(y = age, x = thal, color = .pred_class)) +
    geom_point() +
    labs(x = "Thalassemia Diagnosis", y = "Age", color = "Angiographic Diagnosis\n0: Mild\n1:Severe") +
    ggtitle("Diagnosis Predicted using KNN-Classification") +
    theme(text = element_text(size = 15))
##arranging side by side
options(repr.plot.width = 15)
grid.arrange(true_diagnosis_plot, pred_results_plot, ncol=2)

#### Discussion

In this project, we found that the final, most accurate knn-classification model trained through the training dataset from 164 patients, 5 predictors (sex, age, exang, thal, and restecg), and k=15 nearest neighbors, which yielded a prediction accuracy of approximately 75% on the testing dataset. 

Some of the outcomes listed above were unexpected. First, an accuracy of 75% is not so subpar that the model is not useful, but it is likely not sufficiently accurate to use reliably in a clinical setting (on average, 25 out of 100 of patients may be misdiagnosed using this model). Secondly, the five predictor variables selected using the for-loop method were not the same variables that that scholarly literature review suggested would contribute to a higher accuracy (history of diabetes (fbs), blood pressure (trestbps)). One potential explanation for these unexpected results is the multi-collinearity of many of the variables. For example, variables such as sex and history of thalassemia (thal) were found to be heavily correlated to each other, which suggests that there is not a substantial increase in additional predictive power of using both predictors (e.g. using both sex and thal). Such variables were found to slightly increase the accuracy of the predictor, so they are presumably still correlated with the diagnosis. Additionally, many of the correlated variable pairs/groups were used together in the final model, so the overall accuracy was decent - but their additional correlation suggests that they are not a truly independent variable and may have contributed to the modest level of predictive accuracy.

The dataset used lacks variables with great explanatory/predictive ability. Most variables were found to correlate to each other, not correlate to the diagnosis of angiographic disease or found to correlate with diagnosis slightly, which may explain why the accuracy is not exceptionally high. We recommend the collection of more variables, such as a factor variable for smoking history, which according to literature may cause heart disease, to increase the prediction accuracy in further research.

This knn-classification model developed is not likely to be of significant use to healthcare providers. A classification model that can be used to diagnose angiographic disease should ideally have an accuracy that is almost that of the actual diagnostic procedure; using the model developed in this project would result in misdiagnosis in about 25% of cases. However the results of this project can be used to inform future attempts to construct such a classifier. Observations for attributes that correlate stronger to diagnosis of angiographic disease severity, or even is the cause behind it, should be measured for datasets that could train a knn-model more accurately.

A more in depth analysis of the relationship between the variables should take place prior to developing a model, to account for our issue with multi-collinearity. Moreover, beyond the statistical correlation between variables, the causal relationship between the variables in our dataset and a diagnosis could be examined further. Additionally, knn-classification may have simply not been the most effective model, as there are many predictive models in existence that could better account for the kinds of issues that we encountered.

#### References

Benson R. A. et al. (2019). The Relationship Between Severity of Arterial Disease Quantified by the Bollinger Angiographic Scoring Method, Presenting Symptoms and Signs and Preoperative 
Comorbidities in the Bypass Versus Angioplasty in Severe Ischaemia of the Leg (BASIL) Trial. $European$ $Journal$ $of$ $Vascular$ & $Endovascular$ $Surgery$, 58($6$), e483-e484.

Carson J.A.S et al. (2020) Dietary Cholesterol and Cardiovascular Risk: A Science Advisory From the 
American Heart Association. $Circulation$, 141($3$), e39-353. 
https://doi.org/10.1161/CIR.0000000000000743

Janosi A. et al. (1988, July 1). Heart Disease Data Set [Dataset]. UCI Machine Learning Repository. Retrieved from: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Jousilaht P. et al (1999). Sex, Age, Cardiovascular Risk Factors, and Coronary Heart Disease: A 
Prospective Follow-Up Study of 14 786 Middle-Aged Men and Women in Finland. $Circulation$, 
99($9$), 1165-1175. https://doi.org/10.1161/01.CIR.99.9.1165

World Health Organization. (n.d.). $The$ $top$ $10$ $causes$ $of$ $death.$ World Health Organization. Retrieved March 5, 2023, from https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death 

#### Appendix

##### Definitions and Descriptions of Variables (in order of column number from left to right)
note: bolded variables are those used as predictors used in our final knn-classifier model
1. ##### age measured in years
2. ##### sex: 0 denotes female, 1 denotes male
3. chest_pain_type: 1 denotes typical angina (caused by insufficient blood flow to heart muscles), 2 denotes atypical angina, 3 denotes non-anginal pain, 4 denotes no chest pain 
4. trestbps: resting blood pressure, measured in mm Hg upon admission to hospistal
5. chol: serum cholestoral level, measured in mg/dl
6. fbs: fasting blood sugar, 0 denotes less than 120 mg/dl, 1 denotes more than 120 mg/dl
7. ##### restecg: resting electrocardiograph results, 0 denotes normal, 1 denotes ST-T wave abnormality, 2 denotes left ventricular hypertrophy
8. thalach: maximum heart rate achieved (number of heartbeats per minute)
9. ##### exang: exercise-induced angina (chest pain), 0 denotes asymptomatic, 1 denotes symptomatic
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: type of slope at the peak of the ST segment, units not specified
12. ca: number of major vessels colored by flourosopy, values 0 to 3 (whole numbers)
13. ##### thal: undefined attribute that we assume represents thalassemia, 3 denotes normal, 6 denotes fixed defect, 7 represents reversable defect
14. ##### num (the predicted attribute): diagnosis of angiographic disease status, 0 denotes mild (less than 50% diameter narrowing, 1 denotes severe (more than 50% diameter narrowing)