Introduction

Our study aims to explore the identification of gender in patients with heart disease, recognizing the biological differences between male and female patients that could affect the normal ranges of various predictors. Utilizing the K-nearest neighbors classification method, we intend to predict a patient's gender based on a subset of variables from the heart disease dataset. These variables include disease classification, cholesterol levels, resting electrocardiogram (ECG) results, and maximum heart rate achieved. While the heart disease dataset encompasses a broad range of variables, our analysis will focus exclusively on these selected predictors. There will also be 303 instances of data to be used for both training and testing our data

Citation: Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.

Methods

In [1]:
library(repr)
library(tidyverse)
library(rvest)
library(tidymodels)
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding


── [1mAttac

In [None]:
url <- "https://raw.githubusercontent.com/victoriachoi7/group-4-dsci/main/processed.cleveland.data"
cleveland_data <- read_csv(url, col_names = FALSE)
colnames(cleveland_data) <- c("age", "sex", "chest_pain_type", "resting_bp", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

cleveland_data

From this table, we need to isolate the variables we are interested in. These columns include sex, disease classification (num), cholesterol levels (chol), resting electrocardiogram (ECG) results (rest_ecg), and maximum heart rate achieved (thalach). This is also known as the wrangling process. We mutated the sex variable to tell us female or male for clarity.

In [None]:
cleveland_wrangled <-cleveland_data|>
            select("sex", "num", "chol", "restecg", "thalach")|>
            mutate(sex = as_factor(sex)) |>
            mutate(sex = fct_recode(sex, "female" = "0", "male" = "1"))
cleveland_wrangled

Next, we can also summarize the wrangled data to get a premilinary look at our data distribution and averages. The first part looks at the distribution between female and male data points.

In [None]:
num_obs <- nrow(cleveland_wrangled)
cleveland_obs <- cleveland_wrangled |>
        group_by(sex)|>
        summarize(count = n(), percentage = n() / num_obs * 100)
cleveland_obs

We see here that the percentage of male oberservations dominate the number of female observations in the data set.

Secondly, we can find the mean values of each of the variables to get a feel for how the data varies for each sex.

In [None]:
cleveland_sum <- cleveland_wrangled |>
        group_by(sex)|>
        summarize(chol_mean = mean(chol), num_mean = mean(num), restecg_mean = mean(restecg), thalach_mean = mean(thalach))
cleveland_sum

Through finding the means for each testing variable, we can see what values each variable leans towards for each sex. Though num and restecg usually have integer values, we can see through the means what number most of the respective female and male population leans towards.

Finally, using the next step, we can see if there are any missing values that might end up being a problem in our data analysis.

In [None]:
cleveland_missing <- sum(is.na(cleveland_wrangled))
cleveland_missing

We can see that there are no missing values detected in our data. Thus, we can go ahead without any extra steps and use the wrangled data for our next steps.

Our next step will be creating a visualization to additionally summarize the data we plan to analyze. We have six different scatterplots made with the genders highlighted in different colours to see if there are groupings in the data for males and females. Essentially, we are looking at trends that give us hint of how our "official" data analysis will go.

In [None]:
options(repr.plot.width = 6, repr.plot.height = 6)
cleveland_viz_1 <- cleveland_wrangled|>
        ggplot(aes(x = chol, y = thalach, colour = sex)) +
            geom_point()+
            labs(x = "Cholesterol (mg/dl)",
                 y = "Maximum Heart Rate Achieved (beats/minute)",
                 colour = "Sex")+
        ggtitle("Cholesterol vs Maximum Heart Rate Achieved of Patients")+
        theme(text = element_text(size = 10))
cleveland_viz_1

# cleveland_viz_2 <- cleveland_wrangled|>
#         ggplot(aes(x = chol, y = restecg, colour = sex)) +
#             geom_point()+
#             labs(x = "Cholesterol (mg/dl)",
#                  y = "Resting electrocardiogram (ECG) results",
#                  colour = "Sex")+
#         ggtitle("Cholesterol vs Resting ECG results of Patients")+
#         theme(text = element_text(size = 8))
# cleveland_viz_2

# cleveland_viz_3 <- cleveland_wrangled|>
#         ggplot(aes(x = chol, y = num, colour = sex)) +
#             geom_point()+
#             labs(x = "Cholesterol (mg/dl)",
#                  y = "Heart Disease Diagnosis",
#                  colour = "Sex")+
#         ggtitle("Cholesterol vs Heart Disease of Patients")+
#         theme(text = element_text(size = 8))
# cleveland_viz_3

# cleveland_viz_4 <- cleveland_wrangled|>
#         ggplot(aes(x = restecg, y = thalach, colour = sex)) +
#             geom_point()+
#             labs(x = "Resting electrocardiogram (ECG) results",
#                  y = "Maximum Heart Rate Achieved (beats/minute)",
#                  colour = "Sex")+
#         ggtitle("Resting ECG results vs Maximum Heart Rate Achieved of Patients")+
#         theme(text = element_text(size = 8))
# cleveland_viz_4

# cleveland_viz_5 <- cleveland_wrangled|>
#         ggplot(aes(x = num, y = thalach, colour = sex)) +
#             geom_point()+
#             labs(x = "Heart Disease Diagnosis",
#                  y = "Maximum Heart Rate Achieved (beats/minute)",
#                  colour = "Sex")+
#         ggtitle("Heart Disease vs Maximum Heart Rate Achieved of Patients")+
#         theme(text = element_text(size = 8))
# cleveland_viz_5

# cleveland_viz_6 <- cleveland_wrangled|>
#         ggplot(aes(x = num, y = restecg, colour = sex)) +
#             geom_point()+
#             labs(x = "Heart Disease Diagnosis",
#                  y = "Resting electrocardiogram (ECG) results",
#                  colour = "Sex")+
#         ggtitle("Heart Disease vs Resting ECG results of Patients")+
#         theme(text = element_text(size = 8))
# cleveland_viz_6

The next step to do after wrangling the data is to make a model of the K-nearest neighbors in order to predict the patient's gender. The first step we have to do here is to split the data into training data and testing data in order to be able to train our model while also testing it so that we can have the best possible model. In this case, we are using a seed throughout the process to create a consistent result of randomness.

In [None]:
set.seed(1234) 

cleveland_split <- initial_split(cleveland_wrangled, prop = 0.75, strata = sex)
cleveland_train <- training(cleveland_split)   
cleveland_test <- testing(cleveland_split)

Next up, we have to create the model using tune() as the neighbors parameter, so that each parameter in the model can be adjusted rather than given a specific value.

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

knn_recipe <- recipe(sex ~ . , data = cleveland_train) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

Then, we have to make the the plot between the accuracy and the number of neighbors (between n = 1 to 15) so that we can decide which number of neighbors will be optimal in predicting the gender of the patients. We also made a 5 cross-validation folds which helps to validate the stability and performance of a machine learning model. It does so by training the model multiple times on different subsets of the data and testing it on the remaining parts.

In [None]:
set.seed(1234) 

cleveland_vfold <- vfold_cv(cleveland_train, v = 5, strata = sex)

grid_vals <- tibble(neighbors = seq(1, 15))

knn_results <- workflow() |>
      add_recipe(knn_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = cleveland_vfold, grid = grid_vals) |>
      collect_metrics()

accuracies <- knn_results |> 
  filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) + 
  geom_point() + 
  geom_line() + 
  labs(x = "Number of Neighbors", y = "Accuracy", title = "Cross-validation Results: kNN Accuracy by Number of Neighbors")

cross_val_plot

As we can see, the average peaked at about 0.73 when N = 13, this shows that the optimal N nearest neighbors for the kNN model is N = 13. Therefore, we are going to make a model with N = 13.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 13) |>
      set_engine("kknn") |>
      set_mode("classification")

cleveland_fit <- workflow() |>
       add_recipe(knn_recipe) |>
       add_model(knn_spec) |>
       fit(data = cleveland_train)
cleveland_fit

predictions <- predict(cleveland_fit, cleveland_train) |>
  bind_cols(cleveland_train)

metrics <- predictions |>
  metrics(truth = sex, estimate = .pred_class) |>
  filter(.metric == "accuracy")
metrics

Here we can see that the model gives us an accuracy of 0.7389381 which is pretty good to move on to the next stage.

Now that we have a K-nearest neighbors classifier object, we can use it to predict the class labels for our test set.

In [None]:
cleveland_test_predictions <- predict(cleveland_fit, cleveland_test) |>
bind_cols(cleveland_test) |>
select(.pred_class, sex)

cleveland_test_predictions

After printing out the first few columns of the predictions it made it looks pretty accurate however to understand both how good our predictive model was and how closely correlated gender and the other elements we looked at are we need more.
The first and easiest evaluation step is to make a table of how often the podel predicted vs how often it was right. This will allow us to determine if the model was overpredicting a single gender or if it is more balanced.

In [None]:
cleavland_mat <- cleveland_test_predictions |> 
       conf_mat(truth = sex, estimate = .pred_class)
cleaveland_mat

Results

Discussion

From the results we can see that there is clearly a correlation as the accuracy is around 70 percent however in an article by Tapie et al.(2023) they found that gender only affects some heart health metrics in a meaningful way. This could explain our relatively low accuracy as our predictions assign some weight to statistics that cant realistically be used to determine gender and thus combines bad predictions into our results. With some scaling of the data to reflect its corelation to gender these results could likely be improved however that is outside the scope of our analysis of the topic.

Tapia, J., Basalo, M., Enjuanes, C., Calero, E., José, N., Ruíz, M., Calvo, E., Garcimartín, P., Moliner, P., Hidalgo, E., Yun, S., Garay, A., Jiménez-Marrero, S., Pons, A., Corbella, X., & Colet, J. C. (2023). Psychosocial factors partially explain gender differences in health-related quality of life in heart failure patients. ESC heart failure, 10(2), 1090–1102. https://doi.org/10.1002/ehf2.14260



Conclusion
In conclusion we looked at the correlation between various heart metrics and gender in the data from the cleaveland area. We cound that there is some definitive correlation however our models accuracy was ~74 percent which indicates that there is significant room for improvements. Future studies could look at each individual metric seperately and determine if it should or should not be used to preodict geneder or look at different areas. Understanding the correlation between gender and heart metrics would help medical professionals give more personalized care and may even improve healthcare outcomes. Overall a better understanding of the heart and its average values is important and more work needs to be done to help improve the healthcare system.