# Data Analysis on Heart Disease

## Introduction
Heart diseases are one of the leading causes of death in the world including the United States for many years. With the increasing rate of cases for heart diseases worldwide, it is important to focus at the leading causes of this disease and establish a relationship between these vicious disease. As a base for our analysis we are going to use the heart disease dataset available at Kaggle. 
(https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset/data)

This is the cleaned version of the data set given in Canvas -> group_project_proposal


The dataset originates from 1988 and evaluates data from four different databases: Cleveland, Hungary, Switzerland and Long Beach V. While the dataset does have 76 distinct attributes, it only utilizes 14 of them. In our group project, we chose to focus on the variables age, resting blood pressure, cholesterol, and  maximum heart rate and compere it to the diagnosis of heart disease. We chose this variables since according to National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) it is considered one of the leading reason of causing the heart disease. We will be comparing the numerical variables by the categorical variables (2 cases) “0” and “1” which shows the diagnosis of heart diseases. 

From thee following informations, we will be attempting to answer the following predictive question. 

**Predictive question:**
How does the amount of cholesterol, type of heart defect, age, and sex help us predict the diagnosis of heart disease?

We will be conducting a KNN classification in order to answer the predictive question.

In [None]:
### Run this cell before continuing.
install.packages("kknn")
library(kknn)
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
options(repr.matrix.max.rows = 10)

### Reading the Data

First read the raw data from GitHub that we uploaded.

In [None]:
heart_disease_data <- read_csv("https://raw.githubusercontent.com/yma24ma/dsci_009_43_gp/main/heart.csv")
heart_disease_data

**Variables**

age: Age

sex: Sex

cp: Chest pain type (4 values)

trestbps: resting blood pressure

chol: serum cholestoral in mg/dl

fbs: fasting blood sugar > 120 mg/dl

restecg: resting electrocardiographic results (values 0,1,2)

thalacg: maximum heart rate achieved

exang: exercise induced angina

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

ca: number of major vessels (0-3) colored by flourosopy

thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

target: diagnosis of heart disease

**Select Data**

We will now be using the `select()` function to select the row that we will be using in this analysis and make it into one table.

In [None]:
heart_disease_selected <- select(heart_disease_data, age, chol, target, thalach, trestbps)|>
                          mutate(target=as_factor(target))|>
                          mutate(heart_disease=fct_recode(target, "Yes" = "1", "No " = "0"))
heart_disease_selected

We use the sum function to check if there are any NA values in our data tables

In [None]:
sum(is.na(heart_disease_selected))

**Average of Selected**

In [None]:
hd_average1 <- heart_disease_selected |>
                map(mean) 
hd_average1

## Visualization
Now we will visualize the relationship between each numerical variable vs the heart disease.

**Age vs Cholesterol**

Now, we will visualize the relationship between the age and the cholesterol. First, we will be using `select` function to create a table.

In [None]:
hd_select_chol <- select(heart_disease_selected,age,chol,heart_disease)|>
               mutate(heart_disease)
hd_select_chol

**Age vs Cholesterol**

Now, we will visualize the correlation between `age` and `chol` (the number of cholesterol) and whether there is any relationship with the diagnosis of heart disease. We will be using `ggplot` to create a scatter plot using `geom_point` in order to visualize the relationship between those two variables. We will be using the `alpha` value of 0.3 in order to visualize any overlapping points and have a density visualization.
Hypothesis: We think that the patient with a higher age will have a higher cholesterol which will cause to have a higher rate of heart disease. 

In [None]:
hd_chol_plot <- ggplot(hd_select_chol,aes(x = age, y = chol,colour = heart_disease)) +
                    geom_point(alpha = 0.3) +
                    labs(x = "Age", y = "Cholesterol (mg/dl)", colour = "Heart Disease") +       
                    ggtitle("Age vs Cholesterol Scatterplot")
hd_chol_plot   

**Analysis**

By looking at the scatter plot above, it seems like there is a weak to no correlation between those two variables (age and the number of cholesterols) However, it has a slight tendency that the red dots which represent no heart disease are on the right side of the scatter plot, which means that the higher the age is, there are fewer people that are diagnosed with heart disease. This scatter plot has surprised us to some degree since we initially hypothesized that the higher the age is, the higher the cholesterol will be, and so as the number of patients diagnosed with heart disease. 

In [None]:
hd_select_thalach <- select(heart_disease_selected,age,thalach,heart_disease)|>
                  mutate(heart_disease)
hd_select_thalach

**Age vs Max Heart Rate**

Now we will be using `age` and `thalach` to visualize the correlation between the Age and the Maximum Heart Rate achieved by the patient, and if they have any effect on the diagnosis of the heart disease. 

In [None]:
hd_thalach_plot <- ggplot(hd_select_thalach,aes(x = age, y = thalach, colour = heart_disease)) +
                       geom_point(alpha = 0.3)+
                       labs(x = "Age", y= "Maximum Heart Rate (bpm)", colour = "Heart Disease") +       
                       ggtitle("Age vs Maximum Heart Rate Scatterplot")
hd_thalach_plot   

**Analysis**

From the scatter plot above, it is clear that it has a weak negative correlation between the age and the maximum heart rate achieved. This was according to our hypothesis that the younger the age, the higher the maximum heart rate. This hypothesis was based on our research that it said to derive the maximum heart rate, minus the age from 220. (Centers for Disease Control and Prevention) 
Also, there is a tendency for patients with higher maximum heart rates tend to have a higher rate of heart disease. 

In [None]:
hd_select_trestbps <- select(heart_disease_selected,age,trestbps,heart_disease)|>
                   mutate(heart_disease)
hd_select_trestbps

**Age vs Resting Blood Pressure**

Now we will be using `age` and `trestbps` to visualize the correlation between the Age and the Resting Blood Pressure achieved by the patient, and if they have any effect on the diagnosis of the heart disease. 
Hypothesis: We think that elder patients will have a higher resting blood pressure which causes them to have a higher number of positive heart disease diagnoses. 

In [None]:
hd_trestbps_plot <- ggplot(hd_select_trestbps,aes(x = age, y = trestbps, colour = heart_disease)) +
                        geom_point(alpha = 0.3) +
                        labs(x = "Age", y = "Resting Blood Pressure (mm/Hg)", colour = "Heart Disease") +       
                        ggtitle("Age vs Resting Blood Pressure Scatterplot")
hd_trestbps_plot   

**Analysis**

According to our scatter plot above, there is almost no correlation between the age and the resting blood pressure as well as the diagnosis of heart disease.

## Methods
For our Heart Disease data set, we are going to use the method of K-nearest neighbors classification. Essentially, we are going to use predictor variables chol (amount of cholesterol), thal (type of heart defect), age and sex to predict the diagnosis class of heart disease, which can be categorized into 0 (no heart disease) or 1 (heart disease). Therefore, the column names we will incorporate are chol, thal, age, sex and target. We chose to only use four predictor variables because we think that there are more than two factors that contribute to the diagnosis of heart disease. Since there are multivariables, we can avoid a 4D graph by using the facet_grid function to create a plot that has multiple subplots arranged in a grid.

## Expected outcomes and significance


What do you expect to find?

We expect to identify the most relevant features that contribute to the presence or absence of heart disease.We also look for patterns and correlations within the data.

What impact could such findings have?

Understanding the factors that contribute to heart disease can inform public health initiatives.Furthermore, discoveries from this dataset can enhance healthcare by improving diagnostic tools and predictive models for heart disease, potentially leading to early detection and treatment.

What future questions could this lead to?

Are there additional attributes that should be considered, or are there redundant variables that can be eliminated to improve model performance?



**Scaling and Centering Data**

Now, we will be creating a recipe by scaling and centering the dataset using `step_scale` and `step_center`. Scaling and centering the dataset will allow us to compare variables to each other by setting the mean to 0 and the standard deviation to 1. This additional step will simplify interpretation when comparing each variables and allow us to conduct a KNN classification.

In [None]:
set.seed(9999) 
heart_disease_recipe <- recipe(heart_disease ~ age + chol + thalach + trestbps, data = heart_disease_selected) |>
                        step_scale(all_predictors()) |>
                        step_center(all_predictors())
heart_disease_recipe
                        
heart_disease_scaled <- heart_disease_recipe |>  
                        prep() |> 
                        bake(heart_disease_selected)
heart_disease_scaled

**Visualize Scaled/Centered Data**

Now, we will visualize the scaled/centered data into a scatter plot using `ggplot`. 

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)


hd_chol_scaled_plot <- ggplot(heart_disease_scaled, aes(x = age, y = chol, color = heart_disease)) +
geom_point(alpha = 0.3) +
labs(x = "Age", y = "Cholesterol (mg/dl)", colour = "Heart Disease")
hd_chol_scaled_plot

hd_maxheart_scaled_plot <- ggplot(heart_disease_scaled, aes(x = age, y = thalach, color = heart_disease)) +
geom_point(alpha = 0.3) +
labs(x = "Age", y = "Max Heart Rate (bpm)", colour = "Heart Disease")
hd_maxheart_scaled_plot

hd_trestbps_scaled_plot <- ggplot(heart_disease_scaled, aes(x = age, y = trestbps, color = heart_disease)) +
geom_point(alpha = 0.3) +
labs(x = "Age", y = "Resting Blood Pressure (mm/Hg)", colour = "Heart Disease")
hd_trestbps_scaled_plot

By looking at the three scatter plots above, it looks like the scaling and the centering of the data worked by looking at the axis where the value 0 for both the x-axis and y-axis generally goes through the center of the scatter plot. 
As a result, the dataset has been standardized, allowing for meaningful comparisons between variables, and enabling the application of KNN classification.

In [None]:
split_set <- initial_split(heart_disease_selected, prop = 0.75, strata =heart_disease)  
training_set <- training(split_set)   
testing_set <- testing(split_set)
testing_set
training_set

In [None]:
options(repr.plot.height = 5, repr.plot.width = 6)

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

mnist_recipe  <- recipe(heart_disease ~ age + chol + thalach + trestbps , data = training_set) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())


mnist_vfold <- vfold_cv(training_set, v = 5, strata = heart_disease)

knn_results <- workflow() |>
                 add_recipe(mnist_recipe) |>
                 add_model(knn_spec) |>
                 tune_grid(resamples = mnist_vfold, grid = tibble(neighbors = c(2,3,4,5,6))) |>
                 collect_metrics()

accuracies <- knn_results |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate') +
                  theme(text = element_text(size = 20)) +
                  scale_x_continuous(breaks = seq(0, 20, 2)) +
                  scale_y_continuous(limits = c(0.7, 0.85))

cross_val_plot

In [None]:
mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) |>
       set_engine("kknn") |>
       set_mode("classification")

mnist_fit <- workflow() |>
             add_recipe(mnist_recipe) |>
             add_model(mnist_spec) |>
            fit(data = training_set)
mnist_fit

In [None]:
new_ob <- tibble(age = 50,thalach = 170, chol=140, trestbps=200)
heart_disease_predicted <- predict(mnist_fit,new_ob)
heart_disease_predicted

In [None]:
mnist_predictions <- predict(mnist_fit ,testing_set) |>
      bind_cols(testing_set)

mnist_predictions

In [None]:
prediction_accuracy <- mnist_predictions |>
        metrics(truth = heart_disease, estimate = .pred_class)             

prediction_accuracy

In [None]:
set.seed(9999) 

mnist_metrics <- mnist_predictions |>
  metrics(truth = heart_disease,estimate = .pred_class) |>
filter(.metric=="accuracy")

mnist_conf_mat <- mnist_predictions |>
  conf_mat(truth = heart_disease, estimate = .pred_class)


mnist_metrics
mnist_conf_mat