# Data Analysis on Heart Disease

## Introduction
One of the leading causes of death in the world and the leading cause in the United States, is heart disease and has been for many years. With the rate of them increasing worldwide in recent years, it is important to look at the leading causes of this disease and establish a relationship between them and its global increase. As a base for our analysis we are going to use the heart disease dataset available at Kaggle.

The dataset originates from 1988 and evaluates data from four different databases: Cleveland, Hungary, Switzerland and Long Beach V. While the dataset does have 76 distinct attributes, it only utilizes 14 of them. 

**Predictive question:**
How does the amount of cholesterol, type of heart defect, age, and sex help us predict the diagnosis of heart disease?


In [None]:
### Run this cell before continuing.

library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
options(repr.matrix.max.rows = 10)

### Reading the Data

First read the raw data from GitHub that we uploaded.

In [None]:
heart_disease_data <- read_csv("https://raw.githubusercontent.com/yma24ma/dsci_009_43_gp/main/heart.csv")
heart_disease_data

**Variables**

age: Age

sex: Sex

cp: Chest pain type (4 values)

trestbps: resting blood pressure

chol: serum cholestoral in mg/dl

fbs: fasting blood sugar > 120 mg/dl

restecg: resting electrocardiographic results (values 0,1,2)

thalacg: maximum heart rate achieved

exang: exercise induced angina

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

ca: number of major vessels (0-3) colored by flourosopy

thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

target: diagnosis of heart disease

**Select Data**

We will now be using the `select()` function to select the row that we will be using in this analysis and make it into one table.

In [None]:
heart_disease_selected <- select(heart_disease_data, age, sex, fbs, exang, chol, target,thalach, trestbps)|>
                        mutate(sex=as_factor(sex))|>
                        mutate(fbs=as_factor(fbs))|>
                        mutate(exang=as_factor(exang))|>
                        mutate(target=as_factor(target))|>
                        mutate(heart_disease=fct_recode(target, "Yes" = "1", "No " = "0"))
heart_disease_selected

We use the sum function to check if there are any NA values in our data tables

In [None]:
sum(is.na(heart_disease_selected))

**Average of Selected**

In [None]:
hd_average1 <- heart_disease_selected |>
                map(mean) 
hd_average1

**Plot Sex X Target**

In [None]:
hd_sex_target_df <- heart_disease_selected |>
                        group_by(sex) |>
                        summarize(average_target = mean(target))
hd_sex_target_df

In [None]:
hd_age_chol<-select(heart_disease_selected,age,chol,trestbps,heart_disease)|>
                mutate(heart_disease)
hd_age_chol

In [None]:
hd_age_chol_plot<-ggplot(hd_age_chol,aes(x=age,y=chol,colour=heart_disease))+
                        geom_point(alpha=0.3)+
                        labs(x="Age",y="Cholesterol",colour="Heart Disease")+       
                    ggtitle("Age vs Cholesterol Scatterplot")
hd_age_chol_plot   

In [None]:
all_temp_plot <- hd_age_chol|> 
   ggplot(aes(x = age, y = chol)) + 
   geom_point() + 
   facet_wrap(facets = vars(factor(heart_disease, levels = c("No","Yes")))) +
   xlab('Age') + 
   ylab('Cholesterol') +
   theme(text = element_text(size=20))

all_temp_plot

In [None]:
hd_age_thalach<-select(heart_disease_selected,age,thalach,heart_disease)|>
                mutate(heart_disease)
hd_age_thalach

In [None]:
hd_age_thalach_plot<-ggplot(hd_age_thalach,aes(x=age,y=thalach,colour=heart_disease))+
                        geom_point(alpha=0.3)+
                        labs(x="Age",y="Maximum Heart Rate",colour="Heart Disease")+       
                    ggtitle("Age vs Maximum Heart Rate Scatterplot")
hd_age_thalach_plot   

In [None]:
all_temp_plot <- hd_age_thalach |> 
   ggplot(aes(x = age, y = thalach)) + 
   geom_point() + 
   facet_wrap(facets = vars(factor(heart_disease, levels = c("No","Yes")))) +
   xlab('Age') + 
   ylab('Maximum heat rate') +
   theme(text = element_text(size=20))

all_temp_plot

In [None]:
hd_age_trestbps<-select(heart_disease_selected,age,trestbps,heart_disease)|>
                mutate(heart_disease)
hd_age_trestbps

In [None]:
hd_age_trestbps_plot<-ggplot(hd_age_trestbps,aes(x=age,y=trestbps,colour=heart_disease))+
                        geom_point(alpha=0.3)+
                        labs(x="Age",y="Resting Blood Pressure(mm/Hg)",colour="Heart Disease")+       
                    ggtitle("Age vs Resting Blood Pressure Scatterplot")
hd_age_trestbps_plot   

In [None]:
all_temp_plot <- hd_age_trestbps |> 
   ggplot(aes(x = age, y = trestbps)) + 
   geom_point() + 
   facet_wrap(facets = vars(factor(heart_disease, levels = c("No","Yes")))) +
   xlab('Age') + 
   ylab('Resting Blood Pressure') +
   theme(text = element_text(size=20))

all_temp_plot

## Methods
For our Heart Disease data set, we are going to use the method of K-nearest neighbors classification. Essentially, we are going to use predictor variables chol (amount of cholesterol), thal (type of heart defect), age and sex to predict the diagnosis class of heart disease, which can be categorized into 0 (no heart disease) or 1 (heart disease). Therefore, the column names we will incorporate are chol, thal, age, sex and target. We chose to only use four predictor variables because we think that there are more than two factors that contribute to the diagnosis of heart disease. Since there are multivariables, we can avoid a 4D graph by using the facet_grid function to create a plot that has multiple subplots arranged in a grid.

## Expected outcomes and significance


What do you expect to find?

We expect to identify the most relevant features that contribute to the presence or absence of heart disease.We also look for patterns and correlations within the data.

What impact could such findings have?

Understanding the factors that contribute to heart disease can inform public health initiatives.Furthermore, discoveries from this dataset can enhance healthcare by improving diagnostic tools and predictive models for heart disease, potentially leading to early detection and treatment.

What future questions could this lead to?

Are there additional attributes that should be considered, or are there redundant variables that can be eliminated to improve model performance?



In [None]:
set.seed(9999) 
heart_disease_recipe <- recipe(heart_disease ~ age + chol + thalach + trestbps,data = heart_disease_selected) |>
                       step_scale(all_predictors()) |>
                       step_center(all_predictors())
                        
heart_disease_scaled <- heart_disease_recipe |>  
                           prep() |> 
                           bake(heart_disease_selected)
heart_disease_scaled

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)


heart_disease_chol_plot<-ggplot(heart_disease_scaled,aes(x=age,y=chol,color=heart_disease))+
geom_point(alpha=0.3)+
labs(x="age",
               y="Cholesterol(mg/dl)",
               colour="Heart Disease")


heart_disease_chol_plot

heart_disease_maxheart_plot<-ggplot(heart_disease_scaled,aes(x=age,y=thalach,color=heart_disease))+
geom_point(alpha=0.3)+
labs(x="age",
               y="Max Heart Rate (bpm)",
               colour="Heart Disease")


heart_disease_maxheart_plot

heart_disease_rbp_plot<-ggplot(heart_disease_scaled,aes(x=age,y=trestbps,color=heart_disease))+
geom_point(alpha=0.3)+
labs(x="age",
               y="Resting Blood Pressure(mm/Hg)",
               colour="Heart Disease")


heart_disease_rbp_plot