# Classifying Presense of Heart Disease  

### DSCI 100 004 Group 24 Proposal

## Preliminary Exploratory Data Analysis

### Installation

Before beginning the analysis the library `cowplot` must be installed. To install the library run the install cell below.

In [None]:
install.packages("cowplot") 

### Libraries

Next we will import the libraries `tidyverse`, `tidymodels`, `dplyr`, `repr`, and 
the previously installed `cowplot`. 

These libraries will be used to read, clean, split, summarize, and visualize the data set. 

In [None]:
library(tidyverse)
library(tidymodels)
library(dplyr)
library(repr)
library(cowplot)

### Reading Data

The data is sourced from a UC Irvine Machine Learning Repository, found here https://archive.ics.uci.edu/dataset/45/heart+disease.

The relevant files have been add to this repository's `data` directory and pushed to GitHub. This file will read 
the data file from the GitHub URL to the raw data. 


In [None]:
# Create list of column names found in data/heart-disease.names
column_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

# Read in data
dataset <- read_delim("https://raw.githubusercontent.com/tamzeedq/dsci-100_group_project/main/data/processed.cleveland.data", delim = ",", col_names = column_names)
head(dataset, 5) # Preview first 5 rows

### Tidying/Cleaning the Data

Columns `ca` and `thal` are the only columns to have missing data, and since they are not relevant to our analysis we'll drop these two columns. Some column headers are also unclear, so we'll update the names that are relevant to our analysis to improve readability. Lastly, we will convert the columns representing chest pain type (`cp_type`) and presence of heart disease (`presence`) from numbers to factors. This is because they are both categorical variables that are being classified with correlated numbers; the columns don't actually represent a range of numbers.


In [None]:
updated_column_names <- c("age", "sex", "cp_type", "rest_bps", "chol", "fbs", "restecg", "max_heart_rate", "exang", "oldpeak", "slope", "ca", "thal", "presence")

colnames(dataset) <- updated_column_names # Updated the column names
updated_dataset <- dataset |>
            select(-ca, -thal) |> # Select every column except for ca and thal
            mutate(cp_type = as_factor(cp_type), presence = as_factor(presence)) # Convert column datatypes

head(updated_dataset, 5) # Preview first 5 rows

### Split the Data

The updated data set is split into a training set and testing set, 75% of the data will be used to create a training set while the remaining 25% will be used to create a testing data set.


In [None]:
data_split <- initial_split(updated_dataset, prop = 0.75, strata = presence)  
training_data <- training(data_split)   
testing_data <- testing(data_split)

### Summarize the Data

We will do two things to summarize the training data set. First, we will look at the count of rows for each category of the variable we wish to predict(`presence`). Second, we will look at the mean of the variables that we want to use to predict for `presence`: `age`, `rest_bps`, `chol`, `max_heart_rate`. 

**Note**: Although we want to use chest pain type (`cp_type`) as one of our predictor variables, it is a categorical variable therefore a mean value can not be found for it and won't be included in the mean table.  

Let's start with finding the count of different categories for our response variable.

In [None]:
presence_count <- training_data |>
                group_by(presence) |>
                summarize(count = n())
presence_count

Finding the means of our quantitative predictor variables.

In [None]:
variable_means <- training_data |>
summarize(mean_age=mean(age), mean_rest_bps=mean(rest_bps), mean_chol=mean(chol), mean_max_rate=mean(max_heart_rate))

variable_means

### Plotting the Data

Due to having multiple predictor variables, creating a plot with all of our predictor variables together would make for an unclear plot. So below is a scatter plot proving an example of a plot to visualize the association between 3/5 of our predictor variables. The graph compares age and cholesterol on the x and y axis, and is classified by shapes that relate to different chest pains. The different colors represent the different classifications for our response variable (`presence`).

Below the scatter plot are histograms to demonstrate the distribution of data that is being representing by our predictor variables in the training data set.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)


chest_pain_labels <- c("Typical Angina", "Atypical Angina", "Non-Anginal Pain", "Asymptomatic")

age_chol_plot <- training_data |>
    ggplot(aes(x = age, 
               y = chol, 
               colour = presence,
               shape = factor(cp_type, labels = chest_pain_labels))) +
    labs(x = "Age",
         y = "Cholesterol (mg/dl)",
         colour = "Presence",
         shape = "Chest Pain Type") +
    geom_point(size = 2) +
    ggtitle("Cholesterol vs Age") +
    theme(text = element_text(size = 20))

age_chol_plot

Distribution histograms

In [None]:
age_hist <- training_data |>
        ggplot(aes(x = age)) +
        geom_histogram() +
        ggtitle("Age Distribution")

bps_hist <- training_data |>
        ggplot(aes(x = rest_bps)) +
        geom_histogram() +
        ggtitle("Resting Blood Pressure Distribution")

chol_hist <- training_data |>
        ggplot(aes(x = chol)) +
        geom_histogram() + 
        ggtitle("Cholesterole Distribution")


heart_rate_hist <- training_data |>
        ggplot(aes(x = max_heart_rate)) +
        geom_histogram() + 
        ggtitle("Max Heart Rate Distribution")


plot_grid(age_hist, bps_hist, chol_hist, heart_rate_hist, ncol = 2, nrow=2)