# Heart Disease Classification

## Introduction

Cardiovascular diseases (CVDs), especially heart disorders, are a primary cause of death worldwide, according to the World Health Organization. CVDs kill 17.9 million individuals worldwide, accounting for 32% of all fatalities. As a result, detecting the existence of heart disease early on based on numerous parameters such as cholesterol, blood pressure, and age can help save more lives by allowing patients to make lifestyle adjustments or seek medical assistance sooner.

### Question

Our goal is to investigate ways to predict whether a new patient would or wouldn't have heart diesease based on other related factors. More specifically, we aim to answer the question: **Is heart disease present (values = 1, 2, 3, 4) or absent (value = 0) in a new patient?**

### Data set
To answer this question we will analyze data downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). There are 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach in the downloaded data. Below, we compile these databases into one dataset `heart_disease_dataset`. Each row in the data set contains data about one person. We will add a column `presence` (the predicted attribute) to indicate the presence or absense of heart disease. In our project we will use 3 of these attributes as predictor variables namely trestbps (resting blood pressure), chol (serum cholestoral in mg/dl), and age (age in years) to predict the presence or absence of heart disease on a new patient. 

Relevant columns in the dataset:

- `age` - age in years
- `sex` - sex (1 = male; 0 = female)
- `cp` - chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- `trestbps` - resting blood pressure (in mm Hg on admission to the hospital)
- `chol` - serum cholestoral in mg/dl
- `fbs` - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- `restecg` - resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `thalach` - maximum heart rate achieved
- `exang` - exercise induced angina (1 = yes; 0 = no)
- `oldpeak` - ST depression induced by exercise relative to rest
- `slope` - the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
- `ca` - number of major vessels (0-3) colored by flourosopy
- `thal` - 3 = normal; 6 = fixed defect; 7 = reversable defect
- `presence` - predicted attribute 

We will use the KNN classification to develop a classifier in this project. The prediction and performance of our model, which has been trained and tested on the data above, will then be evaluated. A good-fit model, trained and evaluated as stated, can then be used to predict the presence or absence of cardiac disease in new patients with good accuracy.


## Preliminary exploratory data analysis


### Loading Libraries

In [17]:
library(tidyverse)
library(ggplot2)
library(dplyr)

### Reading data, Cleaning & Wrangling data

In [18]:
path_names <- list("data/processed.cleveland.data", "data/processed.switzerland.data", "data/processed.va.data")
factors <- list("cleveland", "switzerland", "va")
colnames <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "presence")

heart_disease_data <- read_delim("data/reprocessed.hungarian.data", delim = " ", col_names = colnames) %>% mutate(region = factor("hungary"))

i <- 1 # for every iteration in the loop, increment by 1
for (p in path_names) {
    # for every file processed, add a col that has the region name
    data_from_file <- read_csv(p, col_names = colnames, na = c("?")) %>%
      mutate(region = factor(factors[i]))

    # add the freshly read data to the master data frame
    heart_disease_data <- rbind(heart_disease_data, data_from_file)

    # increment to keep track of the position in the list of files
    i <- i + 1
}

heart_disease_data <- mutate_at(heart_disease_data, vars(presence), factor) # create a new col presence to df
heart_disease_data <- heart_disease_data %>% select(age, trestbps, chol, thalach, oldpeak, region, presence)

heart_disease_data

Parsed with column specification:
cols(
  age = [32mcol_double()[39m,
  sex = [32mcol_double()[39m,
  cp = [32mcol_double()[39m,
  trestbps = [32mcol_double()[39m,
  chol = [32mcol_double()[39m,
  fbs = [32mcol_double()[39m,
  restecg = [32mcol_double()[39m,
  thalach = [32mcol_double()[39m,
  exang = [32mcol_double()[39m,
  oldpeak = [32mcol_double()[39m,
  slope = [32mcol_double()[39m,
  ca = [32mcol_double()[39m,
  thal = [32mcol_double()[39m,
  presence = [32mcol_double()[39m
)

Parsed with column specification:
cols(
  age = [32mcol_double()[39m,
  sex = [32mcol_double()[39m,
  cp = [32mcol_double()[39m,
  trestbps = [32mcol_double()[39m,
  chol = [32mcol_double()[39m,
  fbs = [32mcol_double()[39m,
  restecg = [32mcol_double()[39m,
  thalach = [32mcol_double()[39m,
  exang = [32mcol_double()[39m,
  oldpeak = [32mcol_double()[39m,
  slope = [32mcol_double()[39m,
  ca = [32mcol_double()[39m,
  thal = [32mcol_double()[39m,
  prese

age,trestbps,chol,thalach,oldpeak,region,presence
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>
40,140,289,172,0.0,hungary,0
49,160,180,156,1.0,hungary,1
37,130,283,98,0.0,hungary,0
48,138,214,108,1.5,hungary,3
54,150,-9,122,0.0,hungary,0
39,120,339,170,0.0,hungary,0
45,130,237,170,0.0,hungary,0
54,110,208,142,0.0,hungary,0
37,140,207,130,1.5,hungary,1
48,120,284,120,0.0,hungary,0


## Methods

## Expected outcomes & significance