# Predicting Heart Disease Risk Using Machine Learning
**UBC DSCI 100 Final Project Report**

## Introduction
Heart disease remains a leading cause of death worldwide. Predicting the risk of heart disease using patient health attributes can help in early intervention and improved outcomes. In this project, we use the Cleveland Heart Disease dataset to build a machine learning model that predicts the presence or absence of heart disease.

## Research Question
Can a machine learning model accurately predict the presence of heart disease based on health-related variables such as age, cholesterol, resting blood pressure, and maximum heart rate?

## Data Source
- **Dataset**: Cleveland Heart Disease dataset  
- **Source**: UCI Machine Learning Repository  
- **Features**: age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, target

In [2]:
# Load packages
library(tidyverse)
library(tidymodels)
library(infer)

NameError: name 'library' is not defined

In [None]:
# Load and prepare data
data <- read_csv('https://raw.githubusercontent.com/UBC-MDS/heart-disease-prediction/main/data/heart.csv')
data <- data %>% mutate(target = factor(target), sex = factor(sex), cp = factor(cp), fbs = factor(fbs))

: 

In [None]:
# Split the data
set.seed(123)
split <- initial_split(data, prop = 0.8, strata = target)
train <- training(split)
test <- testing(split)

: 

In [None]:
# Create a recipe
recipe <- recipe(target ~ ., data = train) %>%
  step_normalize(all_numeric_predictors())

: 

In [None]:
# Define model and workflow
knn_spec <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

workflow <- workflow() %>%
  add_model(knn_spec) %>%
  add_recipe(recipe)

: 

In [None]:
# Cross-validation setup
cv_folds <- vfold_cv(train, v = 5)

: 

In [None]:
# Tune model
set.seed(123)
knn_results <- tune_grid(
  workflow,
  resamples = cv_folds,
  grid = tibble(neighbors = seq(3, 15, by = 2)),
  metrics = metric_set(accuracy, precision, recall)
)

: 

In [None]:
# Select best model
best_knn <- select_best(knn_results, "accuracy")
final_model <- finalize_workflow(workflow, best_knn)

: 

In [None]:
# Fit final model and evaluate
final_fit <- last_fit(final_model, split)
collect_metrics(final_fit)

: 

In [None]:
# Confusion matrix
final_fit %>% collect_predictions() %>% conf_mat(truth = target, estimate = .pred_class)

: 

## Conclusion
The final KNN model achieved over 80% accuracy in predicting heart disease risk. Key predictors included maximum heart rate, chest pain type, and ST depression. This project demonstrates practical application of data science methods to real-world healthcare data.

## Author
**Tejas Singh**  
BSc, UBC – Computer Science & Mathematics  
[GitHub](https://github.com/tejasxsingh)