# **Investigating Predictors of Heart Disease**
done by Arnav, Drishti, Karan and Samia 

### Introduction:

Heart disease is a leading cause of mortality worldwide, making it a significant public health concern.
Understanding the factors associated with heart disease occurrence is crucial for prevention and intervention efforts. In our project, we aim to explore the predictors of heart disease using a dataset containing various patient attributes and heart disease diagnosis outcomes.

The central question guiding our analysis is: What factors contribute to the likelihood of heart disease occurrence?


### Dataset Description:
We will utilize the "Heart Disease" dataset, which includes patient demographic information, clinical attributes, and a binary target variable indicating the presence or absence of heart disease. 
The dataset is sourced from a reliable repository and has been widely used for cardiovascular research.
The dataset comprises 76 attributes, with most studies centered on a subset of 14 attributes. Primarily sourced from the Cleveland database, extensively utilized by machine learning (ML) researchers, it aims to predict the presence of heart disease. The "goal" field denotes heart disease presence, ranging from 0 (absence) to 4. 

(Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.)

The relevant columns in this spreadsheet are: 
1. **age**: age
2. **sex**: sex (1 = male, 0 = female)
3. **cp**: chest pain type
4. **trestbps**: resting blood pressure in mmHg
5. **chol**: serum cholestoral in mg/dl
6. **fbs**: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
7. **restecg**: resting electrocardiographic results
8. **thalach**: maximum heart rate achieved
9. **exang**: whether exercise induced angina (1 = True, 0 = False)
10. **oldpeak**: ST depression induced by exercise, relative to rest
11. **slope**: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
12. **ca**: number of major vessels (0-3) colored by flourosopy
13. **thal**: (3 = normal, 6 = fixed defect, 7 = reversable defect)
14. **num**: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)




### Preliminary Exploratory Data Analysis:

We will programmatically access and import the "Heart Disease" dataset into our Jupyter notebook to ensure reproducibility.

Data cleaning and preprocessing steps will be performed to handle missing values, address outliers, and encode categorical variables as necessary.

Summary statistics will be calculated to understand the distribution of variables and identify potential trends.

Visualizations, such as histograms, bar charts, and correlation matrices, will be generated to explore relationships between variables and gain insights into the data.

In [4]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
#options(repr.matrix.max.rows = 6)jofrjfor


In [3]:
cleveland_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                           col_names = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                                         "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"))
cleveland_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1
38,1,3,138,175,0,0,173,0,0.0,1,?,3.0,0


### Methods:

- We will employ descriptive statistics, inferential analysis, and predictive modeling techniques to investigate predictors of heart disease.
- Variables such as age, gender, cholesterol levels, blood pressure, and exercise habits will be considered as potential predictors.
- Feature engineering may be performed to derive new variables or transformations to enhance predictive performance.
- Machine learning algorithms, such as logistic regression or decision trees, will be applied to build predictive models of heart disease occurrence.

### Expected Outcomes and Significance:

We anticipate identifying key factors associated with the presence or absence of heart disease, providing valuable insights for healthcare professionals and policymakers.
Understanding these predictors could inform risk assessment strategies, intervention programs, and personalized treatment plans for individuals at risk of heart disease.
Our findings may prompt further research inquiries, such as investigating the effectiveness of specific lifestyle interventions or exploring disparities in heart disease risk across demographic groups.

In [None]:
#testing
