# Group Project Proposal

*An Analysis of the Risk of Developing Heart Disease for Individuals in Cleveland, Ohio.*

Cardiovascular disease includes any condition that affects the circulatory system. This results in negative impacts on the function of the heart and veins, which differ from different types of cardiovascular disease (Thiriet, 1970). Cardiovascular disease is also commonly known as heart disease (Thiriet, 1970). 
In order to save many lives, it is essential that individuals with heart disease are diagnosed early to start treatment as soon as possible (Pal et al., 2022). A model that can accurately predict the risk of developing cardiovascular disease would have the potential to save countless lives (Pal et al., 2022). 
By narrowing it down to specific factors such as cholesterol or blood pressure we can possibly give advice on proactive measures that are most important to minimize risk of heart disease.

Therefore the question we intend to answer is: Can we predict which patients in Cleveland, Ohio are at the highest risk of developing heart disease?

The dataset we will be using was retrieved here: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/#:~:text=processed.cleveland.data under "processed.cleveland.data" and it includes the following 14 attributes:
which can be numerical or categorical (which ones?)


*Preliminary Exploratory Data Analysis:*

Our goal is to get the data down to 5 main variables to answer our question. To do so, we will analyze the correlation between heart disease and each variable independently.

In [48]:
import altair as alt
import numpy as np
import pandas as pd

In [49]:
heart_disease_data = pd.read_csv("data/processed.cleveland.data", names = [
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_pressure", 
    "resting_electrocardiographic _results", 
    "max_heart_rate", "exercise_induced_angina",  
    "ST_depression_exercise", 
    "slope_ST", 
    "major_vessels", 
    "thal", "diagnosis"
])
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_pressure,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


In [50]:
heart_disease_data.dropna
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_pressure,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


For the diagnosis category the researchers classify 0 as absense of heart disease and 1, 2, 3,and 4 as presence. Therefore we will group 1, 2, 3, and 4 as 1.

In [51]:
# make a column, heart_disease, where 0 indicates no disease and 1 indicates disease

heart_disease_data["heart_disease"] = heart_disease_data["diagnosis"].apply(lambda x: 0 if (x == 0) else 1)
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_pressure,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis,heart_disease
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,1
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,1
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,1
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,1


Now we will start to do some exploratory data analysis. First, we will split the dataset to do the exploratroy analysis on just the training data.

In [52]:
from sklearn.model_selection import train_test_split

heart_disease_train, heart_disease_test = train_test_split(heart_disease_data, test_size=0.25, random_state=123)
print(heart_disease_train.head())
print(heart_disease_test.head())

      age  sex  chest_pain  resting_blood_pressure  cholesterol  \
36   43.0  1.0         4.0                   120.0        177.0   
148  45.0  1.0         2.0                   128.0        308.0   
21   58.0  0.0         1.0                   150.0        283.0   
187  66.0  1.0         2.0                   160.0        246.0   
161  77.0  1.0         4.0                   125.0        304.0   

     fasting_blood_pressure  resting_electrocardiographic _results  \
36                      0.0                                    2.0   
148                     0.0                                    2.0   
21                      1.0                                    2.0   
187                     0.0                                    0.0   
161                     0.0                                    2.0   

     max_heart_rate  exercise_induced_angina  ST_depression_exercise  \
36            120.0                      1.0                     2.5   
148           170.0             

Now we will start with some plots to visualize the correlational relationships between the variables and the diagnosis with the training data.

In [53]:
age_scatterplot = (
    alt.Chart(heart_disease_train , title= "Age vs Heart Disease") # title for the entire plot
    .mark_point(opacity= 0.5) 
    .encode(
        x=alt.X("age", title = "Age (years)"),
        y=alt.Y("diagnosis", title = "Risk of Heart Disease (units)"),
    )
    .properties(width=380, height=300)  #adjust these accordingly…
)

age_scatterplot


  for col_name, dtype in df.dtypes.iteritems():
