# Group Project Proposal

*An Analysis of the Risk of Developing Heart Disease for Individuals in Cleveland, Ohio.*

## Introduction

Cardiovascular disease includes any condition that affects the circulatory system. This results in negative impacts on the function of the heart and veins, which differ from different types of cardiovascular disease (Thiriet, 1970). Cardiovascular disease is also commonly known as heart disease (Thiriet, 1970). 
In order to save many lives, it is essential that individuals with heart disease are diagnosed early to start treatment as soon as possible (Pal et al., 2022). A model that can accurately predict the risk of developing cardiovascular disease would have the potential to save countless lives (Pal et al., 2022). 
By narrowing it down to specific factors such as cholesterol or blood pressure we can possibly give advice on proactive measures that are most important to minimize risk of heart disease.

Therefore the question we intend to answer is: Can we predict which patients in Cleveland, Ohio are at the highest risk of developing heart disease?

The dataset we will be using was retrieved from "processed.cleveland.data" in the Heart Disease Data Set (https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/) provided by UCI Machine Learning Repository. The dataset includes the following 14 attributes which can be numerical or categorical.

| # | Variable | Description | Value |
|---|----------|-------------|-------|
| 1 | age | Individual's age | numerical |
| 2 | sex | Individual's sex | 1 = male; 0 = female |
| 3 | chest_pain | Chest pain type | <ul><li>1: typical angina</li><li>2: atypical angina</li><li>3: non-anginal pain</li><li>4: asymptomatic</li></ul> |
| 4 | resting_blood_pressure | Resting blood pressure (in mm Hg on admission to the hospital) | numerical |
| 5 | cholesterol | Serum cholestoral in mg/dL | numerical |
| 6 | fasting_blood_sugar | Fasting blood sugar > 120 mg/dL | 1 = true; 0 = false |
| 7 | resting_electrocardiographic | Resting electrocardiographic results | <ul><li>0: normal</li><li>1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)</li><li>2: showing probable or definite left ventricular hypertrophy by Estes' criteria</li></ul> |
| 8 | max_heart_rate | Maximum heart rate achieved | numerical |
| 9 | exercise_induced_angina | Exercise induced angina | 1 = yes; 0 = no |
| 10 | ST_depression_exercise | ST depression induced by exercise relative to rest | numerical |
| 11 | slope_ST | The slope of the peak exercise ST segment | <ul><li>1: upsloping</li><li>2: flat</li><li>3: downsloping</li></ul> |
| 12 | major_vessels | Number of major vessels (0-3) colored by flourosopy | numerical |
| 13 | thal | Thalassemia | 3 = normal; 6 = fixed defect; 7 = reversable defect |
| 14 | diagnosis | Diagnosis of presence of heart disease (Note: he Cleveland database uses this variable for simply distinguishing presence of disease rather than distingushing narrowing the diameter of vessels ) | numerical; from 0 (no presence) to 4 |

## Preliminary Exploratory Data Analysis

Our goal of this preliminary exploratory analysis is to get the data down to some main variables to answer our question. To do so, we will analyze the correlation between heart disease and each variable independently. We will create multiple graphs to compare how given variables affect the risk of developing heart disease. We will then select the variables with the greatest obvious risk given this initial analysis to observe further. For boolean/categorical variables, we will use a bar graph to compare risk of heart disease. For other variables, we will use scatter plots.This may change depending on how the graphs look and if they make sense. 

First, we will load the dataset from the online directory using pandas.

In [611]:
import altair as alt
import numpy as np
import pandas as pd

In [612]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

heart_disease_data = pd.read_csv(url, names=[
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_sugar", 
    "resting_electrocardiographic_results", 
    "max_heart_rate",
    "exercise_induced_angina",  
    "ST_depression_exercise", 
    "slope_ST", 
    "major_vessels", 
    "thal",
    "diagnosis"
])
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


Next, to obtain a tidy format for analysis, we will clean and wrangle the data. 

1. We will remove missing values.
2. The values of `sex` are stored as float. We will change the values as a categorical variables: 1 = male; 0 = female. Note: this binary classification does not imply sex or gender categorization.
3. The values of `chest_pain` are numbers, but they represent a categorical variable. So, we will replace the four numbers with the corresponding types of chest pain.
4. For the `diagnosis` category, the researchers classify 0 as absense of heart disease and 1, 2, 3,and 4 as presence. Therefore, we will group 1, 2, 3, and 4 as "diagnosed" and define "undiagnosed" for 0.

In [613]:
heart_disease_data.dropna

heart_disease_data["sex"] = heart_disease_data["sex"].apply(lambda x: "male" if (x == 1.0) else "female")

# replace the values of chest_pain
heart_disease_data['chest_pain'] = heart_disease_data['chest_pain'].replace({
    1: 'Typical angina',
    2: 'Atypical angina',
    3: 'Non-anginal pain',
    4: 'Asymptomatic',
})

# define a column, "heart_disease", based on the "diagnosis" column
heart_disease_data["heart_disease"] = heart_disease_data["diagnosis"].apply(lambda x: "undiagnosed" if (x == 0) else "diagnosed")

heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis,heart_disease
0,63.0,male,Typical angina,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,undiagnosed
1,67.0,male,Asymptomatic,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,diagnosed
2,67.0,male,Asymptomatic,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,diagnosed
3,37.0,male,Non-anginal pain,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,undiagnosed
4,41.0,female,Atypical angina,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,undiagnosed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,male,Typical angina,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,diagnosed
299,68.0,male,Asymptomatic,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,diagnosed
300,57.0,male,Asymptomatic,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,diagnosed
301,57.0,female,Atypical angina,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,diagnosed


In [614]:
heart_disease_data["number_of_patients"] = heart_disease_data["sex"].apply(lambda x: "1" if (x == "male") else "1")
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis,heart_disease,number_of_patients
0,63.0,male,Typical angina,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,undiagnosed,1
1,67.0,male,Asymptomatic,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,diagnosed,1
2,67.0,male,Asymptomatic,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,diagnosed,1
3,37.0,male,Non-anginal pain,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,undiagnosed,1
4,41.0,female,Atypical angina,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,undiagnosed,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,male,Typical angina,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,diagnosed,1
299,68.0,male,Asymptomatic,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,diagnosed,1
300,57.0,male,Asymptomatic,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,diagnosed,1
301,57.0,female,Atypical angina,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,diagnosed,1


We check the mean, minimum and maximum values for each numerical variables: age, resting_blood_pressure, cholesterol and max_heart_rate.

In [615]:
age_stat = heart_disease_data.agg({'age': ['mean', 'min', 'max']}).round(decimals=1)
rbp_stat = heart_disease_data.agg({'resting_blood_pressure': ['mean', 'min', 'max']}).round(decimals=2)
chol_stat = heart_disease_data.agg({'cholesterol': ['mean', 'min', 'max']}).round(decimals=2)
max_hr_stat = heart_disease_data.agg({'max_heart_rate': ['mean', 'min', 'max']}).round(decimals=2)

num_stat = pd.concat([age_stat, rbp_stat, chol_stat, max_hr_stat], axis=1)
num_stat

Unnamed: 0,age,resting_blood_pressure,cholesterol,max_heart_rate
mean,54.4,131.69,246.69,149.61
min,29.0,94.0,126.0,71.0
max,77.0,200.0,564.0,202.0


In [616]:
heart_disease_data["sex"].value_counts()

male      206
female     97
Name: sex, dtype: int64

Now we will start to do some exploratory data analysis. First, we will split the dataset to do the exploratroy analysis on just the training data.

In [617]:
from sklearn.model_selection import train_test_split

heart_disease_train, heart_disease_test = train_test_split(heart_disease_data, test_size=0.25, random_state=123)

Now we will start with some plots to visualize the correlational relationships between the variables and the diagnosis with the training data.
Disclaimer: Count of individuals is only referring to the training part of our data. 

In [618]:
# extract the data that are diagnosed for heat_disease from the training data
diagnosed_data = heart_disease_train
diagnosed_data = diagnosed_data.drop(diagnosed_data[diagnosed_data.heart_disease == "undiagnosed"].index)
diagnosed_data

age_hist = (alt.Chart()
            .mark_bar()
            .encode(
                x=alt.X("age", title="Age", bin=alt.Bin(maxbins=30)),
                y=alt.Y("count()", title="Count of diagnosed individuals", stack=False),
            )
            .properties(height=200, width=400)
           )

age_hist_facet = (age_hist
                  .facet(
                      "sex",
                      data=diagnosed_data,
                      columns=0,
                      title='Distribution of Heart Disease over Age Ranges'
                  )
                  .configure_header(titleFontSize=18)
                  .configure_axis(labelFontSize=15, titleFontSize=15)
                  .configure_headerFacet(labelFontSize=18, titleFontSize=18)
                 )

age_hist_facet

  for col_name, dtype in df.dtypes.iteritems():


With the distribution of the number of individuals who were diagnosed for heart disease, We can see that the range of middle 50s to middle 60s has the most abundant indivuduals diagnosed in females. On the other hand, we can see that individuals diagnosed spread into the broad range from late 30s to ealy 70s for males.

In [619]:
sex_bar_graph = (alt.Chart(heart_disease_train, title = "The Differentiation in the Number of Individuals with Heart Disease by Sex")
                 .mark_bar()
                 .encode(
                     x=alt.X("sex"), 
                     y=alt.Y("count()", title= "Count of diagnosis of heart disease"),
                     color="heart_disease"
                 ).configure_title(fontSize=18)
                 .configure_axis(titleFontSize=15, labelFontSize=15)
                 .configure_legend(titleFontSize=15, labelFontSize=15)
                 .properties(width=380, height=300)
)
sex_bar_graph

We do recognize that there are fewer females in our data but the percentage of diagnosed males vs the percentage of diagnosed females is still significantly higher. 

In [620]:
chest_pain_graph = (
    alt.Chart(heart_disease_train, title = "Chest Pain vs Heart Disease")
    .mark_bar()
    .encode(
        x=alt.X("count()", title="Count of individuals"), 
        y=alt.Y("chest_pain:N", title="Type of chest pain"),
        color="heart_disease"
    ).properties(width=380, height=300)
    .configure_title(fontSize=18)
    .configure_axis(titleFontSize=15, labelFontSize=15)
    .configure_legend(titleFontSize=15, labelFontSize=15)
)
chest_pain_graph

Here we see that for the asymptomatic type of chest pain, the majority of people have been disagnosed with heart disease. Non-anginal pain is the second most recorded chest pain type, but the large portion of it is subjected to undiagnosed for heart disease.

When compared to the other types of chest pain, the data may indicate a slight correlation between an asymptomatic type of pain and heart disease; although, chest pain alone might not be a good indicator of heart disease.

In [621]:
Resting_Blood_Pressure_graph = (
    alt.Chart(heart_disease_train, title = "Resting Blood Pressure vs Heart Disease")  
    .mark_point()
    .encode(
        x=alt.X("resting_blood_pressure", title= "Resting Blood Pressure (in mm Hg)"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)
Resting_Blood_Pressure_graph

There seems to be little to no direct correlation between resting blood pressure apon hospital admission and heart disease.

In [622]:
cholesterol_graph = (
    alt.Chart(heart_disease_train, title = "Cholesterol vs Heart Disease") 
    .mark_point()
    .encode(
        x=alt.X("cholesterol", title = "Serum cholesterol (mg/dL)"), 
        y=alt.Y("heart_disease", title= "diagnosed vs undiagnosed individual"), color="heart_disease"
    ).properties(width=380, height=300)
)
cholesterol_graph

There also seems to be little to no correlation between heart disease and cholesterol as an independent  variable

In [623]:

# replace the values of fasting_blood_sugar
heart_disease_train["fasting_blood_sugar"] = heart_disease_train["fasting_blood_sugar"].replace({
    1: '> 120 mg/dL',
    0: '<= 120 mg/dL'
})

fasting_blood_sugar_graph = (
    alt.Chart(heart_disease_train, title = "Fasting blood sugar vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("count()", title= "Count of Individuals"), color="heart_disease",
        y=alt.Y("fasting_blood_sugar", title= "The amount of fasting blood sugar"), 
    ).properties(width=500, height=200)
    .configure_title(fontSize=18)
    .configure_axis(titleFontSize=15, labelFontSize=15)
    .configure_legend(titleFontSize=15, labelFontSize=15)
)
fasting_blood_sugar_graph 

The proportions of diagnosed and undiagnosed seem similar in both categories, so we will regard this as little to no correlation for now. 

In [624]:
# replace the values of resting_electrocardiographic_results
heart_disease_train["resting_electrocardiographic_results"] = heart_disease_train[
    "resting_electrocardiographic_results"].replace({
    0: 'Normal',
    1: 'ST-T abnormality',
    2: 'Probable or definite left ventricular hypertrophy',
})

restecg_graph = (
    alt.Chart(heart_disease_train, title = "Resting electrocardiographic results vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("count()", title= "Count of Individuals"), color="heart_disease",
        y=alt.Y("resting_electrocardiographic_results:N", title= "Resting electrocardiographic results"), 
    ).properties(width=500, height=300)
    .configure_title(fontSize=18)
    .configure_axis(titleFontSize=15, labelFontSize=15)
    .configure_legend(titleFontSize=15, labelFontSize=15)
)

restecg_graph

There seems to be more individuals with the result of **probable or definite left ventricular hypertrophy by Estes' criteria** that are diagnosed with heart disease than for a result of **having ST-T wave abnormality**. We will consider this a small correlation for now. 

In [625]:
max_heart_rate_graph = (
    alt.Chart(heart_disease_train, title = "Resting Blood Pressure vs Heart Disease")  
    .mark_point()
    .encode(
        x=alt.X("max_heart_rate", title= "maximum heart rate achieved"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)
max_heart_rate_graph

We can definitely see more of the orange points (undiagnosed) lean toward a higher maximum heart rate. We are not quite sure what this means but will keep this possible correlation in mind for future reference and decision on variables.

In [626]:
exercise_induced_angina_graph = (
    alt.Chart(heart_disease_train, title = "exercise induced angina vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("exercise_induced_angina", title= "Experience of exercise induced angina (yes or no)"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)
exercise_induced_angina_graph

The percentage of diagnosed individuals who experienced exercise induced angina is significantly higher than the percentage on undiagnosed people to experience it, which leads us to say there is a strong correlation between this variable and heart disease

In [627]:
ST_depression_exercise_graph = (
    alt.Chart(heart_disease_train, title = "ST depression induced by exercise vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("ST_depression_exercise", title= "ST depression induced by exercise relative to rest"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)

ST_depression_exercise_graph

There seems to be a coreelatiob between ST depression induced by exercise relative to rest and diagnosis. Most healthy people are seen around 0 to 2.0. Diagnosed people are seen in higher percentages with higher ST depression rates. 

In [628]:

slope_ST_graph = (
    alt.Chart(heart_disease_train, title = "the slope of the peak exercise ST segment vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("slope_ST", title= "the slope of the peak exercise ST segment"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)

slope_ST_graph

The percentage of people with heart disease is highest at the value of 2 which means flat/no slope. As we are not experts on the topic and from our searches do not deeply understand this variable we will be leaving it out of our analysis. 

In [629]:
major_vessels_graph = (
    alt.Chart(heart_disease_train, title = "Number of major vessels vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("major_vessels", title= "Number of major vessels (0-3) colored by fluoroscopy"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)

major_vessels_graph

The more major vessels are colored the higher the percentage of diagnosed individuals becomes, We think this could be an important variable for our question.

In [630]:
thalassemia_graph = (
    alt.Chart(heart_disease_train, title = "Thalassemia vs Heart Disease")  
    .mark_bar()
    .encode(
        x=alt.X("thal", title= "Presence of Thalassemia"), 
        y=alt.Y("count()", title= "Count of Individuals"), color="heart_disease"
    ).properties(width=380, height=300)
)

thalassemia_graph

6 and 7 indicate the presence of thalassemia with & being a reversable defect and 6 being non-reversable. It is obvious that there is a strong correlation between a diagnosis of heart disease and thalassemia. 

## Methods

The first part of our metho
Variable choice: considering our analysis and visualizations the top candidates for our model would be the presence of thalassemia, chest pain, number of major vessels indentified, exercise induced angina, resting ecg results and sex.
To narrow it down to categories that we understand best and are the greatest indicators so far we will be using the following variables:
- sex
- chest pain
- number of major vessels indentified ? 
- exercise induced angina ?
- thalassemia

## Expected outcomes and significance

### What we expect to find:

Patients with higher heart rates and higher blood pressures are at a higher risk of developing heart disease. 
Older patients are more likely to develop heart disease. 
Males are expected to be more likely to develop heart disease than females. 
Patients with high cholesterol are more likely to develop heart disease. 

### What could these findings lead to

These findings could give insight to many people who might be at increased risk of heart disease for unpreventable reasons. For example, family history, preexisting health conditions, age, or biological sex can all be considered to inform individuals who are at increased risk of heart disease, and so that preventative measures can be taken. As we mentioned predicting heart disease could be life changing for people and choosing the determining variables could help save lives. 

### What future questions could this lead to?

What variable is the best predictor of heart disease?
Does this apply to all American cities?  The world?
Which socioeconomic groups are most at risk of heart disease based on the predictors discovered in the research? How can we reduce the exposure of these groups to heart disease risk?
Why is it that males are more likely to develop heart disease than females? Is it a biological factor? Or a matter of different lifestyles? (maybe males are more likely to be smokers, or drink alcohol? )