# Group Project Proposal

*An Analysis of the Risk of Developing Heart Disease for Individuals in Cleveland, Ohio.*

## Introduction

Cardiovascular disease includes any condition that affects the circulatory system. This results in negative impacts on the function of the heart and veins, which differ from different types of cardiovascular disease (Thiriet, 1970). Cardiovascular disease is also commonly known as heart disease (Thiriet, 1970). 
In order to save many lives, it is essential that individuals with heart disease are diagnosed early to start treatment as soon as possible (Pal et al., 2022). A model that can accurately predict the risk of developing cardiovascular disease would have the potential to save countless lives (Pal et al., 2022). 
By narrowing it down to specific factors such as cholesterol or blood pressure we can possibly give advice on proactive measures that are most important to minimize risk of heart disease.

Therefore the question we intend to answer is: Can we predict which patients in Cleveland, Ohio are at the highest risk of developing heart disease?

The dataset we will be using was retrieved from "processed.cleveland.data" in the Heart Disease Data Set (https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/) provided by UCI Machine Learning Repository. The dataset includes the following 14 attributes which can be numerical or categorical.

| # | Variable | Description | Value |
|---|----------|-------------|-------|
| 1 | age | Individual's age | numerical |
| 2 | sex | Individual's sex | 1 = male; 0 = female |
| 3 | chest_pain | Chest pain type | <ul><li>1: typical angina</li><li>2: atypical angina</li><li>3: non-anginal pain</li><li>4: asymptomatic</li></ul> |
| 4 | resting_blood_pressure | Resting blood pressure (in mm Hg on admission to the hospital) | numerical |
| 5 | cholesterol | Serum cholestoral in mg/dL | numerical |
| 6 | fasting_blood_sugar | Fasting blood sugar > 120 mg/dL | 1 = true; 0 = false |
| 7 | resting_electrocardiographic | Resting electrocardiographic results | <ul><li>0: normal</li><li>1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)</li><li>2: showing probable or definite left ventricular hypertrophy by Estes' criteria</li></ul> |
| 8 | max_heart_rate | Maximum heart rate achieved | numerical |
| 9 | exercise_induced_angina | Exercise induced angina | 1 = yes; 0 = no |
| 10 | ST_depression_exercise | ST depression induced by exercise relative to rest | numerical |
| 11 | slope_ST | The slope of the peak exercise ST segment | <ul><li>1: upsloping</li><li>2: flat</li><li>3: downsloping</li></ul> |
| 12 | major_vessels | Number of major vessels (0-3) colored by flourosopy | numerical |
| 13 | thal | Thalassemia | 3 = normal; 6 = fixed defect; 7 = reversable defect |
| 14 | diagnosis | Diagnosis of presence of heart disease (Note: he Cleveland database uses this variable for simply distinguishing presence of disease rather than distingushing narrowing the diameter of vessels ) | numerical; from 0 (no presence) to 4 |

## Preliminary Exploratory Data Analysis

Our goal is to get the data down to 5 main variables to answer our question. To do so, we will analyze the correlation between heart disease and each variable independently.

First, we will load the dataset from the online directory using pandas.

In [254]:
import altair as alt
import numpy as np
import pandas as pd

In [255]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

heart_disease_data = pd.read_csv(url, names=[
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_sugar", 
    "resting_electrocardiographic _results", 
    "max_heart_rate",
    "exercise_induced_angina",  
    "ST_depression_exercise", 
    "slope_ST", 
    "major_vessels", 
    "thal",
    "diagnosis"
])
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


Next, to obtainwe a tidy format for analysis, we will clean and wrangle the data. 

1. We will remove missing values.
2. The variable for sex is stored as float. To make the further analysis easier, we will change the values as a categorical variables: 1 = male; 0 = female. Note: this binary classification does not imply sex or gender categolization.
3. For the diagnosis category the researchers classify 0 as absense of heart disease and 1, 2, 3,and 4 as presence. Therefore we will group 1, 2, 3, and 4 as "diagnosed."

In [256]:
heart_disease_data.dropna

heart_disease_data["sex"] = heart_disease_data["sex"].apply(lambda x: "male" if (x == 1.0) else "female")

# define a column, "heart_disease", based on the "diagnosis" column
heart_disease_data["heart_disease"] = heart_disease_data["diagnosis"].apply(lambda x: "undiagnosed" if (x == 0) else "diagnosed")

heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis,heart_disease
0,63.0,male,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,undiagnosed
1,67.0,male,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,diagnosed
2,67.0,male,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,diagnosed
3,37.0,male,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,undiagnosed
4,41.0,female,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,undiagnosed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,male,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,diagnosed
299,68.0,male,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,diagnosed
300,57.0,male,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,diagnosed
301,57.0,female,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,diagnosed


In [257]:
heart_disease_data["number_of_patients"] = heart_disease_data["sex"].apply(lambda x: "1" if (x == "male") else "1")
heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic _results,max_heart_rate,exercise_induced_angina,ST_depression_exercise,slope_ST,major_vessels,thal,diagnosis,heart_disease,number_of_patients
0,63.0,male,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,undiagnosed,1
1,67.0,male,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,diagnosed,1
2,67.0,male,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,diagnosed,1
3,37.0,male,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,undiagnosed,1
4,41.0,female,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,undiagnosed,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,male,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,diagnosed,1
299,68.0,male,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,diagnosed,1
300,57.0,male,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,diagnosed,1
301,57.0,female,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,diagnosed,1


We check the mean, minimum and maximum values for each numerical variables: age, resting_blood_pressure, cholesterol and max_heart_rate.

In [258]:
age_stat = heart_disease_data.agg({'age': ['mean', 'min', 'max']}).round(decimals=1)
rbp_stat = heart_disease_data.agg({'resting_blood_pressure': ['mean', 'min', 'max']}).round(decimals=2)
chol_stat = heart_disease_data.agg({'cholesterol': ['mean', 'min', 'max']}).round(decimals=2)
max_hr_stat = heart_disease_data.agg({'max_heart_rate': ['mean', 'min', 'max']}).round(decimals=2)

num_stat = pd.concat([age_stat, rbp_stat, chol_stat, max_hr_stat], axis=1)
num_stat

Unnamed: 0,age,resting_blood_pressure,cholesterol,max_heart_rate
mean,54.4,131.69,246.69,149.61
min,29.0,94.0,126.0,71.0
max,77.0,200.0,564.0,202.0


In [259]:
heart_disease_data["sex"].value_counts()

male      206
female     97
Name: sex, dtype: int64

Now we will start to do some exploratory data analysis. First, we will split the dataset to do the exploratroy analysis on just the training data.

In [260]:
from sklearn.model_selection import train_test_split

heart_disease_train, heart_disease_test = train_test_split(heart_disease_data, test_size=0.25, random_state=123)

Now we will start with some plots to visualize the correlational relationships between the variables and the diagnosis with the training data.

In [261]:
age_scatterplot = (
    alt.Chart(heart_disease_train , title= "Age vs Heart Disease") # title for the entire plot
    .mark_point(opacity= 0.5) 
    .encode(
        x=alt.X("age", title = "Age (years)"),
        y=alt.Y("diagnosis", title = "Risk of Heart Disease (units)"),
        color="sex"
    )
    .properties(width=380, height=300)  #adjust these accordingly…
)

age_scatterplot


  for col_name, dtype in df.dtypes.iteritems():


In [262]:
rest_bp_vs_cholesterol_plot = (
    alt.Chart(heart_disease_train , title= "Cholesterol vs Resting Blood Pressure")
    .mark_point(opacity= 0.5) 
    .encode(
        x=alt.X("resting_blood_pressure", title = "Resting blood pressure (mm Hg)", scale=alt.Scale(zero=False)),
        y=alt.Y("cholesterol", title = "Serum cholestoral (mg/dL)"),
        color="heart_disease"
    )
    .configure_title(fontSize=18)
    .configure_axis(labelFontSize=15, titleFontSize=15)
    .configure_legend(labelFontSize=15, titleFontSize=15)
    .properties(width=380, height=300)
)

rest_bp_vs_cholesterol_plot

In [263]:
sex_bar_graph = (alt.Chart(heart_disease_data, title = "The Differentiation in the Number of Individuals with Heart Disease by Sex")
                 .mark_bar()
                 .encode(
                     x=alt.X("sex"), 
                     y=alt.Y("count()", title= "Count of diagnosis of heart disease"),
                     color="heart_disease"
                 ).configure_title(fontSize=18)
                 .configure_axis(titleFontSize=15, labelFontSize=15)
                 .configure_legend(titleFontSize=15, labelFontSize=15)
                 .properties(width=380, height=300)
)
sex_bar_graph

In [264]:
chest_pain_graph = (
    alt.Chart(heart_disease_data, title = "Chest Pain vs Heart Disease") #barplot because it’s a scale 1 - 4
    .mark_point()
    .encode(
        x=alt.X("chest_pain", title= "Chest Pain"), 
        y=alt.Y("heart_disease", title= "Heart Disease")
    ).properties(width=380, height=300)
)
chest_pain_graph
