# Group 22 Project Final Report

*An Analysis of the Risk of Developing Heart Disease for Individuals in Cleveland, Ohio.*

## Introduction

Cardiovascular disease includes any condition that affects the circulatory system. This results in negative impacts on the function of the heart and veins, which differ from different types of cardiovascular disease (Thiriet, 1970). Cardiovascular disease is also commonly known as heart disease (Thiriet, 1970). 
In order to save many lives, it is essential that individuals with heart disease are diagnosed early to start treatment as soon as possible (Pal et al., 2022). A model that can accurately predict the risk of developing cardiovascular disease would have the potential to save countless lives (Pal et al., 2022). 
By narrowing it down to specific factors such as cholesterol or blood pressure we can possibly give advice on proactive measures that are most important to minimize risk of heart disease.

Therefore the question we intend to answer is: Can we predict which patients in Cleveland, Ohio are at the highest risk of developing heart disease?

The dataset we will be using was retrieved from "processed.cleveland.data" in the Heart Disease Data Set (https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/) provided by UCI Machine Learning Repository. The dataset includes the following 14 attributes which can be numerical or categorical.

## Summary of Preliminary Analysis

We will start with a quick summary of our previous analysis. Our table describes which variables have apossible correlation with heart disease. We noticed some minor mistakes in our analysis but we will proceed with 2 numerical variables than seem to have correlation with heart disease

Now we have a 14-column dataset that contains six numerical variables and eight categorical variables. 

| # | Variable | Description | Value | Possible Correlation |
|---|----------|-------------|-------|----------------------|
| 1 | age | Individual's age | numerical | Yes |
| 2 | sex | Individual's sex | male or female | Yes |
| 3 | chest_pain | Chest pain type | Typical angina, Atypical angina, Non-anginal pain, or Asymptomatic | Yes |
| 4 | resting_blood_pressure | Resting blood pressure (in mm Hg on admission to the hospital) | numerical | worth checking again  |
| 5 | cholesterol | Serum cholestoral in mg/dL | numerical | need re-analysis  |
| 6 | fasting_blood_sugar | Fasting blood sugar > 120 mg/dL | True or False | No |
| 7 | resting_electrocardiographic_results | Resting electrocardiographic results | <ul><li>0: normal</li><li>1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)</li><li>2: showing probable or definite left ventricular hypertrophy by Estes' criteria</li></ul> | No |
| 8 | max_heart_rate | Maximum heart rate achieved | numerical | Yes |
| 9 | exercise_induced_angina | Exercise induced angina | Yes or No | Yes |
| 10 | st_depression_exercise | ST depression induced by exercise relative to rest | numerical | Yes |
| 11 | slope_st | The slope of the peak exercise ST segment | Upsloping, Flat, or Downsloping | Yes (but too technical) |
| 12 | major_vessels | Number of major vessels (0-3) colored by flourosopy | numerical | Yes |
| 13 | thal | Thalassemia | Normal, Fixed defect, or Reversable defect | Yes |
| 14 | heart_disease | Diagnosis of presence of heart disease | diagnosed or undiagnosed | NA |

## Cleaning and wrangling

In [1]:
# This cleaning-up procedure is mostly the same as our proposal
# We added some steps to omit rows containing unclassified values
import altair as alt
import numpy as np
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

heart_disease_data = pd.read_csv(url, names=[
    "age", 
    "sex", 
    "chest_pain", 
    "resting_blood_pressure", 
    "cholesterol", 
    "fasting_blood_sugar", 
    "resting_electrocardiographic_results", 
    "max_heart_rate",
    "exercise_induced_angina",  
    "st_depression_exercise", 
    "slope_st", 
    "major_vessels", 
    "thal",
    "diagnosis"
])

heart_disease_data = heart_disease_data.dropna(how='all')

# omit rows containing unclassified values
heart_disease_data = heart_disease_data.drop(
    heart_disease_data[heart_disease_data.major_vessels == '?'].index
)
heart_disease_data["major_vessels"] = pd.to_numeric(heart_disease_data["major_vessels"])

heart_disease_data = heart_disease_data.drop(
    heart_disease_data[heart_disease_data.thal == '?'].index
)

In [2]:
# replace numeric values with categorical lables for categorical variables 
# except for 'resting_electrocardiographic_results' which requires a long description for each label
heart_disease_data["sex"] = heart_disease_data["sex"].apply(lambda x: "male" if (x == 1.0) else "female")

heart_disease_data["chest_pain"] = heart_disease_data["chest_pain"].replace({
    1: 'Typical angina',
    2: 'Atypical angina',
    3: 'Non-anginal pain',
    4: 'Asymptomatic'
})

heart_disease_data["fasting_blood_sugar"] = heart_disease_data["fasting_blood_sugar"].apply(lambda x: True if (x == 1.0) else False)

heart_disease_data["exercise_induced_angina"] = heart_disease_data["exercise_induced_angina"].apply(lambda x: 'Yes' if (x == 1.0) else 'No')

heart_disease_data["slope_st"] = heart_disease_data["slope_st"].replace({
    1.0: 'Upsloping',
    2.0: 'Flat',
    3.0: 'Downsloping'
})

heart_disease_data["thal"] = heart_disease_data["thal"].replace({
    '3.0': 'Normal',
    '6.0': 'Fixed defect',
    '7.0': 'Rreversable defect'
})


In [3]:
# define a column, 'heart_disease', based on the 'diagnosis' column and drop 'diagnosis'
heart_disease_data["heart_disease"] = heart_disease_data["diagnosis"].apply(
    lambda x: "undiagnosed" if (x == 0) else "diagnosed")
heart_disease_data = heart_disease_data.drop(columns=["diagnosis"])

heart_disease_data

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_electrocardiographic_results,max_heart_rate,exercise_induced_angina,st_depression_exercise,slope_st,major_vessels,thal,heart_disease
0,63.0,male,Typical angina,145.0,233.0,True,2.0,150.0,No,2.3,Downsloping,0.0,Fixed defect,undiagnosed
1,67.0,male,Asymptomatic,160.0,286.0,False,2.0,108.0,Yes,1.5,Flat,3.0,Normal,diagnosed
2,67.0,male,Asymptomatic,120.0,229.0,False,2.0,129.0,Yes,2.6,Flat,2.0,Rreversable defect,diagnosed
3,37.0,male,Non-anginal pain,130.0,250.0,False,0.0,187.0,No,3.5,Downsloping,0.0,Normal,undiagnosed
4,41.0,female,Atypical angina,130.0,204.0,False,2.0,172.0,No,1.4,Upsloping,0.0,Normal,undiagnosed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
297,57.0,female,Asymptomatic,140.0,241.0,False,0.0,123.0,Yes,0.2,Flat,0.0,Rreversable defect,diagnosed
298,45.0,male,Typical angina,110.0,264.0,False,0.0,132.0,No,1.2,Flat,0.0,Rreversable defect,diagnosed
299,68.0,male,Asymptomatic,144.0,193.0,True,0.0,141.0,No,3.4,Flat,2.0,Rreversable defect,diagnosed
300,57.0,male,Asymptomatic,130.0,131.0,False,0.0,115.0,Yes,1.2,Flat,1.0,Rreversable defect,diagnosed
