# Predicting Heart Disease

Image

Author: Xiaohua Su

Date: May 17th, 2022

# Overview

As of 2020, heart disease is the leading cause of death in the US, with the disease claiming close to 700,000 that year. It is the leading cause of death regardless of gender and for most race/ethnicity. This disease can lead to early death in individuals, increase medicial visits and a lost of productivity in our economy. As such, it is important to try to address this.


# Business Problem

With how prevalent heart disease is in the nation, it is important for doctors to discuss with their patients about early prevention. In order to do this, doctors would need to know more about a patient’s history in order to diagnose them with having heart disease, potentially requiring blood work in addition. Getting the results from the blood work usually happens after the patient’s is already out of the doctor’s office. Calls will then be made to discuss these results and potential follow up appointments will be made. 

Our model aims to predict whether a patient, who comes into a doctor’s office/hospital, has heart disease. By being able to predict if the patient has heart disease or not, we can then flag this patient for the doctor electronically. Instead of having to waiting for a phone call for a discussion on, that may not be between the patient and doctor, conversation between the doctor and patient about managing heart disease can begin. This flagging can help start the conversation between the doctor and patient about early prevention steps that can be made and can help lead the doctor in asking certain questions for further verification and testing.

# Data

The data was taken from the [CDC's 2020 Behavorial Risk Factor Surveillance System](https://www.cdc.gov/brfss/annual_data/annual_2020.html) (BRFSS). Due to how large the data is, this data was not uploaded to the github but can be found where the data was taken underneath the data files section.

It is a survey data performed between 2020 to 2021 from the CDC to monitor people's health-behavior, chronic health conditions, and use of services to help manage their disease. The data contains information of the individual such as `race` and `gender` that we will not use to avoid these biases in our models. A new column was created as the data does not specifically have a column called heart disease but instead has two two columns called `cvdinfr4` and `cvdcrhd4` that corresponded with whether the individual was ever told/diagnose with having a heart attack and told that they had coronary heart disease. Both questions, get at the issue of heart disease.

# Data Prep

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#display all columns in dataframe
pd.set_option('display.max_columns', None)

## Inspection of Original Dataset

In [None]:
#load the data
df = pd.read_sas('./LLCP2020.XPT')

In [None]:
#Inspect the data
df

In [None]:
#cleaning the columns names up for easier access
df.columns = [name.strip().lower() for name in df.columns]

After looking at the data, it is clear that the code book is required to figure out what each column represents. Not only that, but it will be benefical to rename these columns after cleaning and dropping some of them.

In [None]:
#verifying that this frequency matches with what's written in the code book
#looking at the years this survey was conducted
df.iyear.value_counts()

In [None]:
#heart attacks
df.cvdinfr4.value_counts(normalize= True)

In [None]:
df.cvdinfr4.isnull().sum()

In [None]:
#Coronary Heart Disease
df.cvdcrhd4.value_counts(normalize= True)

1 = yes,  2 = no,  7 = Don't know/Not sure ,   9 = refused

We will look at the heart disease which is defined by the CDC as : stuff. As such, it is reasonable that we will combine heart attacks and coronary heart disease into a new column called heart disease after initial cleaning.

In [None]:
#creation of the heart_disease column
conditions = [
    (df.cvdcrhd4 == 1),
    (df.cvdinfr4 == 1)
]

values = [1,1]

df['heart_disease'] = np.select(conditions, values)

In [None]:
df.heart_disease.value_counts()

In [None]:
df.duplicated()

***After looking at the code book.*** These will be are potential columns, I will want to keep as they can be/ are related to heart disease. Some are potentially environmental factors such as income. Some features while related to heart disease were a bit too fine grain for the business problem and or were not asked to the individual as it does not apply to them. As such, there 50% or more of missing or blanks and imputing would skew it highly: 

- GENHLTH : general health; Would you say that in general your health is

- PHYSHLTH : Number of Days Physical Health Not Good; Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? 

- MENTHLTH : Number of Days Mental Health Not Good; Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? 


- POORHLTH : Poor Physical or Mental Health; During the past 30 days, for about how many days did poor physical or mental health keep you from doing your usual activities, such as self-care, work, or recreation? 

- HLTHPLN1 :  Have any health care coverage

- PERSDOC2 : Multiple Health Care Professionals ; Do you have one person you think of as your personal doctor or health care - provider? (If ´No´ ask ´Is there more than one or is there no person who you think of as your personal doctor or health care provider?´.)

- MEDCOST : past 12 months, Could Not See Doctor Because of Cost

- CHECKUP1 : Length of time since last routine checkup

- EXERANY2 : : Exercise in Past 30 Days 

- SLEPTIM1 : How Much Time Do You Sleep 

- CVDSTRK3 : Ever Diagnosed with a Stroke

- ASTHMA3 : Ever Told Had Asthma 

- ASTHNOW  : Still Have Asthma 

- CHCSCNCR : (Ever told) you had skin cancer

- CHCOCNCR : (Ever told) you had any other types of cancer?

- CHCCOPD2 : (Ever told) (you had) chronic obstructive pulmonary disease, C.O.P.D., emphysema or chronic bronchitis?

- HAVARTH4 : Told Had Arthritis; (Ever told) (you had) some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion,tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)

-  ADDEPEV3 : (Ever told) (you had) a depressive disorder (including depression, major depression, dysthymia, or minor depression)?

- CHCKDNY2 : Ever told you have kidney disease?

- DIABETE4: (Ever told) you had diabetes; (Ever told) (you had) diabetes? (If ´Yes´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)

- EDUCA : Education Level

- RENTHOM1 : Own or Rent Home

- EMPLOY1 :  Employment Status

- INCOME2 : Income Level 

- WEIGHT2 : : Reported Weight in Pounds

- HEIGHT3 :  Reported Height in Feet and Inches 

- DIFFWALK : : Difficulty Walking or Climbing Stairs

- SMOKE100 : SMOKE100; Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 

- USENOW3 :  Use of Smokeless Tobacco Products; Do you currently use chewing tobacco, snuff, or snus every day, some days, or not at all? (Snus (Swedish for snuff) is a moist smokeless tobacco, usually sold in small pouches that are placed under the lip against the gum.)[Snus
(rhymes with ´goose´)]

- ALCDAY5 : Days in past 30 had alcoholic beverage 

- HIVRISK5 : Do Any High Risk Situations Apply

- ECIGARET : Ever used an e-cigarette?

All other columns will be droppped as they contained information that will either introduce biass into the models such as the`imrace` and `colgsex` which refer to race and gender respectively, or they are finer details about some the columns above where over half the respondents had were not asked as it did not pertain to them, nor did I think it would be helpful to the model. Further columns will be drop in the process if they contained too many nulls or if later on it is determined that it's too similar to another features. No cleaning was done on the original dataset as we still need to create our heart disease dataframe.

## Creation of Heart Disease Dataset

In [None]:
#Creating a heart_disease specific data frame
heart_df = df[['genhlth', 'physhlth', 'menthlth', 'poorhlth', 'hlthpln1' , 'persdoc2' , 'medcost' , 'checkup1' ,
                  'exerany2' , 'sleptim1' , 'cvdstrk3' , 'asthma3' , 'chcscncr' , 'chcocncr' , 'chccopd2' , 'havarth4' ,
                  'addepev3' , 'chckdny2' , 'diabete4' , 'educa' , 'renthom1' , 'employ1' , 'income2' , 'weight2' ,
                  'height3' , 'diffwalk' , 'smoke100' , 'usenow3' , 'alcday5' , 'hivrisk5' , 'ecigaret' ,
              'cvdcrhd4', 'cvdinfr4' , 'heart_disease']]

In [None]:
heart_df

### Renaming our columns

Due to the nature of some of the column names, it is difficult to keep track of what some of these names represent. As such, they will be renamed into more interpretable name.

In [None]:
heart_df.rename(columns = { 'genhlth': 'general_health',
                           'physhlth': 'physical_health',
                           'menthlth': 'mental_health',
                           'poorhlth' : 'poor_health30',
                           'hlthpln1': 'health_insurance',
                           'persdoc2':'health_care_doctors',
                           'medcost':'no_doc_bc_cost',
                           'checkup1':'last_checkup',
                           'exerany2':'excercise_30',
                           'sleptim1':'sleep',
                           'cvdstrk3':'stroke',
                           'asthma3':'asthma',
                           'chcscncr':'skin_cancer',
                           'chcocncr':'other_cancer',
                           'chccopd2':'copd_type_issue',
                           'havarth4':'arthritis_anyform',
                           'addepev3':'depressive_disorder',
                           'chckdny2':'kidney_disease',
                           'diabete4':'diabetes',
                           'educa':'education_lvl',
                           'renthom1':'rent_own',
                           'employ1':'employment_status',
                           'income2':'income_level',
                           'weight2':'weight_lbs',
                           'height3':'height_ftandinch',
                           'diffwalk':'difficulty_walking',
                           'smoke100':'smoke100_lifetime',
                           'usenow3':'smokeless_tobacco_products',
                           'alcday5':'alcohol_consumption_30',
                           'hivrisk5':'high_risk_situations',
                           'cvdcrhd4':'coronary_heart_disease',
                           'cvdinfr4':'heart_attack',
                          }, inplace = True)

In [None]:
heart_df

### Cleaning our target

Because our heart disease was created on the condition that someone responded yes to either they had a heart attack or has coronary heart disease, every row is filled in. When in reality there is some nulls in both the heart attack and coronary heart disease column.

In [None]:
heart_df.heart_attack.value_counts()

In [None]:
heart_df.heart_attack.isna().sum()

In [None]:
#dropping the nulls
heart_df.dropna(subset= ['heart_attack', 'coronary_heart_disease'], inplace = True)

In [None]:
heart_df.heart_attack.isna().sum()

In [None]:
heart_df.coronary_heart_disease.value_counts()

In [None]:
heart_df.coronary_heart_disease.isna().sum()

Because these were used to create our target and is a bit more detailed. We willl be removing the individuals that refused to answer or were unsured as we want our target `heart_disease` to represent answers to either response as either a yes or no.

In [None]:
#subsetting where they responded with either 1, 2
heart_df = heart_df[(heart_df['heart_attack'] != 7.0) & (heart_df['heart_attack'] != 9.0)]

In [None]:
heart_df.heart_attack.value_counts()

In [None]:
#subsetting where they responded with either 1, 2
heart_df = heart_df[(heart_df['coronary_heart_disease'] != 7.0) & (heart_df['coronary_heart_disease'] != 9.0)]

In [None]:
heart_df.coronary_heart_disease.value_counts()

In [None]:
heart_df

In [None]:
heart_df.heart_disease.value_counts(normalize = True)

We have a heavy imbalance where 92% of the individuals did not have either a coronary heart disease or heart attack, while 8% have it. As such, when we model, we will need to either SMOTE or undersample the majority.  

### Looking at our columns

In [None]:
heart_df.info()

In [None]:
heart_df.general_health.value_counts()

In [None]:
heart_df.poor_health30.value_counts()

In [None]:
heart_df.poor_health30.isna().sum()

Close to half of the individuals were not asked this question or it's missing. Those that responded, majority fall into the none category. We will drop this column instead of imputing it as 'not applicable' since ~200k missing values is close to half of our dataset and that would not be a continous value.

In [None]:
heart_df.drop(columns= ['poor_health30'], inplace = True)

In [None]:
heart_df.head()

In [None]:
heart_df.physical_health.value_counts()

In [None]:
heart_df.physical_health.isna().sum()

Alot of people stated none (88) for physhealth. Will need to recode 88 to 0 to ensure that if any scaling is done, we don't get insane pulling. Not only that but 0 makes sense as they've 0 days out of 30 days of bad physical health.

In [None]:
heart_df.mental_health.value_counts()

In [None]:
heart_df.alcohol_consumption_30.value_counts()

The 1__ and 2__ mean two different things. Those that are using the format 1__ refers to how many days in a WEEK the individual drinks, while 2__ refers to how many days in a MONTH. As such, cleaning will be made in the coding of these values to be able to use this feature in our modeling.

In [None]:
#working on fixing the alcohol. Changing all the weekly drinks to monthly

# 101- 107 convert to days in a month

#201 - 230 remove the 2 infront 

#filter by 1

heart_df.alcohol_consumption_30 = heart_df.alcohol_consumption_30.astype(str)

In [None]:
heart_df.alcohol_consumption_30

In [None]:
#subseting to just the dirnking
heart_week = heart_df[heart_df.alcohol_consumption_30.str.startswith('1')]

heart_week.alcohol_consumption_30  = heart_week.alcohol_consumption_30.str.replace("10","")

In [None]:
heart_week.alcohol_consumption_30.value_counts()

In [None]:
heart_week.alcohol_consumption_30 = heart_week.alcohol_consumption_30.astype(float)

In [None]:
#calculating how much these individual consumed in a month
heart_week.alcohol_consumption_30 = [x*4.3 for x in heart_week.alcohol_consumption_30]

In [None]:
#checking to verify that math was applied correctly
heart_week.alcohol_consumption_30.value_counts()

In [None]:
#subset the months drinking
heart_month = heart_df[heart_df.alcohol_consumption_30.str.startswith('2')]

In [None]:
#converting to string
heart_month.alcohol_consumption_30.astype(str)

In [None]:
#removing the 2 from the monthly
heart_month.alcohol_consumption_30  = heart_month.alcohol_consumption_30.str.replace("2","", 1)

In [None]:
heart_month.alcohol_consumption_30.value_counts()

In [None]:
heart_month.alcohol_consumption_30.astype(float)

In [None]:
heart_month.alcohol_consumption_30.value_counts()

In [None]:
heart_df.weight_lbs.value_counts()

Values between 50 and 0776 are weights in lbs.  Weights that start with 9___ are weights in kilograms. 9999.0  = refused, 7777 = not sure

**Note** Most of the features have a 88.0 or some form of this to code for none instead of coding it as 0. As such we will now need to go into each column and recode it. I am recoding 88 to 0 to ensure that if any scaling is done, we don't get insane pulling. The 77 means they don't know/remember and 99 is they refused to answer the question. Will need to decide on what type of imputation to do on these ones or whether we'd like to drop it. 

In [None]:
#recoding using all features that contains 88.0, 7.0 and 9.0

#def recoding(dataframe):  

In [None]:
heart_df

# Next Steps

- figuring out a way to incorporate all of the other types of heart conditions that fall under cardiovascular disease. This projects only looks at heart attack and CAD. While the true scale of the disease expands out to high blood pressure, congenitial heart disease etc.