# Predicting Heart Disease

Image

Author: Xiaohua Su

Date: May 17th, 2022

# Overview

As of 2020, heart disease is the leading cause of death in the US, with the disease claiming close to 700,000 that year. It is the leading cause of death regardless of gender and for most race/ethnicity. This disease can lead to early death in individuals, increase medicial visits and a lost of productivity in our economy. As such, it is important to try to address this. My project aims to help build a predictive model for heart disease. By being able to predict whether a patient has heart disease or not, this can be used in hospital to flag doctors to discuss way to manage this disease and prevent early death and potentially slow/mitigate the disease progression.

# Business Problem

With how prevalent heart disease is in the nation, it is important for doctors to discuss with their patients about early prevention. In order to do this, doctors would need to know more about a patient’s history in order to diagnose them with having heart disease, potentially requiring blood work in addition. Getting the results from the blood work usually happens after the patient’s is already out of the doctor’s office. Calls will then be made to discuss these results and potential follow up appointments will be made. 

Our model aims to predict whether a patient, who comes into a doctor’s office/hospital, has heart disease. By being able to predict if the patient has heart disease or not, we can then flag this patient for the doctor electronically. Instead of having to waiting for a phone call for a discussion on, that may not be between the patient and doctor, conversation between the doctor and patient about managing heart disease can begin. This flagging can help start the conversation between the doctor and patient about early prevention steps that can be made and can help lead the doctor in asking certain questions for further verification and testing.

# Data

The data was taken from the [CDC's 2020 Behavorial Risk Factor Surveillance System](https://www.cdc.gov/brfss/annual_data/annual_2020.html) (BRFSS). Due to how large the data is, this data was not uploaded to the github but can be found where the data was taken underneath the data files section.

It is a survey data performed between 2020 to 2021 from the CDC to monitor people's health-behavior, chronic health conditions, and use of services to help manage their disease. The data contains information of the individual such as `race` and `gender` that we will not use to avoid these biases in our models. A new column was created as the data does not specifically have a column called heart disease but instead has two two columns called `cvdinfr4` and `cvdcrhd4` that corresponded with whether the individual was ever told/diagnose with having a heart attack and told that they had coronary heart disease. Both questions, get at the issue of heart disease.

# Imports

***The neural network modeling was performed with tensorflow. In order for it to work properly you must have at least tensorflow 2.5 and above. Please use the provided yml file to create the enviroment properly on a windows as it has tensorflow version 2.8.0 within it. Unfortunately, I do not have access to a Mac as such a Mac yml file is not provide. Not only that but there is some known issue with more recent versions of tensorflow with the M1 chips as such I highly recommend running this notebook on a windows or a cloud base service such as google colab.***

In [26]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Normalization, IntegerLookup, CategoryEncoding
from tensorflow.keras import layers
import pandas as pd
import numpy as np
from tensorflow import  keras
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from keras import regularizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.metrics import BinaryAccuracy, Precision, Recall, AUC
from tensorflow.keras import Model
from get_features import *

In [2]:
seed = 7
np.random.seed(seed)

In [3]:
print(tf.__version__)

2.8.0


In [4]:
heart = pd.read_csv('./Data/heart_df.csv')

In [5]:
heart

Unnamed: 0.1,Unnamed: 0,state,general_health,physical_health,mental_health,health_insurance,health_care_doctors,no_doc_bc_cost,last_checkup,excercise_30,...,income_level,weight_kg,height_m,difficulty_walking,smoke100_lifetime,smokeless_tobacco_products,alcohol_consumption_30,high_risk_situations,ecigaret,heart_disease
0,0,1.0,2.0,3.0,30.0,2.0,3.0,1.0,4.0,1.0,...,1.0,48.0,170.0,2.0,1.0,3.0,0.0,2.0,1.0,0.0
1,1,1.0,3.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,,,163.0,2.0,,,,,,0.0
2,2,1.0,3.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,...,7.0,,173.0,2.0,2.0,3.0,0.0,2.0,2.0,0.0
3,3,1.0,1.0,0.0,0.0,1.0,3.0,2.0,2.0,2.0,...,,,,2.0,2.0,3.0,0.0,2.0,2.0,0.0
4,4,1.0,2.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,...,,57.0,168.0,2.0,2.0,3.0,0.0,2.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397142,401953,72.0,3.0,0.0,0.0,2.0,1.0,2.0,2.0,1.0,...,,55.0,150.0,2.0,2.0,3.0,0.0,2.0,,0.0
397143,401954,72.0,3.0,0.0,0.0,1.0,1.0,2.0,3.0,2.0,...,4.0,76.0,152.0,2.0,2.0,3.0,0.0,2.0,,0.0
397144,401955,72.0,3.0,0.0,0.0,1.0,1.0,2.0,2.0,1.0,...,1.0,72.0,124.0,2.0,2.0,3.0,0.0,2.0,,0.0
397145,401956,72.0,3.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,...,,80.0,173.0,2.0,7.0,3.0,4.0,2.0,,0.0


In [6]:
heart.drop(columns = 'Unnamed: 0', inplace = True)

In [7]:
heart.drop(columns = ['education_lvl', 'income_level', 'employment_status', 'rent_own', 'health_care_doctors','no_doc_bc_cost', 'smokeless_tobacco_products', 'high_risk_situations', 'ecigaret', 'state'], inplace = True)

In [8]:
heart.dtypes

general_health            float64
physical_health           float64
mental_health             float64
health_insurance          float64
last_checkup              float64
excercise_30              float64
sleep                     float64
stroke                    float64
asthma                    float64
skin_cancer               float64
other_cancer              float64
copd_type_issue           float64
arthritis_anyform         float64
depressive_disorder       float64
kidney_disease            float64
diabetes                  float64
weight_kg                 float64
height_m                  float64
difficulty_walking        float64
smoke100_lifetime         float64
alcohol_consumption_30    float64
heart_disease             float64
dtype: object

# Train-test-validation split

In [9]:
heart = heart.head(1000)

In [10]:
heart.isna().sum()

general_health              1
physical_health            32
mental_health              15
health_insurance            2
last_checkup                6
excercise_30                0
sleep                      19
stroke                      2
asthma                      2
skin_cancer                 2
other_cancer                2
copd_type_issue             5
arthritis_anyform           3
depressive_disorder         4
kidney_disease              8
diabetes                    1
weight_kg                 108
height_m                   39
difficulty_walking         39
smoke100_lifetime          38
alcohol_consumption_30     50
heart_disease               0
dtype: int64

In [58]:
# heart.fillna(0)

# Preprocessing

In [11]:
X = heart.drop(columns='heart_disease')
y = heart.heart_disease

In [14]:
X_train_original, X_test, y_train_original, y_test = train_test_split(X,y, stratify=y ,random_state = 42)

In [17]:
X_train, X_val, y_train, y_val = train_test_split(X_train_original,y_train_original, stratify=y_train_original, random_state=42)

In [21]:
#creating list of columns names that needs to be either scaled or OHE
continous = ['physical_health', 'mental_health', 'last_checkup' , 'excercise_30', 'sleep', 'weight_kg',
             'height_m', 'alcohol_consumption_30']

categorical = list(X_train.columns.drop(continous))

In [22]:
cat_pipe = Pipeline(steps=[('cat_impute', IterativeImputer(estimator = RandomForestClassifier(),\
                                                           random_state=42, max_iter = 5))])
scale_pipe = Pipeline(steps=[('scale_impute', IterativeImputer(random_state=42))])

In [23]:
ct = ColumnTransformer(transformers=[('scale', scale_pipe, continous), ('cat', cat_pipe, categorical)]).fit(X_train)

In [24]:
def df_imputation_transformer(dataframe):
    dataframe= ct.transform(dataframe)
    dataframe = pd.DataFrame(dataframe, columns=get_feature_names(ct))
    dataframe.columns = [name.strip().replace("cat__",'').replace("scale__", '') for name in dataframe.columns]
    return dataframe

In [27]:
train = df_imputation_transformer(X_train)
val = df_imputation_transformer(X_val)
test = df_imputation_transformer(X_test)



In [28]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562 entries, 0 to 561
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   physical_health         562 non-null    float64
 1   mental_health           562 non-null    float64
 2   last_checkup            562 non-null    float64
 3   excercise_30            562 non-null    float64
 4   sleep                   562 non-null    float64
 5   weight_kg               562 non-null    float64
 6   height_m                562 non-null    float64
 7   alcohol_consumption_30  562 non-null    float64
 8   general_health          562 non-null    float64
 9   health_insurance        562 non-null    float64
 10  stroke                  562 non-null    float64
 11  asthma                  562 non-null    float64
 12  skin_cancer             562 non-null    float64
 13  other_cancer            562 non-null    float64
 14  copd_type_issue         562 non-null    fl

In [29]:
train.isna().sum().sum()

0

In [31]:
continous_train = train[continous]
continous_train

Unnamed: 0,physical_health,mental_health,last_checkup,excercise_30,sleep,weight_kg,height_m,alcohol_consumption_30
0,0.0,0.0,2.0,2.0,6.0,81.496091,167.518221,0.0
1,2.0,0.0,1.0,1.0,7.0,91.000000,183.000000,0.0
2,0.0,30.0,1.0,2.0,5.0,68.000000,152.000000,0.0
3,0.0,0.0,1.0,1.0,8.0,73.000000,165.000000,0.0
4,0.0,0.0,1.0,1.0,7.0,100.000000,170.000000,0.0
...,...,...,...,...,...,...,...,...
557,0.0,5.0,2.0,1.0,6.0,73.000000,155.000000,0.0
558,0.0,0.0,2.0,1.0,6.0,75.000000,175.000000,0.0
559,0.0,0.0,1.0,2.0,8.0,95.000000,168.000000,0.0
560,0.0,0.0,1.0,1.0,7.0,71.000000,170.000000,0.0


In [36]:
continous_val = val[continous]

In [32]:
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(continous_train)

In [38]:
METRICS = [
    BinaryAccuracy(name='accuracy'),
    Precision(name='precision'),
    Recall(name='recall'),
    AUC(name='auc'),
    AUC(name='prc', curve='PR'), # precision-recall curve
]

In [55]:
def get_basic_model():
    model = tf.keras.Sequential([normalizer,
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])
    model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(),
                metrics=[tf.keras.metrics.Recall()])
    return model

In [56]:
model = get_basic_model()
model.fit(continous_train, y_train, validation_data= (continous_val, y_val) ,epochs=15, batch_size=50)

Epoch 1/15


InvalidArgumentError: Graph execution error:

Detected at node 'assert_greater_equal/Assert/AssertGuard/Assert' defined at (most recent call last):
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\traitlets\config\application.py", line 845, in launch_instance
      app.start()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
      self.io_loop.start()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\platform\asyncio.py", line 149, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\asyncio\base_events.py", line 570, in run_forever
      self._run_once()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\asyncio\base_events.py", line 1859, in _run_once
      handle._run()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\asyncio\events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\ioloop.py", line 690, in <lambda>
      lambda f: self._run_callback(functools.partial(callback, future))
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
      ret = callback()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\gen.py", line 787, in inner
      self.run()
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\gen.py", line 748, in run
      yielded = self.gen.send(value)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
      yield gen.maybe_future(dispatch(*args))
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\gen.py", line 209, in wrapper
      yielded = next(result)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
      yield gen.maybe_future(handler(stream, idents, msg))
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\gen.py", line 209, in wrapper
      yielded = next(result)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\kernelbase.py", line 543, in execute_request
      self.do_execute(
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\tornado\gen.py", line 209, in wrapper
      yielded = next(result)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\interactiveshell.py", line 2876, in run_cell
      result = self._run_cell(
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\interactiveshell.py", line 2922, in _run_cell
      return runner(coro)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\interactiveshell.py", line 3145, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\interactiveshell.py", line 3337, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\IPython\core\interactiveshell.py", line 3417, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "<ipython-input-56-e6de44335bc2>", line 2, in <module>
      model.fit(continous_train, y_train, validation_data= (continous_val, y_val) ,epochs=15, batch_size=50)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\utils\traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 864, in train_step
      return self.compute_metrics(x, y, y_pred, sample_weight)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\training.py", line 957, in compute_metrics
      self.compiled_metrics.update_state(y, y_pred, sample_weight)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\engine\compile_utils.py", line 459, in update_state
      metric_obj.update_state(y_t, y_p, sample_weight=mask)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\utils\metrics_utils.py", line 70, in decorated
      update_op = update_state_fn(*args, **kwargs)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\metrics.py", line 178, in update_state_fn
      return ag_update_state(*args, **kwargs)
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\metrics.py", line 1533, in update_state
      return metrics_utils.update_confusion_matrix_variables(
    File "C:\Users\xiaoh\anaconda3\envs\learn-env\lib\site-packages\keras\utils\metrics_utils.py", line 602, in update_confusion_matrix_variables
      tf.compat.v1.assert_greater_equal(
Node: 'assert_greater_equal/Assert/AssertGuard/Assert'
assertion failed: [predictions must be >= 0] [Condition x >= y did not hold element-wise:] [x (sequential_11/dense_35/BiasAdd:0) = ] [[-0.0859924629][-0.39077425][0.116408363]...] [y (Cast_2/x:0) = ] [0]
	 [[{{node assert_greater_equal/Assert/AssertGuard/Assert}}]] [Op:__inference_train_function_15813]

# Next Steps

The next step for this project would be to further refine our target. This projects only looks at heart attack and Cornary Artery Disease. These two conditions are some of the conditions that fall under the heart disease. Heart disease encompasses other conditions such as high blood pressure, congenitial heart disease etc., it's not just CAD and heart attacks as such we would have to refine the questions being asked to individual. 

Not only that but more time to refine our model. Due to computational limitation of my system and the computational time, I am not able to perform as much gridsearches to fine-tune the model even further. Not only that but we can refine our model on data from patients' information form and the diagnoses given by the doctor to help improve the flagging of indivduals with such a condition that way their primary doctor know to discuss this with the patient.

Build a better app. The app created was for demonstrated purposes. Preferabily, the app would be further improve to take in a picture of the form filled out by the patient and would be able to pick out the data from the image and input it into our model. 