<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# DSI-SG-42 Project 4:
## In A Heartbeat: Prediction of Heart Disease Risk for Early Detection
---

## Introduction of Problem Statement

Heart disease remains the leading cause of death in the US, a statistic that has persisted for over a century.  Many of the contributing factors are controllable and detectable through early intervention. At Teladoc, a medical teleconsulting company, we are committed to empowering patients with knowledge and tools for proactive health management.

This project focuses on developing a heart disease risk prediction model. Our goal is to raise awareness of this prevalent condition and encourage early detection, which plays a crucial role in risk reduction.  This model will be presented to our company leadership for consideration as a valuable addition to our teleconsulting platform.

## Key Question

#### *How can we develop and integrate a data-driven feature that provides Teladoc end-users with a rigorous prediction of their risk for heart disease to facilitate early detection?*

Data is downloaded from the [Centers for Disease Control and Prevention (CDC) webpage](https://www.cdc.gov/brfss/annual_data/annual_2022.html) which contains raw data from a phone survey.

This notebook will import the dataset that comprises of 445132 rows and 326 variables, with a storage space of close to 500 MB to a select few variables for analysis.


Content:


[1. Import of Data](01_Data_Import_and_Cleaning.ipynb)  
[2A. Exploratory Data Analysis - Data Visualizations](02A_EDA_and_Data_Visualization.ipynb)  
[2B. Exploratory Data Analysis - Analysis on Missing Values](02B_EDA_MissingValues.ipynb)  
[2C. Exploratory Data Analysis - Before and After Imputation](02C_EDA_Before_and_After_Imputation.ipynb)  
[3. Supervised Learning](03_Modeling.ipynb)  

## 1. Data Import and Filter

In [None]:
# import libraries
import pandas as pd
import numpy as np

# setting displays
pd.set_option('display.width', 100000)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
# import full dataframe
df = pd.read_csv('../data/LLCP2022.csv')
# Unable to upload the above csv file into GitHub as it exceeds the maximum file size
# Please download the dataset from https://www.cdc.gov/brfss/annual_data/annual_2022.html

print(df.head()) # inspect first 5 rows of df

   _STATE  FMONTH        IDATE IMONTH   IDAY    IYEAR  DISPCODE          SEQNO          _PSU  CTELENM1  PVTRESD1  COLGHOUS  STATERE1  CELPHON1  LADULT1  COLGSEX1  NUMADULT  LANDSEX1  NUMMEN  NUMWOMEN  RESPSLCT  SAFETIME  CTELNUM1  CELLFON5  CADULT1  CELLSEX1  PVTRESD3  CCLGHOUS  CSTATE1  LANDLINE  HHADULT  SEXVAR  GENHLTH  PHYSHLTH  MENTHLTH  POORHLTH  PRIMINSR  PERSDOC3  MEDCOST1  CHECKUP1  EXERANY2  SLEPTIM1  LASTDEN4  RMVTETH4  CVDINFR4  CVDCRHD4  CVDSTRK3  ASTHMA3  ASTHNOW  CHCSCNC1  CHCOCNC1  CHCCOPD3  ADDEPEV3  CHCKDNY2  HAVARTH4  DIABETE4  DIABAGE4  MARITAL  EDUCA  RENTHOM1  NUMHHOL4  NUMPHON4  CPDEMO1C  VETERAN3  EMPLOY1  CHILDREN  INCOME3  PREGNANT  WEIGHT2  HEIGHT3  DEAF  BLIND  DECIDE  DIFFWALK  DIFFDRES  DIFFALON  HADMAM  HOWLONG  CERVSCRN  CRVCLCNC  CRVCLPAP  CRVCLHPV  HADHYST2  HADSIGM4  COLNSIGM  COLNTES1  SIGMTES1  LASTSIG4  COLNCNCR  VIRCOLO1  VCLNTES2  SMALSTOL  STOLTEST  STOOLDN2  BLDSTFIT  SDNATES1  SMOKE100  SMOKDAY2  USENOW3  ECIGNOW2  LCSFIRST  LCSLAST  LCSNUMCG 

In [None]:
# clean column headers
df.columns = df.columns.str.lower() # convert headers to lowercase
df.columns = df.columns.str.strip() # remove whitespace
df.columns = df.columns.str.replace(' ', '-') # replace space between words with dash
df.columns = df.columns.str.replace('_', '') # replace space between words with underscore
print(df.head()) # inspect df

   state  fmonth        idate imonth   iday    iyear  dispcode          seqno           psu  ctelenm1  pvtresd1  colghous  statere1  celphon1  ladult1  colgsex1  numadult  landsex1  nummen  numwomen  respslct  safetime  ctelnum1  cellfon5  cadult1  cellsex1  pvtresd3  cclghous  cstate1  landline  hhadult  sexvar  genhlth  physhlth  menthlth  poorhlth  priminsr  persdoc3  medcost1  checkup1  exerany2  sleptim1  lastden4  rmvteth4  cvdinfr4  cvdcrhd4  cvdstrk3  asthma3  asthnow  chcscnc1  chcocnc1  chccopd3  addepev3  chckdny2  havarth4  diabete4  diabage4  marital  educa  renthom1  numhhol4  numphon4  cpdemo1c  veteran3  employ1  children  income3  pregnant  weight2  height3  deaf  blind  decide  diffwalk  diffdres  diffalon  hadmam  howlong  cervscrn  crvclcnc  crvclpap  crvclhpv  hadhyst2  hadsigm4  colnsigm  colntes1  sigmtes1  lastsig4  colncncr  vircolo1  vclntes2  smalstol  stoltest  stooldn2  bldstfit  sdnates1  smoke100  smokday2  usenow3  ecignow2  lcsfirst  lcslast  lcsnumcg  

### 1.1 To include profiles of non-smokers

For the section on Tobacco Use, the first question posed to respondents is 'Have you smoked at least 100 cigarettes in your entire life?' Responses are captured under the variable `smoke100`.   
  
If their answer is `1` 'yes', they will continue to answer relavant questions pertaining to tobacco use, and responses to these follow-up questions are captured under the variables `yrssmok` and `packday`.  
  
If their answer is `2` 'no', they will not be asked any further questions, and their responses to the follow-up questions will be captured as `Blank` for `yrssmok` and `packday` as it was not asked.  
To make sure that we include these non-smokers' profiles in our dataset, we will map `yrssmok` and `packday` to a value of `0` if their response to `smoke100` is `2` 'no'.  
  
Another question asked is 'Do you now smoke cigarettes every day, some days, or not at all?' Responses are captured under the variable `smokday2`.  
To make sure that we include these non-smokers' profiles in our dataset, we will map `yrssmok` and `packday` to a value of `0` if their response to `smokday2` is `3` 'not at all'.   

In [None]:
# To include profiles of non-smokers

# 2.0 represents if the person has not smoke 100 sticks of cigarettes (cig) in the respondents entire life.
# We will assume the person is a non-smoker
df.loc[df['smoke100'] == 2.0, ['yrssmok','packday']] = 0.0


# 3.0 represents if the person is not a smoker.
df.loc[df['smokday2'] == 3.0, ['yrssmok','packday']] = 0.0

### 1.2 To include profiles of respondents under the age of 45

The U.S. Preventive Services Task Force (Task Force) recommends that adults age 45 to 75 be screened for colorectal cancer (see [source](https://www.cdc.gov/cancer/colorectal/basic_info/screening/tests.htm)).  
  
For the section on Colorectal Cancer Screening, respondents are asked 'Colonoscopy and sigmoidoscopy are exams to check for colon cancer. Have you ever had either of these exams?'.  
  
If they are 45 years old and above, their responses will be captured under the variable `hadsigm4`.  
  
If they are below 45 years old, they will not be required to answer this question, and their responses will be captured as `Blank` as it was not asked.  
  
To make sure that we include these respondents profiles in our dataset, we will map `hadsigm4` to a value of `999` if their age `age80` is less than 45.

In [None]:
# Respondents under the age of 45 years old are not required to for a colorectal examination and we will set the conditions
# as if the person is below 45, the value will a an arbitrary large number to be detected as that

# To indicate under 45 years old for Sigmoidoscopy/Colonoscopy
df.loc[df['age80'] < 45.0, ['hadsigm4']] = 999.0

### 1.3 To convert height and weight to metric measurements

In [None]:
# cleaning up height and weight variables

df1 = df.copy() # create a copy of dataframe
df1['height'] = [ht/100 for ht in df1['htm4']] # convert height from cm to m
df1['weight'] = [wt/100 for wt in df1['wtkg3']] # put 2 decimal place due to original dataset formatting

# custom function to calculate bmi
def get_bmi(height, weight):
    #bmi formula
    bmi = weight/height**2
    return bmi

# create new column for bmi
df1['bmi'] = round(get_bmi(df1['height'], df1['weight']),2)

# dropped columns that was in refining height, weight
df1 = df1.drop(columns=['htm4', 'wtkg3'])

In [None]:
# rename column headers
df1 = df1.rename(columns = {'genhlth': 'health_status',
                            'age80': 'age',
                                  'phys14d': 'phys_health_not_good',
                                  'ment14d': 'mental_health_not_good',
                                  'checkup1': 'last_routine_checkup',
                                  'denvst3': 'visit_dentist_past_year',
                                  'hlthpln': 'health_insurance',
                                  'sleptim1': 'sleep_hours',
                                  'totinda': 'phy_exercise_past_30_days',
                                  'cvdstrk3': 'stroke',
                                  'chcocnc1': 'cancer',
                                  'chckdny2': 'kidney_disease',
                                  'hadsigm4': 'colon_sigmoidoscopy',
                                  'michd': 'chd_mi',
                                  'asthms1': 'asthma_status',
                                  'race1': 'race_ethnicity',
                                  'educag': 'education',
                                  'income3': 'income',
                                  'smoker3': 'smoker_status',
                                  'cureci2': 'e_cig_smoker',
                                  'rfbing6': 'binge_drinker',
                                  'rfdrhv8': 'heavy_drinker'})

print(df1.shape) # original rows and columns
print(df1.head()) # debug

(445132, 329)
   state  fmonth        idate imonth   iday    iyear  dispcode          seqno           psu  ctelenm1  pvtresd1  colghous  statere1  celphon1  ladult1  colgsex1  numadult  landsex1  nummen  numwomen  respslct  safetime  ctelnum1  cellfon5  cadult1  cellsex1  pvtresd3  cclghous  cstate1  landline  hhadult  sexvar  health_status  physhlth  menthlth  poorhlth  priminsr  persdoc3  medcost1  last_routine_checkup  exerany2  sleep_hours  lastden4  rmvteth4  cvdinfr4  cvdcrhd4  stroke  asthma3  asthnow  chcscnc1  cancer  chccopd3  addepev3  kidney_disease  havarth4  diabete4  diabage4  marital  educa  renthom1  numhhol4  numphon4  cpdemo1c  veteran3  employ1  children  income  pregnant  weight2  height3  deaf  blind  decide  diffwalk  diffdres  diffalon  hadmam  howlong  cervscrn  crvclcnc  crvclpap  crvclhpv  hadhyst2  colon_sigmoidoscopy  colnsigm  colntes1  sigmtes1  lastsig4  colncncr  vircolo1  vclntes2  smalstol  stoltest  stooldn2  bldstfit  sdnates1  smoke100  smokday2  u

In [None]:
# filter out those variables that will be used in the analysis
df1 = df1[['age', 'height', 'weight', 'bmi', 'yrssmok', 'packday',
           'yrsquit', 'sleep_hours', 'health_status', 'phys_health_not_good',
           'mental_health_not_good', 'last_routine_checkup', 'visit_dentist_past_year',
           'health_insurance', 'phy_exercise_past_30_days', 'stroke', 'cancer',
           'kidney_disease', 'colon_sigmoidoscopy', 'chd_mi', 'asthma_status', 'race_ethnicity',
           'sex', 'education', 'income', 'smoker_status', 'e_cig_smoker', 'binge_drinker',
           'heavy_drinker']]

print(f'Number of rows: {df1.shape[0]} and number of columns: {df1.shape[1]} \n')
print(df1.head()) # debug

Number of rows: 445132 and number of columns: 29 

    age  height  weight    bmi  yrssmok  packday  yrsquit  sleep_hours  health_status  phys_health_not_good  mental_health_not_good  last_routine_checkup  visit_dentist_past_year  health_insurance  phy_exercise_past_30_days  stroke  cancer  kidney_disease  colon_sigmoidoscopy  chd_mi  asthma_status  race_ethnicity  sex  education  income  smoker_status  e_cig_smoker  binge_drinker  heavy_drinker
0  80.0     NaN     NaN    NaN      0.0      0.0      0.0          8.0            2.0                   1.0                     1.0                   1.0                      9.0               9.0                        2.0     2.0     2.0             2.0                  1.0     2.0            3.0             1.0  2.0        4.0    99.0            4.0           1.0            1.0            1.0
1  80.0    1.60   68.04  26.58      0.0      0.0      0.0          6.0            1.0                   1.0                     1.0                   8

### 1.4 Prepare data for final export

In [None]:
# Final subset of dataframe for export
filtered_df = df1[['age', 'height', 'weight', 'bmi', 'yrssmok', 'packday',
                    'sleep_hours', 'health_status', 'phys_health_not_good',
                          'mental_health_not_good', 'last_routine_checkup', 'visit_dentist_past_year',
                          'health_insurance', 'phy_exercise_past_30_days', 'stroke', 'cancer',
                          'kidney_disease', 'colon_sigmoidoscopy', 'chd_mi', 'asthma_status',
                          'race_ethnicity', 'sex', 'education', 'income' ,'smoker_status',
                          'e_cig_smoker', 'binge_drinker', 'heavy_drinker']]

filtered_df.to_csv('../data/filtered_data.csv', index=False)

### 1.5 Key Takeaways

Through a rigorous filtering process and codebook review, we've  reduced the data from 326 variables to a core set of 28. These variables will undergo further data cleaning and analysis to extract valuable insights, detailed in the next notebook [2A. Exploratory Data Analysis - Data Visualizations](02A_EDA_and_Data_Visualization.ipynb) .