#### Classification | Summary

# Predicting Heart Disease 

## Abstract
Predict patient's risk of heart disease using an excerpted dataset of the CDC BRFSS to identify which patients would benefit from enrolling in an HMO's heart health program. The best performing model was a logistic regression tuned with GridSearchCV. 
 

## Design

Heart disease is the [leading cause](https://www.cdc.gov/heartdisease/facts.htm) of death in the U.S. for men and women. The risk factors that increase a person's risk are: diabetes, overweight and obesity, unhealthy diet, physical inactivity, and excessive alcohol use. 

The client, [Kaiser Permanente](https://healthy.kaiserpermanente.org/front-door), a [health maintenance organization (HMO)](https://www.investopedia.com/terms/h/hmo.asp) and integrated managed care consortium, wants to lower the cost of running the organization and requested a model to determine which patients are at high risk of heart disease before they have a heart attack. 

**Research Question:** What model can best predict patients' risk for heart disease?<br/>
**Impact Hypothesis:** Reduce the number of patients who develop heart disease (arterial plaque or heart attack).<br/>
**Error metric:** Recall to have high confidence that the model does not miss any patients at risk for heart disease.

## Data

[Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease) excerpted from CDC's health survey, the [BRFSS](https://www.cdc.gov/brfss/index.html). The dataset (n= 319,795) has 17 features which are survey responses; a mix of numeric, binary (yes/no, male, female), categorical (White, Black, Asian, American Indian/Alaskan Native, Other, Hispanic) and  Likert scale (poor, fair, good, very good, excellent). Complete survey questions are available in the [BRFSS 2020 Data Dictionary](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf).<br/>
**Data Transformation:** Mapped survey responses to numbers (Yes=1, No=1) to prepare for modeling.<br/>


## Algorithms

##### Iterative modeling
0 Baseline Logistic Regression (3 features)<br/>
1 Dummy Classifier<br/>
2 Logistic Regression<br/>
3 Decision Trees: Depth 2<br/>
4 Decision Trees: Depth 4<br/>
5 Random Forests<br/>
6 Gradient Boosted Trees: xgboost<br/>
7 Naive Bayes: Bernoulli<br/>
8 Naive Bayes: Gaussian<br/>
9 Naive Bayes: Multinomial<br/>
10 Ensemble: Naive Bayes Hard Voting Classifier<br/>
11 Ensemble: Naive Bayes Soft Voting Classifier<br/>
12 Ensemble: Stacking Classifier (non-NBs)<br/>

##### Feature Engineering
13 Logistic regression: Question groups<br/>
* Groups of similar survey question questions as defined by [data dictionary](https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf).
    * behaviors = physical activity days/month + sleep time hrs/day + alcohol + tobacco use
    * demographics = age + gender + race
    * disease = asthma + diabetes + kidney + skin cancer + stroke
    * (health) measures = bmi + general health + mental + mobility + physical
14 Logistic regression: Risk factors<br/>
* Heart disease risk factors as outlined [CDC](https://www.cdc.gov/heartdisease/facts.htm). 
    * risk factors = diabetes + bmi + physical activity + alcohol
15 Logistic regression: Question groups + Risk factor features<br/>
    * model using the features engineered in models 13 & 14. 

##### Class Imbalance Handling
16 Logistic regression: threshold = 0.05
* Based on precision and recall curves for X/y_val, model iteration with decision threshold = 0.25

##### Model Tuning: GridSearchCV
17 Logistic regression: GridSearchCV
* Model based on best parameters: C=0.01, penalty="l2". 
18 Log reg GridSearchCV + threshold =0.05
* Model with decision threshold = 0.05


## Tools

* Python, numpy, pandas, sklearn<br/>
* Matplotlib, Seaborn<br/>

## Communication

Slides and code are available on https://github.com/slp22/classification-project.
<br/>
<br/>
<br/>

#### Figure 1. 
### Iterative Models (total = 18)

![model_eval.png](attachment:model_eval.png)

#### Figure 2. 
### Confusion Matrix for Logistic Regression GridSearchCV

![confusion-matrix-3.jpeg](attachment:confusion-matrix-3.jpeg)

#### Figure 3. 
### ROC AUC Curve for Logistic Regression GridSearchCV

![roc-auc-curve-3.jpeg](attachment:roc-auc-curve-3.jpeg)