# Predicting Heart Disease  

## Classification Model Stepwise Analysis <a id='top'></a> 

1. [Research Question](#1)<br/>
2. [Dataset: Personal Key Indicators of Heart Disease](#2) <br/>
3. [Exporatory Data Analysis](#3)<br/>
4. [Baselining](#4)<br/>
5. [Validation and Testing](#5)<br/>
6. [Model Iterations](#6) <br/>
7. [Model Selection ](#7)<br/>



In [7]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression ,LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC ,SVR
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns
# import plotly.express as px
# import plotly.graph_objects as go
# from plotly.subplots import make_subplots

## 1. Research Question<a id='1'></a> 

* **RQ:** Could a model predict the probability of a patient having heart disease based on the risk factors in electronic health records?
* **Data source:** [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)
* **Error metric:** Recall


## 2. Dataset: [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)<a id='2'></a>  


In [8]:
df = pd.read_csv('heart_2020_cleaned.csv')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

In [10]:
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


[back to top](#top)

## 3. Exporatory Data Analysis<a id='3'></a> 

**Look at, summarize, and clean the data.** 
- Examine at least some rows in micro detail, 
    - checking that the data is correct and 
    - appears as you expected (e.g., number of customers should not be negative!). 
- Also study the macro level by aggregating the data and looking at 
    - summary information and statistics (what is the data type, 
    - how many entries and missing values are there, 
    - what are some descriptive statistics like mean for numerical columns, etc.) 
- Significant preprocessing/data preparation may be called for in cases of messy/problematic data.

**Scope out classification viability:** 
- Look at key statistics and visualizations related to classification - 
    - correlation matrix, 
    - target distribution (how imbalanced are the classes you're predicting?), 
    - target vs. feature plots (e.g. seaborn pairplot with the target value passed as a hue, see logistic regression notebook for a good example). 
- Get some initial expectations for model performance and identify features that intuitively, visually, and based on their per-class distributions (e.g. box plots by target class) are likely to work well.

**Determine the most relevant classification metric(s):** 
- Given the model's use case and the distribution of the target, what metrics are most relevant for this problem? 
- It's critical to establish this before modeling so that you can properly decide how well a model is actually working. 
- Is the class distribution balanced and accuracy is a meaningful metric? Do you need to have good recall and precision for the positive class (use F1)? 
- Is recall more important than precision (use F_beta with beta > 1, care about recall beta times more than precision)? 
- Is it a probability ranking problem (use ROC AUC)? Think carefully about cost-benefit analysis from the use case perspective to decide on the right metric(s).


[back to top](#top)

## 4. Baselining<a id='4'></a> 

- Build a simple baseline model such as 
    - logistic regression, 
    - KNN, or 
    - naive Bayes, 
    using a small handful of features (you might get lucky and be able to explain your targets with very few features). 
- Start with features that are most likely to be predictive based on your domain knowledge and EDA (step 1) and/or are simple to handle (e.g. numeric without null values). 
- Calculate the model evaluation metrics you've determined are relevant to get a baseline score and feel for how well the model can perform. 
- Ideally, calculate the baseline score on a hold out set as in part 3, not the training data.

[back to top](#top)

## 5. Validation and Testing<a id='5'></a>

- Set up a data splitting structure for validating and testing your model. 
- Cross-validation will often be preferable to simple, single-set validation due to its robustness. 
- There may also be cases where a specialized validation setup is called for, such as in time series problems. 
- Using your chosen validation scheme, you can perform iterative feature selection/expansion/engineering and model complexity adjustments in order to complete the next 2 steps. 
- You will use the test data only once your model is finalized in order to compute a final estimate of generalization performance.



[back to top](#top)

## 6. Model Iterations <a id='6'></a> 

- Starting from the baseline and in an iterative, validated loop ask: 
    - Do you need more complexity or less (underfitting vs. overfitting)? 
    - Do you need a fancier model (nonlinear, additional feature engineering / transformations)? 
    - If you need more complexity, try tree-based models such as random forest or gradient boosted trees. 
    - Are you overfitting and need to make your model more conservative by removing features or using regularization? - Hopefully you can quickly acquire an understanding of which direction you need to go in from your baseline and early modeling results, then make more fine-tuned changes as you go.


- The impact of model choices should be consistently measured against the same validation data as in part 3, using your relevant classification performance metrics such as F1 or ROC AUC. 


#### k-Nearst Neighbors

[back to top](#top)

#### Logistic Regression 

[back to top](#top)

#### Random Forests

[back to top](#top)

#### Gradient Boosted Trees

[back to top](#top)

#### Ensembling

[back to top](#top)

#### Naive Bayes

[back to top](#top)

### Error metric: Recall

[back to top](#top)

### Class Imblance Handling 

- If your target class distribution is (highly) imbalanced, make sure to try imbalance handling strategies such as 
    - resampling, 
    - class_weight adjustments, and 
    - decision threshold tuning that dovetail with the metrics you’re most interested in. 
- These methods are part of the modeling process, whether they happen before, during, or after training.

[back to top](#top)

### Feature Engineering 

- The [feature engineering](https://app.thisismetis.com/courses/162/pages/home-feature-engineering-for-classification) lesson provides a model for how you might track your progress while iteratively expanding your model. 

[back to top](#top)

## 7. Model Selection<a id='7'></a> 

**Finalize and test:** 
- When satisfied with the results of your tuning in _Model Iterations_, establish your final model choices:
        - features, 
        - preprocessing, 
        - imbalance handling strategy, and 
        - hyperparameters
    - retrain this model on all training + validation data. 
- Make predictions on the test data and score these predictions, reporting this score as your estimate of the model's generalization performance.

**Interpret:** 
- Extract and study your final model coefficients or feature importances. Are there any interesting or unexpected takeaways? 
- How do the model coefficients/importances align with your intuition and domain knowledge about the problem? 
- Be careful about complicating factors in interpretation such as differing feature scales and multicollinearity.



[back to top](#top)