# Stage 1 - Individual Empowerment 👨🏻‍🎓
Analytical Model to empower individuals to monitor their cardiovascular health at home

<br>

**Goal:**
- obtain important features (personal) for predicting heart attack risk
- the best model that automatically predict heart attack risk based on the important features (>80% accuracy rate) 

**Dataset:**
- heart_pki_2020_encoded.csv

**Models:**
1. Logistic Regression
2. CART

<br>
<hr>



## 0. Pre-modelling preparations
- Install dependencies
- Import libraries and dataset

In [3]:
# install dependencies
# Run the following code in your terminal if you don't have the dependecy installed:
    # pip install -U imbalanced-learn --user

# Or run in notebook:    
# %pip install -U imbalanced-learn 

In [8]:
# import basic libraries
import pandas as pd
import numpy as np

# model-related libraries
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics


In [35]:
# import personal_key_indicator dataset
# note: dataset is already OneHotEncoded and Integer Encoded in 'data-cleaning-preprocessing.ipynb'
pki_df = pd.read_csv('datasets/heart_pki_2020_encoded.csv')
pki_df

Unnamed: 0,HeartDisease,BMI,PhysicalHealth,MentalHealth,SleepTime,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,0,16.60,3.0,30.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1,0,20.34,0.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0,26.58,20.0,30.0,8.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,0,24.21,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,0,23.71,28.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315247,0,22.22,0.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
315248,1,27.41,7.0,0.0,6.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
315249,0,29.84,0.0,0.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
315250,0,24.24,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [10]:
pki_df.columns.values

array(['HeartDisease', 'BMI', 'PhysicalHealth', 'MentalHealth',
       'SleepTime', 'Smoking_No', 'Smoking_Yes', 'AlcoholDrinking_No',
       'AlcoholDrinking_Yes', 'Stroke_No', 'Stroke_Yes', 'DiffWalking_No',
       'DiffWalking_Yes', 'Sex_Female', 'Sex_Male', 'AgeCategory_18-24',
       'AgeCategory_25-29', 'AgeCategory_30-34', 'AgeCategory_35-39',
       'AgeCategory_40-44', 'AgeCategory_45-49', 'AgeCategory_50-54',
       'AgeCategory_55-59', 'AgeCategory_60-64', 'AgeCategory_65-69',
       'AgeCategory_70-74', 'AgeCategory_75-79',
       'AgeCategory_80 or older', 'Race_American Indian/Alaskan Native',
       'Race_Asian', 'Race_Black', 'Race_Hispanic', 'Race_Other',
       'Race_White', 'Diabetic_No', 'Diabetic_No, borderline diabetes',
       'Diabetic_Yes', 'Diabetic_Yes (during pregnancy)',
       'PhysicalActivity_No', 'PhysicalActivity_Yes',
       'GenHealth_Excellent', 'GenHealth_Fair', 'GenHealth_Good',
       'GenHealth_Poor', 'GenHealth_Very good', 'Asthma_No', 'Asthma_

<hr>

## 1. Logistic Regression

**Why logistic regression?**
- Depedent variable, `HeartDisease` can be encoded in binary (Yes: 1, No: 1)
- Large dataset of 315252 rows


**References:**
- https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

#### 1.1 Train-Test split + SMOTE (Synthetic Minority Oversampling Technique)

**1.1 a) Train-Test spliting of dataset**

In [36]:
X = pki_df.loc[:, pki_df.columns != 'HeartDisease']
y = pki_df.loc[:, pki_df.columns == 'HeartDisease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

**1.1 b) The 'HeartDisease' is unbalanced, so we need to resample it (using SMOTE on trainset) for Recursive Feature Elimination**

Note: only for trainset

In [37]:
num_of_HD_Yes = len(pki_df[pki_df['HeartDisease'] == 1])
num_of_HD_No = len(pki_df[pki_df['HeartDisease'] == 0])
num_rows = len(pki_df)

print('Proportion of HeartDisease == 0:', num_of_HD_No/num_rows)
print('Proportion of HeartDisease == 1:', num_of_HD_Yes/num_rows)

Proportion of HeartDisease == 0: 0.9156198850443454
Proportion of HeartDisease == 1: 0.08438011495565452


In [86]:
# Oversampling for trainset to fixed imbalanced 'HeartDisease' class

from imblearn.over_sampling import SMOTE

os = SMOTE(random_state=0)

X_train_os, y_train_os = os.fit_resample(X_train, y_train)
X_train_os = pd.DataFrame(data = X_train_os, columns = X_train.columns)
y_train_os = pd.DataFrame(data = y_train_os, columns = y_train.columns)


# check the oversampled data (train)
print("===== Oversampled data =====")

num_of_HD_Yes = len(y_train_os[y_train_os['HeartDisease'] == 1])
num_of_HD_No = len(y_train_os[y_train_os['HeartDisease'] == 0])
num_rows = len(X_train_os)

print("Total number of rows:", num_rows)
print("Number of Heart Disease == 0 rows:", num_of_HD_No)
print("Number of Heart Disease == 1 rows:", num_of_HD_Yes)

print("Proportion of HeartDisease == 0:", num_of_HD_No/num_rows)
print("Proportion of HeartDisease == 1:", num_of_HD_Yes/num_rows)

===== Oversampled data =====
Total number of rows: 404218
Number of Heart Disease == 0 rows: 202109
Number of Heart Disease == 1 rows: 202109
Proportion of HeartDisease == 0: 0.5
Proportion of HeartDisease == 1: 0.5


In [31]:
X_train.head()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,Stroke_Yes,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
18662,32.98,0.0,0.0,7.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
45050,25.61,0.0,0.0,6.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
191308,16.24,0.0,0.0,7.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
105093,18.3,0.0,0.0,9.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
161942,31.57,5.0,0.0,8.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0


#### 1.2 Recursive Feature Elimination

- i.e. repetitive pruning until the number of features desired is reached
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

> Feature ranking with recursive feature elimination.
> 
> Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. 
> - First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. 
> - Then, the least important features are **`pruned`** from current set of features. 
> - That procedure is recursively repeated on the **`pruned set`** until the **desired number of features to select is eventually reached.**

<br>

! We do this to reduce the number of features when doing `Backwards Elimination` <br>
! RFE to select `top 20` important features

In [71]:
pki_df_features = pki_df.columns.values.tolist()
y_features = ['HeartDisease']
X_features = [i for i in pki_df_features if i not in y]

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='lbfgs', max_iter=1000)
rfe = RFE(logreg, n_features_to_select=20, step=1)
rfe = rfe.fit(X_train_os, y_train_os.values.ravel())

print(rfe.support_)
print(rfe.ranking_)

[False False False False False False False  True  True  True False False
 False  True  True  True  True  True  True  True False False False  True
  True  True  True False  True False False False False False False  True
 False False False  True  True False  True  True False False False False
 False False]
[27 30 28 23  4 12  8  1  1  1  9 17  6  1  1  1  1  1  1  1 25  7  2  1
  1  1  1 16  1 13 31 24  5 18 14  1 19 26 15  1  1 22  1  1 10 21 11  3
 29 20]


In [89]:

pki_df_top_features = rfe.get_feature_names_out()
print(pki_df_top_features)

X_train_os = X_train_os[pki_df_top_features]

['AlcoholDrinking_Yes' 'Stroke_No' 'Stroke_Yes' 'Sex_Male'
 'AgeCategory_18-24' 'AgeCategory_25-29' 'AgeCategory_30-34'
 'AgeCategory_35-39' 'AgeCategory_40-44' 'AgeCategory_45-49'
 'AgeCategory_65-69' 'AgeCategory_70-74' 'AgeCategory_75-79'
 'AgeCategory_80 or older' 'Race_Asian' 'Diabetic_Yes'
 'GenHealth_Excellent' 'GenHealth_Fair' 'GenHealth_Poor'
 'GenHealth_Very good']


#### 1.3 Training Logistic Regression Model + Backwards Elimination

> Backwards Elimation (https://www.simplilearn.com/what-is-backward-elimination-technique-in-machine-learning-article):
> - The model includes the initial and all explanatory variables in backward elimination. 
> - Then, the variable with the highest p-value is removed from the model. 
> - This process is repeated until all variables in the model have a p-value below a given threshold.

In [106]:
logreg_m1 = LogisticRegression(solver='lbfgs', max_iter=1000)
logreg_m1.fit(X_train[pki_df_top_features], y_train.values.ravel())
print("model fitting done!")

model fitting done!


In [114]:
y_pred = logreg_m1.predict(X_test[pki_df_top_features])
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg_m1.score(X_test[pki_df_top_features], y_test)))
print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg_m1.score(X_train[pki_df_top_features], y_train)))


Accuracy of logistic regression classifier on test set: 0.92
Accuracy of logistic regression classifier on train set: 0.92


TO EXPLORE:
- Backward elimination for feature importance
- confusion matrix
- Precision, F-score, etc
- ROC Curve

## CART

https://towardsdatascience.com/cart-classification-and-regression-trees-for-clean-but-powerful-models-cc89e60b7a85

https://www.naukri.com/learning/articles/predicting-categorical-data-using-classification-algorithms/#lo