# Heart Health Ensemble

In [1]:
%%html
<style>
table {float:left}
</style>

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd
from pathlib import Path
from collections import Counter
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from imblearn.metrics import classification_report_imbalanced
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTEENN

# Read Data and Perform Data Cleaning

In [3]:
columns = [
    "HeartDisease",
    "BMI",
    "Smoking",
    "AlcoholDrinking",
    "Stroke",
    # "PhysicalHealth",
    "MentalHealth",
    "DiffWalking",
    "Sex",
    "AgeCategory",
    "Race",
    "Diabetic",
    "PhysicalActivity",
    "SleepTime",
    "Asthma",
    "KidneyDisease",
    "SkinCancer"
]

target = ["health_status"]

In [4]:
# Load the data
file_path = Path('./DataTables/heart_2020_cleaned.csv')
df = pd.read_csv(file_path)
df = df.loc[:, columns].copy()

# Drop the null columns where all values are null
df = df.dropna(axis='columns', how='all')

# Drop the null rows
df = df.dropna()
df.head(10)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,30,No,Female,55-59,White,Yes,Yes,5,Yes,No,Yes
1,No,20.34,No,No,Yes,0,No,Female,80 or older,White,No,Yes,7,No,No,No
2,No,26.58,Yes,No,No,30,No,Male,65-69,White,Yes,Yes,8,Yes,No,No
3,No,24.21,No,No,No,0,No,Female,75-79,White,No,No,6,No,No,Yes
4,No,23.71,No,No,No,0,Yes,Female,40-44,White,No,Yes,8,No,No,No
5,Yes,28.87,Yes,No,No,0,Yes,Female,75-79,Black,No,No,12,No,No,No
6,No,21.63,No,No,No,0,No,Female,70-74,White,No,Yes,4,Yes,No,Yes
7,No,31.64,Yes,No,No,0,Yes,Female,80 or older,White,Yes,No,9,Yes,No,No
8,No,26.45,No,No,No,0,No,Female,80 or older,White,"No, borderline diabetes",No,5,No,Yes,No
9,No,40.69,No,No,No,0,Yes,Male,65-69,White,No,Yes,10,No,No,No


# Read Data and Perform Data Cleaning
### The following three steps take in three specific DataTables with pre-defined bins for the following columns:
- AgeCategory
- Diabetic
- SleepTime

### AgeCategory
AgeCategory for the original database had 13 different groupings with a span of 5 years each. To highlight the particular risk correlated with Heart Disease and increased age, the groupings were split into the following AgeRisk Bins.

| AgeCategory|AgeRisk|
|:------------|:-------|
| 18-24|Low Risk|
| 25-29|Low Risk|
| 30-34|Low Risk|
| 35-39|Low Risk|
| 40-44|Low Risk|
| 45-49|Medium Risk|
| 50-54|Medium Risk|
| 55-59|Medium Risk|
| 60-64|Medium Risk|
| 65-69|High Risk|
| 70-74|High Risk|
| 75-79|High Risk|
| 80 or older|High Risk|


In [None]:
# Define AgeCategory Bins dictionary
file_path = Path('./DataTables/AgeRecoded.csv')
ageDF = pd.read_csv(file_path)
ageDict = dict(zip(ageDF.AgeCategory, ageDF.AgeRisk))
print(ageDict)

{'18-24': 'Low Risk', '25-29': 'Low Risk', '30-34': 'Low Risk', '35-39': 'Low Risk', '40-44': 'Low Risk', '45-49': 'Medium Risk', '50-54': 'Medium Risk', '55-59': 'Medium Risk', '60-64': 'Medium Risk', '65-69': 'High Risk', '70-74': 'High Risk', '75-79': 'High Risk', '80 or older': 'High Risk'}


### Diabetic
The Diabetic column for the original database had 4 different groupings with a spectrum of statuses in regards to diabetes. To reduce the confusing and potenitally flucuating degrees of diabetes. The groupings were reduced into the following Diabetes Bins.

|Diabetic|Diabetes Bin|
|:------------|:-------|
|No|No|
|Yes|Yes|
|"No, borderline diabetes"|No|
|Yes (during pregnancy)|No|

In [None]:
# Define Diabetic Bins dictionary
file_path = Path('./DataTables/DiabeticRecoded.csv')
diabeticDF = pd.read_csv(file_path)
diabeticDict = dict(zip(diabeticDF.Diabetic, diabeticDF['Diabetes Bin']))
print(diabeticDict)

{'No': 'No', 'Yes': 'Yes', 'No, borderline diabetes': 'No', 'Yes (during pregnancy)': 'No'}


### SleepTime
The SleepTime data for the original database had no groupings at all. Many rows show the recommended allotment of slee per day (between 7-9 hours). However some of the entries show patients who are recorded as sleeping up to 24 hours! To limit any outliers or significant deviations, groupings were reduced into the following three Recommended Sleep bins.

|SleepTime|Recommended Sleep|
|------------:|:-------|
|1|Below|
|2|Below|
|3|Below|
|4|Below|
|5|Below|
|6|Below|
|7|Meets|
|8|Meets|
|9|Meets|
|10|Above|
|11|Above|
|12|Above|
|13|Above|
|14|Above|
|15|Above|
|16|Above|
|17|Above|
|18|Above|
|19|Above|
|20|Above|
|21|Above|
|22|Above|
|23|Above|
|24|Above|

In [None]:
# Define SleepTime Bins dictionary
file_path = Path('./DataTables/SleepRecoded.csv')
sleepDF = pd.read_csv(file_path)
sleepDict = dict(zip(sleepDF.SleepTime, sleepDF['Recommended Sleep']))
print(sleepDict)

{1: 'Below', 2: 'Below', 3: 'Below', 4: 'Below', 5: 'Below', 6: 'Below', 7: 'Meets', 8: 'Meets', 9: 'Meets', 10: 'Above', 11: 'Above', 12: 'Above', 13: 'Above', 14: 'Above', 15: 'Above', 16: 'Above', 17: 'Above', 18: 'Above', 19: 'Above', 20: 'Above', 21: 'Above', 22: 'Above', 23: 'Above', 24: 'Above'}


In [8]:
## LabelEncoder
Since the Machine Learning models cannot use strings when dynamically analyzing data, each string had to be switched to a numeric value. For the next step we used the LabelEncoder to target all columns containing string values and transform each piece of data into an integer.


Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,30,No,Female,Medium Risk,White,Yes,Yes,Below,Yes,No,Yes
1,No,20.34,No,No,Yes,0,No,Female,High Risk,White,No,Yes,Meets,No,No,No
2,No,26.58,Yes,No,No,30,No,Male,High Risk,White,Yes,Yes,Meets,Yes,No,No
3,No,24.21,No,No,No,0,No,Female,High Risk,White,No,No,Below,No,No,Yes
4,No,23.71,No,No,No,0,Yes,Female,Low Risk,White,No,Yes,Meets,No,No,No


In [9]:
# Apply LabelEncoder module to all columns containing string values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
string_columns = [    
    "HeartDisease",
    "Smoking",
    "AlcoholDrinking",
    "Stroke",
    "DiffWalking",
    "Sex",
    "AgeCategory",
    "Race",
    "Diabetic",
    "PhysicalActivity",
    "SleepTime",
    "Asthma",
    "KidneyDisease",
    "SkinCancer"    ]
df2 = df.copy()
df2[string_columns] = df2[string_columns].apply(le.fit_transform)
df2.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.6,1,0,0,30,0,0,2,5,1,1,1,1,0,1
1,0,20.34,0,0,1,0,0,0,0,5,0,1,2,0,0,0
2,0,26.58,1,0,0,30,0,1,0,5,1,1,2,1,0,0
3,0,24.21,0,0,0,0,0,0,0,5,0,0,1,0,0,1
4,0,23.71,0,0,0,0,1,0,1,5,0,1,2,0,0,0


#  Separate the Features (X) from the Target (y)

In [10]:
# Create our features
X = df.drop(columns="HeartDisease", axis=1 )
X = pd.get_dummies(X)

# Create our target
y = df["HeartDisease"]


In [11]:
pd.set_option('display.max_columns', None)
X.head()

Unnamed: 0,BMI,MentalHealth,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,Stroke_Yes,DiffWalking_No,DiffWalking_Yes,Sex_Female,Sex_Male,AgeCategory_High Risk,AgeCategory_Low Risk,AgeCategory_Medium Risk,Race_American Indian/Alaskan Native,Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White,Diabetic_No,Diabetic_Yes,PhysicalActivity_No,PhysicalActivity_Yes,SleepTime_Above,SleepTime_Below,SleepTime_Meets,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,16.6,30,0,1,1,0,1,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,1,0,1,0,0,1,1,0,0,1
1,20.34,0,1,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0
2,26.58,30,0,1,1,0,1,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,1,1,0,1,0
3,24.21,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,1
4,23.71,0,1,0,1,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,1,0,1,0,1,0


In [13]:
X.shape

(319795, 34)

# Split Data into Training and Testing

In [14]:
#Split the data inot Training and Testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
   y,  random_state=78, train_size=0.80)

# Determine the shape of our training and testing sets.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(255836, 34)
(63959, 34)
(255836,)
(63959,)


In [15]:
# Creating a StandardScaler instance.
scaler = StandardScaler()
# Fitting the Standard Scaler with the training data.
X_scaler = scaler.fit(X_train)

# Scaling the data.
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Undersampling Model goes in here- 

## Combination (Over and Under) Sampling
In this section, you will test a combination over- and under-sampling algorithm to determine if the algorithm results in the best performance compared to the other sampling algorithms above. You will resample the data using the SMOTEENN algorithm and complete the folliowing steps:

1. View the count of the target classes using Counter from the collections library.
2. Use the resampled data to train a logistic regression model.
3. Calculate the balanced accuracy score from sklearn.metrics.
4. Print the confusion matrix from sklearn.metrics.
5. Generate a classication report using the imbalanced_classification_report from imbalanced-learn.
Note: Use a random state of 1 for each sampling algorithm to ensure consistency between tests

In [None]:
# Resample the training data with SMOTEENN
# Warning: This is a large dataset, and this step may take some time to complete
from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X,y)
print(Counter(y_resampled["health_status"]))

In [None]:
# Train the Logistic Regression model using the resampled data
modelENN = LogisticRegression(solver='lbfgs',random_state=1)
modelENN.fit(X_resampled, y_resampled)

In [None]:
# Calculated the balanced accuracy score
balanced_accuracy_score(y_test, y_pred)

In [None]:
# Display the confusion matrix
confusion_matrix(y_test, y_pred)

In [None]:
#Print the imbalanced classification report
print(classification_report_imbalanced(y_test, y_pred))