# GSB 545: Advanced Machine Learning for Business Analytics

## Predicting Heart Disease
### Primary Goals:
1. Predict heart disease.
2. One of the questions posted on the kaggle page is, "Can you indicate which variables have a significant effect on the likelihood of heart disease?" So, if your work allows you to comment on this question then please do!

### Data
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

### Assignment Specs
You need to use at least one boosting model in your work to answer the questions above, but you should explore at least two other models in order to answer the above questions as best you can. You may use multiple boosting models if you like, but I'd encourage you to consider past model types we've discussed.

The kaggle page indicates that the classes are extremely unbalanced in this dataset. You should keep this in mind as you work and if appropriate, take steps to adjust for it.

In [68]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedKFold, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import warnings

warnings.filterwarnings("ignore")

In [69]:
heart_disease = pd.read_csv("data/2020/heart_2020_cleaned.csv")

In [70]:
heart_disease.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [71]:
print(heart_disease.dtypes)

HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
dtype: object
HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCa

In [72]:
# Check the unique values in the column to ensure they are 'No' and 'Yes'
print(heart_disease['HeartDisease'].unique())

['No' 'Yes']
['No' 'Yes']


In [73]:
# Count the values again
heart_disease_counts = heart_disease['HeartDisease'].value_counts()

# Print the counts
print("Count of people who had a heart attack (Yes):", heart_disease_counts['Yes'])
print("Count of people who did not have a heart attack (No):", heart_disease_counts['No'])

Count of people who had a heart attack (Yes): 27373
Count of people who did not have a heart attack (No): 292422
Count of people who had a heart attack (Yes): 27373
Count of people who did not have a heart attack (No): 292422


In [74]:
27373/292422

0.09360786808106093

Since the classes of the dataset are extremely unbalanced, instead of using accuracy, we use metrics like precision, recall, F1-score, or ROC-AUC to evaluate the models.

In [75]:
# Convert categorical variables to numerical format using LabelEncoder
label_encoders = {}
for column in heart_disease.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    heart_disease[column] = label_encoders[column].fit_transform(heart_disease[column])

# Normalize the numerical features
scaler = StandardScaler()
numerical_columns = heart_disease.select_dtypes(include=['float64']).columns
heart_disease[numerical_columns] = scaler.fit_transform(heart_disease[numerical_columns])

# Split the data into training and testing sets
X = heart_disease.drop('HeartDisease', axis=1)  # Features
y = heart_disease['HeartDisease']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Modeling

### AdaBoost

In [76]:
# Define the model
AdaBoost = AdaBoostClassifier(n_estimators=50, algorithm='SAMME')

# Evaluate the model using cross-validation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(AdaBoost, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
AdaBoost.fit(X_train, y_train)
y_pred = AdaBoost.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("AdaBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

AdaBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.538
Test Recall: 0.087
Test F1-score: 0.150
Confusion Matrix:
[[57948   419]
 [ 5105   487]]

AdaBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.538
Test Recall: 0.087
Test F1-score: 0.150
Confusion Matrix:
[[57948   419]
 [ 5105   487]]



### XGBoost

In [77]:
# Define the model
xgb_model = XGBClassifier(n_estimators=50)

# Evaluate the model using cross-validation
scores = cross_val_score(xgb_model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("XGBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

XGBoost Results:
Cross-validated Accuracy: 0.916 (0.001)
Test Accuracy: 0.913
Test Precision: 0.514
Test Recall: 0.089
Test F1-score: 0.152
Confusion Matrix:
[[57895   472]
 [ 5093   499]]

XGBoost Results:
Cross-validated Accuracy: 0.916 (0.001)
Test Accuracy: 0.913
Test Precision: 0.514
Test Recall: 0.089
Test F1-score: 0.152
Confusion Matrix:
[[57895   472]
 [ 5093   499]]



### Logistic Regression

In [78]:
# Define the model
logreg = LogisticRegression()

# Evaluate the model using cross-validation
scores = cross_val_score(logreg, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

Logistic Regression Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.913
Test Precision: 0.511
Test Recall: 0.087
Test F1-score: 0.149
Confusion Matrix:
[[57902   465]
 [ 5106   486]]

Logistic Regression Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.913
Test Precision: 0.511
Test Recall: 0.087
Test F1-score: 0.149
Confusion Matrix:
[[57902   465]
 [ 5106   486]]



### Tuning hyper-parameters for AdaBoost

In [79]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0],
    'estimator': [DecisionTreeClassifier(max_depth=1)],
    'algorithm': ['SAMME']
}

# Perform grid search
grid_search = GridSearchCV(AdaBoost, param_grid=param_grid,
                            scoring='f1', cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Best Model Test Accuracy:", accuracy)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
Fitting 3 folds for each of 9 candidates, totalling 27 fits


Best Parameters: {'algorithm': 'SAMME', 'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1.0, 'n_estimators': 200}
Best Model Test Accuracy: 0.9135539955283853
Best Parameters: {'algorithm': 'SAMME', 'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1.0, 'n_estimators': 200}
Best Model Test Accuracy: 0.9135539955283853


In [80]:
# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("AdaBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

AdaBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.528
Test Recall: 0.105
Test F1-score: 0.175
Confusion Matrix:
[[57844   523]
 [ 5006   586]]

AdaBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.528
Test Recall: 0.105
Test F1-score: 0.175
Confusion Matrix:
[[57844   523]
 [ 5006   586]]



### Tuning hyper-parameters for XGBoost

In [81]:
# Grid of hyperparameters
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Perform grid search
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
                            scoring='f1', cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Best Model Test Accuracy:", accuracy)

Fitting 3 folds for each of 32 candidates, totalling 96 fits
Fitting 3 folds for each of 32 candidates, totalling 96 fits


Best Parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Best Model Test Accuracy: 0.9135383605122032
Best Parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Best Model Test Accuracy: 0.9135383605122032


In [82]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("XGBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

XGBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.533
Test Recall: 0.089
Test F1-score: 0.153
Confusion Matrix:
[[57929   438]
 [ 5092   500]]

XGBoost Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.914
Test Precision: 0.533
Test Recall: 0.089
Test F1-score: 0.153
Confusion Matrix:
[[57929   438]
 [ 5092   500]]



### Tuning Hyperparameters for Logistic Regression

In [83]:
# Define the parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100], 
    'penalty': ['l1', 'l2'],  
    'max_iter': [100, 500, 1000] 
}

# Perform the grid search
grid_search = GridSearchCV(LogisticRegression(), param_grid=param_grid,
                            scoring='f1', cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Best Model Test Accuracy:", accuracy)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Fitting 3 folds for each of 36 candidates, totalling 108 fits


Best Parameters: {'C': 1, 'max_iter': 100, 'penalty': 'l2'}
Best Model Test Accuracy: 0.9128973248487312
Best Parameters: {'C': 1, 'max_iter': 100, 'penalty': 'l2'}
Best Model Test Accuracy: 0.9128973248487312


In [84]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Logistic Regression Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.913
Test Precision: 0.511
Test Recall: 0.087
Test F1-score: 0.149
Confusion Matrix:
[[57902   465]
 [ 5106   486]]
Logistic Regression Results:
Cross-validated Accuracy: 0.916 (0.002)
Test Accuracy: 0.913
Test Precision: 0.511
Test Recall: 0.087
Test F1-score: 0.149
Confusion Matrix:
[[57902   465]
 [ 5106   486]]


In [86]:
from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=100, random_state=42 )

# Evaluate the model using cross-validation
scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("BalancedRandomForest Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


BalancedRandomForest Results:
Cross-validated Accuracy: 0.722 (0.003)
Test Accuracy: 0.720
Test Precision: 0.209
Test Recall: 0.791
Test F1-score: 0.330
Confusion Matrix:
[[41610 16757]
 [ 1170  4422]]

BalancedRandomForest Results:
Cross-validated Accuracy: 0.722 (0.003)
Test Accuracy: 0.720
Test Precision: 0.209
Test Recall: 0.791
Test F1-score: 0.330
Confusion Matrix:
[[41610 16757]
 [ 1170  4422]]



The Balanced RandomForestClassifier performed the best out of all the models I used. This model helps reduce the bias towards the majority class while maintaining good performance in predicting the minority class. All the other models had very low recalls, all under 10%, which means the model is missing most of the patients who actually have heart disease, which is critical in a healthcare setting. But in the Balanced RandomForestClassifier, the test metrics were much higher. It had a recall of 79% and a F1 score of 33%. This means that 79% of patients with a heart disease were correctly classified by this model.