### In this project, we will be using the [Heart Attack Prediction Dataset](https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset?resource=download).

This project focuses on predicting heart attack risk using a dataset from Kaggle. It involves data cleaning (e.g., splitting blood pressure into systolic and diastolic values), feature encoding with `OneHotEncoder`, and training a `DecisionTreeClassifier`. Key techniques include hyperparameter tuning with `GridSearchCV` to optimize model performance and identifying the most important features influencing heart attack risk. Insights are provided on lifestyle factors and their impact on heart attack prevention, along with a discussion on the model’s limitations for medical decision-making.


### Part 1 - Clean the Data  

We will first load the data into a Pandas DataFrame. Then we will split the Blood Pressure column into two separate columns: Systolic BP and Diastolic BP. Ensure that both new columns are of type float and remove the original Blood Pressure column from the DataFrame.
Next we will convert categorical variables into numerical format using OneHotEncoder. The categorical columns to encode are: Sex, Diet, Country, Continent, and Hemisphere

In [1]:
import pandas as pd

df = pd.read_csv('heart_attack_prediction_dataset.csv')

In [2]:
df[['Systolic BP', 'Diastolic BP']] = df['Blood Pressure'].str.split('/', expand=True).astype(float)

# Remove the original 'Blood Pressure' column
df.drop(columns=['Blood Pressure'], inplace=True)

In [3]:
df

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,...,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk,Systolic BP,Diastolic BP
0,BMW7812,67,Male,208,72,0,0,1,0,0,...,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0,158.0,88.0
1,CZE1114,21,Male,389,98,1,1,1,1,1,...,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0,165.0,93.0
2,BNI9906,21,Female,324,72,1,0,0,0,0,...,28.176571,587,4,4,France,Europe,Northern Hemisphere,0,174.0,99.0
3,JLN3497,84,Male,383,73,1,1,1,0,1,...,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0,163.0,100.0
4,GFO8847,66,Male,318,93,1,1,1,1,0,...,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0,91.0,88.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8758,MSV9918,60,Male,121,61,1,1,1,0,1,...,19.655895,67,7,7,Thailand,Asia,Northern Hemisphere,0,94.0,76.0
8759,QSV6764,28,Female,120,73,1,0,0,1,0,...,23.993866,617,4,9,Canada,North America,Northern Hemisphere,0,157.0,102.0
8760,XKA5925,47,Male,250,105,0,1,1,1,1,...,35.406146,527,4,4,Brazil,South America,Southern Hemisphere,1,161.0,75.0
8761,EPE6801,36,Male,178,60,1,0,1,0,0,...,27.294020,114,2,8,Brazil,South America,Southern Hemisphere,0,119.0,67.0


In [4]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Specify the categorical columns to encode
categorical_columns = ['Sex', 'Diet', 'Country', 'Continent', 'Hemisphere']

# Initialize the OneHotEncoder with the correct parameter
encoder = OneHotEncoder(sparse_output=False)

# Perform one-hot encoding and convert to DataFrame
encoded_data = encoder.fit_transform(df[categorical_columns])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded columns with the original DataFrame (excluding the original categorical columns)
df = pd.concat([df.drop(columns=categorical_columns), encoded_df], axis=1)


In [5]:
df

Unnamed: 0,Patient ID,Age,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,Country_United States,Country_Vietnam,Continent_Africa,Continent_Asia,Continent_Australia,Continent_Europe,Continent_North America,Continent_South America,Hemisphere_Northern Hemisphere,Hemisphere_Southern Hemisphere
0,BMW7812,67,208,72,0,0,1,0,0,4.168189,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,CZE1114,21,389,98,1,1,1,1,1,1.813242,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,BNI9906,21,324,72,1,0,0,0,0,2.078353,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,JLN3497,84,383,73,1,1,1,0,1,9.828130,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,GFO8847,66,318,93,1,1,1,1,0,5.804299,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8758,MSV9918,60,121,61,1,1,1,0,1,7.917342,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
8759,QSV6764,28,120,73,1,0,0,1,0,16.558426,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
8760,XKA5925,47,250,105,0,1,1,1,1,3.148438,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
8761,EPE6801,36,178,60,1,0,1,0,0,3.789950,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


### Part 2 - Train a Model  
Next, we'll create a feature `DataFrame` called `X` by removing the Patient ID and Heart Attack Risk columns.
Then, we'll create a target `Series` called `y`.
We will split the data into 80% training and 20% test using `train_test_split`.
Then, we'll train a `DecisionTreeClassifier` with default arguments (except for `random_state=42`) and fit it to the training data.
After these steps, we'll display the 10 most important features and their names according to the built-in model importance measure.
We'll also print the model accuracy on the test set

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Create the feature DataFrame X and target Series y
X = df.drop(columns=['Patient ID', 'Heart Attack Risk'])  # Remove 'Patient ID' and 'Heart Attack Risk' for features
y = df['Heart Attack Risk']  # Target variable

In [7]:
# Split the data into 80% training and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Get feature importances and display the top 10 features
importances = clf.feature_importances_
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
top_10_features = feature_importances.sort_values(by='Importance', ascending=False).head(10)

# Display the top 10 most important features
print("Top 10 Important Features:\n", top_10_features)

# Print the model accuracy on the test set
accuracy = clf.score(X_test, y_test)
print(f"Model accuracy on test set: {accuracy:.4f}")


Top 10 Important Features:
                     Feature  Importance
12  Sedentary Hours Per Day    0.089845
15            Triglycerides    0.082088
14                      BMI    0.079637
8   Exercise Hours Per Week    0.079244
2                Heart Rate    0.070617
18              Systolic BP    0.064570
13                   Income    0.061281
1               Cholesterol    0.060264
0                       Age    0.059827
19             Diastolic BP    0.051464
Model accuracy on test set: 0.5271


### Part 3 - Determine the Best Hyperparameters  

Considering the following values for each of these hypeparameters, we will determine the best set of values:
  - 'max_depth': 3, 5, 7, 10, 12
  - 'min_samples_split': 10, 30, 50, 70
  - 'min_samples_leaf': 5, 10, 20, 23
  - 'criterion': 'gini', 'entropy'

In [8]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

param_grid = {
    'max_depth': [3, 5, 7, 10, 12],
    'min_samples_split': [10, 30, 50, 70],
    'min_samples_leaf': [5, 10, 20, 23],
    'criterion': ['gini', 'entropy']
}

# Initialize the DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV to find the best hyperparameter combination
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)

Best Parameters: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 10}
Best Cross-Validation Score: 0.6380884450784594


In [9]:
best_clf = DecisionTreeClassifier(**best_params, random_state=42)
best_clf.fit(X_train, y_train)

# Calculate accuracy on the test set
test_accuracy = best_clf.score(X_test, y_test)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

Test Set Accuracy: 0.6412


### Part 4 - Interpretation : "Which lifestyle factors appear to be the strongest predictors of heart attack risk?", "How might this information be used to develop preventive healthcare strategies?", "What are the limitations of using this model for medical decision-making?"

Based on the feature importance scores, sedentary Hours Per Day emerges as the strongest predictor of heart attack risk, followed by Triglycerides, BMI, and Exercise Hours Per Week. These results suggest that lifestyle factors related to physical activity and metabolic health, such as how long individuals remain sedentary, their exercise habits, and their levels of triglycerides and BMI, are crucial in determining heart attack risk. Therefore, healthcare strategies should focus on encouraging physical activity, reducing sedentary time, and promoting healthier dietary choices to manage weight and triglycerides levels.

Preventive healthcare strategies can include public health campaigns aimed at reducing sedentary behavior, promoting regular exercise, and offering nutritional counseling to lower triglycerides and BMI.

The limitations of using this model for medical decision-making would be it's accuracy on the test set, which is only 52.71%. While the optimized model showed an improvement in accuracy (0.6412 on the test set),this is still not sufficient for high-stakes medical applications. Medical decisions often require more robust and clinically validated models, as the stakes are high and incorrect predictions could have serious consequences for patient care. The model also likely lacks clinical nuance, as it does not account for complex interactions between features or potential confounding variables.