# Data Preprocessing and Model Exploration 

*   **Patient ID** - Unique identifier for each patient
*   **Age** - Age of the patient
*   **Sex** - Gender of the patient (Male/Female)
*   **Cholesterol** - Cholesterol levels of the patient
*   **Blood Pressure** - Blood pressure of the patient (systolic/diastolic)
*   **Heart Rate** - Heart rate of the patient
*   **Diabetes** - Whether the patient has diabetes (Yes/No)
*   **Family History** - Family history of heart-related problems (1: Yes, 0: No)
*   **Smoking** - Smoking status of the patient (1: Smoker, 0: Non-smoker)
*   **Obesity** - Obesity status of the patient (1: Obese, 0: Not obese)
*   **Alcohol Consumption** - Level of alcohol consumption by the patient (None/Light/Moderate/Heavy)
*   **Exercise Hours Per Week** - Number of exercise hours per week
*   **Diet** - Dietary habits of the patient (Healthy/Average/Unhealthy)
*   **Previous Heart Problems** - Previous heart problems of the patient (1: Yes, 0: No)
*   **Medication Use** - Medication usage by the patient (1: Yes, 0: No)
*   **Stress Level** - Stress level reported by the patient (1-10)
*   **Sedentary Hours Per Day** - Hours of sedentary activity per day
*   **Income** - Income level of the patient
*   **BMI** - Body Mass Index (BMI) of the patient
*   **Triglycerides** - Triglyceride levels of the patient
*   **Physical Activity Days Per Week** - Days of physical activity per week
*   **Sleep Hours Per Day** - Hours of sleep per day
*   **Country** - Country of the patient
*   **Continent** - Continent where the patient resides
*   **Hemisphere** - Hemisphere where the patient resides
*   **Heart Attack Risk** - Presence of heart attack risk (1: Yes, 0: No)

## Imports

In [1]:
import warnings

import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

In [2]:
# auto reload libs
%load_ext autoreload
%autoreload 2

## Paths

In [3]:
DATASET = "../data/heart_attack_prediction_dataset.csv"

## Load Data

In [4]:
org_df = pd.read_csv(DATASET)
org_df.head()

Unnamed: 0,Patient ID,Age,Sex,Cholesterol,Blood Pressure,Heart Rate,Diabetes,Family History,Smoking,Obesity,...,Sedentary Hours Per Day,Income,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Country,Continent,Hemisphere,Heart Attack Risk
0,BMW7812,67,Male,208,158/88,72,0,0,1,0,...,6.615001,261404,31.251233,286,0,6,Argentina,South America,Southern Hemisphere,0
1,CZE1114,21,Male,389,165/93,98,1,1,1,1,...,4.963459,285768,27.194973,235,1,7,Canada,North America,Northern Hemisphere,0
2,BNI9906,21,Female,324,174/99,72,1,0,0,0,...,9.463426,235282,28.176571,587,4,4,France,Europe,Northern Hemisphere,0
3,JLN3497,84,Male,383,163/100,73,1,1,1,0,...,7.648981,125640,36.464704,378,3,4,Canada,North America,Northern Hemisphere,0
4,GFO8847,66,Male,318,91/88,93,1,1,1,1,...,1.514821,160555,21.809144,231,1,5,Thailand,Asia,Northern Hemisphere,0


In [5]:
org_df.columns

Index(['Patient ID', 'Age', 'Sex', 'Cholesterol', 'Blood Pressure',
       'Heart Rate', 'Diabetes', 'Family History', 'Smoking', 'Obesity',
       'Alcohol Consumption', 'Exercise Hours Per Week', 'Diet',
       'Previous Heart Problems', 'Medication Use', 'Stress Level',
       'Sedentary Hours Per Day', 'Income', 'BMI', 'Triglycerides',
       'Physical Activity Days Per Week', 'Sleep Hours Per Day', 'Country',
       'Continent', 'Hemisphere', 'Heart Attack Risk'],
      dtype='object')

### Train, Validation, Test Split

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = org_df.drop("Heart Attack Risk", axis=1)
y = org_df["Heart Attack Risk"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=X.Sex, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, stratify=X_train.Sex, random_state=42)

In [9]:
X_train.shape, X_val.shape, X_test.shape

((6308, 25), (1578, 25), (877, 25))

## Data Cleaning and pre-processing

In [10]:
cols_to_drop = ["Patient ID", "Blood Pressure", "Country", "Continent", "Hemisphere", "Income"]

### processing Blood pressure column

In [11]:
source_df = org_df.copy()

bp_split = source_df["Blood Pressure"].str.split("/", expand=True).astype(int)
bp_split.columns = ["Systolic", "Diastolic"]
source_df.drop(cols_to_drop, axis=1, inplace=True)
source_df = pd.concat([source_df, bp_split], axis=1)

### Processing Categorical columns

In [12]:
source_df.Sex = source_df.Sex.map({"Male": 1, "Female": 0})
source_df.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,...,Medication Use,Stress Level,Sedentary Hours Per Day,BMI,Triglycerides,Physical Activity Days Per Week,Sleep Hours Per Day,Heart Attack Risk,Systolic,Diastolic
0,67,1,208,72,0,0,1,0,0,4.168189,...,0,9,6.615001,31.251233,286,0,6,0,158,88
1,21,1,389,98,1,1,1,1,1,1.813242,...,0,1,4.963459,27.194973,235,1,7,0,165,93
2,21,0,324,72,1,0,0,0,0,2.078353,...,1,9,9.463426,28.176571,587,4,4,0,174,99
3,84,1,383,73,1,1,1,0,1,9.82813,...,0,9,7.648981,36.464704,378,3,4,0,163,100
4,66,1,318,93,1,1,1,1,0,5.804299,...,0,6,1.514821,21.809144,231,1,5,0,91,88


In [13]:
source_df.Diet = source_df.Diet.map({"Average": 0, "Healthy": 1, "Unhealthy": 2})
source_df.Diet[:5]

0    0
1    2
2    1
3    0
4    2
Name: Diet, dtype: int64

### Correlation Metrix calculation

In [14]:
corr_matrix = source_df.corr()
corr_matrix["Heart Attack Risk"].sort_values(ascending=False)

Heart Attack Risk                  1.000000
Cholesterol                        0.019340
Systolic                           0.018585
Diabetes                           0.017225
Exercise Hours Per Week            0.011133
Triglycerides                      0.010471
Age                                0.006403
Diet                               0.004540
Sex                                0.003095
Medication Use                     0.002234
Previous Heart Problems            0.000274
BMI                                0.000020
Family History                    -0.001652
Smoking                           -0.004051
Stress Level                      -0.004111
Heart Rate                        -0.004251
Physical Activity Days Per Week   -0.005014
Sedentary Hours Per Day           -0.005613
Diastolic                         -0.007509
Obesity                           -0.013318
Alcohol Consumption               -0.013778
Sleep Hours Per Day               -0.018528
Name: Heart Attack Risk, dtype: 

### Column selection

In [15]:
drop_less_corr_cols = ["Diet", "Medication Use", "BMI", "Physical Activity Days Per Week"]
source_df.drop(drop_less_corr_cols, axis=1, inplace=True)
source_df.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Heart Attack Risk,Systolic,Diastolic
0,67,1,208,72,0,0,1,0,0,4.168189,0,9,6.615001,286,6,0,158,88
1,21,1,389,98,1,1,1,1,1,1.813242,1,1,4.963459,235,7,0,165,93
2,21,0,324,72,1,0,0,0,0,2.078353,1,9,9.463426,587,4,0,174,99
3,84,1,383,73,1,1,1,0,1,9.82813,1,9,7.648981,378,4,0,163,100
4,66,1,318,93,1,1,1,1,0,5.804299,1,6,1.514821,231,5,0,91,88


In [16]:
len(source_df.columns)

18

In [17]:
source_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      8763 non-null   int64  
 1   Sex                      8763 non-null   int64  
 2   Cholesterol              8763 non-null   int64  
 3   Heart Rate               8763 non-null   int64  
 4   Diabetes                 8763 non-null   int64  
 5   Family History           8763 non-null   int64  
 6   Smoking                  8763 non-null   int64  
 7   Obesity                  8763 non-null   int64  
 8   Alcohol Consumption      8763 non-null   int64  
 9   Exercise Hours Per Week  8763 non-null   float64
 10  Previous Heart Problems  8763 non-null   int64  
 11  Stress Level             8763 non-null   int64  
 12  Sedentary Hours Per Day  8763 non-null   float64
 13  Triglycerides            8763 non-null   int64  
 14  Sleep Hours Per Day     

In [18]:
source_df["Sleep Hours Per Day"].min(), source_df["Sleep Hours Per Day"].max()

(np.int64(4), np.int64(10))

In [19]:
source_df["Stress Level"].min(), source_df["Stress Level"].max()

(np.int64(1), np.int64(10))

In [20]:
def cols_preprocessor(source_df: pd.DataFrame, drop_cols: list):
    bp_split = source_df["Blood Pressure"].str.split("/", expand=True).astype(int)
    bp_split.columns = ["Systolic", "Diastolic"]
    source_df.drop(drop_cols, axis=1, inplace=True)
    source_df = pd.concat([source_df, bp_split], axis=1)
    source_df.Sex = source_df.Sex.map({"Male": 1, "Female": 0})
    return source_df

In [21]:
cols_to_drop.extend(drop_less_corr_cols)
cols_to_drop

['Patient ID',
 'Blood Pressure',
 'Country',
 'Continent',
 'Hemisphere',
 'Income',
 'Diet',
 'Medication Use',
 'BMI',
 'Physical Activity Days Per Week']

In [22]:
X_train = cols_preprocessor(X_train, cols_to_drop)
X_val = cols_preprocessor(X_val, cols_to_drop)

In [23]:
X_train.columns

Index(['Age', 'Sex', 'Cholesterol', 'Heart Rate', 'Diabetes', 'Family History',
       'Smoking', 'Obesity', 'Alcohol Consumption', 'Exercise Hours Per Week',
       'Previous Heart Problems', 'Stress Level', 'Sedentary Hours Per Day',
       'Triglycerides', 'Sleep Hours Per Day', 'Systolic', 'Diastolic'],
      dtype='object')

### Data Preprocessor

In [24]:
categorical_cols = ["Stress Level"]  # is Ordinal Variable -  a categorical variable for which the possible values are ordered

In [25]:
X_train.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
2572,33,1,166,87,1,1,1,1,1,19.284174,0,9,11.344357,684,4,135,94
2628,46,1,124,49,1,1,1,1,0,14.390176,1,6,2.554072,517,10,95,78
7661,76,1,191,50,0,0,1,1,1,19.161113,0,5,8.223748,275,4,134,95
4373,36,1,228,92,0,0,1,0,1,2.430102,0,8,5.391507,453,6,150,93
3440,52,0,352,54,1,1,1,1,0,1.345332,0,5,8.421591,596,8,99,62


In [26]:
X_train["Alcohol Consumption"].min(), X_train["Alcohol Consumption"].max()

(np.int64(0), np.int64(1))

In [27]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
X_train[categorical_cols] = oe.fit_transform(X_train[categorical_cols])
X_train.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
2572,33,1,166,87,1,1,1,1,1,19.284174,0,8.0,11.344357,684,4,135,94
2628,46,1,124,49,1,1,1,1,0,14.390176,1,5.0,2.554072,517,10,95,78
7661,76,1,191,50,0,0,1,1,1,19.161113,0,4.0,8.223748,275,4,134,95
4373,36,1,228,92,0,0,1,0,1,2.430102,0,7.0,5.391507,453,6,150,93
3440,52,0,352,54,1,1,1,1,0,1.345332,0,4.0,8.421591,596,8,99,62


In [28]:
oe.categories_

[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])]

In [29]:
from sklearn.preprocessing import MinMaxScaler

all_cols = list(X_train.columns)
[all_cols.remove(col) for col in categorical_cols]
all_cols

['Age',
 'Sex',
 'Cholesterol',
 'Heart Rate',
 'Diabetes',
 'Family History',
 'Smoking',
 'Obesity',
 'Alcohol Consumption',
 'Exercise Hours Per Week',
 'Previous Heart Problems',
 'Sedentary Hours Per Day',
 'Triglycerides',
 'Sleep Hours Per Day',
 'Systolic',
 'Diastolic']

In [30]:
mm = MinMaxScaler()
X_train[all_cols] = mm.fit_transform(X_train[all_cols])
X_train.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
2572,0.208333,1.0,0.164286,0.671429,1.0,1.0,1.0,1.0,1.0,0.964267,0.0,8.0,0.945411,0.849351,0.0,0.5,0.68
2628,0.388889,1.0,0.014286,0.128571,1.0,1.0,1.0,1.0,0.0,0.719521,1.0,5.0,0.212769,0.632468,1.0,0.055556,0.36
7661,0.805556,1.0,0.253571,0.142857,0.0,0.0,1.0,1.0,1.0,0.958112,0.0,4.0,0.685318,0.318182,0.0,0.488889,0.7
4373,0.25,1.0,0.385714,0.742857,0.0,0.0,1.0,0.0,1.0,0.121406,0.0,7.0,0.44926,0.549351,0.333333,0.666667,0.66
3440,0.472222,0.0,0.828571,0.2,1.0,1.0,1.0,1.0,0.0,0.067157,0.0,4.0,0.701808,0.735065,0.666667,0.1,0.04


In [31]:
from pandas import DataFrame
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder


def data_preprocessor(df: pd.DataFrame, categorical_cols: list) -> DataFrame:
    all_cols = list(df.columns)
    [all_cols.remove(col) for col in categorical_cols]
    oe = OrdinalEncoder()
    df[categorical_cols] = oe.fit_transform(df[categorical_cols])

    mm = MinMaxScaler()
    df[all_cols] = mm.fit_transform(df[all_cols])
    return df

In [32]:
X_val = data_preprocessor(X_val, categorical_cols)
X_val.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
6308,0.722222,1.0,0.492857,0.871429,0.0,0.0,1.0,0.0,0.0,0.748469,1.0,2.0,0.758366,0.236671,0.166667,0.911111,0.92
5961,0.027778,0.0,0.239286,0.571429,1.0,1.0,0.0,1.0,0.0,0.639408,0.0,8.0,0.203697,0.782835,0.0,0.411111,0.64
4995,0.236111,1.0,0.607143,0.628571,1.0,0.0,1.0,1.0,0.0,0.453316,0.0,0.0,0.053213,0.657997,0.5,0.4,0.52
7622,0.888889,1.0,0.939286,0.842857,1.0,0.0,1.0,0.0,1.0,0.665121,1.0,1.0,0.197962,0.997399,0.5,0.333333,0.72
4048,0.597222,1.0,0.485714,0.042857,1.0,1.0,1.0,0.0,1.0,0.195197,1.0,5.0,0.45201,0.3342,0.166667,0.233333,0.0


### Initial model training

In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV

random_search = {
    "criterion": ["entropy", "gini"],
    "max_depth": [2, 3, 4, 5, 6, 7, 10],
    "min_samples_leaf": [4, 6, 8, 10],
    "min_samples_split": [5, 7, 10, 15],
    "n_estimators": [300, 350, 400, 450, 500],
}
rf = RandomForestClassifier()

rf_random = RandomizedSearchCV(rf, random_search, n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END criterion=gini, max_depth=5, min_samples_leaf=10, min_samples_split=5, n_estimators=300; total time=   1.2s
[CV] END criterion=gini, max_depth=5, min_samples_leaf=10, min_samples_split=5, n_estimators=300; total time=   1.2s
[CV] END criterion=gini, max_depth=5, min_samples_leaf=10, min_samples_split=5, n_estimators=300; total time=   1.3s
[CV] END criterion=entropy, max_depth=3, min_samples_leaf=8, min_samples_split=5, n_estimators=350; total time=   1.2s
[CV] END criterion=gini, max_depth=5, min_samples_leaf=10, min_samples_split=5, n_estimators=300; total time=   1.3s
[CV] END criterion=entropy, max_depth=3, min_samples_leaf=8, min_samples_split=5, n_estimators=350; total time=   1.2s
[CV] END criterion=entropy, max_depth=3, min_samples_leaf=8, min_samples_split=5, n_estimators=350; total time=   1.3s
[CV] END criterion=gini, max_depth=5, min_samples_leaf=10, min_samples_split=5, n_estimators=300; total time=   1.

In [34]:
best_params = rf_random.best_params_

print("Best Hyperparameters:")
print(best_params)

Best Hyperparameters:
{'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 10, 'criterion': 'gini'}


In [None]:
# save model

import pickle

filename = "rf_model.pkl"

with open(filename, "wb") as f:
    pickle.dump(rf_random, f)

In [None]:
# loading back pickle file
with open(filename, "rb") as f:
    rf_load = pickle.load(f)

In [37]:
X_test_processed = cols_preprocessor(X_test, cols_to_drop)
X_test_processed = data_preprocessor(X_test_processed, categorical_cols)
X_test_processed.head()

Unnamed: 0,Age,Sex,Cholesterol,Heart Rate,Diabetes,Family History,Smoking,Obesity,Alcohol Consumption,Exercise Hours Per Week,Previous Heart Problems,Stress Level,Sedentary Hours Per Day,Triglycerides,Sleep Hours Per Day,Systolic,Diastolic
5752,0.291667,1.0,0.825,0.942857,1.0,0.0,1.0,0.0,0.0,0.945005,0.0,8.0,0.512715,0.361039,0.166667,0.033333,0.06
2450,0.555556,1.0,0.546429,0.357143,1.0,1.0,1.0,1.0,1.0,0.189602,1.0,2.0,0.117367,0.637662,0.333333,0.422222,0.38
7264,1.0,1.0,0.632143,0.414286,1.0,1.0,1.0,1.0,0.0,0.539586,1.0,3.0,0.630089,0.571429,0.333333,0.366667,0.32
5826,0.930556,0.0,0.307143,0.357143,1.0,0.0,1.0,1.0,1.0,0.149615,0.0,8.0,0.696236,0.771429,0.166667,0.966667,0.5
7064,0.569444,1.0,0.889286,0.457143,0.0,1.0,1.0,1.0,1.0,0.617684,0.0,4.0,0.007679,0.309091,0.0,0.055556,0.72


In [38]:
y_pred = rf_load.predict(X_test_processed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      1.00      0.80       584
           1       0.00      0.00      0.00       293

    accuracy                           0.67       877
   macro avg       0.33      0.50      0.40       877
weighted avg       0.44      0.67      0.53       877



In [45]:
from scipy import stats
from xgboost import XGBClassifier

# Core parameter distributions
param_dist = {
    "n_estimators": stats.randint(200, 1000),
    "learning_rate": stats.uniform(0.01, 0.3),  # Wider range than typical grid searches
    "max_depth": stats.randint(3, 10),
    "subsample": stats.uniform(0.5, 0.5),  # 0.5-1.0
    "colsample_bytree": stats.uniform(0.6, 0.4),  # 0.6-1.0
    "gamma": stats.uniform(0, 0.5),
    "reg_alpha": stats.expon(0, 50),  # L1 regularization
    "reg_lambda": stats.expon(0, 50),  # L2 regularization
    "min_child_weight": stats.randint(1, 10),
    "scale_pos_weight": [1, 5, 10],  # For class imbalance
}

# Recommended search configuration
xgb_random = RandomizedSearchCV(
    estimator=XGBClassifier(
        objective="binary:logistic",
        eval_metric="aucpr",  # Good for imbalanced data
        n_jobs=-1,
        early_stopping_rounds=20,  # If using validation split
    ),
    param_distributions=param_dist,
    n_iter=50,  # 50-100 iterations recommended
    cv=3,  # 3-5 folds for speed
    scoring="roc_auc",
    verbose=2,
    refit=True,
    random_state=42,
)
xgb_random.fit(X_train, y_train, eval_set=[(X_val, y_val)])

Fitting 3 folds for each of 50 candidates, totalling 150 fits
[0]	validation_0-aucpr:0.37371
[1]	validation_0-aucpr:0.37369
[2]	validation_0-aucpr:0.38320
[3]	validation_0-aucpr:0.38152
[4]	validation_0-aucpr:0.39984
[5]	validation_0-aucpr:0.40066
[6]	validation_0-aucpr:0.39379
[7]	validation_0-aucpr:0.38212
[8]	validation_0-aucpr:0.38409
[9]	validation_0-aucpr:0.38540
[10]	validation_0-aucpr:0.38279
[11]	validation_0-aucpr:0.38310
[12]	validation_0-aucpr:0.37859
[13]	validation_0-aucpr:0.37519
[14]	validation_0-aucpr:0.37365
[15]	validation_0-aucpr:0.36950
[16]	validation_0-aucpr:0.36938
[17]	validation_0-aucpr:0.36546
[18]	validation_0-aucpr:0.36701
[19]	validation_0-aucpr:0.36701
[20]	validation_0-aucpr:0.36660
[21]	validation_0-aucpr:0.36641
[22]	validation_0-aucpr:0.36641
[23]	validation_0-aucpr:0.36641
[24]	validation_0-aucpr:0.36455
[CV] END colsample_bytree=0.749816047538945, gamma=0.4753571532049581, learning_rate=0.22959818254342154, max_depth=7, min_child_weight=5, n_estimat

In [46]:
xgb_random.best_estimator_

In [47]:
y_pred = xgb_random.predict(X_test_processed)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       584
           1       0.33      1.00      0.50       293

    accuracy                           0.33       877
   macro avg       0.17      0.50      0.25       877
weighted avg       0.11      0.33      0.17       877

