# **Heart Disease Prediction**  

## **Baseline Model**  
This notebook focuses on creating a **baseline model** using a heuristic approach to serve as a reference for comparing future machine learning models. The goal is to establish a simple yet informative starting point that helps evaluate improvements in predictive performance.  

All steps will be implemented using **scikit-learn** and other relevant libraries to ensure a structured and reproducible workflow.  

`Simón Correa Marín`


The model will be based on the following rules based on the EDA and other analysis.

1. Chest pain **'asymptomatic'** has 73.42% influence is heart disease prediction true.
2. If it is a **man** it is most likely to have heart disease.
3. If **fbs > 120mg/dL (True)** then it is most likely to not having heart disease.
4. If the patient’s **age is between 50 and 60 years** its most likely yo have a heart disease.
5. Heart disease likely happens to people with **max_hr (maximum heart rate) between 120 and 160**
6. If **that is reversable** it is most likely to have heart disease.

### **1. Import Libraries and Configurations**

In [83]:
# base libraries for data science
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import (
    KFold,
    ShuffleSplit,
    cross_val_score,
    learning_curve,
    train_test_split,
)
from sklearn.pipeline import Pipeline

### **2. Load Data**

In [84]:
DATA_DIR = Path.cwd().resolve().parents[0] / "data"

hd_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/hd_type_fixed.parquet", engine="pyarrow"
)

In [85]:
# print library version for reproducibility

print("Pandas version: ", pd.__version__)

Pandas version:  2.2.3


### **3. Data Preparation**

Based on the rules above, for the heuristic model only 6 columns are needed: `chest_pain`, `sex`,`fbs`,`age`,`max_hr`,`thal`

In [86]:
selected_features = ['chest_pain', 'sex', 'fbs', 'age', 'max_hr', 'thal', 'disease']
hd_features = hd_df[selected_features]
hd_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   chest_pain  6648 non-null   category
 1   sex         6692 non-null   category
 2   fbs         6848 non-null   bool    
 3   age         6763 non-null   float64 
 4   max_hr      6453 non-null   float64 
 5   thal        6552 non-null   category
 6   disease     6848 non-null   bool    
dtypes: bool(2), category(3), float64(2)
memory usage: 141.0 KB


In [87]:
hd_features.isna().sum()

chest_pain    200
sex           156
fbs             0
age            85
max_hr        395
thal          296
disease         0
dtype: int64

In [88]:
# Change target data type to int (0,1)
hd_features.loc[:, "disease"] = hd_features["disease"].astype(int)


  hd_features.loc[:, "disease"] = hd_features["disease"].astype(int)


#### **Duplicated Data**

In [89]:
len(hd_features.drop_duplicates())

474

In [90]:
hd_features = hd_features.drop_duplicates()
hd_features.shape

(474, 7)

Compared to the duplicates in the whole dataset (508) I'll be using the dataset with the duplicated rows dropped.

In [91]:
hd_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 474 entries, 0 to 2423
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   chest_pain  417 non-null    category
 1   sex         432 non-null    category
 2   fbs         474 non-null    bool    
 3   age         457 non-null    float64 
 4   max_hr      363 non-null    float64 
 5   thal        409 non-null    category
 6   disease     474 non-null    int64   
dtypes: bool(1), category(3), float64(2), int64(1)
memory usage: 17.1 KB


In [92]:
hd_features.sample(10)

Unnamed: 0,chest_pain,sex,fbs,age,max_hr,thal,disease
1,asymptomatic,Female,False,57.0,163.0,normal,0
1864,,,True,43.0,,reversable,0
682,,,True,50.0,,normal,0
131,nonanginal,Male,True,53.0,173.0,normal,0
65,asymptomatic,Male,False,39.0,140.0,reversable,1
430,nonanginal,Male,False,44.0,179.0,normal,0
46,nontypical,Female,False,45.0,175.0,normal,0
157,nontypical,Male,False,56.0,169.0,normal,0
591,nonanginal,Female,False,44.0,149.0,normal,0
2212,nonanginal,Female,True,60.0,,,1


### **4. Feature Engineering**

In [93]:
nom_categorical_cols = ["chest_pain", "thal", "sex"]
disc_numerical_cols = ["age", "max_hr"]
boolean_cols = ["fbs"] #Doesn't need to be transformed (no nans)

#### **Pipeline**

In [94]:
from sklearn.preprocessing import FunctionTransformer

numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)

nom_categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipe, disc_numerical_cols),
        ("nominal_categoric", nom_categorical_pipe, nom_categorical_cols),
    ]
)

In [95]:
preprocessor

**Data preprocessing example**

In [96]:
data_example = hd_features.drop(columns="disease").sample(10, random_state=42)
data_example

Unnamed: 0,chest_pain,sex,fbs,age,max_hr,thal
1899,asymptomatic,Male,True,55.0,,reversable
2042,nonanginal,Female,False,76.0,,
10,asymptomatic,Female,False,62.0,163.0,normal
2362,nontypical,Female,False,45.0,138.0,
42,nonanginal,Male,False,44.0,169.0,normal
36,asymptomatic,Male,False,47.0,143.0,normal
2294,asymptomatic,Female,False,71.0,,
532,asymptomatic,Male,False,61.0,161.0,reversable
91,asymptomatic,Male,False,44.0,177.0,normal
59,typical,Female,True,58.0,,


In [97]:
preprocessor.fit_transform(data_example)

array([[55.0, 162.0, 'asymptomatic', 'reversable', 'Male'],
       [76.0, 162.0, 'nonanginal', 'normal', 'Female'],
       [62.0, 163.0, 'asymptomatic', 'normal', 'Female'],
       [45.0, 138.0, 'nontypical', 'normal', 'Female'],
       [44.0, 169.0, 'nonanginal', 'normal', 'Male'],
       [47.0, 143.0, 'asymptomatic', 'normal', 'Male'],
       [71.0, 162.0, 'asymptomatic', 'normal', 'Female'],
       [61.0, 161.0, 'asymptomatic', 'reversable', 'Male'],
       [44.0, 177.0, 'asymptomatic', 'normal', 'Male'],
       [58.0, 162.0, 'typical', 'normal', 'Female']], dtype=object)

### **5. Train/Test split**

In [98]:
X_features = hd_features.drop("disease", axis="columns")
Y_target = hd_features["disease"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)

In [99]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((379, 6), (379,), (95, 6), (95,))

**Data preprocessing**

In [100]:
transformed_data = preprocessor.fit(x_train)

In [101]:
feature_names = preprocessor.get_feature_names_out()

# Transform X_test with preprocessor and pandas output set
x_train_transformed = preprocessor.transform(x_train)
x_train_transformed = pd.DataFrame(x_train_transformed, columns=feature_names)
x_train_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   numeric__age                   379 non-null    object
 1   numeric__max_hr                379 non-null    object
 2   nominal_categoric__chest_pain  379 non-null    object
 3   nominal_categoric__thal        379 non-null    object
 4   nominal_categoric__sex         379 non-null    object
dtypes: object(5)
memory usage: 14.9+ KB


### **6. Model**

In [102]:
class HeuristicModel(BaseEstimator, ClassifierMixin):
    def fit(self, _, y=None):
        if y is not None:
            self.classes_ = np.unique(y)
        return self

    def predict(self, X) -> np.ndarray:
        CHEST_PAIN_THRESHOLD = "asymptomatic"  # High risk if chest pain is asymptomatic
        SEX_HIGH_RISK = "Male"  # Males are at higher risk
        FBS_THRESHOLD = True  # If fasting blood sugar > 120 mg/dL, lower risk
        AGE_LOW, AGE_HIGH = 50, 60  # High-risk age range
        MAX_HR_LOW, MAX_HR_HIGH = 120, 160  # High-risk max heart rate range
        THAL_HIGH_RISK = "reversable"  # Thal results indicating risk

        predictions = []
        for row in X:
            if (
                (row[0] == CHEST_PAIN_THRESHOLD) or
                (row[1] == SEX_HIGH_RISK) or
                (row[2] != FBS_THRESHOLD) or
                (AGE_LOW <= row[3] <= AGE_HIGH) or
                (MAX_HR_LOW <= row[4] <= MAX_HR_HIGH) or
                (row[5] == THAL_HIGH_RISK)
            ):
                predictions.append(1)
            else:
                predictions.append(0)

        return np.array(predictions)
