# Phase 2 â€“ Non-ML Baseline for Cardiovascular Disease

In this phase, I step away from machine learning and design a simple
rule-based function that predicts the likelihood of cardiovascular
disease using only hand-crafted rules (no training, no models).

The goal is to:
1. Reuse the cleaned dataset from Phase 1.
2. Convert `age` from days to years for easier reasoning.
3. Implement a non-ML `predict_non_ml` function that outputs a value
   between 0 and 1.
4. Apply this function to every row and evaluate accuracy and confusion
   matrix as a baseline for later neural network models.

In [None]:
import pandas as pd
import numpy as np

csv_path = "data/cardio_train 2.csv"

df = pd.read_csv(csv_path, sep=";")

print("Loaded data with shape:", df.shape)
print(df.head())
print(df.columns)

## Converting age from days to years

The dataset stores `age` in days. To make the rules more interpretable,
I convert age to years and look at its summary statistics.

In [None]:
# Convert age in days to years
if "age" in df.columns:
    df["age_years"] = df["age"] / 365.25
else:
    raise ValueError("Column 'age' not found in the dataset.")

print("Age in years summary:")
print(df["age_years"].describe())

## Designing the non-ML heuristic

The `predict_non_ml_inputs` function is a purely hand-crafted rule-based
heuristic. It does **not** learn from data and does not use any ML library.

The main ideas:
- Older patients (over 40, 50, 60) gradually get higher risk.
- Higher systolic/diastolic blood pressure increases risk.
- Higher cholesterol and glucose categories (2 and 3) add more risk.
- Smoking, alcohol, and low physical activity add small risk bumps.

The function adds these contributions into a risk score between 0 and 1.

In [None]:
def predict_non_ml_inputs(age_years, ap_hi, ap_lo, cholesterol, gluc, smoke, alco, active):
    """
    Hand-crafted heuristic:
    Returns a risk score between 0 and 1 based on manually chosen rules.
    """
    risk = 0.0

    # --- AGE CONTRIBUTION ---
    if age_years > 60:
        risk += 0.35
    elif age_years > 50:
        risk += 0.25
    elif age_years > 40:
        risk += 0.15

    # --- BLOOD PRESSURE CONTRIBUTION ---
    if ap_hi > 160 or ap_lo > 100:
        risk += 0.30
    elif ap_hi > 140 or ap_lo > 90:
        risk += 0.20
    elif ap_hi > 130:
        risk += 0.10

    # --- CHOLESTEROL CONTRIBUTION ---
    if cholesterol == 3:
        risk += 0.25
    elif cholesterol == 2:
        risk += 0.15

    # --- GLUCOSE CONTRIBUTION ---
    if gluc == 3:
        risk += 0.20
    elif gluc == 2:
        risk += 0.10

    # --- LIFESTYLE CONTRIBUTION ---
    if smoke == 1:
        risk += 0.05
    if alco == 1:
        risk += 0.05
    if active == 0:
        risk += 0.10

    # Clamp to [0, 1] just to be safe
    risk = max(0.0, min(risk, 1.0))
    return risk


def predict_non_ml_row(row):
    """
    Wrapper that takes a pandas row and calls predict_non_ml_inputs
    with the correct columns.
    """
    return predict_non_ml_inputs(
        age_years=row["age_years"],
        ap_hi=row["ap_hi"],
        ap_lo=row["ap_lo"],
        cholesterol=row["cholesterol"],
        gluc=row["gluc"],
        smoke=row["smoke"],
        alco=row["alco"],
        active=row["active"],
    )

In [None]:
# Apply the non-ML predictor to each row
df["pred_prob_non_ml"] = df.apply(predict_non_ml_row, axis=1)
df["pred_label_non_ml"] = (df["pred_prob_non_ml"] >= 0.5).astype(int)

df[[
    "age_years", "ap_hi", "ap_lo",
    "cholesterol", "gluc", "smoke", "alco", "active",
    "cardio", "pred_prob_non_ml", "pred_label_non_ml"
]].head()

In [None]:
# Simple accuracy
accuracy = (df["pred_label_non_ml"] == df["cardio"]).mean()
print(f"Non-ML heuristic accuracy: {accuracy:.4f}")

# Majority-class baseline
majority_class = df["cardio"].mode()[0]
baseline_acc = (df["cardio"] == majority_class).mean()
print(f"Majority-class baseline accuracy (always predict {majority_class}): {baseline_acc:.4f}")

def confusion_matrix_simple(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return tp, tn, fp, fn

tp, tn, fp, fn = confusion_matrix_simple(df["cardio"], df["pred_label_non_ml"])
print("Confusion matrix (TP, TN, FP, FN):", tp, tn, fp, fn)

## Reflection on the non-ML baseline

This non-ML heuristic reaches about 63.6% accuracy on the full dataset,
which is better than the majority baseline (around 50%).
This means the hand-crafted rules are capturing some real patterns in
the data, especially for identifying healthy patients.

However, the confusion matrix shows a large number of false negatives
(patients with cardiovascular disease that the heuristic predicts as
healthy).

In later phases, a trained neural network should be able to learn more
subtle interactions between features and hopefully reduce these errors.