<a href="https://colab.research.google.com/github/tousifo/ml_notebooks/blob/main/ALS_QNN_PRO_ACT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Cell 1 — Imports & setup for the ALS QNN notebook

**Purpose:** bring in the sklearn utilities, pin/install Qiskit, and import the quantum pieces used by a variational regressor.

### What’s in this cell
- **Data utilities:** `train_test_split`, `MinMaxScaler`
- **Metrics:** `mean_squared_error`, `pearsonr`
- **Install (pinned):** `%pip install qiskit~=1.0 qiskit-machine-learning~=0.8.1 qiskit_algorithms`
- **Quantum stack:** `ZZFeatureMap`, `RealAmplitudes`, `COBYLA`, `VQR`, `Sampler`

> **Note:** If the install line complains about `qiskit_algorithms`, the PyPI package name is often `qiskit-algorithms` (hyphen). We’re not changing your code here—this is just a heads-up.

### Why scale features?
Angle encoders work better when inputs live in a tight range; min–max scaling avoids angle wraparound and makes training steadier.

<details>
<summary>Quick I/O expectations</summary>
After preprocessing later:
- Feature vectors will be scaled to a small range (often [0, 1]).
- The `ZZFeatureMap(feature_dimension=...)` should match your final number of features.
</details>

---


In [1]:
from sklearn.model_selection import train_test_split  # quick train/validation split
from sklearn.preprocessing import MinMaxScaler        # keep features in a compact range for angle encoding
from sklearn.metrics import mean_squared_error        # regression loss (lower is better)
from scipy.stats import pearsonr                      # correlation between predictions and targets (closer to 1 is better)
%pip install qiskit~=1.0 qiskit-machine-learning~=0.8.1 qiskit_algorithms  # pinned install; if it fails, try 'qiskit-algorithms' manually

# Qiskit Imports
from qiskit.circuit.library import ZZFeatureMap, RealAmplitudes  # feature map + ansatz for the variational circuit
from qiskit_algorithms.optimizers import COBYLA                  # gradient-free optimizer suited to noisy objectives
from qiskit_machine_learning.algorithms.regressors import VQR    # variational quantum regressor wrapper
from qiskit.primitives import Sampler                            # primitive that evaluates circuits (shot-based)

Collecting qiskit~=1.0
  Downloading qiskit-1.4.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting qiskit-machine-learning~=0.8.1
  Downloading qiskit_machine_learning-0.8.4-py3-none-any.whl.metadata (13 kB)
Collecting qiskit_algorithms
  Downloading qiskit_algorithms-0.4.0-py3-none-any.whl.metadata (4.7 kB)
Collecting rustworkx>=0.15.0 (from qiskit~=1.0)
  Downloading rustworkx-0.17.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting stevedore>=3.0.0 (from qiskit~=1.0)
  Downloading stevedore-5.5.0-py3-none-any.whl.metadata (2.2 kB)
Collecting symengine<0.14,>=0.11 (from qiskit~=1.0)
  Downloading symengine-0.13.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting scipy>=1.5 (from qiskit~=1.0)
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31

In [4]:
import pandas as pd
import numpy as np
import warnings
from typing import Dict, Optional, List

warnings.filterwarnings("ignore")


class ALSDataProcessor:
    """
    CV-safe preprocessing for PRO-ACT to reproduce the paper's EDA:
      - Anchor to FIRST ALSFRS visit (t=0) per subject
      - Inputs: first 3 months (0–90 days from anchor) for all longitudinal tables
      - Outcome: ALSFRS Total slope between FIRST-after-3mo and FIRST-after-12mo
      - ALSFRS-R harmonization hooks (Q10 from 10a; merge Q5a/Q5b)
      - FVC reduced to max-of-trials per test before summarization
      - Seven summaries: min, max, median, std, first, last, slope (slope=NaN if only 1 obs)
      - Drop features with >30% missing (no other transforms here — avoid leakage)
    """

    def __init__(self):
        # columns that should never be summarized as values
        self.id_and_delta_cols = {
            "subject_id",
            "alsfrs_delta",
            "fvc_delta",
            "vitals_delta",
            "labs_delta",
            "grip_delta",
            "muscle_delta",
            "onset_delta",
            "death_delta",
            "history_delta",
            "anchor_days",
            "days_from_alsfrs_anchor",
        }

    # --------- Utilities ---------

    @staticmethod
    def _find_time_col(df: pd.DataFrame) -> Optional[str]:
        """Find a time column that represents days since baseline in a table."""
        # Prefer delta
        for c in df.columns:
            lc = c.lower()
            if "delta" in lc:
                return c
        # Fallback to 'days' if present
        for c in df.columns:
            lc = c.lower()
            if "day" in lc:
                return c
        return None

    # --------- ALSFRS-R harmonization ---------

    def _convert_alsfrs_r(self, alsfrs_df: pd.DataFrame) -> pd.DataFrame:
        """
        Prepare ALSFRS table. If ALSFRS-R subitems exist, map per paper:
          - Q10 <- 10a (dyspnea). Ignore 10b/10c.
          - Merge Q5a/Q5b into Q5 if present.
        If only totals exist, this is a no-op aside from coercions.
        """
        df = alsfrs_df.copy()

        if "ALSFRS_Total" in df.columns:
            df["ALSFRS_Total"] = pd.to_numeric(df["ALSFRS_Total"], errors="coerce")

        # Try to locate subitems by loose names
        cols = {c.lower(): c for c in df.columns}

        # Q10 from 10a (dyspnea)
        for candidate in ["alsfrs_r_q10a", "q10a", "dyspnea", "alsfrs_q10a"]:
            if candidate in cols:
                df["Q10"] = pd.to_numeric(df[cols[candidate]], errors="coerce")
                break

        # Merge Q5a/Q5b
        q5a = next(
            (cols[k] for k in ["alsfrs_r_q5a", "q5a", "cutting_wout_gastrostomy"] if k in cols),
            None,
        )
        q5b = next(
            (cols[k] for k in ["alsfrs_r_q5b", "q5b", "cutting_with_gastrostomy"] if k in cols),
            None,
        )
        if q5a and q5b:
            q5a_vals = pd.to_numeric(df[q5a], errors="coerce").values
            q5b_vals = pd.to_numeric(df[q5b], errors="coerce").values
            df["Q5"] = np.nanmax(np.vstack([q5a_vals, q5b_vals]), axis=0)

        return df

    # --------- Anchoring ---------

    def _alsfrs_anchor_days(self, alsfrs_df: pd.DataFrame) -> pd.Series:
        """
        Compute per-subject anchor day = first ALSFRS visit (min delta/days).
        """
        df = alsfrs_df.copy()
        tcol = self._find_time_col(df)
        if tcol is None:
            raise ValueError("ALSFRS table lacks a time delta/days column.")

        df.rename(columns={tcol: "alsfrs_delta"}, inplace=True)
        anchor_map = df.groupby("subject_id")["alsfrs_delta"].min()
        return anchor_map

    # --------- Data I/O ---------

    def load_and_inspect_data(self, file_path: str = "") -> Dict[str, pd.DataFrame]:
        datasets: Dict[str, pd.DataFrame] = {}
        file_list = [
            "PROACT_ALSFRS.csv",
            "PROACT_FVC.csv",
            "PROACT_VITALSIGNS.csv",
            "PROACT_RILUZOLE.csv",
            "PROACT_DEMOGRAPHICS.csv",
            "PROACT_LABS.csv",
            "PROACT_DEATHDATA.csv",
            "PROACT_HANDGRIPSTRENGTH.csv",
            "PROACT_MUSCLESTRENGTH.csv",
            "PROACT_ALSHISTORY.csv",
        ]
        print("--- Loading and Inspecting Data ---")
        for file_name in file_list:
            try:
                df = pd.read_csv(file_path + file_name, on_bad_lines="skip")
                # normalize subject_id
                if "subject_id" not in df.columns:
                    potential = [c for c in df.columns if "subject" in c.lower()]
                    if potential:
                        df = df.rename(columns={potential[0]: "subject_id"})
                # coerce delta-like numeric columns
                for c in df.columns:
                    if "delta" in c.lower() or "day" in c.lower():
                        df[c] = pd.to_numeric(df[c], errors="coerce")
                datasets[file_name] = df
                print(f"✓ {file_name}: {df.shape}")
            except FileNotFoundError:
                print(f"✗ {file_name}: File not found (skipped).")
        return datasets

    # --------- Outcome ---------

    def calculate_alsfrs_slope(self, alsfrs_df: pd.DataFrame) -> pd.DataFrame:
        """
        Outcome = slope between FIRST-after-3mo and FIRST-after-12mo ALSFRS totals,
        with time anchored to first ALSFRS visit.
        """
        df = alsfrs_df.copy()
        tcol = self._find_time_col(df)
        if tcol is None:
            raise ValueError("ALSFRS table lacks a time delta/days column.")
        if "ALSFRS_Total" not in df.columns:
            raise ValueError("ALSFRS_Total missing in ALSFRS table.")

        df.rename(columns={tcol: "alsfrs_delta"}, inplace=True)
        # Anchor
        anchor_map = df.groupby("subject_id")["alsfrs_delta"].min()
        df["days_from_anchor"] = df["alsfrs_delta"] - df["subject_id"].map(anchor_map)
        df["months"] = df["days_from_anchor"] / 30.44

        df = df.sort_values(["subject_id", "months"])
        slopes = {}

        for sid, g in df.groupby("subject_id", sort=False):
            g = g.dropna(subset=["months", "ALSFRS_Total"])
            t1 = g[g["months"] > 3.0].head(1)
            t2 = g[g["months"] > 12.0].head(1)
            if not t1.empty and not t2.empty:
                t1m = float(t1["months"].iloc[0])
                t2m = float(t2["months"].iloc[0])
                t1v = float(t1["ALSFRS_Total"].iloc[0])
                t2v = float(t2["ALSFRS_Total"].iloc[0])
                if t2m > t1m:
                    slopes[sid] = (t2v - t1v) / (t2m - t1m)

        return pd.DataFrame({"subject_id": list(slopes.keys()), "alsfrs_slope": list(slopes.values())})

    # --------- FVC collapse ---------

    @staticmethod
    def _fvc_collapse_trials(df: pd.DataFrame, time_col: str) -> pd.DataFrame:
        """
        Reduce FVC per row/time to the max across trials before summarization.
        Tries to detect typical trial columns; falls back gracefully.
        """
        d = df.copy()
        # Find obvious trial columns
        trial_cols = [c for c in d.columns if "trial" in c.lower()]
        # Some datasets have explicit liters columns per trial name
        if trial_cols:
            d["FVC_Liters"] = pd.to_numeric(d[trial_cols].max(axis=1), errors="coerce")
            keep = ["subject_id", time_col, "FVC_Liters"]
            return d[keep]
        # Fallbacks: look for liters column names
        liter_like = [c for c in d.columns if "liter" in c.lower() or "fvc" in c.lower()]
        if liter_like:
            # If multiple, take row-wise max
            d["FVC_Liters"] = pd.to_numeric(d[liter_like].max(axis=1), errors="coerce")
            keep = ["subject_id", time_col, "FVC_Liters"]
            return d[keep]
        # Last resort: return as-is
        return d

    # --------- Longitudinal summarization ---------

    def create_longitudinal_features(self, df: pd.DataFrame, time_col: str, prefix: str) -> pd.DataFrame:
        """
        Create 7 summaries over [0, 90] days from ALSFRS anchor:
          min, max, median, std, first, last, slope(first→last)
        Slope remains NaN if only one observation or zero time span.
        """
        if time_col not in df.columns:
            return pd.DataFrame()

        d = df.copy()
        # Coerce numerics (but keep subject_id/time cols)
        for c in d.columns:
            if c not in {"subject_id", time_col}:
                d[c] = pd.to_numeric(d[c], errors="coerce")

        # Ensure window is 0..90 days from ALSFRS anchor (already anchored)
        d = d[(d[time_col] >= 0) & (d[time_col] <= 90)].copy()
        if d.empty:
            return pd.DataFrame()

        # Value columns (exclude identifiers/derived delta/time)
        val_cols = [
            c
            for c in d.select_dtypes(include=[np.number]).columns
            if c not in self.id_and_delta_cols and c not in {"subject_id", time_col}
        ]
        if not val_cols:
            return pd.DataFrame()

        out = []
        g = d.groupby("subject_id", as_index=True)
        for col in val_cols:
            agg = g[col].agg(["min", "max", "median", "first", "last"])
            std_ = g[col].std(ddof=0).rename("std")
            slope = g.apply(
                lambda x: (x[col].iloc[-1] - x[col].iloc[0]) / max(1e-9, (x[time_col].iloc[-1] - x[time_col].iloc[0]))
                if len(x) > 1 and (x[time_col].iloc[-1] - x[time_col].iloc[0]) > 0
                else np.nan
            ).rename("slope")
            feat = pd.concat([agg, std_, slope], axis=1)
            feat.columns = [f"{prefix}{col}_{cname}" for cname in feat.columns]
            out.append(feat)

        return pd.concat(out, axis=1).reset_index()

    # --------- Static table processing (no encoding here to avoid leakage) ---------

    @staticmethod
    def process_static_data(df: pd.DataFrame) -> pd.DataFrame:
        """
        CV-safe: DO NOT encode here. Just keep one row per subject.
        (Do categorical encoding in your modeling pipeline.)
        """
        if "subject_id" not in df.columns:
            return pd.DataFrame()
        # Keep first non-duplicated row per subject_id
        return df.drop_duplicates(subset=["subject_id"]).copy()

    # --------- Merge features ---------

    def merge_all_features(self, datasets: Dict[str, pd.DataFrame]) -> pd.DataFrame:
        if "PROACT_DEMOGRAPHICS.csv" not in datasets:
            raise ValueError("Demographics file is missing.")

        # Build ALSFRS anchor map
        alsfrs = datasets["PROACT_ALSFRS.csv"]
        anchor_map = self._alsfrs_anchor_days(alsfrs)

        # Start with demographics (static)
        final_df = self.process_static_data(datasets["PROACT_DEMOGRAPHICS.csv"])

        # Add static-ish other tables (keep CV-safe; no encodings)
        for file in ["PROACT_RILUZOLE.csv", "PROACT_ALSHISTORY.csv"]:
            if file in datasets:
                static_df = self.process_static_data(datasets[file])
                final_df = pd.merge(final_df, static_df, on="subject_id", how="left")

        # Longitudinal configs
        longitudinal = {
            "PROACT_ALSFRS.csv": "alsfrs_",
            "PROACT_FVC.csv": "fvc_",
            "PROACT_VITALSIGNS.csv": "vitals_",
            "PROACT_LABS.csv": "labs_",
            "PROACT_HANDGRIPSTRENGTH.csv": "grip_",
            "PROACT_MUSCLESTRENGTH.csv": "muscle_",
        }

        print("\n--- Generating Longitudinal Features (anchored to first ALSFRS; window = 0–90 days) ---")
        for file, prefix in longitudinal.items():
            if file not in datasets:
                continue

            df = datasets[file].copy()
            tcol = self._find_time_col(df)
            if tcol is None:
                print(f"Warning: No time delta/days column in {file}. Skipping.")
                continue

            # Anchor this table to ALSFRS first visit
            df["anchor_days"] = df["subject_id"].map(anchor_map)
            df = df[~df["anchor_days"].isna()].copy()
            df["days_from_alsfrs_anchor"] = pd.to_numeric(df[tcol], errors="coerce") - df["anchor_days"]

            # FVC special handling: collapse to max-of-trials BEFORE summarization
            if file == "PROACT_FVC.csv":
                df = self._fvc_collapse_trials(df, time_col="days_from_alsfrs_anchor")

            # Attempt to pivot long-form measurement tables (best effort)
            if file in {"PROACT_LABS.csv", "PROACT_MUSCLESTRENGTH.csv", "PROACT_HANDGRIPSTRENGTH.csv"}:
                try:
                    test_cols = [
                        c
                        for c in df.columns
                        if c not in {"subject_id", "days_from_alsfrs_anchor", "anchor_days"}
                        and any(k in c.lower() for k in ["test", "exam", "muscle", "site", "name", "strength_test"])
                    ]
                    value_cols = [
                        c
                        for c in df.columns
                        if c not in {"subject_id", "days_from_alsfrs_anchor", "anchor_days"}
                        and any(k in c.lower() for k in ["result", "value", "strength", "score"])
                    ]
                    if test_cols and value_cols:
                        tcol_name = test_cols[0]
                        vcol_name = value_cols[0]
                        df[vcol_name] = pd.to_numeric(df[vcol_name], errors="coerce")
                        df = (
                            df.pivot_table(
                                index=["subject_id", "days_from_alsfrs_anchor"],
                                columns=tcol_name,
                                values=vcol_name,
                                aggfunc="mean",
                            )
                            .reset_index()
                        )
                except Exception as e:
                    print(f"Warning: Pivoting failed for {file}: {e}")

            feats = self.create_longitudinal_features(df, "days_from_alsfrs_anchor", prefix)
            if not feats.empty:
                final_df = pd.merge(final_df, feats, on="subject_id", how="left")

        return final_df

    # --------- Eligibility ---------

    def filter_eligible_patients(self, feature_df: pd.DataFrame, alsfrs_df: pd.DataFrame) -> pd.DataFrame:
        """
        Keep subjects who have ANY ALSFRS >3 months AND >12 months AFTER the ALSFRS anchor.
        """
        df = alsfrs_df.copy()
        tcol = self._find_time_col(df)
        if tcol is None:
            raise ValueError("ALSFRS table lacks a time delta/days column.")

        df.rename(columns={tcol: "alsfrs_delta"}, inplace=True)
        anchor_map = df.groupby("subject_id")["alsfrs_delta"].min()
        df["days_from_anchor"] = df["alsfrs_delta"] - df["subject_id"].map(anchor_map)
        df["months"] = df["days_from_anchor"] / 30.44

        g = df.groupby("subject_id")["months"]
        has_t1 = g.apply(lambda s: (s > 3.0).any())
        has_t2 = g.apply(lambda s: (s > 12.0).any())
        eligible_ids = has_t1[has_t1].index.intersection(has_t2[has_t2].index)

        print(f"\nEligible patients: {len(eligible_ids)} / {df['subject_id'].nunique()}")
        return feature_df[feature_df["subject_id"].isin(eligible_ids)].copy()

    # --------- Orchestration ---------

    def run_pipeline(self, file_path: str = "") -> Optional[Dict[str, pd.DataFrame]]:
        """
        End-to-end EDA (CV-safe) that writes 'final_processed_als_data.csv'.
        No imputation/scaling/feature selection here — do that inside your CV pipeline.
        """
        print("====== Starting ALS Data Preprocessing Pipeline ======")
        datasets = self.load_and_inspect_data(file_path)
        if "PROACT_ALSFRS.csv" not in datasets:
            print("CRITICAL ERROR: PROACT_ALSFRS.csv not found. Aborting.")
            return None

        # ALSFRS prep + anchor
        datasets["PROACT_ALSFRS.csv"] = self._convert_alsfrs_r(datasets["PROACT_ALSFRS.csv"])

        # Outcome
        target_df = self.calculate_alsfrs_slope(datasets["PROACT_ALSFRS.csv"])
        print(f"\nCalculated ALSFRS slope for {len(target_df)} patients.")

        # Features
        full_features = self.merge_all_features(datasets)

        # Eligibility
        eligible_features = self.filter_eligible_patients(full_features, datasets["PROACT_ALSFRS.csv"])

        # Join features + target
        final_df = pd.merge(eligible_features, target_df, on="subject_id", how="inner")

        # Drop features with >30% missing
        print("\n--- Handling Missing Values (Dropping cols with >30% missing) ---")
        initial_cols = len(final_df.columns)
        missing_thresh = 0.30
        min_non_na = int(np.ceil(len(final_df) * (1 - missing_thresh)))
        final_df = final_df.dropna(axis=1, thresh=min_non_na)
        dropped = initial_cols - len(final_df.columns)
        print(f"Dropped {dropped} columns for >{int(missing_thresh*100)}% missingness.")

        # Separate X/y (no transforms here to avoid leakage)
        if "alsfrs_slope" not in final_df.columns:
            print("No target available after merges. Aborting.")
            return None

        y = final_df["alsfrs_slope"]
        valid = y.notna()
        final_df = final_df.loc[valid].reset_index(drop=True)

        subject_ids = final_df["subject_id"]
        y = final_df["alsfrs_slope"]
        X = final_df.drop(columns=["subject_id", "alsfrs_slope"])

        # Save CV-safe engineered dataset (raw features)
        out = pd.concat([subject_ids, y, X], axis=1)
        out.to_csv("final_processed_als_data.csv", index=False)
        print("\n✅ Saved CV-safe engineered data to 'final_processed_als_data.csv'")
        print(f"Feature matrix shape: {X.shape} | Target length: {len(y)}")

        return {"X": X, "y": y, "subject_ids": subject_ids, "raw_frame": out}


if __name__ == "__main__":
    # If your CSVs live elsewhere, set file_path accordingly (e.g., "C:/data/PROACT/")
    file_path = ""
    processor = ALSDataProcessor()
    processed = processor.run_pipeline(file_path=file_path)
    if processed is not None:
        print("\nPreview of columns:", list(processed["X"].columns)[:10])
        print("Done.")


--- Loading and Inspecting Data ---
✓ PROACT_ALSFRS.csv: (73845, 20)
✓ PROACT_FVC.csv: (49110, 10)
✓ PROACT_VITALSIGNS.csv: (84721, 36)
✓ PROACT_RILUZOLE.csv: (10363, 3)
✓ PROACT_DEMOGRAPHICS.csv: (12504, 14)
✓ PROACT_LABS.csv: (2937162, 5)
✓ PROACT_DEATHDATA.csv: (5043, 3)
✓ PROACT_HANDGRIPSTRENGTH.csv: (19032, 11)
✓ PROACT_MUSCLESTRENGTH.csv: (204875, 10)
✓ PROACT_ALSHISTORY.csv: (13765, 16)

Calculated ALSFRS slope for 1897 patients.

--- Generating Longitudinal Features (anchored to first ALSFRS; window = 0–90 days) ---

Eligible patients: 3317 / 8538

--- Handling Missing Values (Dropping cols with >30% missing) ---
Dropped 1413 columns for >30% missingness.

✅ Saved CV-safe engineered data to 'final_processed_als_data.csv'
Feature matrix shape: (1897, 346) | Target length: 1897

Preview of columns: ['Demographics_Delta', 'Age', 'Race_Caucasian', 'Sex', 'Subject_used_Riluzole', 'Riluzole_use_Delta', 'Subject_ALS_History_Delta', 'Site_of_Onset', 'alsfrs_Q1_Speech_min', 'alsfrs_Q1_S

In [None]:
import numpy as np
import pandas as pd
from typing import Tuple, Dict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error
from scipy.stats import pearsonr
import warnings

warnings.filterwarnings("ignore")


# ---------- Metrics ----------
def rmse(y_true, y_pred) -> float:
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def safe_pcc(y_true, y_pred) -> float:
    # Robust Pearson: returns 0 if either vector is (near) constant in a bootstrap draw
    yt = np.asarray(y_true, dtype=float)
    yp = np.asarray(y_pred, dtype=float)
    if np.std(yt) < 1e-12 or np.std(yp) < 1e-12:
        return 0.0
    return float(np.corrcoef(yt, yp)[0, 1])

def pcc_scorer_func(y_true, y_pred) -> float:
    return safe_pcc(y_true, y_pred)

PCC_SCORER = make_scorer(pcc_scorer_func, greater_is_better=True)


def bootstrap_ci(y_true, y_pred, metric_fn, n_boot=5000, alpha=0.95, seed=42) -> Tuple[float, float]:
    rng = np.random.default_rng(seed)
    n = len(y_true)
    stats = []
    idx = np.arange(n)
    for _ in range(n_boot):
        b = rng.choice(idx, size=n, replace=True)
        stats.append(metric_fn(y_true[b], y_pred[b]))
    lower = float(np.percentile(stats, (1 - alpha) / 2 * 100))
    upper = float(np.percentile(stats, (1 + alpha) / 2 * 100))
    return lower, upper


# ---------- Main ----------
def run_classical_pipeline_cv() -> pd.DataFrame:
    print("====== Classical Baselines (CV-safe, paper-aligned) ======")

    # 1) Load engineered data (from your preprocessing step)
    df = pd.read_csv("final_processed_als_data.csv")
    print(f"✓ Loaded engineered dataset: {df.shape}")

    X = df.drop(columns=["subject_id", "alsfrs_slope"])
    y = df["alsfrs_slope"].astype(float)

    # Split 80/20 at subject level (random split is fine; paper used 80/20 once)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=42
    )
    print(f"Split: train={X_train.shape[0]}, test={X_test.shape[0]}")

    # 2) Column typing for preprocessing
    num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
    cat_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
    print(f"Detected numeric={len(num_cols)}, categorical={len(cat_cols)}")

    # Preprocessors (inside pipeline to avoid leakage)
    # Numeric: impute (mean/median tuned); scaler tuned for SVR; RF doesn't need scaling
    # Categorical: impute most_frequent + one-hot (inside CV)
    num_for_rf = Pipeline(steps=[
        ("imputer", SimpleImputer())  # strategy tuned in grid
        # no scaler for RF
    ])
    num_for_svr = Pipeline(steps=[
        ("imputer", SimpleImputer()),          # strategy tuned
        ("scaler", StandardScaler())           # scaler choice tuned (Std vs MinMax)
    ])
    cat_common = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])

    # ColumnTransformers
    preproc_rf = ColumnTransformer(
        transformers=[
            ("num", num_for_rf, num_cols),
            ("cat", cat_common, cat_cols),
        ],
        remainder="drop"
    )
    preproc_svr = ColumnTransformer(
        transformers=[
            ("num", num_for_svr, num_cols),
            ("cat", cat_common, cat_cols),
        ],
        remainder="drop"
    )

    # 3) Pipelines + grids

    # Feature selection approximating Pearson via f_regression; k tuned.
    rf_pipe = Pipeline(steps=[
        ("preprocess", preproc_rf),
        ("select", SelectKBest(score_func=f_regression, k="all")),
        ("model", RandomForestRegressor(random_state=42, n_jobs=-1))
    ])

    rf_param_grid: Dict[str, list] = {
        "preprocess__num__imputer__strategy": ["median", "mean"],
        "select__k": [30, 50, 100, "all"],
        "model__n_estimators": [300, 500],
        "model__max_depth": [None, 10, 16],
        "model__min_samples_leaf": [1, 2, 4],
    }

    rf_grid = GridSearchCV(
        rf_pipe,
        rf_param_grid,
        cv=5,
        scoring={"rmse": "neg_root_mean_squared_error", "pcc": PCC_SCORER},
        refit="pcc",
        n_jobs=-1,
        verbose=0
    )
    print("\n--- Fitting RandomForest (5-fold CV, refit=PCC) ---")
    rf_grid.fit(X_train, y_train)
    print(f"RF best params: {rf_grid.best_params_}")

    # SVR pipeline
    svr_pipe = Pipeline(steps=[
        ("preprocess", preproc_svr),
        ("select", SelectKBest(score_func=f_regression, k="all")),
        ("model", SVR(kernel="rbf"))
    ])

    # Swap scaler choice inside the numeric branch
    svr_param_grid: Dict[str, list] = {
        "preprocess__num__imputer__strategy": ["median", "mean"],
        "preprocess__num__scaler": [StandardScaler(), MinMaxScaler()],
        "select__k": [30, 50, 100, "all"],
        "model__C": [1.0, 3.0, 10.0],
        "model__epsilon": [0.05, 0.1, 0.2],
        "model__gamma": ["scale", "auto"],
    }

    svr_grid = GridSearchCV(
        svr_pipe,
        svr_param_grid,
        cv=5,
        scoring={"rmse": "neg_root_mean_squared_error", "pcc": PCC_SCORER},
        refit="pcc",
        n_jobs=-1,
        verbose=0
    )
    print("\n--- Fitting SVR (5-fold CV, refit=PCC) ---")
    svr_grid.fit(X_train, y_train)
    print(f"SVR best params: {svr_grid.best_params_}")

    # 4) Test-set evaluation + bootstrap CIs
    results = []

    for name, grid in [("Random Forest", rf_grid), ("SVR (RBF)", svr_grid)]:
        y_pred = grid.best_estimator_.predict(X_test)
        test_rmse = rmse(y_test, y_pred)
        test_pcc  = safe_pcc(y_test.values, y_pred)

        rmse_lo, rmse_hi = bootstrap_ci(y_test.values, y_pred, rmse, n_boot=5000, alpha=0.95, seed=123)
        pcc_lo,  pcc_hi  = bootstrap_ci(y_test.values, y_pred, safe_pcc, n_boot=5000, alpha=0.95, seed=456)

        results.append({
            "Model": name,
            "RMSE": test_rmse,
            "RMSE 95% CI Low": rmse_lo,
            "RMSE 95% CI High": rmse_hi,
            "PCC": test_pcc,
            "PCC 95% CI Low": pcc_lo,
            "PCC 95% CI High": pcc_hi,
        })

    results_df = pd.DataFrame(results).set_index("Model")
    print("\n====== Test Set Performance (80/20 split) ======")
    print(results_df.round(4))

    # Optional: simple average ensemble (paper averaged DL models; we show an example here)
    # Align predictions
    rf_pred = rf_grid.best_estimator_.predict(X_test)
    svr_pred = svr_grid.best_estimator_.predict(X_test)
    ens_pred = 0.5 * (rf_pred + svr_pred)

    ens_rmse = rmse(y_test, ens_pred)
    ens_pcc  = safe_pcc(y_test.values, ens_pred)
    ens_rmse_ci = bootstrap_ci(y_test.values, ens_pred, rmse, n_boot=5000, alpha=0.95, seed=789)
    ens_pcc_ci  = bootstrap_ci(y_test.values, ens_pred, safe_pcc, n_boot=5000, alpha=0.95, seed=101112)

    print("\n--- Simple RF+SVR Avg Ensemble ---")
    print(pd.DataFrame({
        "RMSE": [ens_rmse],
        "RMSE 95% CI Low": [ens_rmse_ci[0]],
        "RMSE 95% CI High": [ens_rmse_ci[1]],
        "PCC": [ens_pcc],
        "PCC 95% CI Low": [ens_pcc_ci[0]],
        "PCC 95% CI High": [ens_pcc_ci[1]],
    }, index=["RF+SVR Ensemble"]).round(4))

    return results_df


if __name__ == "__main__":
    run_classical_pipeline_cv()


✓ Loaded engineered dataset: (1897, 348)
Split: train=1517, test=380
Detected numeric=343, categorical=3

--- Fitting RandomForest (5-fold CV, refit=PCC) ---


In [None]:
# qrf_stacked_cv.py
import os, time, numpy as np, pandas as pd
from typing import Tuple, Dict
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings("ignore")

# Keep CPU libs tame
for k in ["OMP_NUM_THREADS","OPENBLAS_NUM_THREADS","MKL_NUM_THREADS","NUMEXPR_NUM_THREADS"]:
    os.environ.setdefault(k, "1")
np.random.seed(42)

# --- sklearn & friends
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.metrics import mean_squared_error

# --- stats
from scipy.stats import pearsonr

# --- qiskit
from qiskit.circuit.library import ZZFeatureMap, EfficientSU2
from qiskit.quantum_info import SparsePauliOp
from qiskit.primitives import Estimator
try:
    from qiskit_aer.primitives import Estimator as AerEstimator
    AER_OK = True
except Exception:
    AER_OK = False


# ======================= Utilities =======================

def safe_pcc(a, b) -> float:
    a, b = np.asarray(a).ravel(), np.asarray(b).ravel()
    if a.std() == 0 or b.std() == 0:
        return 0.0
    v = pearsonr(a, b)[0]
    return float(v) if np.isfinite(v) else 0.0

def rmse(y_true, y_pred) -> float:
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def bootstrap_ci(y_true, y_pred, metric_fn, n_boot=5000, alpha=0.95, seed=42) -> Tuple[float, float]:
    rng = np.random.default_rng(seed)
    n = len(y_true)
    idx = np.arange(n)
    vals = []
    for _ in range(n_boot):
        b = rng.choice(idx, size=n, replace=True)
        vals.append(metric_fn(y_true[b], y_pred[b]))
    lo, hi = np.percentile(vals, [(1-alpha)/2*100, (1+alpha)/2*100])
    return float(lo), float(hi)

def load_xy(path="final_processed_als_data.csv"):
    df = pd.read_csv(path)
    X_all = df.drop(columns=["subject_id","alsfrs_slope"], errors="ignore")
    # Use numeric features only (keeps CV safe and avoids encoding leakage here)
    X = X_all.select_dtypes(include=[np.number]).copy()
    y = df["alsfrs_slope"].values
    m = ~np.isnan(y)
    X, y = X.loc[m].reset_index(drop=True), y[m]
    print(f"✓ Data loaded: X_num={X.shape} (from {X_all.shape[1]} total features), y={y.shape}")
    return X, y

def select_topk_features(X_df, y, k=16):
    # Rank with RF on training only
    imp = SimpleImputer(strategy="median")
    Xn = imp.fit_transform(X_df)
    rf = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1).fit(Xn, y)
    rf_rank = rf.feature_importances_
    idx = np.argsort(rf_rank)[::-1][:k]
    cols = [X_df.columns[i] for i in idx]
    print(f"✓ Top-{k} features: {cols}")
    return idx, cols


# ======================= Quantum helpers =======================

def make_observables(n_qubits=4, basis="ZX", use_pairs=True):
    """
    Returns a list of SparsePauliOp observables:
      - Singles: Z_i (and X_i / Y_i depending on basis)
      - Pairs: Z_i Z_j if use_pairs
    """
    obs = []
    for i in range(n_qubits):
        p = ['I']*n_qubits; p[i] = 'Z'
        obs.append(SparsePauliOp.from_list([("".join(p[::-1]), 1.0)]))
        if basis in ("ZX","ZXY"):
            p = ['I']*n_qubits; p[i] = 'X'
            obs.append(SparsePauliOp.from_list([("".join(p[::-1]), 1.0)]))
        if basis == "ZXY":
            p = ['I']*n_qubits; p[i] = 'Y'
            obs.append(SparsePauliOp.from_list([("".join(p[::-1]), 1.0)]))
    if use_pairs:
        for i in range(n_qubits):
            for j in range(i+1, n_qubits):
                p = ['I']*n_qubits; p[i] = p[j] = 'Z'
                obs.append(SparsePauliOp.from_list([("".join(p[::-1]), 1.0)]))
    return obs

def _idx_from_param_name(name: str) -> int:
    if '[' in name and ']' in name: return int(name.split('[')[1].split(']')[0])
    if '_' in name: return int(name.split('_')[-1])
    digits = ''.join(ch for ch in name if ch.isdigit())
    return int(digits) if digits else 0

def build_random_sink(n_qubits=4, fmap_reps=1, ansatz_reps=2, rng=None):
    rng = np.random.default_rng(None if rng is None else rng)
    fmap = ZZFeatureMap(feature_dimension=n_qubits, reps=fmap_reps)
    ans  = EfficientSU2(num_qubits=n_qubits, reps=ansatz_reps, entanglement="linear")
    circ = fmap.compose(ans)

    # Freeze ansatz weights randomly for a QRF sink
    rand_theta = {p: float(rng.normal(0, 0.35)) for p in ans.parameters}
    circ = circ.assign_parameters(rand_theta, inplace=False)

    # Collect feature params in order x[0]..x[n-1]
    feat_params = [p for p in circ.parameters if p.name.startswith("x")]
    feat_params = sorted(feat_params, key=lambda p: _idx_from_param_name(p.name))
    assert len(feat_params) == n_qubits, f"Expected {n_qubits} feature params, got {len(feat_params)}"
    return circ, feat_params

def build_estimator():
    if AER_OK:
        try:
            return AerEstimator()
        except Exception:
            pass
    return Estimator()

def qrf_features_batched(estimator, sinks, observables, X_theta, batch_size=128, desc_prefix=""):
    """
    Given a list of sinks [(circ, feat_params)], a list of observables, and a matrix of angles X_theta,
    returns the quantum feature matrix Z of shape [N, len(sinks)*len(observables)] with expectation values.
    """
    N = X_theta.shape[0]; T = len(sinks); D = len(observables)
    Z = np.empty((N, T*D), dtype=float)
    for t, (circ, feat_params) in enumerate(sinks):
        col0 = t*D
        for s in tqdm(range(0, N, batch_size), desc=f"{desc_prefix}Sink {t+1}/{T}", leave=False):
            e = min(N, s+batch_size)
            circuits = []; obs_list = []
            for i in range(s, e):
                pmap = {feat_params[k]: float(X_theta[i, k]) for k in range(len(feat_params))}
                cb = circ.assign_parameters(pmap, inplace=False)
                circuits.extend([cb]*D)
                obs_list.extend(observables)
            vals = estimator.run(circuits, obs_list).result().values
            Z[s:e, col0:col0+D] = np.array(vals).reshape(e-s, D)
    return Z


# ======================= Main (Stacked QRF) =======================

def run_qrf_stacked(
    data_path="final_processed_als_data.csv",
    n_qubits=4,
    topk_for_proj=16,            # more classical features before PLS
    pls_components=4,
    seeds=(13, 37),
    zx_sinks=4,
    zxy_sinks=3,
    fmap_reps=1, ansatz_reps=2,
    use_pairs=True,
    batch_size=160,
    train_cap=None,              # cap per-fold TRAIN rows for quantum features (speed); final refit uses ALL
    strict_cv=True,
    n_boot=5000,                 # bootstrap draws for test CIs
):
    t0 = time.time()
    # ---------- data
    X, y = load_xy(data_path)
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.20, random_state=42)
    print(f"Split: train={X_tr.shape[0]} test={X_te.shape[0]}")

    # ---------- top-k selection (train-only)
    idxK, colsK = select_topk_features(X_tr, y_tr, k=topk_for_proj)

    # ---------- quantum config (sinks & observables)
    est = build_estimator()
    obs_ZX  = make_observables(n_qubits=n_qubits, basis="ZX",  use_pairs=use_pairs)
    obs_ZXY = make_observables(n_qubits=n_qubits, basis="ZXY", use_pairs=use_pairs)
    sink_specs = []
    for sd in seeds:
        rng = np.random.default_rng(sd)
        for _ in range(zx_sinks):
            sink_specs.append(("ZX",  build_random_sink(n_qubits, fmap_reps, ansatz_reps, rng)))
        for _ in range(zxy_sinks):
            sink_specs.append(("ZXY", build_random_sink(n_qubits, fmap_reps, ansatz_reps, rng)))

    # ---------- TEST-TIME transforms (fit on FULL train)
    impK_full = SimpleImputer(strategy="median")
    stdK_full = StandardScaler()
    XtrK_full = stdK_full.fit_transform(impK_full.fit_transform(X_tr.iloc[:, idxK]))
    XteK_full = stdK_full.transform(impK_full.transform(X_te.iloc[:, idxK]))

    pls_full = PLSRegression(n_components=pls_components, scale=False).fit(XtrK_full, y_tr)
    Xtr_p_full = pls_full.transform(XtrK_full)
    Xte_p_full = pls_full.transform(XteK_full)

    ang_full = MinMaxScaler(feature_range=(0.0, np.pi)).fit(Xtr_p_full)
    Xtr_th_full = ang_full.transform(Xtr_p_full)
    Xte_th_full = ang_full.transform(Xte_p_full)
    assert Xtr_th_full.shape[1] == n_qubits == pls_components, "pls_components must equal n_qubits"

    # Compute quantum features for ALL train rows (final refit) + ALL test rows
    D_ZX, D_ZXY = len(obs_ZX), len(obs_ZXY)
    total_D = zx_sinks*len(seeds)*D_ZX + zxy_sinks*len(seeds)*D_ZXY
    Z_tr_full = np.zeros((X_tr.shape[0], total_D), dtype=float)
    Z_te       = np.zeros((X_te.shape[0], total_D), dtype=float)

    col = 0
    for (basis, sink) in tqdm(sink_specs, desc="Quantum sinks (FINAL)"):
        obs = obs_ZX if basis == "ZX" else obs_ZXY
        Ztr_part = qrf_features_batched(est, [sink], obs, Xtr_th_full, batch_size=batch_size, desc_prefix="TR ")
        Zte_part = qrf_features_batched(est, [sink], obs, Xte_th_full, batch_size=batch_size, desc_prefix="TE ")
        Z_tr_full[:, col:col+len(obs)] = Ztr_part
        Z_te[:,      col:col+len(obs)] = Zte_part
        col += len(obs)

    # Append PLS comps to quantum features
    Z_tr_full = np.hstack([Z_tr_full, Xtr_p_full])  # (+pls_components)
    Z_te      = np.hstack([Z_te,      Xte_p_full])

    # Classical baseline features for RF (no scaling needed, just impute)
    imp_full = SimpleImputer(strategy="median")
    Xtr_full_base = imp_full.fit_transform(X_tr)
    Xte_full_base = imp_full.transform(X_te)

    # ---------- OOF stacking (strict CV: fit transforms per fold; avoids leakage)
    K = 5
    kf = KFold(n_splits=K, shuffle=True, random_state=42)

    oof_rf   = np.zeros_like(y_tr, dtype=float)
    oof_qrid = np.zeros_like(y_tr, dtype=float)
    oof_qhgb = np.zeros_like(y_tr, dtype=float)

    for fold, (tr_idx, va_idx) in enumerate(kf.split(X_tr), 1):
        # --- classical baseline per fold
        imp_b = SimpleImputer(strategy="median").fit(X_tr.iloc[tr_idx])
        Xtr_b = imp_b.transform(X_tr.iloc[tr_idx])
        Xva_b = imp_b.transform(X_tr.iloc[va_idx])

        rf = RandomForestRegressor(n_estimators=300, max_depth=None,
                                   random_state=100+fold, n_jobs=-1)
        rf.fit(Xtr_b, y_tr[tr_idx])
        oof_rf[va_idx] = rf.predict(Xva_b)

        # --- PLS → angles per fold (train-only fit)
        impK = SimpleImputer(strategy="median").fit(X_tr.iloc[tr_idx, idxK])
        stdK = StandardScaler().fit(impK.transform(X_tr.iloc[tr_idx, idxK]))

        XtrK = stdK.transform(impK.transform(X_tr.iloc[tr_idx, idxK]))
        XvaK = stdK.transform(impK.transform(X_tr.iloc[va_idx, idxK]))

        pls = PLSRegression(n_components=pls_components, scale=False).fit(XtrK, y_tr[tr_idx])
        trP = pls.transform(XtrK);  vaP = pls.transform(XvaK)

        ang = MinMaxScaler(feature_range=(0.0, np.pi)).fit(trP)
        trTH = ang.transform(trP);  vaTH = ang.transform(vaP)

        # --- quantum features per fold
        tr_rows = tr_idx
        if train_cap and len(tr_rows) > train_cap:
            tr_rows = np.sort(np.random.default_rng(123+fold).choice(tr_rows, size=train_cap, replace=False))

        # We need indices of tr_rows inside tr_idx order
        mask_tr_in_rows = np.isin(tr_idx, tr_rows)

        D_total = total_D
        Ztr_f = np.zeros((mask_tr_in_rows.sum(), D_total), dtype=float)
        Zva_f = np.zeros((len(va_idx), D_total), dtype=float)

        col = 0
        for (basis, sink) in tqdm(sink_specs, desc=f"Quantum sinks (fold {fold})", leave=False):
            obs = obs_ZX if basis == "ZX" else obs_ZXY
            Ztr_part = qrf_features_batched(est, [sink], obs, trTH[mask_tr_in_rows], batch_size=batch_size, desc_prefix="TRf ")
            Zva_part = qrf_features_batched(est, [sink], obs, vaTH,                batch_size=batch_size, desc_prefix="VAf ")
            Ztr_f[:, col:col+len(obs)] = Ztr_part
            Zva_f[:, col:col+len(obs)] = Zva_part
            col += len(obs)

        # Append PLS comps
        Ztr_f = np.hstack([Ztr_f, trP[mask_tr_in_rows]])
        Zva_f = np.hstack([Zva_f, vaP])

        # Whiten per fold for linear heads
        zsc = StandardScaler().fit(Ztr_f)
        Ztr_w = zsc.transform(Ztr_f)
        Zva_w = zsc.transform(Zva_f)

        # Heads
        ridge = RidgeCV(alphas=np.array([0.03,0.1,0.3,1.0,3.0,10.0,30.0]), cv=3).fit(Ztr_w, y_tr[mask_tr_in_rows])
        hgb   = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07, max_iter=400,
                                              l2_regularization=0.01, random_state=200+fold).fit(Ztr_w, y_tr[mask_tr_in_rows])

        oof_qrid[va_idx] = ridge.predict(Zva_w)
        oof_qhgb[va_idx] = hgb.predict(Zva_w)

        tqdm.write(f"Fold {fold}/{K} done.")

    # Meta-learner on OOF preds
    X_oof = np.c_[oof_rf, oof_qrid, oof_qhgb]
    meta = Ridge(alpha=0.3, fit_intercept=True).fit(X_oof, y_tr)

    # ---------- FINAL refit on FULL train (paper-style)
    # baseline RF
    rf_full = RandomForestRegressor(n_estimators=300, random_state=123, n_jobs=-1).fit(Xtr_full_base, y_tr)

    # quantum heads on FULL
    zsc_full_refit = StandardScaler().fit(Z_tr_full)
    Ztr_w_full = zsc_full_refit.transform(Z_tr_full)
    Zte_w      = zsc_full_refit.transform(Z_te)

    ridge_full = RidgeCV(alphas=np.array([0.03,0.1,0.3,1.0,3.0,10.0,30.0]), cv=5).fit(Ztr_w_full, y_tr)
    hgb_full   = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07, max_iter=500,
                                               l2_regularization=0.01, random_state=999).fit(Ztr_w_full, y_tr)

    # Test predictions
    y_rf_te  = rf_full.predict(Xte_full_base)
    y_qr_te  = ridge_full.predict(Zte_w)
    y_qh_te  = hgb_full.predict(Zte_w)
    y_stack_te = meta.predict(np.c_[y_rf_te, y_qr_te, y_qh_te])

    # Metrics
    def report(tag, yhat):
        r = rmse(y_te, yhat); p = safe_pcc(y_te, yhat)
        print(f"{tag:<18} RMSE={r:.4f}  PCC={p:.4f}")
        return r, p

    print("\n===== RESULTS (Stacked QRF vs RF) =====")
    r_rf   = report("RF (baseline)", y_rf_te)
    r_qr   = report("QRFR-Ridge",   y_qr_te)
    r_qh   = report("QRFR-GBDT",    y_qh_te)
    r_meta = report("STACKED (meta)", y_stack_te)

    # Bootstrap CIs (test set)
    for tag, pred in [("RF", y_rf_te), ("QRFR-Ridge", y_qr_te), ("QRFR-GBDT", y_qh_te), ("STACKED", y_stack_te)]:
        rm_lo, rm_hi = bootstrap_ci(y_te, pred, rmse,    n_boot=n_boot, seed=202)
        pc_lo, pc_hi = bootstrap_ci(y_te, pred, safe_pcc, n_boot=n_boot, seed=303)
        print(f"{tag:<12}  RMSE 95% CI [{rm_lo:.4f}, {rm_hi:.4f}]   PCC 95% CI [{pc_lo:.4f}, {pc_hi:.4f}]")

    D_ZX, D_ZXY = len(obs_ZX), len(obs_ZXY)
    Zdim = zx_sinks*len(seeds)*D_ZX + zxy_sinks*len(seeds)*D_ZXY
    print("\nDims: ZX/seed=%d  ZXY/seed=%d  seeds=%d  -> Z=%d (+PLS=%d) = %d"
          % (zx_sinks, zxy_sinks, len(seeds), Zdim, Xtr_p_full.shape[1], Z_tr_full.shape[1]))
    print("Time: %.1fs   AER=%s" % (time.time()-t0, str(AER_OK)))

    return dict(rf=r_rf, qridge=r_qr, qgbdt=r_qh, stacked=r_meta)


if __name__ == "__main__":
    _ = run_qrf_stacked(
        data_path="final_processed_als_data.csv",
        n_qubits=4,
        topk_for_proj=16,
        pls_components=4,
        seeds=(13, 37),     # try (13,37,91) for a stronger model
        zx_sinks=4,         # bump to 5–6 for more capacity
        zxy_sinks=3,        # bump to 4–5 for more capacity
        fmap_reps=1,
        ansatz_reps=2,
        use_pairs=True,
        batch_size=160,
        train_cap=None,     # keep None for best final refit; set e.g. 800 for faster OOF
        strict_cv=True,
        n_boot=5000
    )


✓ Data loaded: X=(2022, 30), y=(2022,)
✓ Top-16 features: ['fvc_Subject_Liters_Trial_1_slope', 'alsfrs_ALSFRS_Total_slope', 'fvc_Subject_Liters_Trial_1_last', 'fvc_Subject_Liters_Trial_1_std', 'labs_AST(SGOT)_slope', 'vitals_Weight_slope', 'vitals_Pulse_std', 'Age', 'vitals_Vital_Signs_Delta_std', 'vitals_Blood_Pressure_Diastolic_std', 'alsfrs_ALSFRS_Total_std', 'labs_Sodium_slope', 'labs_Bicarbonate_slope', 'labs_Creatinine_slope', 'vitals_Pulse_median', 'labs_Laboratory_Delta_std']


Quantum sinks:   0%|          | 0/14 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

TR Sink 1/1:   0%|          | 0/11 [00:00<?, ?it/s]

TE Sink 1/1:   0%|          | 0/3 [00:00<?, ?it/s]

Fold 1/5 done.
Fold 2/5 done.
Fold 3/5 done.
Fold 4/5 done.
Fold 5/5 done.

===== RESULTS (Stacked QRFR vs RF) =====
RF (baseline)      RMSD=0.5579  PCC=0.2764  R²=0.0731  ACC3=86.2%
QRFR-Ridge         RMSD=0.5817  PCC=0.1774  R²=-0.0075  ACC3=85.9%
QRFR-GBDT          RMSD=0.6010  PCC=0.1445  R²=-0.0757  ACC3=85.7%
STACKED (meta)     RMSD=0.5572  PCC=0.2770  R²=0.0756  ACC3=86.2%

Dims: ZX sinks/seed=4  ZXY sinks/seed=3  seeds=2  -> Z_dims=220 (+PLS=4) = 224
Time: 1380.4s   AER=False
