# ICU Preprocessing Pipeline with ReciPies

This notebook demonstrates a **complete end-to-end preprocessing pipeline** for ICU time-series data using ReciPies (see https://github.com/rvandewater/YAIB for more info). We'll cover:

1. **Data Loading**: Loading dynamic measurements, static features, and outcomes from parquet files
2. **Train/Test Split**: Proper group-level splitting to prevent data leakage
3. **Multi-Step Pipeline**: 
   - Missing value imputation (forward fill + zero fill)
   - Feature scaling (standardization)
   - Historical feature engineering (rolling mean and max)
   - Custom domain-specific features
4. **Baking the Data**: Applying the preprocessing pipeline to both training and test sets
5. **Model Training**: Using the preprocessed data to train a machine learning model

The pipeline uses **Polars** for high-performance data processing, with ReciPies handling all preprocessing steps while maintaining column role information throughout the transformation pipeline.


## 1. Load ICU Data

We start by loading the ICU demo data, which consists of three components:
- **Dynamic data**: Time-varying measurements (vitals, lab values) recorded at regular intervals
- **Static data**: Patient-level features that don't change over time (age, sex, height, weight)
- **Outcome data**: The target variable we want to predict (mortality at 24 hours)

Let's examine the structure of each dataset.


In [None]:
import numpy as np
import polars as pl
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

from recipies import Ingredients, Recipe
from recipies.selector import all_predictors, all_numeric_predictors, has_role, has_type, all_of
from recipies.step import StepImputeFill, StepHistorical, StepScale, StepFunction, Accumulator, StepSklearn

dynamic_data = pl.read_parquet("../../examples/icu_demo_data/mortality24/eicu_demo/dyn.parquet")
static_data = pl.read_parquet("../../examples/icu_demo_data/mortality24/eicu_demo/sta.parquet")
outcome = pl.read_parquet("../../examples/icu_demo_data/mortality24/eicu_demo/outc.parquet")
print("Columns:")
print(f"dynamic: {dynamic_data.columns}")
print(f"static: {static_data.columns}")
print(f"outcome: {outcome.columns}")
print("Shapes:")
print(f"dynamic: {dynamic_data.shape}")
print(f"static: {static_data.shape}")
print(f"outcome: {outcome.shape}")
print("Heads:")
print(dynamic_data.head())
print(static_data.head())
print(outcome.head())

## 2. Train/Test Split

**Critical**: We perform a **group-level split** at the `stay_id` level. This ensures that all records for a given patient stay are assigned to either the training or test set, preventing data leakage where information from the test set could leak into the training process.

We use an 80/20 split stratified by `stay_id`:


In [None]:
# Train/test split at the stay_id level (group-level split)
# This ensures all records for a given stay go to either train or test

# Get unique stay_ids
unique_stays = outcome.select("stay_id").unique().sample(fraction=1.0, seed=42)
n_train = int(len(unique_stays) * 0.8)
train_stay_ids = unique_stays.head(n_train)["stay_id"].to_list()
test_stay_ids = unique_stays.tail(len(unique_stays) - n_train)["stay_id"].to_list()

# Split dynamic, static, and outcome data
dynamic_train = dynamic_data.filter(pl.col("stay_id").is_in(train_stay_ids))
dynamic_test = dynamic_data.filter(pl.col("stay_id").is_in(test_stay_ids))

static_train = static_data.filter(pl.col("stay_id").is_in(train_stay_ids))
static_test = static_data.filter(pl.col("stay_id").is_in(test_stay_ids))

outcome_train = outcome.filter(pl.col("stay_id").is_in(train_stay_ids))
outcome_test = outcome.filter(pl.col("stay_id").is_in(test_stay_ids))

# Join train data
df_train = dynamic_train.join(static_train, on="stay_id", how="left")
df_train = df_train.join(outcome_train.select(["stay_id", "label"]), on="stay_id", how="left")

# Join test data
df_test = dynamic_test.join(static_test, on="stay_id", how="left")
df_test = df_test.join(outcome_test.select(["stay_id", "label"]), on="stay_id", how="left")

print(f"Train: {len(df_train)} rows, {len(train_stay_ids)} stays")
print(f"Test: {len(df_test)} rows, {len(test_stay_ids)} stays")

In [None]:
# Quick check: verify we have the expected columns after joining
print(f"Train dataframe columns: {len(df_train.columns)}")
print(f"Test dataframe columns: {len(df_test.columns)}")

## 3. Build Preprocessing Pipeline

Now we'll create a comprehensive preprocessing pipeline using ReciPies. The pipeline includes:

1. **Role Assignment**: Define which columns are outcomes, predictors, groups (`stay_id`), and sequences (`time`)
2. **Imputation**: Forward fill followed by zero fill for any remaining missing values
3. **Feature Scaling**: Standardize numeric predictors (mean=0, std=1)
4. **Historical Features**: Create rolling mean and max aggregations over time within each stay
5. **Custom Features**: Add domain-specific features like heart rate to temperature ratio

The key advantage of ReciPies is that all transformations maintain column role information, ensuring proper handling of grouped time-series data.


In [None]:
# Initialize Ingredients
ing = Ingredients(df_train)

# Define and build the recipe
rec = Recipe(
    ing,
    outcomes=["label"],
    predictors=[c for c in ing.columns if c not in {"label", "stay_id", "time"}],
    groups=["stay_id"],
    sequences=["time"],
)

# Impute missing values forward (pre-resample)
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="forward"))
rec.add_step(StepImputeFill(sel=all_predictors(), strategy="zero"))

# Scale numeric predictors at the end (after imputation)
rec.add_step(StepScale(sel=all_numeric_predictors(), with_mean=True, with_std=True))


#  Add a custom domain feature (example: hr/temp ratio) via StepFunction
def add_custom_features(ingr: Ingredients) -> Ingredients:
    df_ = ingr.get_df()
    if all(col in df_.columns for col in ["hr", "temp"]):
        df_ = df_.with_columns((pl.col("hr") / pl.col("temp")).alias("hr_temp_ratio"))
        ingr.set_df(df_)
        ingr.update_role("hr_temp_ratio", "predictor")
    return ingr


rec.add_step(StepFunction(sel=has_role(["predictor"]), function=add_custom_features))

# Label encode categorical features
types = ["String", "Object", "Categorical"]
rec.add_step(StepSklearn(SimpleImputer(missing_values=np.nan, strategy="most_frequent"), sel=has_type(types)))
rec.add_step(StepSklearn(LabelEncoder(), sel=has_type(types), columnwise=True))

original_predictors = all_of(
    list(all_numeric_predictors()(ing))
)  # Capture the fixed list of original numeric predictors
# Historical features
rec.add_step(StepHistorical(sel=original_predictors, fun=Accumulator.MEAN, suffix="_mean_hist"))
rec.add_step(StepHistorical(sel=original_predictors, fun=Accumulator.MIN, suffix="_min_hist"))
rec.add_step(StepHistorical(sel=original_predictors, fun=Accumulator.MAX, suffix="_max_hist"))
rec.add_step(StepHistorical(sel=original_predictors, fun=Accumulator.VAR, suffix="_var_hist"))

# Prep and bake (fit and transform) the training data
train_baked = rec.prep()
train_baked.head()
print(train_baked.columns)
print(len(train_baked.columns))

## 4. Apply Pipeline to Test Data

Once the recipe is fitted on the training data using `prep()`, we can apply the same transformations to the test data using `bake()`. This ensures:

- **No data leakage**: Test data statistics are never used to fit the pipeline
- **Consistent transformations**: The same preprocessing steps are applied identically to both datasets
- **Reproducibility**: The fitted recipe can be saved and reused on new data

The `bake()` method applies all fitted transformations without refitting, ensuring the test set is processed identically to how the training set was processed.


In [None]:
test_baked = rec.bake(df_test)
print(test_baked.head())
print(test_baked.columns)
print(len(test_baked.columns))

## 5. Train a Machine Learning Model

With our preprocessed data ready, we can now train a machine learning model. The preprocessed dataframes contain:
- All original features (scaled and imputed)
- Historical aggregated features 
- One-hot encoded categorical variables
- Custom domain features (e.g., hr/temp ratio)

For demonstration, we'll use a simple logistic regression model, but you can use any scikit-learn compatible model or more advanced methods like XGBoost, LightGBM, or neural networks.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

# Extract features and labels
# Exclude outcome, group, and sequence columns from features
feature_cols = [c for c in train_baked.columns if c not in ["label", "stay_id", "time"]]

X_train = train_baked.select(feature_cols).to_numpy()
y_train = train_baked.select("label").to_numpy().ravel()

X_test = test_baked.select(feature_cols).to_numpy()
y_test = test_baked.select("label").to_numpy().ravel()

# Handle any remaining NaN values (should be minimal after preprocessing)
X_train = np.nan_to_num(X_train, nan=0.0)
X_test = np.nan_to_num(X_test, nan=0.0)

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples, {X_test.shape[1]} features")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test): {np.bincount(y_test)}")

# Train model
model = LogisticRegression(max_iter=1000, random_state=42, class_weight="balanced")
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict_proba(X_train)[:, 1]
y_test_pred = model.predict_proba(X_test)[:, 1]

# Evaluate
train_auc = roc_auc_score(y_train, y_train_pred)
test_auc = roc_auc_score(y_test, y_test_pred)

print("\nModel Performance:")
print(f"Train AUC: {train_auc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test, model.predict(X_test), target_names=["No Mortality", "Mortality"]))