Reproducibility gap in ICU pathogen prediction benchmarks (MIMIC-IV) #1154

netanelcyber · 2026-05-29T13:06:12Z

netanelcyber
May 29, 2026

Hi PyHealth team,

I have been working on an open-source ICU pathogen prediction pipeline using MIMIC-IV:
https://github.com/netanelcyber/PenuX

During development we noticed something that may be relevant for the broader clinical ML community:

Many ICU prediction pipelines report strong AUROC values, but relatively few evaluate:

In our experiments, relatively small preprocessing choices produced unexpectedly large differences in:

This raises a broader reproducibility question:

Are current clinical ML benchmark pipelines sufficiently robust to temporal and operational shifts commonly seen in ICU environments?

I would be interested in contributing PyHealth-compatible examples for:

Current experiments include combinations of:

Are there existing efforts around temporal robustness benchmarking in PyHealth?
Would calibration-focused benchmark examples be useful for the project?
Is there interest in a community benchmark around ICU robustness / reproducibility failures?

I would also appreciate feedback from others working on:

Thanks again for building PyHealth — it has been extremely useful for rapid experimentation in clinical ML.