Hi PyHealth team,
I have been working on an open-source ICU pathogen prediction pipeline using MIMIC-IV:
https://github.com/netanelcyber/PenuX
During development we noticed something that may be relevant for the broader clinical ML community:
Observation
Many ICU prediction pipelines report strong AUROC values, but relatively few evaluate:
- temporal drift
- calibration stability
- subgroup robustness
- rare-class confidence behavior
- leakage sensitivity across ICU workflows
In our experiments, relatively small preprocessing choices produced unexpectedly large differences in:
- pathogen ranking stability
- calibration curves
- minority-class behavior
- external-like temporal splits
This raises a broader reproducibility question:
Are current clinical ML benchmark pipelines sufficiently robust to temporal and operational shifts commonly seen in ICU environments?
Potential contribution ideas
I would be interested in contributing PyHealth-compatible examples for:
- reproducible MIMIC-IV infection/pathogen prediction
- calibration-first evaluation
- temporal split utilities
- subgroup robustness analysis
- uncertainty estimation benchmarks
- eICU transfer experiments
- clinically interpretable evaluation templates
Technical direction
Current experiments include combinations of:
- tabular clinical variables
- vitals/labs trajectories
- ICU time-series aggregation
- calibration analysis
- ranking-oriented evaluation rather than pure hard classification
Questions for maintainers/community
- Are there existing efforts around temporal robustness benchmarking in PyHealth?
- Would calibration-focused benchmark examples be useful for the project?
- Is there interest in a community benchmark around ICU robustness / reproducibility failures?
I would also appreciate feedback from others working on:
- MIMIC-IV
- eICU
- clinical foundation models
- ICU time-series
- trustworthy medical AI
Project:
https://github.com/netanelcyber/PenuX
Thanks again for building PyHealth — it has been extremely useful for rapid experimentation in clinical ML.
Hi PyHealth team,
I have been working on an open-source ICU pathogen prediction pipeline using MIMIC-IV:
https://github.com/netanelcyber/PenuX
During development we noticed something that may be relevant for the broader clinical ML community:
Observation
Many ICU prediction pipelines report strong AUROC values, but relatively few evaluate:
In our experiments, relatively small preprocessing choices produced unexpectedly large differences in:
This raises a broader reproducibility question:
Potential contribution ideas
I would be interested in contributing PyHealth-compatible examples for:
Technical direction
Current experiments include combinations of:
Questions for maintainers/community
I would also appreciate feedback from others working on:
Project:
https://github.com/netanelcyber/PenuX
Thanks again for building PyHealth — it has been extremely useful for rapid experimentation in clinical ML.