Reproducibility gap in ICU pathogen prediction benchmarks (MIMIC-IV)

Hi PyHealth team,

I have been working on an open-source ICU pathogen prediction pipeline using MIMIC-IV:
https://github.com/netanelcyber/PenuX

During development we noticed something that may be relevant for the broader clinical ML community:

## Observation

Many ICU prediction pipelines report strong AUROC values, but relatively few evaluate:

- temporal drift
- calibration stability
- subgroup robustness
- rare-class confidence behavior
- leakage sensitivity across ICU workflows

In our experiments, relatively small preprocessing choices produced unexpectedly large differences in:
- pathogen ranking stability
- calibration curves
- minority-class behavior
- external-like temporal splits

This raises a broader reproducibility question:

> Are current clinical ML benchmark pipelines sufficiently robust to temporal and operational shifts commonly seen in ICU environments?

## Potential contribution ideas

I would be interested in contributing PyHealth-compatible examples for:

- reproducible MIMIC-IV infection/pathogen prediction
- calibration-first evaluation
- temporal split utilities
- subgroup robustness analysis
- uncertainty estimation benchmarks
- eICU transfer experiments
- clinically interpretable evaluation templates

## Technical direction

Current experiments include combinations of:
- tabular clinical variables
- vitals/labs trajectories
- ICU time-series aggregation
- calibration analysis
- ranking-oriented evaluation rather than pure hard classification

## Questions for maintainers/community

1. Are there existing efforts around temporal robustness benchmarking in PyHealth?
2. Would calibration-focused benchmark examples be useful for the project?
3. Is there interest in a community benchmark around ICU robustness / reproducibility failures?

I would also appreciate feedback from others working on:
- MIMIC-IV
- eICU
- clinical foundation models
- ICU time-series
- trustworthy medical AI

Project:
https://github.com/netanelcyber/PenuX

Thanks again for building PyHealth — it has been extremely useful for rapid experimentation in clinical ML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility gap in ICU pathogen prediction benchmarks (MIMIC-IV) #1153

Observation

Potential contribution ideas

Technical direction

Questions for maintainers/community

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reproducibility gap in ICU pathogen prediction benchmarks (MIMIC-IV) #1153

Description

Observation

Potential contribution ideas

Technical direction

Questions for maintainers/community

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions