Skip to content

bench: add Polars streaming vs in-memory loader benchmark#871

Closed
AhmedAli58 wants to merge 2 commits intosunlabuiuc:masterfrom
AhmedAli58:bench/polars-streaming-loader-benchmark
Closed

bench: add Polars streaming vs in-memory loader benchmark#871
AhmedAli58 wants to merge 2 commits intosunlabuiuc:masterfrom
AhmedAli58:bench/polars-streaming-loader-benchmark

Conversation

@AhmedAli58
Copy link
Copy Markdown

What this does

Adds a benchmarking script that compares PyHealth's new Polars streaming data loader
against the legacy in-memory loader across RAM usage, wall-clock time, and throughput
at 3 dataset scales (100, 1k, 5k patients).

Why it matters

The new streaming loader was added without systematic benchmarks. This PR gives
maintainers and users data to make informed decisions about which loader to use
based on dataset size and available memory.

How to run

python benchmarks/loader_benchmark.py

Results

See benchmarks/results.csv and benchmarks/benchmark_chart.png for outputs.

@Logiquo
Copy link
Copy Markdown
Collaborator

Logiquo commented Feb 24, 2026

InMemorySampleDataset is primarly designed for unittest, it is not as battle tested as the SampleDataset.

And also I think InMemorySampleDataset lacks a few functionality compare with SampleDataset.

@jhnwu3
Copy link
Copy Markdown
Collaborator

jhnwu3 commented Feb 24, 2026

Hey we did run some tests across a variety of things to make our decision: https://pyhealth.readthedocs.io/en/latest/why_pyhealth.html

Will be closing this PR. Would love to talk more if you want to discuss benchmarking PyHealth 2.0 in general on the discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants