Skip to content

[Data] Track peak object store memory usage in benchmarks#63418

Merged
bveeramani merged 2 commits into
ray-project:masterfrom
yuhuan130:feat/add-object-store-high-watermark
May 20, 2026
Merged

[Data] Track peak object store memory usage in benchmarks#63418
bveeramani merged 2 commits into
ray-project:masterfrom
yuhuan130:feat/add-object-store-high-watermark

Conversation

@yuhuan130
Copy link
Copy Markdown
Contributor

@yuhuan130 yuhuan130 commented May 18, 2026

Why are these changes needed?

Ray Data benchmarks already report runtime and object store spilling, but they do not show how much object store memory was used at peak during a benchmark run.

This metric is useful for backpressure tuning. We want to know not only whether a run spilled or got slower, but also how much object store memory it used.

What changes were made?

This PR adds a lightweight sampler to the benchmark utility.

While each benchmark case is running, the sampler periodically reads aggregate object store memory stats and keeps the highest value it sees.

This adds two new benchmark metrics:

object_store_memory_used_peak_gb
object_store_memory_utilization_peak

Why sample during the benchmark?

Spilled bytes are cumulative, so they can be measured with an end-minus-start delta.

Object store memory usage is different: it can go up during the benchmark and drop back down before the end. Sampling during the run lets us capture that peak.

Test

Smoke test with a 100 MB object store allocation:
Screenshot 2026-05-19 at 03 21 18


Closes #63417

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an ObjectStoreMemorySampler to track peak object store usage and utilization during benchmarks using a background thread. It also adds new metrics to the BenchmarkMetric enum and integrates the sampler into the run_fn execution flow. Reviewer feedback identifies performance improvements, specifically regarding the caching of GlobalState to avoid expensive re-initializations and the removal of a redundant gRPC call by reusing stats already collected by the sampler.

Comment thread release/nightly_tests/dataset/benchmark.py Outdated
Comment thread release/nightly_tests/dataset/benchmark.py Outdated
@yuhuan130
Copy link
Copy Markdown
Contributor Author

Will follow up and rebuild Ray locally for some additional testing once I’m off the plane 🙌

@ray-gardener ray-gardener Bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels May 18, 2026
Signed-off-by: Alex Chien <alexchien130@gmail.com>
@yuhuan130 yuhuan130 force-pushed the feat/add-object-store-high-watermark branch from 0986c2c to 761f092 Compare May 18, 2026 17:45
@bveeramani bveeramani self-assigned this May 19, 2026
Comment on lines +18 to +21
def _get_object_store_stats(state):
"""Get aggregate object store stats across the cluster."""
memory_info = get_memory_info_reply(state)
return memory_info.store_stats
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really thin abstraction (2 LOC). Given that this is only called in two places, I'm not sure if this is worth abstracting

return round(b / (1024**3), 4)


class ObjectStoreMemorySampler:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a context manager like MemoryProfiler? More canonical than try-finally for cleanup

Comment on lines +46 to +47
self.peak_used_bytes = 0
self.peak_utilization = 0.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make property to make more explicit this is part of public interface?

Signed-off-by: Alex Chien <alexchien130@gmail.com>
@bveeramani bveeramani enabled auto-merge (squash) May 20, 2026 22:03
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label May 20, 2026
@bveeramani bveeramani merged commit 7cd206a into ray-project:master May 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add peak object store memory usage metric to Ray Data benchmarks

2 participants