[Data] Track peak object store memory usage in benchmarks#63418
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an ObjectStoreMemorySampler to track peak object store usage and utilization during benchmarks using a background thread. It also adds new metrics to the BenchmarkMetric enum and integrates the sampler into the run_fn execution flow. Reviewer feedback identifies performance improvements, specifically regarding the caching of GlobalState to avoid expensive re-initializations and the removal of a redundant gRPC call by reusing stats already collected by the sampler.
|
Will follow up and rebuild Ray locally for some additional testing once I’m off the plane 🙌 |
Signed-off-by: Alex Chien <alexchien130@gmail.com>
0986c2c to
761f092
Compare
| def _get_object_store_stats(state): | ||
| """Get aggregate object store stats across the cluster.""" | ||
| memory_info = get_memory_info_reply(state) | ||
| return memory_info.store_stats |
There was a problem hiding this comment.
This seems like a really thin abstraction (2 LOC). Given that this is only called in two places, I'm not sure if this is worth abstracting
| return round(b / (1024**3), 4) | ||
|
|
||
|
|
||
| class ObjectStoreMemorySampler: |
There was a problem hiding this comment.
Make this a context manager like MemoryProfiler? More canonical than try-finally for cleanup
| self.peak_used_bytes = 0 | ||
| self.peak_utilization = 0.0 |
There was a problem hiding this comment.
Nit: Make property to make more explicit this is part of public interface?
Signed-off-by: Alex Chien <alexchien130@gmail.com>
Why are these changes needed?
Ray Data benchmarks already report runtime and object store spilling, but they do not show how much object store memory was used at peak during a benchmark run.
This metric is useful for backpressure tuning. We want to know not only whether a run spilled or got slower, but also how much object store memory it used.
What changes were made?
This PR adds a lightweight sampler to the benchmark utility.
While each benchmark case is running, the sampler periodically reads aggregate object store memory stats and keeps the highest value it sees.
This adds two new benchmark metrics:
Why sample during the benchmark?
Spilled bytes are cumulative, so they can be measured with an end-minus-start delta.
Object store memory usage is different: it can go up during the benchmark and drop back down before the end. Sampling during the run lets us capture that peak.
Test
Smoke test with a 100 MB object store allocation:

Closes #63417