[data] Fix `iter_batches` spilling (1/n): Remove outer `make_async_gen` to reduce untracked buffered batches + reduce prefetch onto GPU by justinvyu · Pull Request #63660 · ray-project/ray

justinvyu · 2026-05-27T07:44:06Z

Description

iter_batches uses make_async_gen(ref_bundle_iterator, num_workers=1) to decouple the batching pipeline from the consumer thread. With a single worker, the multi-worker machinery (filling worker, per-worker input/output queues, round-robin draining) is unnecessary — it just adds complexity and hidden buffering.
With buffer_size=prefetch_batches (added in [Data] Handle prefetches buffering in iter_batches #58657), make_async_gen creates an input queue (capacity prefetch_batches + 1) and an output queue (capacity prefetch_batches), buffering up to ~2 * prefetch_batches items that are invisible to the resource manager's memory accounting.
BEHAVIOR CHANGE AFTER THIS PR: The outer make_async_gen output buffer was a queue of GPU batches max size prefetch_batches. This means that you can have up to prefetch_batches + 1 (the working batch) on GPU memory. This happens implicitly and is not good default behavior, since users expect their entire GRAM to be usable for model params, grads, optimizer states, and the current batch and associated activations. Prefetching too many batches into GPU forces can be silently hurting user GPU utilization and throughput by forcing them to reduce their batch size. This PR bounds the number of prefetched GPU batches to 1 which has parity with the current defaults. A follow-up PR will introduce a configuration option to choose how many batches to be prefetched to the GPU at a time.
Replace with iter_threaded, a simplified utility that runs an iterator in a single background thread with a bounded output queue.

Related PRs

This PR brings the behavior closer to the original implementation: #33620

#51661

This change introduced extra queues to the mix and was mostly needed for the preserve_order case with multiple threadpool workers in Ray Data read tasks. The iter_batches usage of make_async_gen only uses a single thread and the behavior change was unintentional.

#58657

The change to increase make_async_gen(buffer_size=prefetch_batches) was unrelated to the main issue of the batch formatting threadpool round-robin causing next batch spiky latencies.

Release test results

backpressure_benchmark.single_node

Metric	Master	PR1
runtime (s)	1039.86	1035.94
peak obj store (GB)	9.50	7.63
util peak	0.9896	0.7943
spilled (GB)	5.88	0.0

backpressure_benchmark.multi_node

Note: Still spills with this PR, but significantly reduced. See PR2 for reducing spill to 0.

Metric	Master	PR1
runtime (s)	144.47	142.11
peak obj store (GB)	72.25	63.88
util peak	0.8362	0.7393
spilled (GB)	63.88	14.00

…terator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request replaces the asynchronous generator make_async_gen with a simpler background thread iterator iter_in_background using a bounded queue. A critical issue was identified in the new iter_in_background utility: if the consumer stops iterating early, the background producer thread can block indefinitely on queue.put(), causing a thread and resource leak. A code suggestion was provided to use a threading.Event and a finally block to safely terminate the producer thread and drain the queue.

…sumer exit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…invyu/replace-outer-make-async-gen

justinvyu · 2026-05-27T22:03:16Z

backpressure_training_prefetch.single_node

Before:

After:

…kers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…invyu/replace-outer-make-async-gen

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 6639106. Configure here.}

cursor · 2026-06-02T22:30:52Z

+    worker_threads = [
+        threading.Thread(target=_worker, name="iter_threaded", daemon=True)
+        for _ in range(num_workers)
+    ]


Multi-worker runs duplicate pipelines

Medium Severity

When iter_threaded uses num_workers greater than one, each worker runs a full fn(...) over the shared base_iterator instead of splitting work like make_async_gen. Two pipelines interleave ref-bundle reads and emit batches into one queue, corrupting batching and ordering.

^{Reviewed by Cursor Bugbot for commit 6639106. Configure here.}

I don't think this is an actual bug since we are only using num_workers = 1 for _iter_batches.

I think the cursor comment isn't accurate, but I realized I have no unit tests. Will sanity check implementation and add unit tests.

Added unit tests.

rayhhome

A few comments but doesn't need to block; overall lgtm

rayhhome · 2026-06-02T23:37:24Z

+    worker_threads = [
+        threading.Thread(target=_worker, name="iter_threaded", daemon=True)
+        for _ in range(num_workers)
+    ]


I don't think this is an actual bug since we are only using num_workers = 1 for _iter_batches.

rayhhome · 2026-06-02T23:42:57Z

+_SENTINEL = object()
+
+
+def iter_threaded(


I think this function should live in python/ray/data/_internal/util.py beside make_async_gen?

I wanted to keep this utility in iter_batches so that it doesn't end up getting abused/co-opted for another purpose again and then forgetting the original intention. That's what happened with make_async_gen 😄

rayhhome · 2026-06-03T00:03:43Z

+
+    try:
+        while True:
+            item = result_queue.get()


I think make_async_gen has a polling mechanism that has a timeout for get and also detects interrupt events. I think this can lead to hanging issues but in very unlikely situations.

Interrupt should be handled by the finally: stopped.set(). Do you have a different deadlock/hang case in mind?

I was thinking about the case where a node dies and get hangs, but the old code seems to also hang in that case, so I guess the current implementation is fine 😃

I see -- this stuff is all just threads on a single node so if the node dies then there's no hanging, everything just dies.

There is one scenario where node death can cause a somewhat orthogonal issue: if block ref A gets sent over, it doesn't block the pipeline on this queue get, but it will block if ray.get(block_ref) fails in resolve_block_refs. But that's the same as the status quo and requires lineage reconstruction to unblock.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…e `make_async_gen` with `iter_threaded` (#63682) Replaces the inner format/collate `make_async_gen` with `iter_threaded` from #63660, cutting untracked object store memory pinned batches from ~16 to ~8 (2× reduction). - `_format_in_threadpool` runs format + collate across a threadpool via `make_async_gen(num_workers=min(4, prefetch_batches), preserve_ordering=False)`. With the default `buffer_size=1`, this allocates one shared input queue of size `(buffer_size + 1) * num_workers` and `num_workers` per-worker output queues of size `buffer_size` — for `num_workers=4`, that is **8 (input) + 4 (in-flight in workers) + 4 (output) ≈ 16** batches buffered inside the threadpool, none of which are visible to the resource manager. - These buffered batches are pre-format `pa.Table.slice()` views that pin their **full** source blocks in the object store (`pa.Table.slice` is zero-copy and references the entire underlying buffer). They keep blocks pinned in shared memory even after the distributed reference counter considers them out of scope, which is the accounting decoupling that contributes to streaming-split underestimation and spilling. - Replace with `iter_threaded(..., num_workers=num_threadpool_workers, output_buffer_size=num_threadpool_workers)` from PR 1 (generalized in this stack to take a required `fn` and `num_workers`). Workers share `batch_iter` under a lock and funnel results through a single bounded queue sized to match the worker count — enough depth to keep workers from blocking on each other's `put()` when collate is non-trivial. In-flight is now bounded to **~2 × num_workers ≈ 8** (workers + shared output buffer) — roughly a 2× reduction in untracked pinned batches. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…n` to reduce untracked buffered batches + reduce prefetch onto GPU (ray-project#63660) - `iter_batches` uses `make_async_gen(ref_bundle_iterator, num_workers=1)` to decouple the batching pipeline from the consumer thread. With a single worker, the multi-worker machinery (filling worker, per-worker input/output queues, round-robin draining) is unnecessary — it just adds complexity and hidden buffering. - With `buffer_size=prefetch_batches` (added in ray-project#58657), `make_async_gen` creates an input queue (capacity `prefetch_batches + 1`) and an output queue (capacity `prefetch_batches`), buffering up to `~2 * prefetch_batches` items that are invisible to the resource manager's memory accounting. - **BEHAVIOR CHANGE AFTER THIS PR**: The outer make_async_gen output buffer was a queue of **GPU batches** max size prefetch_batches. This means that you can have up to prefetch_batches + 1 (the working batch) on GPU memory. This happens implicitly and is not good default behavior, since users expect their entire GRAM to be usable for model params, grads, optimizer states, and the current batch and associated activations. Prefetching too many batches into GPU forces can be silently hurting user GPU utilization and throughput by forcing them to reduce their batch size. This PR bounds the number of prefetched GPU batches to 1 which has parity with the current defaults. A follow-up PR will introduce a configuration option to choose how many batches to be prefetched to the GPU at a time. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…e `make_async_gen` with `iter_threaded` (ray-project#63682) Replaces the inner format/collate `make_async_gen` with `iter_threaded` from ray-project#63660, cutting untracked object store memory pinned batches from ~16 to ~8 (2× reduction). - `_format_in_threadpool` runs format + collate across a threadpool via `make_async_gen(num_workers=min(4, prefetch_batches), preserve_ordering=False)`. With the default `buffer_size=1`, this allocates one shared input queue of size `(buffer_size + 1) * num_workers` and `num_workers` per-worker output queues of size `buffer_size` — for `num_workers=4`, that is **8 (input) + 4 (in-flight in workers) + 4 (output) ≈ 16** batches buffered inside the threadpool, none of which are visible to the resource manager. - These buffered batches are pre-format `pa.Table.slice()` views that pin their **full** source blocks in the object store (`pa.Table.slice` is zero-copy and references the entire underlying buffer). They keep blocks pinned in shared memory even after the distributed reference counter considers them out of scope, which is the accounting decoupling that contributes to streaming-split underestimation and spilling. - Replace with `iter_threaded(..., num_workers=num_threadpool_workers, output_buffer_size=num_threadpool_workers)` from PR 1 (generalized in this stack to take a required `fn` and `num_workers`). Workers share `batch_iter` under a lock and funnel results through a single bounded queue sized to match the worker count — enough depth to keep workers from blocking on each other's `put()` when collate is non-trivial. In-flight is now bounded to **~2 × num_workers ≈ 8** (workers + shared output buffer) — roughly a 2× reduction in untracked pinned batches. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…e behind `preserve_order` (#63792) Part of the iter_batches consumer pipeline cleanup (#63660, #63682). Gates restore_original_order behind `DataContext.execution_options.preserve_order` (default off). When one format/collate worker lags, the reorder buffer grows with the other workers' completed batches, and ready batches aren't allowed to be yielded; this PR skips the reorder step when ordering isn't required. Recovers next-batch latency from PR1+2's regressed 113 ms steady back to 23 ms (lower than master's 32 ms), with no other regressions. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

[Data] Replace outer make_async_gen with iter_in_background in BatchI…

7e2328a

…terator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from a team as a code owner May 27, 2026 07:44

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/data/_internal/block_batching/util.py Outdated

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/ray/data/_internal/block_batching/util.py Outdated

[Data] Add producer thread cleanup to iter_in_background on early con…

5898f1d

…sumer exit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ray-gardener Bot added the data Ray Data-related issues label May 27, 2026

Merge branch 'master' of https://github.com/ray-project/ray into just…

4f9a566

…invyu/replace-outer-make-async-gen

[Data] Rename iter_in_background -> iter_threaded with fn and num_wor…

f5a6103

…kers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into just…

6639106

…invyu/replace-outer-make-async-gen

cursor Bot reviewed Jun 2, 2026

View reviewed changes

bveeramani self-assigned this Jun 2, 2026

rayhhome approved these changes Jun 3, 2026

View reviewed changes

justinvyu added 2 commits June 3, 2026 11:34

[Data] Add unit tests for iter_threaded

63768a6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[Data] Short-circuit _locked_next when consumer has stopped

edc3eef

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rayhhome added the go add ONLY when ready to merge, run all tests label Jun 3, 2026

justinvyu changed the title ~~[data] Fix iter_batches spilling (1/n): Remove outer make_async_gen and its untracked input/output queues~~ [data] Fix iter_batches spilling (1/n): Remove outer make_async_gen to reduce untracked buffered batches + reduce prefetch onto GPU Jun 4, 2026

justinvyu merged commit 80021c9 into ray-project:master Jun 4, 2026
8 checks passed

justinvyu deleted the justinvyu/replace-outer-make-async-gen branch June 4, 2026 17:31

		_SENTINEL = object()


		def iter_threaded(

Conversation

justinvyu commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related PRs

Release test results

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

justinvyu commented May 27, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 2, 2026

Choose a reason for hiding this comment

Multi-worker runs duplicate pipelines

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rayhhome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justinvyu commented May 27, 2026 •

edited

Loading