Skip to content

[data] Fix iter_batches spilling (1/n): Remove outer make_async_gen to reduce untracked buffered batches + reduce prefetch onto GPU#63660

Merged
justinvyu merged 7 commits into
ray-project:masterfrom
justinvyu:justinvyu/replace-outer-make-async-gen
Jun 4, 2026
Merged

[data] Fix iter_batches spilling (1/n): Remove outer make_async_gen to reduce untracked buffered batches + reduce prefetch onto GPU#63660
justinvyu merged 7 commits into
ray-project:masterfrom
justinvyu:justinvyu/replace-outer-make-async-gen

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu commented May 27, 2026

Description

  • iter_batches uses make_async_gen(ref_bundle_iterator, num_workers=1) to decouple the batching pipeline from the consumer thread. With a single worker, the multi-worker machinery (filling worker, per-worker input/output queues, round-robin draining) is unnecessary — it just adds complexity and hidden buffering.
  • With buffer_size=prefetch_batches (added in [Data] Handle prefetches buffering in iter_batches #58657), make_async_gen creates an input queue (capacity prefetch_batches + 1) and an output queue (capacity prefetch_batches), buffering up to ~2 * prefetch_batches items that are invisible to the resource manager's memory accounting.
  • BEHAVIOR CHANGE AFTER THIS PR: The outer make_async_gen output buffer was a queue of GPU batches max size prefetch_batches. This means that you can have up to prefetch_batches + 1 (the working batch) on GPU memory. This happens implicitly and is not good default behavior, since users expect their entire GRAM to be usable for model params, grads, optimizer states, and the current batch and associated activations. Prefetching too many batches into GPU forces can be silently hurting user GPU utilization and throughput by forcing them to reduce their batch size. This PR bounds the number of prefetched GPU batches to 1 which has parity with the current defaults. A follow-up PR will introduce a configuration option to choose how many batches to be prefetched to the GPU at a time.
  • Replace with iter_threaded, a simplified utility that runs an iterator in a single background thread with a bounded output queue.
Screenshot 2026-05-27 at 12 44 47 AM

Related PRs

This PR brings the behavior closer to the original implementation: #33620

#51661

  • This change introduced extra queues to the mix and was mostly needed for the preserve_order case with multiple threadpool workers in Ray Data read tasks. The iter_batches usage of make_async_gen only uses a single thread and the behavior change was unintentional.

#58657

  • The change to increase make_async_gen(buffer_size=prefetch_batches) was unrelated to the main issue of the batch formatting threadpool round-robin causing next batch spiky latencies.

Release test results

backpressure_benchmark.single_node

Metric Master PR1
runtime (s) 1039.86 1035.94
peak obj store (GB) 9.50 7.63
util peak 0.9896 0.7943
spilled (GB) 5.88 0.0

backpressure_benchmark.multi_node

Note: Still spills with this PR, but significantly reduced. See PR2 for reducing spill to 0.

Metric Master PR1
runtime (s) 144.47 142.11
peak obj store (GB) 72.25 63.88
util peak 0.8362 0.7393
spilled (GB) 63.88 14.00

…terator

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner May 27, 2026 07:44
Comment thread python/ray/data/_internal/block_batching/util.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the asynchronous generator make_async_gen with a simpler background thread iterator iter_in_background using a bounded queue. A critical issue was identified in the new iter_in_background utility: if the consumer stops iterating early, the background producer thread can block indefinitely on queue.put(), causing a thread and resource leak. A code suggestion was provided to use a threading.Event and a finally block to safely terminate the producer thread and drain the queue.

Comment thread python/ray/data/_internal/block_batching/util.py Outdated
…sumer exit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 27, 2026
@justinvyu
Copy link
Copy Markdown
Contributor Author

backpressure_training_prefetch.single_node

Before:
Screenshot 2026-05-27 at 3 01 58 PM
After:
Screenshot 2026-05-27 at 3 02 42 PM

…kers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 6639106. Configure here.

worker_threads = [
threading.Thread(target=_worker, name="iter_threaded", daemon=True)
for _ in range(num_workers)
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-worker runs duplicate pipelines

Medium Severity

When iter_threaded uses num_workers greater than one, each worker runs a full fn(...) over the shared base_iterator instead of splitting work like make_async_gen. Two pipelines interleave ref-bundle reads and emit batches into one queue, corrupting batching and ordering.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6639106. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an actual bug since we are only using num_workers = 1 for _iter_batches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cursor comment isn't accurate, but I realized I have no unit tests. Will sanity check implementation and add unit tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests.

@bveeramani bveeramani self-assigned this Jun 2, 2026
Copy link
Copy Markdown
Contributor

@rayhhome rayhhome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but doesn't need to block; overall lgtm

worker_threads = [
threading.Thread(target=_worker, name="iter_threaded", daemon=True)
for _ in range(num_workers)
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an actual bug since we are only using num_workers = 1 for _iter_batches.

_SENTINEL = object()


def iter_threaded(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should live in python/ray/data/_internal/util.py beside make_async_gen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep this utility in iter_batches so that it doesn't end up getting abused/co-opted for another purpose again and then forgetting the original intention. That's what happened with make_async_gen 😄


try:
while True:
item = result_queue.get()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think make_async_gen has a polling mechanism that has a timeout for get and also detects interrupt events. I think this can lead to hanging issues but in very unlikely situations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interrupt should be handled by the finally: stopped.set(). Do you have a different deadlock/hang case in mind?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the case where a node dies and get hangs, but the old code seems to also hang in that case, so I guess the current implementation is fine 😃

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- this stuff is all just threads on a single node so if the node dies then there's no hanging, everything just dies.

There is one scenario where node death can cause a somewhat orthogonal issue: if block ref A gets sent over, it doesn't block the pipeline on this queue get, but it will block if ray.get(block_ref) fails in resolve_block_refs. But that's the same as the status quo and requires lineage reconstruction to unblock.

justinvyu added 2 commits June 3, 2026 11:34
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@rayhhome rayhhome added the go add ONLY when ready to merge, run all tests label Jun 3, 2026
@justinvyu justinvyu changed the title [data] Fix iter_batches spilling (1/n): Remove outer make_async_gen and its untracked input/output queues [data] Fix iter_batches spilling (1/n): Remove outer make_async_gen to reduce untracked buffered batches + reduce prefetch onto GPU Jun 4, 2026
@justinvyu justinvyu merged commit 80021c9 into ray-project:master Jun 4, 2026
8 checks passed
justinvyu added a commit that referenced this pull request Jun 4, 2026
…e `make_async_gen` with `iter_threaded` (#63682)

Replaces the inner format/collate `make_async_gen` with `iter_threaded`
from #63660, cutting untracked object store memory pinned batches from
~16 to ~8 (2× reduction).

- `_format_in_threadpool` runs format + collate across a threadpool via
`make_async_gen(num_workers=min(4, prefetch_batches),
preserve_ordering=False)`. With the default `buffer_size=1`, this
allocates one shared input queue of size `(buffer_size + 1) *
num_workers` and `num_workers` per-worker output queues of size
`buffer_size` — for `num_workers=4`, that is **8 (input) + 4 (in-flight
in workers) + 4 (output) ≈ 16** batches buffered inside the threadpool,
none of which are visible to the resource manager.
- These buffered batches are pre-format `pa.Table.slice()` views that
pin their **full** source blocks in the object store (`pa.Table.slice`
is zero-copy and references the entire underlying buffer). They keep
blocks pinned in shared memory even after the distributed reference
counter considers them out of scope, which is the accounting decoupling
that contributes to streaming-split underestimation and spilling.
- Replace with `iter_threaded(..., num_workers=num_threadpool_workers,
output_buffer_size=num_threadpool_workers)` from PR 1 (generalized in
this stack to take a required `fn` and `num_workers`). Workers share
`batch_iter` under a lock and funnel results through a single bounded
queue sized to match the worker count — enough depth to keep workers
from blocking on each other's `put()` when collate is non-trivial.
In-flight is now bounded to **~2 × num_workers ≈ 8** (workers + shared
output buffer) — roughly a 2× reduction in untracked pinned batches.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…n` to reduce untracked buffered batches + reduce prefetch onto GPU (ray-project#63660)

- `iter_batches` uses `make_async_gen(ref_bundle_iterator,
num_workers=1)` to decouple the batching pipeline from the consumer
thread. With a single worker, the multi-worker machinery (filling
worker, per-worker input/output queues, round-robin draining) is
unnecessary — it just adds complexity and hidden buffering.
- With `buffer_size=prefetch_batches` (added in ray-project#58657),
`make_async_gen` creates an input queue (capacity `prefetch_batches +
1`) and an output queue (capacity `prefetch_batches`), buffering up to
`~2 * prefetch_batches` items that are invisible to the resource
manager's memory accounting.
- **BEHAVIOR CHANGE AFTER THIS PR**: The outer make_async_gen output
buffer was a queue of **GPU batches** max size prefetch_batches. This
means that you can have up to prefetch_batches + 1 (the working batch)
on GPU memory. This happens implicitly and is not good default behavior,
since users expect their entire GRAM to be usable for model params,
grads, optimizer states, and the current batch and associated
activations. Prefetching too many batches into GPU forces can be
silently hurting user GPU utilization and throughput by forcing them to
reduce their batch size. This PR bounds the number of prefetched GPU
batches to 1 which has parity with the current defaults. A follow-up PR
will introduce a configuration option to choose how many batches to be
prefetched to the GPU at a time.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
rueian pushed a commit to rueian/ray that referenced this pull request Jun 4, 2026
…e `make_async_gen` with `iter_threaded` (ray-project#63682)

Replaces the inner format/collate `make_async_gen` with `iter_threaded`
from ray-project#63660, cutting untracked object store memory pinned batches from
~16 to ~8 (2× reduction).

- `_format_in_threadpool` runs format + collate across a threadpool via
`make_async_gen(num_workers=min(4, prefetch_batches),
preserve_ordering=False)`. With the default `buffer_size=1`, this
allocates one shared input queue of size `(buffer_size + 1) *
num_workers` and `num_workers` per-worker output queues of size
`buffer_size` — for `num_workers=4`, that is **8 (input) + 4 (in-flight
in workers) + 4 (output) ≈ 16** batches buffered inside the threadpool,
none of which are visible to the resource manager.
- These buffered batches are pre-format `pa.Table.slice()` views that
pin their **full** source blocks in the object store (`pa.Table.slice`
is zero-copy and references the entire underlying buffer). They keep
blocks pinned in shared memory even after the distributed reference
counter considers them out of scope, which is the accounting decoupling
that contributes to streaming-split underestimation and spilling.
- Replace with `iter_threaded(..., num_workers=num_threadpool_workers,
output_buffer_size=num_threadpool_workers)` from PR 1 (generalized in
this stack to take a required `fn` and `num_workers`). Workers share
`batch_iter` under a lock and funnel results through a single bounded
queue sized to match the worker count — enough depth to keep workers
from blocking on each other's `put()` when collate is non-trivial.
In-flight is now bounded to **~2 × num_workers ≈ 8** (workers + shared
output buffer) — roughly a 2× reduction in untracked pinned batches.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
justinvyu added a commit that referenced this pull request Jun 4, 2026
…e behind `preserve_order` (#63792)

Part of the iter_batches consumer pipeline cleanup (#63660, #63682).
Gates restore_original_order behind
`DataContext.execution_options.preserve_order` (default off). When one
format/collate worker lags, the reorder buffer grows with the other
workers' completed batches, and ready batches aren't allowed to be
yielded; this PR skips the reorder step when ordering isn't required.
Recovers next-batch latency from PR1+2's regressed 113 ms steady back to
23 ms (lower than master's 32 ms), with no other regressions.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@justinvyu justinvyu deleted the justinvyu/replace-outer-make-async-gen branch June 4, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants