[data] Gate restore_original_order in iter_batches consumer pipeline behind preserve_order#63792
Conversation
…ns.preserve_order Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request makes the restoration of the original batch order optional, executing it only when execution_options.preserve_order is enabled. This prevents head-of-line blocking and improves throughput for consumers that do not require order preservation. A corresponding unit test was added to verify this behavior. The reviewer noted that dynamically calling DataContext.get_current() inside the pipeline could lead to issues if the context changes or is evaluated asynchronously, and suggested capturing the preserve_order flag during initialization instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…invyu/gate-restore-original-order
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 7cfe17f. Configure here.
…rve_order Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…erator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…invyu/gate-restore-original-order

Description
Part of the iter_batches consumer pipeline cleanup (#63660, #63682). Gates restore_original_order behind
DataContext.execution_options.preserve_order(default off). When one format/collate worker lags, the reorder buffer grows with the other workers' completed batches, and ready batches aren't allowed to be yielded; this PR skips the reorder step when ordering isn't required. Recovers next-batch latency from PR1+2's regressed 113 ms steady back to 23 ms (lower than master's 32 ms), with no other regressions.Why does
restore_original_orderreduce throughput?Here's the chain that leads to higher average throughput:
Profiling the training worker
Before:
loss.item()takes up ~25% of the profiling duration, sinceloss.item()is the cuda/CPU sync point. The slow straggler batch time manifests inloss.item()timing outliers.After:
loss.item()takes <2% of the profiling duration.Release test results
peak_object_store_memoryvariant (sleep=2s, exposes consumer-side buffer fill)throughputvariant (no sleep, pipeline is the rate-limiter)PR3 closes the temporary next-batch latency regression that PR1+2 alone introduced (32 → 113 ms in the throughput variant), bringing it down to 23 ms — lower than master's 32 ms — while preserving PR1+2's −44% peak object store memory and +21% throughput wins.
Follow-up work
Extra GPU memory cost:
batch_idxto come in, and it can accumulate up to N items (one per format/collate worker that ran ahead) on top of the outermake_async_genoutput queue of sizeprefetch_batches. ([data] Fixiter_batchesspilling (1/n): Remove outermake_async_gento reduce untracked buffered batches + reduce prefetch onto GPU #63660 is removing that output queue.)restore_original_orderinstead to remove the unexpected buffer in GPU memory.This PR mitigates the issue for the default case because the
restore_original_orderbuffer isn't used.