Tianye/4gpu 2ppl fix#4
Merged
zhenyulincs merged 2 commits intoMay 24, 2026
Merged
Conversation
…uter + max_concurrency Squashed working tree for the miles side of the 4-GPU 2-pipeline full-cross-overlap topology (P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]). Companion to TianyeGGBond/rlix@tianye/4gpu-2ppl-fix. Three independent fixes: 1) RolloutManager never plumbed rlix_hooks (Layer 4) --------------------------------------------------- miles.ray.rollout.RolloutManager._get_rollout_data called ``call_rollout_fn(self.generate_rollout, ..., evaluation=False)`` without passing ``rlix_hooks``. fully_async_rollout.generate_rollout_async defaults the kwarg to None and promotes to NoOpRLixHooks — every ``begin_progress_batch`` / ``bump_completed`` was a silent no-op. Between rollouts the rlix central scheduler had no demand signal; gap-ratio had no basis to re-expand engines after _after_training released actor_train; the next rollout's ``await rollout_data`` hung forever. Fix: - ``RolloutManager.__init__`` initializes ``self._rlix_hooks = None`` (standalone miles keeps NoOp; only RLix-mode injects). - New ``RolloutManager.set_rlix_hooks(hooks)`` method. - ``_get_rollout_data`` forwards ``rlix_hooks=self._rlix_hooks``. - ``call_rollout_fn`` (miles/rollout/base_types.py) introspects the target fn's signature and only forwards ``rlix_hooks`` when the fn declares it, preserving back-compat for legacy rollout fns. The rlix-side driver (TianyeGGBond/rlix@tianye/4gpu-2ppl-fix) calls ``rollout_manager.set_rlix_hooks.remote(MilesRLixHooks(coord, pipeline_id))`` in Phase B step 4c. 2) Router leaked LB counter on ClientDisconnect (Layer 5) --------------------------------------------------------- miles/router/router.py:do_proxy acquired a worker via ``_use_url_async`` (``worker_request_counts[url] += 1``) BEFORE the ``try/finally`` containing the matching ``_finish_url`` decrement. The very next line (``await request.body()``) raises ``ClientDisconnect`` when a peer pipeline's engine shrink races a probe-time request — and that exception escaped the try, so ``_finish_url`` never ran. Every disconnect leaked one counter; the LB's ``min(candidates, key=worker_request_counts.get)`` eventually treated leaked workers as permanently busy. Fix: - Moved the ``try/finally`` boundary up to wrap ``await request.body()`` and the upstream request; ``_finish_url`` always runs. - ``proxy()`` catches ``ClientDisconnect`` explicitly and returns 499 with a one-line WARNING. Real CUDA / Ray / scheduler exceptions still propagate (only ``ClientDisconnect`` is suppressed). 3) MilesCoordinator max_concurrency ----------------------------------- Default ``max_concurrency=1`` made the coordinator serialize all RPCs: ``scheduler.resize_infer``, ``report_progress_from_scheduler``, and ``sync_base_weights_to_active`` queued behind each other. Bumped to 4 to mirror rlix's reference 2-pipeline 4-GPU example (``examples/start_multi_pipeline_test.py: COORDINATOR_MAX_CONCURRENCY``). Empirically reduced the post-rollout-0 incidence of Ray's internal ``'Worker' object has no attribute 'core_worker'`` race at ``worker.py:1039`` by relieving the coordinator's RPC-handler thread queue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio planner can wake actor_infer engines without waiting for the rollout function's first begin_progress_batch — too late under cross-pipeline contention. Companion to rlix@tianye/4gpu-2ppl-fix commit which adds MilesPipeline.signal_rollout_demand. Root cause: between rollouts, both pipelines' actor_infer DP workers shrink to set() because actor_train (ACTOR_TRAINING priority) preempts the shared infer GPUs. The first pipeline whose rollout function fires begin_progress_batch wins the GENERATION budget; the peer pipeline starves on "Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N". Mirrors rlix/pipeline/full_finetune_pipeline.py:696 (Phase 4.5 notify_release_then_request_cluster_gpus) — re-request actor_infer at GENERATION with fresh step_target_estimate per rollout. miles can't tear-down/re-request actor_infer mid-loop without resetting SGLang engines, so the rlix-side companion publishes a synthetic new_batch=True progress report instead. This change adds the train-loop hook + driver wire-up that invokes it. Changes: - miles/utils/rlix_train_loop.py: new optional signal_demand: StepHook kwarg on run_async_train_loop. Called pre-pre-loop-dispatch (for start_rollout_id) and after every after_step before dispatching the next rollout. Awaited (commit-before-dispatch ordering); generate.remote immediately follows so the scheduler sees fresh demand BEFORE the rollout function pulls samples. Failure path (release_only) does not call signal_demand because the loop re-raises and no next rollout fires. - examples/rlix/run_miles_dual.py: defines _signal_demand step hook that forwards rollout_id + rollout_batch_size to pipe.signal_rollout_demand.remote. step_target=rollout_batch_size matches the per-rollout demand the scheduler sees once the real begin_progress_batch arrives. Defensive <=0 short-circuit. Test: run 28 of /root/run_4ppl_full_overlap_n10.sh (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0; prior runs (run 25 baseline, run 27 with concurrency-bump only) hung on rollout 2 collection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhenyulincs
added a commit
to rlops/rlix
that referenced
this pull request
May 24, 2026
TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the proper fix for B-13 (mp1 wedge after mp2 completes early under cross-overlap). The fix uses the scheduler's existing gap-ratio machinery via pre-signalled durable demand instead of fighting asyncio.gather, and is materially better than my proposed M11.3 fire-shutdown_hard-early sketch. Root cause (refined from TianyeGGBond's PR description, Codex review of the rlix planner code confirms): - Between rollouts, actor_train preempts shared infer GPUs → both pipelines' actor_infer DP workers shrink to set(). - begin_progress_batch fires INSIDE the rollout fn, after dispatch starts → first pipeline wins all DP workers, peer hangs. - Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle snapshot that vanishes once consumed (planner blind during rollout boundary); (b) MilesRLixHooks was never wired at init → begin_progress_batch hit NoOpRLixHooks (silent no-op). Joint fix: - rlix#16: durable rollout_open_pipelines registry in scheduler; MilesPipeline.signal_rollout_demand(rollout_id, step_target); wires MilesRLixHooks in Phase B init; pending_bucket_gen durability. - miles#4: _signal_demand step hook in dual driver; set_rlix_hooks + _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter balance fix. Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL (signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired) both RESOLVED in the pair. Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh --num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2. Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone raises AttributeError on pipe.signal_rollout_demand.remote without the rlix#16 method). M11.3 follow-up scope shrinks: concurrent-resize stress test under max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression; verify generate_rollout_fully_async accepts rlix_hooks kw.
zhenyulincs
added a commit
to rlops/rlix
that referenced
this pull request
May 24, 2026
Both PRs merged into the zhenyu/* branches; my Phase 1/3 work fast-forwarded cleanly under both merges (no conflicts because TianyeGGBond branched off my prior commits). Verified merged state contains both: - My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix - TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix - My Phase 2 overlap env-driven topology on miles - TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles Vast 4×A40 dual-overlap regression smoke pending on a new instance — to be run with the joint fix to confirm all 7 PASS-bar conditions (including C2 shutdown_hard complete, previously blocked by B-13).
zhenyulincs
added a commit
to rlops/rlix
that referenced
this pull request
May 24, 2026
New: docs/m11-tianye-prs-review.md — standalone document summarizing TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed B-13 (M11.2 overlap rollout-2+ hang). Covers: - Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks never wired) - Fix architecture (durable rollout_open_pipelines registry + signal_rollout_demand pre-dispatch) - File-level change inventory across both PRs (~14 files / +300 LoC) - Validation layers: 5 Codex review rounds, local merge spot-check, real vast smoke v8/v9/v10 evidence - Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks kw verification) - Doc + commit pointers Easier to share than the full m11-2-overlap-log.md for reviewers who only need the Tianye-PR context.
zhenyulincs
added a commit
to rlops/rlix
that referenced
this pull request
May 24, 2026
Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work: - Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b) - Phase 3 — Option β state machine (finish_init_offload + Codex Q5 invariants) (rlix 47bf02b) - Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION + try/finally shutdown_hard) (miles 79f2874) - Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge — durable rollout_open_pipelines registry + signal_rollout_demand + MilesRLixHooks wiring; merged 2026-05-24 - B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set; removing it on stable-tms hardware enables tms.pause() (rlix a011bbf) - Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU validation + pause_generation docs (rlix a4c6369 + miles 1487c3f) Updates §1.4 status table — M11.2 now has 3 rows: - Option A (disjoint pools, M11.2 Attempt 4 GREEN) - Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10) - M11.3+ deferred Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9 as RESOLVED with commit pointers. Adds rows for MED1 (concurrent resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as howard989's PR in flight. Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f). Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Miles-side companion PR for the RLix 4-GPU 2-pipeline rollout-2+ hang fix.
RLix side: makes rollout demand durable in the scheduler.
Miles side: sends that demand before rollout dispatch starts.
Problem
Under full cross-overlap, both pipelines'
actor_inferDP workers can shrink to empty between rollouts.If demand is only reported inside the rollout function, the first pipeline to report can take the available GENERATION budget, while the peer pipeline hangs:
Changes
915a412rlix_hooks; fix router LB counter leak; setMilesCoordinator max_concurrency=4b2acc3esignal_demandhook before rollout dispatchRLix ↔ Miles Mapping
rollout_open_pipelinesMilesPipeline.signal_rollout_demand()run_miles_dual._signal_demand()request_gpus(GENERATION, step_target=N)(rollout_id, rollout_batch_size)MilesRLixHooksRolloutManager.set_rlix_hooks(hooks)New Flow
Extra Stability Fixes
ClientDisconnect.ClientDisconnectreturns499instead of leaking worker load state.MilesCoordinator max_concurrency=4avoids serializing resize/progress/sync RPCs.Test
run_4ppl_full_overlap_n10.sh --num-rollout 5Result: run 28 completed all 5 rollouts with
EXIT=0; earlier baseline/concurrency-only runs hung on rollout 2.