Tianye/4gpu 2ppl fix by TianyeGGBond · Pull Request #4 · rlops/miles

TianyeGGBond · 2026-05-22T19:19:39Z

Summary

Miles-side companion PR for the RLix 4-GPU 2-pipeline rollout-2+ hang fix.

RLix side: makes rollout demand durable in the scheduler.
Miles side: sends that demand before rollout dispatch starts.

Problem

Under full cross-overlap, both pipelines' actor_infer DP workers can shrink to empty between rollouts.

If demand is only reported inside the rollout function, the first pipeline to report can take the available GENERATION budget, while the peer pipeline hangs:

Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N

Changes

Commit	What
`915a412`	Wire `rlix_hooks`; fix router LB counter leak; set `MilesCoordinator max_concurrency=4`
`b2acc3e`	Add awaited `signal_demand` hook before rollout dispatch

RLix ↔ Miles Mapping

RLix side	Miles side
Durable `rollout_open_pipelines`	Send rollout demand before dispatch
`MilesPipeline.signal_rollout_demand()`	`run_miles_dual._signal_demand()`
`request_gpus(GENERATION, step_target=N)`	Forward `(rollout_id, rollout_batch_size)`
Phase B injects `MilesRLixHooks`	`RolloutManager.set_rlix_hooks(hooks)`

New Flow

signal_demand()
→ pipe.signal_rollout_demand.remote(rollout_id, rollout_batch_size)
→ RLix scheduler records durable rollout demand
→ scheduler wakes at least one DP worker
→ generate.remote starts rollout dispatch

Extra Stability Fixes

Router now always decrements LB counters on ClientDisconnect.
ClientDisconnect returns 499 instead of leaking worker load state.
MilesCoordinator max_concurrency=4 avoids serializing resize/progress/sync RPCs.

Test

run_4ppl_full_overlap_n10.sh --num-rollout 5

Result: run 28 completed all 5 rollouts with EXIT=0; earlier baseline/concurrency-only runs hung on rollout 2.

…uter + max_concurrency Squashed working tree for the miles side of the 4-GPU 2-pipeline full-cross-overlap topology (P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]). Companion to TianyeGGBond/rlix@tianye/4gpu-2ppl-fix. Three independent fixes: 1) RolloutManager never plumbed rlix_hooks (Layer 4) --------------------------------------------------- miles.ray.rollout.RolloutManager._get_rollout_data called ``call_rollout_fn(self.generate_rollout, ..., evaluation=False)`` without passing ``rlix_hooks``. fully_async_rollout.generate_rollout_async defaults the kwarg to None and promotes to NoOpRLixHooks — every ``begin_progress_batch`` / ``bump_completed`` was a silent no-op. Between rollouts the rlix central scheduler had no demand signal; gap-ratio had no basis to re-expand engines after _after_training released actor_train; the next rollout's ``await rollout_data`` hung forever. Fix: - ``RolloutManager.__init__`` initializes ``self._rlix_hooks = None`` (standalone miles keeps NoOp; only RLix-mode injects). - New ``RolloutManager.set_rlix_hooks(hooks)`` method. - ``_get_rollout_data`` forwards ``rlix_hooks=self._rlix_hooks``. - ``call_rollout_fn`` (miles/rollout/base_types.py) introspects the target fn's signature and only forwards ``rlix_hooks`` when the fn declares it, preserving back-compat for legacy rollout fns. The rlix-side driver (TianyeGGBond/rlix@tianye/4gpu-2ppl-fix) calls ``rollout_manager.set_rlix_hooks.remote(MilesRLixHooks(coord, pipeline_id))`` in Phase B step 4c. 2) Router leaked LB counter on ClientDisconnect (Layer 5) --------------------------------------------------------- miles/router/router.py:do_proxy acquired a worker via ``_use_url_async`` (``worker_request_counts[url] += 1``) BEFORE the ``try/finally`` containing the matching ``_finish_url`` decrement. The very next line (``await request.body()``) raises ``ClientDisconnect`` when a peer pipeline's engine shrink races a probe-time request — and that exception escaped the try, so ``_finish_url`` never ran. Every disconnect leaked one counter; the LB's ``min(candidates, key=worker_request_counts.get)`` eventually treated leaked workers as permanently busy. Fix: - Moved the ``try/finally`` boundary up to wrap ``await request.body()`` and the upstream request; ``_finish_url`` always runs. - ``proxy()`` catches ``ClientDisconnect`` explicitly and returns 499 with a one-line WARNING. Real CUDA / Ray / scheduler exceptions still propagate (only ``ClientDisconnect`` is suppressed). 3) MilesCoordinator max_concurrency ----------------------------------- Default ``max_concurrency=1`` made the coordinator serialize all RPCs: ``scheduler.resize_infer``, ``report_progress_from_scheduler``, and ``sync_base_weights_to_active`` queued behind each other. Bumped to 4 to mirror rlix's reference 2-pipeline 4-GPU example (``examples/start_multi_pipeline_test.py: COORDINATOR_MAX_CONCURRENCY``). Empirically reduced the post-rollout-0 incidence of Ray's internal ``'Worker' object has no attribute 'core_worker'`` race at ``worker.py:1039`` by relieving the coordinator's RPC-handler thread queue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio planner can wake actor_infer engines without waiting for the rollout function's first begin_progress_batch — too late under cross-pipeline contention. Companion to rlix@tianye/4gpu-2ppl-fix commit which adds MilesPipeline.signal_rollout_demand. Root cause: between rollouts, both pipelines' actor_infer DP workers shrink to set() because actor_train (ACTOR_TRAINING priority) preempts the shared infer GPUs. The first pipeline whose rollout function fires begin_progress_batch wins the GENERATION budget; the peer pipeline starves on "Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N". Mirrors rlix/pipeline/full_finetune_pipeline.py:696 (Phase 4.5 notify_release_then_request_cluster_gpus) — re-request actor_infer at GENERATION with fresh step_target_estimate per rollout. miles can't tear-down/re-request actor_infer mid-loop without resetting SGLang engines, so the rlix-side companion publishes a synthetic new_batch=True progress report instead. This change adds the train-loop hook + driver wire-up that invokes it. Changes: - miles/utils/rlix_train_loop.py: new optional signal_demand: StepHook kwarg on run_async_train_loop. Called pre-pre-loop-dispatch (for start_rollout_id) and after every after_step before dispatching the next rollout. Awaited (commit-before-dispatch ordering); generate.remote immediately follows so the scheduler sees fresh demand BEFORE the rollout function pulls samples. Failure path (release_only) does not call signal_demand because the loop re-raises and no next rollout fires. - examples/rlix/run_miles_dual.py: defines _signal_demand step hook that forwards rollout_id + rollout_batch_size to pipe.signal_rollout_demand.remote. step_target=rollout_batch_size matches the per-rollout demand the scheduler sees once the real begin_progress_batch arrives. Defensive <=0 short-circuit. Test: run 28 of /root/run_4ppl_full_overlap_n10.sh (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0; prior runs (run 25 baseline, run 27 with concurrency-bump only) hung on rollout 2 collection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the proper fix for B-13 (mp1 wedge after mp2 completes early under cross-overlap). The fix uses the scheduler's existing gap-ratio machinery via pre-signalled durable demand instead of fighting asyncio.gather, and is materially better than my proposed M11.3 fire-shutdown_hard-early sketch. Root cause (refined from TianyeGGBond's PR description, Codex review of the rlix planner code confirms): - Between rollouts, actor_train preempts shared infer GPUs → both pipelines' actor_infer DP workers shrink to set(). - begin_progress_batch fires INSIDE the rollout fn, after dispatch starts → first pipeline wins all DP workers, peer hangs. - Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle snapshot that vanishes once consumed (planner blind during rollout boundary); (b) MilesRLixHooks was never wired at init → begin_progress_batch hit NoOpRLixHooks (silent no-op). Joint fix: - rlix#16: durable rollout_open_pipelines registry in scheduler; MilesPipeline.signal_rollout_demand(rollout_id, step_target); wires MilesRLixHooks in Phase B init; pending_bucket_gen durability. - miles#4: _signal_demand step hook in dual driver; set_rlix_hooks + _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter balance fix. Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL (signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired) both RESOLVED in the pair. Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh --num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2. Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone raises AttributeError on pipe.signal_rollout_demand.remote without the rlix#16 method). M11.3 follow-up scope shrinks: concurrent-resize stress test under max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression; verify generate_rollout_fully_async accepts rlix_hooks kw.

Both PRs merged into the zhenyu/* branches; my Phase 1/3 work fast-forwarded cleanly under both merges (no conflicts because TianyeGGBond branched off my prior commits). Verified merged state contains both: - My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix - TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix - My Phase 2 overlap env-driven topology on miles - TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles Vast 4×A40 dual-overlap regression smoke pending on a new instance — to be run with the joint fix to confirm all 7 PASS-bar conditions (including C2 shutdown_hard complete, previously blocked by B-13).

New: docs/m11-tianye-prs-review.md — standalone document summarizing TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed B-13 (M11.2 overlap rollout-2+ hang). Covers: - Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks never wired) - Fix architecture (durable rollout_open_pipelines registry + signal_rollout_demand pre-dispatch) - File-level change inventory across both PRs (~14 files / +300 LoC) - Validation layers: 5 Codex review rounds, local merge spot-check, real vast smoke v8/v9/v10 evidence - Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks kw verification) - Doc + commit pointers Easier to share than the full m11-2-overlap-log.md for reviewers who only need the Tianye-PR context.

Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work: - Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b) - Phase 3 — Option β state machine (finish_init_offload + Codex Q5 invariants) (rlix 47bf02b) - Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION + try/finally shutdown_hard) (miles 79f2874) - Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge — durable rollout_open_pipelines registry + signal_rollout_demand + MilesRLixHooks wiring; merged 2026-05-24 - B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set; removing it on stable-tms hardware enables tms.pause() (rlix a011bbf) - Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU validation + pause_generation docs (rlix a4c6369 + miles 1487c3f) Updates §1.4 status table — M11.2 now has 3 rows: - Option A (disjoint pools, M11.2 Attempt 4 GREEN) - Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10) - M11.3+ deferred Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9 as RESOLVED with commit pointers. Adds rows for MED1 (concurrent resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as howard989's PR in flight. Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f). Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.

TianyeGGBond and others added 2 commits May 20, 2026 02:54

TianyeGGBond requested review from howard989, taoluo and zhenyulincs May 22, 2026 19:19

zhenyulincs merged commit 8f5cef8 into rlops:zhenyu/m11-mvp-test May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tianye/4gpu 2ppl fix#4

Tianye/4gpu 2ppl fix#4
zhenyulincs merged 2 commits into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:tianye/4gpu-2ppl-fix

TianyeGGBond commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TianyeGGBond commented May 22, 2026

Summary

Problem

Changes

RLix ↔ Miles Mapping

New Flow

Extra Stability Fixes

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants