Skip to content

Tianye/4gpu 2ppl fix#4

Merged
zhenyulincs merged 2 commits into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:tianye/4gpu-2ppl-fix
May 24, 2026
Merged

Tianye/4gpu 2ppl fix#4
zhenyulincs merged 2 commits into
rlops:zhenyu/m11-mvp-testfrom
TianyeGGBond:tianye/4gpu-2ppl-fix

Conversation

@TianyeGGBond
Copy link
Copy Markdown

Summary

Miles-side companion PR for the RLix 4-GPU 2-pipeline rollout-2+ hang fix.

RLix side: makes rollout demand durable in the scheduler.
Miles side: sends that demand before rollout dispatch starts.

Problem

Under full cross-overlap, both pipelines' actor_infer DP workers can shrink to empty between rollouts.

If demand is only reported inside the rollout function, the first pipeline to report can take the available GENERATION budget, while the peer pipeline hangs:

Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N

Changes

Commit What
915a412 Wire rlix_hooks; fix router LB counter leak; set MilesCoordinator max_concurrency=4
b2acc3e Add awaited signal_demand hook before rollout dispatch

RLix ↔ Miles Mapping

RLix side Miles side
Durable rollout_open_pipelines Send rollout demand before dispatch
MilesPipeline.signal_rollout_demand() run_miles_dual._signal_demand()
request_gpus(GENERATION, step_target=N) Forward (rollout_id, rollout_batch_size)
Phase B injects MilesRLixHooks RolloutManager.set_rlix_hooks(hooks)

New Flow

signal_demand()
→ pipe.signal_rollout_demand.remote(rollout_id, rollout_batch_size)
→ RLix scheduler records durable rollout demand
→ scheduler wakes at least one DP worker
→ generate.remote starts rollout dispatch

Extra Stability Fixes

  • Router now always decrements LB counters on ClientDisconnect.
  • ClientDisconnect returns 499 instead of leaking worker load state.
  • MilesCoordinator max_concurrency=4 avoids serializing resize/progress/sync RPCs.

Test

run_4ppl_full_overlap_n10.sh --num-rollout 5

Result: run 28 completed all 5 rollouts with EXIT=0; earlier baseline/concurrency-only runs hung on rollout 2.

TianyeGGBond and others added 2 commits May 20, 2026 02:54
…uter + max_concurrency

Squashed working tree for the miles side of the 4-GPU 2-pipeline
full-cross-overlap topology (P1 train=[0,1] infer=[0,1,2,3],
P2 train=[2,3] infer=[0,1,2,3]). Companion to
TianyeGGBond/rlix@tianye/4gpu-2ppl-fix.

Three independent fixes:

1) RolloutManager never plumbed rlix_hooks (Layer 4)
---------------------------------------------------
miles.ray.rollout.RolloutManager._get_rollout_data called
``call_rollout_fn(self.generate_rollout, ..., evaluation=False)`` without
passing ``rlix_hooks``. fully_async_rollout.generate_rollout_async defaults
the kwarg to None and promotes to NoOpRLixHooks — every
``begin_progress_batch`` / ``bump_completed`` was a silent no-op. Between
rollouts the rlix central scheduler had no demand signal; gap-ratio had no
basis to re-expand engines after _after_training released actor_train; the
next rollout's ``await rollout_data`` hung forever.

Fix:
- ``RolloutManager.__init__`` initializes ``self._rlix_hooks = None``
  (standalone miles keeps NoOp; only RLix-mode injects).
- New ``RolloutManager.set_rlix_hooks(hooks)`` method.
- ``_get_rollout_data`` forwards ``rlix_hooks=self._rlix_hooks``.
- ``call_rollout_fn`` (miles/rollout/base_types.py) introspects the target
  fn's signature and only forwards ``rlix_hooks`` when the fn declares it,
  preserving back-compat for legacy rollout fns.

The rlix-side driver (TianyeGGBond/rlix@tianye/4gpu-2ppl-fix) calls
``rollout_manager.set_rlix_hooks.remote(MilesRLixHooks(coord, pipeline_id))``
in Phase B step 4c.

2) Router leaked LB counter on ClientDisconnect (Layer 5)
---------------------------------------------------------
miles/router/router.py:do_proxy acquired a worker via ``_use_url_async``
(``worker_request_counts[url] += 1``) BEFORE the ``try/finally`` containing
the matching ``_finish_url`` decrement. The very next line
(``await request.body()``) raises ``ClientDisconnect`` when a peer
pipeline's engine shrink races a probe-time request — and that exception
escaped the try, so ``_finish_url`` never ran. Every disconnect leaked one
counter; the LB's ``min(candidates, key=worker_request_counts.get)``
eventually treated leaked workers as permanently busy.

Fix:
- Moved the ``try/finally`` boundary up to wrap ``await request.body()``
  and the upstream request; ``_finish_url`` always runs.
- ``proxy()`` catches ``ClientDisconnect`` explicitly and returns 499 with
  a one-line WARNING. Real CUDA / Ray / scheduler exceptions still
  propagate (only ``ClientDisconnect`` is suppressed).

3) MilesCoordinator max_concurrency
-----------------------------------
Default ``max_concurrency=1`` made the coordinator serialize all RPCs:
``scheduler.resize_infer``, ``report_progress_from_scheduler``, and
``sync_base_weights_to_active`` queued behind each other. Bumped to 4 to
mirror rlix's reference 2-pipeline 4-GPU example
(``examples/start_multi_pipeline_test.py: COORDINATOR_MAX_CONCURRENCY``).
Empirically reduced the post-rollout-0 incidence of Ray's internal
``'Worker' object has no attribute 'core_worker'`` race at
``worker.py:1039`` by relieving the coordinator's RPC-handler thread queue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio
planner can wake actor_infer engines without waiting for the rollout
function's first begin_progress_batch — too late under cross-pipeline
contention. Companion to rlix@tianye/4gpu-2ppl-fix commit which adds
MilesPipeline.signal_rollout_demand.

Root cause: between rollouts, both pipelines' actor_infer DP workers
shrink to set() because actor_train (ACTOR_TRAINING priority) preempts
the shared infer GPUs. The first pipeline whose rollout function fires
begin_progress_batch wins the GENERATION budget; the peer pipeline
starves on "Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N".

Mirrors rlix/pipeline/full_finetune_pipeline.py:696 (Phase 4.5
notify_release_then_request_cluster_gpus) — re-request actor_infer at
GENERATION with fresh step_target_estimate per rollout. miles can't
tear-down/re-request actor_infer mid-loop without resetting SGLang
engines, so the rlix-side companion publishes a synthetic
new_batch=True progress report instead. This change adds the train-loop
hook + driver wire-up that invokes it.

Changes:

- miles/utils/rlix_train_loop.py: new optional signal_demand: StepHook
  kwarg on run_async_train_loop. Called pre-pre-loop-dispatch (for
  start_rollout_id) and after every after_step before dispatching the
  next rollout. Awaited (commit-before-dispatch ordering); generate.remote
  immediately follows so the scheduler sees fresh demand BEFORE the
  rollout function pulls samples. Failure path (release_only) does not
  call signal_demand because the loop re-raises and no next rollout
  fires.

- examples/rlix/run_miles_dual.py: defines _signal_demand step hook
  that forwards rollout_id + rollout_batch_size to
  pipe.signal_rollout_demand.remote. step_target=rollout_batch_size
  matches the per-rollout demand the scheduler sees once the real
  begin_progress_batch arrives. Defensive <=0 short-circuit.

Test: run 28 of /root/run_4ppl_full_overlap_n10.sh
  (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3]
  infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0;
  prior runs (run 25 baseline, run 27 with concurrency-bump only)
  hung on rollout 2 collection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhenyulincs added a commit to rlops/rlix that referenced this pull request May 24, 2026
TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the
proper fix for B-13 (mp1 wedge after mp2 completes early under
cross-overlap). The fix uses the scheduler's existing gap-ratio
machinery via pre-signalled durable demand instead of fighting
asyncio.gather, and is materially better than my proposed M11.3
fire-shutdown_hard-early sketch.

Root cause (refined from TianyeGGBond's PR description, Codex review
of the rlix planner code confirms):
- Between rollouts, actor_train preempts shared infer GPUs → both
  pipelines' actor_infer DP workers shrink to set().
- begin_progress_batch fires INSIDE the rollout fn, after dispatch
  starts → first pipeline wins all DP workers, peer hangs.
- Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle
  snapshot that vanishes once consumed (planner blind during rollout
  boundary); (b) MilesRLixHooks was never wired at init →
  begin_progress_batch hit NoOpRLixHooks (silent no-op).

Joint fix:
- rlix#16: durable rollout_open_pipelines registry in scheduler;
  MilesPipeline.signal_rollout_demand(rollout_id, step_target);
  wires MilesRLixHooks in Phase B init; pending_bucket_gen durability.
- miles#4: _signal_demand step hook in dual driver; set_rlix_hooks +
  _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards
  rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter
  balance fix.

Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL
(signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired)
both RESOLVED in the pair.

Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh
--num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2.

Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone
raises AttributeError on pipe.signal_rollout_demand.remote without
the rlix#16 method).

M11.3 follow-up scope shrinks: concurrent-resize stress test under
max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression;
verify generate_rollout_fully_async accepts rlix_hooks kw.
@zhenyulincs zhenyulincs merged commit 8f5cef8 into rlops:zhenyu/m11-mvp-test May 24, 2026
zhenyulincs added a commit to rlops/rlix that referenced this pull request May 24, 2026
Both PRs merged into the zhenyu/* branches; my Phase 1/3 work
fast-forwarded cleanly under both merges (no conflicts because
TianyeGGBond branched off my prior commits).

Verified merged state contains both:
- My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix
- TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix
- My Phase 2 overlap env-driven topology on miles
- TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles

Vast 4×A40 dual-overlap regression smoke pending on a new instance —
to be run with the joint fix to confirm all 7 PASS-bar conditions
(including C2 shutdown_hard complete, previously blocked by B-13).
zhenyulincs added a commit to rlops/rlix that referenced this pull request May 24, 2026
New: docs/m11-tianye-prs-review.md — standalone document summarizing
TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed
B-13 (M11.2 overlap rollout-2+ hang).

Covers:
- Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks
  never wired)
- Fix architecture (durable rollout_open_pipelines registry +
  signal_rollout_demand pre-dispatch)
- File-level change inventory across both PRs (~14 files / +300 LoC)
- Validation layers: 5 Codex review rounds, local merge spot-check,
  real vast smoke v8/v9/v10 evidence
- Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks
  kw verification)
- Doc + commit pointers

Easier to share than the full m11-2-overlap-log.md for reviewers
who only need the Tianye-PR context.
zhenyulincs added a commit to rlops/rlix that referenced this pull request May 24, 2026
Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work:

- Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b)
- Phase 3 — Option β state machine (finish_init_offload + Codex Q5
  invariants) (rlix 47bf02b)
- Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION +
  try/finally shutdown_hard) (miles 79f2874)
- Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge —
  durable rollout_open_pipelines registry + signal_rollout_demand +
  MilesRLixHooks wiring; merged 2026-05-24
- B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set;
  removing it on stable-tms hardware enables tms.pause() (rlix a011bbf)
- Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU
  validation + pause_generation docs (rlix a4c6369 + miles 1487c3f)

Updates §1.4 status table — M11.2 now has 3 rows:
- Option A (disjoint pools, M11.2 Attempt 4 GREEN)
- Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10)
- M11.3+ deferred

Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9
as RESOLVED with commit pointers. Adds rows for MED1 (concurrent
resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as
howard989's PR in flight.

Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f).

Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants