Skip to content

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5

Open
howard989 wants to merge 3 commits into
rlops:zhenyu/m11-mvp-testfrom
howard989:howard/m11-forward-residual-gpu-env-v2
Open

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5
howard989 wants to merge 3 commits into
rlops:zhenyu/m11-mvp-testfrom
howard989:howard/m11-forward-residual-gpu-env-v2

Conversation

@howard989
Copy link
Copy Markdown

@howard989 howard989 commented May 25, 2026

What

Two coordinated changes for the post-offload residual check, sender side of the paired RLix PR:

  1. Forward MILES_MAX_RESIDUAL_GPU_MEM_GB from both rlix-mode drivers (run_miles_rlix.py, run_miles_dual.py) into Ray runtime_env.
  2. Gate RolloutManager.shrink_engines on each engine's real per-process resident GPU memory after release_memory_occupation.

Receiver side: rlops/rlix branch howard/m11-residual-gpu-threshold-v2.

Why

Per @taoluo review (R02-01): "free memory is gpu-model dependent ... it would be more robust to check the residual memory allocation."

We investigated SGLang /server_info weight+kvcache+graph as a possible residual signal. A Vast smoke with Qwen2.5-0.5B showed that /server_info reports ~9.32 GiB after offload and would falsely trip a hard gate.

That value is accounting/static-pool size, not resident memory. The KV static pool is computed from tensor shapes and does not drop after torch_memory_saver pause, which keeps the virtual allocation while freeing physical pages.

Evidence from the same run:

active engine:
  server_info kvcache = 7.06 GiB
  nvidia-smi process = 10686 MiB

slept/offloaded engine:
  server_info kvcache = 8.16 GiB
  nvidia-smi process = 1852 MiB (~1.81 GiB)

The slept engine's accounting is higher than the active one, while its real resident memory is much lower. Therefore /server_info is kept as diagnostic logging only, and the hard gate uses the engine's per-process resident GPU memory.

How The Gate Works

New miles/utils/gpu_probe.py:

  • walks the engine process tree
    • self.process.pid is the multiprocessing spawn parent
    • the GPU-resident process is the sglang::scheduler child
  • queries:
nvidia-smi --query-compute-apps=gpu_bus_id,pid,used_memory --format=csv,noheader,nounits
  • filters to PIDs in the engine process tree
  • sums matched usage within each GPU
  • takes the max across GPUs

This gives the engine's max per-GPU resident residual, matching the MAX semantics in MILES_MAX_RESIDUAL_GPU_MEM_GB and avoiding false failures for TP>1 engines.

Fail-Open Behavior

If nvidia-smi is unavailable or compute-app PIDs cannot be matched to the engine process tree, the probe returns None, logs a warning, and skips the hard gate. This avoids killing a healthy pipeline when the metric is unavailable, while engine-state polling remains the liveness gate.

If an older nvidia-smi does not support gpu_bus_id, the probe falls back to pid,used_memory and logs a warning that the fallback cannot distinguish per-GPU usage.

Changes

  • examples/rlix/run_miles_dual.py
    • forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
  • examples/rlix/run_miles_rlix.py
    • forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
  • miles/ray/rollout.py
    • calls assert_post_sleep_process_vram_below_threshold from shrink_engines
    • logs measured process-resident residual
  • miles/backends/sglang_utils/sglang_engine.py
    • adds process-resident residual gate
    • logs /server_info accounting as diagnostic
  • miles/utils/gpu_probe.py
    • adds dependency-free process-tree GPU residual probe
  • tests/test_gpu_probe.py
    • covers per-GPU max, same-GPU sum, fail-open None-not-0, and process-tree walking
  • tests/test_residual_gpu_mem_wiring.py
    • updated for the per-process gate

Tests

python3 -m py_compile \
  miles/utils/gpu_probe.py \
  miles/backends/sglang_utils/sglang_engine.py \
  miles/ray/rollout.py

python3 -m pytest -q tests/test_gpu_probe.py tests/test_residual_gpu_mem_wiring.py

Results:

tests/test_gpu_probe.py: 11 passed
tests/test_residual_gpu_mem_wiring.py: 2 passed

E2E Verification

Vast Qwen2.5-0.5B dual smoke with paired RLix branch:

shrink_engines: post-sleep process-resident GPU residual max=1.809 GiB per_engine=[1.809, 1.809] threshold=3.000 GiB
shrink_engines: post-sleep process-resident GPU residual max=1.828 GiB per_engine=[1.828] threshold=3.000 GiB
mp2 training loop complete
mp1 training loop complete
shutdown_hard complete for both pipelines
EXIT_CODE=0

The gate measured real residual memory, did not fail-open, and passed under the RLix-side default 3.0.

Known shutdown RolloutManager 500 / RemoteProtocolError teardown noise appears while residual /generate requests are cancelled. Training completed and both pipelines reached shutdown_hard; EXIT_CODE=0.

Scope

Env forwarding + per-process residual gate only. Option Beta / hooks are already upstream and untouched. Default threshold lives on the RLix side.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant