fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01) by howard989 · Pull Request #5 · rlops/miles

howard989 · 2026-05-25T08:38:06Z

What

Two coordinated changes for the post-offload residual check, sender side of the paired RLix PR:

Forward MILES_MAX_RESIDUAL_GPU_MEM_GB from both rlix-mode drivers (run_miles_rlix.py, run_miles_dual.py) into Ray runtime_env.
Gate RolloutManager.shrink_engines on each engine's real per-process resident GPU memory after release_memory_occupation.

Receiver side: rlops/rlix branch howard/m11-residual-gpu-threshold-v2.

Why

Per @taoluo review (R02-01): "free memory is gpu-model dependent ... it would be more robust to check the residual memory allocation."

We investigated SGLang /server_info weight+kvcache+graph as a possible residual signal. A Vast smoke with Qwen2.5-0.5B showed that /server_info reports ~9.32 GiB after offload and would falsely trip a hard gate.

That value is accounting/static-pool size, not resident memory. The KV static pool is computed from tensor shapes and does not drop after torch_memory_saver pause, which keeps the virtual allocation while freeing physical pages.

Evidence from the same run:

active engine:
  server_info kvcache = 7.06 GiB
  nvidia-smi process = 10686 MiB

slept/offloaded engine:
  server_info kvcache = 8.16 GiB
  nvidia-smi process = 1852 MiB (~1.81 GiB)

The slept engine's accounting is higher than the active one, while its real resident memory is much lower. Therefore /server_info is kept as diagnostic logging only, and the hard gate uses the engine's per-process resident GPU memory.

How The Gate Works

New miles/utils/gpu_probe.py:

walks the engine process tree
- self.process.pid is the multiprocessing spawn parent
- the GPU-resident process is the sglang::scheduler child
queries:

nvidia-smi --query-compute-apps=gpu_bus_id,pid,used_memory --format=csv,noheader,nounits

filters to PIDs in the engine process tree
sums matched usage within each GPU
takes the max across GPUs

This gives the engine's max per-GPU resident residual, matching the MAX semantics in MILES_MAX_RESIDUAL_GPU_MEM_GB and avoiding false failures for TP>1 engines.

Fail-Open Behavior

If nvidia-smi is unavailable or compute-app PIDs cannot be matched to the engine process tree, the probe returns None, logs a warning, and skips the hard gate. This avoids killing a healthy pipeline when the metric is unavailable, while engine-state polling remains the liveness gate.

If an older nvidia-smi does not support gpu_bus_id, the probe falls back to pid,used_memory and logs a warning that the fallback cannot distinguish per-GPU usage.

Changes

examples/rlix/run_miles_dual.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
examples/rlix/run_miles_rlix.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB
miles/ray/rollout.py
- calls assert_post_sleep_process_vram_below_threshold from shrink_engines
- logs measured process-resident residual
miles/backends/sglang_utils/sglang_engine.py
- adds process-resident residual gate
- logs /server_info accounting as diagnostic
miles/utils/gpu_probe.py
- adds dependency-free process-tree GPU residual probe
tests/test_gpu_probe.py
- covers per-GPU max, same-GPU sum, fail-open None-not-0, and process-tree walking
tests/test_residual_gpu_mem_wiring.py
- updated for the per-process gate

Tests

python3 -m py_compile \
  miles/utils/gpu_probe.py \
  miles/backends/sglang_utils/sglang_engine.py \
  miles/ray/rollout.py

python3 -m pytest -q tests/test_gpu_probe.py tests/test_residual_gpu_mem_wiring.py

Results:

tests/test_gpu_probe.py: 11 passed
tests/test_residual_gpu_mem_wiring.py: 2 passed

E2E Verification

Vast Qwen2.5-0.5B dual smoke with paired RLix branch:

shrink_engines: post-sleep process-resident GPU residual max=1.809 GiB per_engine=[1.809, 1.809] threshold=3.000 GiB
shrink_engines: post-sleep process-resident GPU residual max=1.828 GiB per_engine=[1.828] threshold=3.000 GiB
mp2 training loop complete
mp1 training loop complete
shutdown_hard complete for both pipelines
EXIT_CODE=0

The gate measured real residual memory, did not fail-open, and passed under the RLix-side default 3.0.

Known shutdown RolloutManager 500 / RemoteProtocolError teardown noise appears while residual /generate requests are cancelled. Training completed and both pipelines reached shutdown_hard; EXIT_CODE=0.

Scope

Env forwarding + per-process residual gate only. Option Beta / hooks are already upstream and untouched. Default threshold lives on the RLix side.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

…(baseline, no assert)

howard989 added 2 commits May 25, 2026 00:12

fix(miles): forward residual GPU threshold env

756e426

feat(miles): gate shrink on per-GPU resident process memory

64578cc

howard989 mentioned this pull request May 25, 2026

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB) rlops/rlix#17

Open

chore(miles): log train actor post-sleep process-resident GPU memory …

da068b3

…(baseline, no assert)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5
howard989 wants to merge 3 commits into
rlops:zhenyu/m11-mvp-testfrom
howard989:howard/m11-forward-residual-gpu-env-v2

howard989 commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard989 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How The Gate Works

Fail-Open Behavior

Changes

Tests

E2E Verification

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard989 commented May 25, 2026 •

edited

Loading