fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01)#5
Open
howard989 wants to merge 3 commits into
Open
Conversation
…(baseline, no assert)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two coordinated changes for the post-offload residual check, sender side of the paired RLix PR:
MILES_MAX_RESIDUAL_GPU_MEM_GBfrom both rlix-mode drivers (run_miles_rlix.py,run_miles_dual.py) into Rayruntime_env.RolloutManager.shrink_engineson each engine's real per-process resident GPU memory afterrelease_memory_occupation.Receiver side: rlops/rlix branch
howard/m11-residual-gpu-threshold-v2.Why
Per @taoluo review (R02-01): "free memory is gpu-model dependent ... it would be more robust to check the residual memory allocation."
We investigated SGLang
/server_infoweight+kvcache+graphas a possible residual signal. A Vast smoke with Qwen2.5-0.5B showed that/server_inforeports ~9.32 GiB after offload and would falsely trip a hard gate.That value is accounting/static-pool size, not resident memory. The KV static pool is computed from tensor shapes and does not drop after
torch_memory_saverpause, which keeps the virtual allocation while freeing physical pages.Evidence from the same run:
The slept engine's accounting is higher than the active one, while its real resident memory is much lower. Therefore
/server_infois kept as diagnostic logging only, and the hard gate uses the engine's per-process resident GPU memory.How The Gate Works
New
miles/utils/gpu_probe.py:self.process.pidis the multiprocessing spawn parentsglang::schedulerchildThis gives the engine's max per-GPU resident residual, matching the
MAXsemantics inMILES_MAX_RESIDUAL_GPU_MEM_GBand avoiding false failures for TP>1 engines.Fail-Open Behavior
If
nvidia-smiis unavailable or compute-app PIDs cannot be matched to the engine process tree, the probe returnsNone, logs a warning, and skips the hard gate. This avoids killing a healthy pipeline when the metric is unavailable, while engine-state polling remains the liveness gate.If an older
nvidia-smidoes not supportgpu_bus_id, the probe falls back topid,used_memoryand logs a warning that the fallback cannot distinguish per-GPU usage.Changes
examples/rlix/run_miles_dual.pyMILES_MAX_RESIDUAL_GPU_MEM_GBexamples/rlix/run_miles_rlix.pyMILES_MAX_RESIDUAL_GPU_MEM_GBmiles/ray/rollout.pyassert_post_sleep_process_vram_below_thresholdfromshrink_enginesmiles/backends/sglang_utils/sglang_engine.py/server_infoaccounting as diagnosticmiles/utils/gpu_probe.pytests/test_gpu_probe.pytests/test_residual_gpu_mem_wiring.pyTests
Results:
E2E Verification
Vast Qwen2.5-0.5B dual smoke with paired RLix branch:
The gate measured real residual memory, did not fail-open, and passed under the RLix-side default
3.0.Known shutdown
RolloutManager500 /RemoteProtocolErrorteardown noise appears while residual/generaterequests are cancelled. Training completed and both pipelines reachedshutdown_hard;EXIT_CODE=0.Scope
Env forwarding + per-process residual gate only. Option Beta / hooks are already upstream and untouched. Default threshold lives on the RLix side.
Refs:
plans/m11-review.review-report/R02.md(R02-01, MEDIUM).