[Feature] Support offload and wake up of SGLang Diffusion #19152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

klhhhhh wants to merge 77 commits into sgl-project:main from klhhhhh:main

docs/advanced_features/sglang_for_rl.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -46,6 +46,7 @@ Enable memory saver support when launching the server: @@
     - This call asserts there are no ongoing requests. Ensure the engine is idle before calling it.
     - If `kv_cache` is released, SGLang flushes cache; subsequent requests will rebuild KV cache as needed.
+    - SGLang Diffusion also supports releasing memory occupation, but there are no `tags` field in the request body.
     ### Resume Memory
@@ Expand All / @@ -58,6 +59,8 @@ Enable memory saver support when launching the server: @@
     | `tags` | Which memory regions to resume. If omitted, all are resumed. | `None` | Type: list[str], values: `kv_cache`, `weights` |
     <!-- python/sglang/srt/managers/io_struct.py#L1393 currently only supports `kv_cache`, `weights` -->
+    SGLang Diffusion also supports resuming memory occupation, but there are no `tags` field in the request body.
     ## Open-To-Use Refit Functionality
     After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. For a deeper dive into SGLang's weight update utilities, see [RL System Deep Thinking: Weight Update Mechanisms](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md).
@@ Expand Down @@

python/sglang/multimodal_gen/runtime/entrypoints/openai/utils.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -9,7 +9,7 @@ @@
     from typing import Any, Generator, List, Optional, Union
     import httpx
-    from fastapi import UploadFile
+    from fastapi import HTTPException, UploadFile
     from sglang.multimodal_gen.configs.sample.sampling_params import (
         DataType,
@@ Expand All / @@ -24,7 +24,10 @@ @@
         format_lora_message,
         save_outputs,
     )
-    from sglang.multimodal_gen.runtime.pipelines_core.schedule_batch import OutputBatch
+    from sglang.multimodal_gen.runtime.pipelines_core.schedule_batch import (
+        SLEEPING_ERROR_PREFIX,
+        OutputBatch,
+    )
     from sglang.multimodal_gen.runtime.scheduler_client import AsyncSchedulerClient
     from sglang.multimodal_gen.runtime.server_args import get_global_server_args
     from sglang.multimodal_gen.runtime.utils.logging_utils import (
@@ Expand Down Expand Up / @@ -257,14 +260,23 @@ async def process_generation_batch( @@
         batch,
     ) -> tuple[list[str], OutputBatch]:
         total_start_time = time.perf_counter()
         with log_generation_timer(logger, batch.prompt):
             result = await scheduler_client.forward([batch])
             if result.output is None and result.output_file_paths is None:
                 error_msg = result.error or "Unknown error"
-                raise RuntimeError(
-                    f"Model generation returned no output. Error from scheduler: {error_msg}"
-                )
+                if str(error_msg).startswith(SLEEPING_ERROR_PREFIX):
+                    raise HTTPException(
+                        status_code=400,
+                        detail={
+                            "message": error_msg,
+                        },
+                    )
+                else:
+                    raise RuntimeError(
+                        f"Model generation returned no output. Error from scheduler: {error_msg}"
+                    )
             if result.output_file_paths:
                 save_file_path_list = result.output_file_paths
@@ Expand Down @@

python/sglang/multimodal_gen/runtime/entrypoints/post_training/io_struct.py

-Original file line number
+Diff line change
@@ -1,4 +1,8 @@
-    """Request/response data structures for post-training APIs."""
+    """Request/response data structures for post-training APIs.
+    TODO(Shuwen, Chenyang): Split RL-oriented request types and serving-oriented
+    request types into dedicated files.
+    """
     from dataclasses import dataclass
@@ Expand All / @@ -17,3 +21,19 @@ class GetWeightsChecksumReqInput: @@
         """Compute SHA-256 checksum of loaded module weights for verification."""
         module_names: list[str] | None = None
+    @dataclass
+    class ReleaseMemoryOccupationReqInput:
+        """Request to release (sleep) GPU memory occupation for the diffusion engine."""
+        # TODO (Kun, Chenyang): We shall have rather dedicated
+        # control of the Diffusion model's memory occupation.
+        pass
+    @dataclass
+    class ResumeMemoryOccupationReqInput:
+        """Request to resume (wake) GPU memory occupation for the diffusion engine."""
+        pass

python/sglang/multimodal_gen/runtime/entrypoints/post_training/weights_api.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -5,12 +5,17 @@ @@
     from sglang.multimodal_gen.runtime.entrypoints.post_training.io_struct import (
         GetWeightsChecksumReqInput,
+        ReleaseMemoryOccupationReqInput,
+        ResumeMemoryOccupationReqInput,
         UpdateWeightFromDiskReqInput,
     )
     from sglang.multimodal_gen.runtime.scheduler_client import async_scheduler_client
+    from sglang.multimodal_gen.runtime.utils.logging_utils import init_logger
     router = APIRouter()
+    logger = init_logger(__name__)
     @router.post("/update_weights_from_disk")
     async def update_weights_from_disk(request: Request):
@@ Expand Down Expand Up / @@ -60,3 +65,41 @@ async def get_weights_checksum(request: Request): @@
             return ORJSONResponse({"error": str(e)}, status_code=500)
         return ORJSONResponse(response.output, status_code=200)
+    async def _handle_memory_occupation_request(
+        req: ReleaseMemoryOccupationReqInput | ResumeMemoryOccupationReqInput,
+    ):
+        """Handle memory sleep/wake requests forwarded to scheduler."""
+        try:
+            response = await async_scheduler_client.forward(req)
+        except Exception as e:
+            logger.exception(f"scheduler_client.forward failed for {type(req).__name__}")
+            return ORJSONResponse({"success": False, "message": str(e)}, status_code=500)
+        payload = response.output if isinstance(response.output, dict) else None
+        if not isinstance(payload, dict) or "success" not in payload:
+            logger.error(f"missing success in scheduler output: {response.output}")
+            return ORJSONResponse(
+                {
+                    "success": False,
+                    "message": f"Missing 'success' field in scheduler response: {response.output}",
+                },
+                status_code=500,
+            )
+        success = bool(payload["success"])
+        return ORJSONResponse(payload, status_code=200 if success else 400)
+    @router.post("/release_memory_occupation")
+    async def release_memory_occupation():
+        """Release GPU memory occupation (sleep the engine)."""
+        return await _handle_memory_occupation_request(ReleaseMemoryOccupationReqInput())
+    @router.post("/resume_memory_occupation")
+    async def resume_memory_occupation():
+        """Resume GPU memory occupation (wake the engine)."""
+        return await _handle_memory_occupation_request(ResumeMemoryOccupationReqInput())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support offload and wake up of SGLang Diffusion #19152

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

mickqian Mar 3, 2026

Uh oh!

zhaochenyang20 Mar 3, 2026

Uh oh!

mickqian Mar 3, 2026

Uh oh!

zhaochenyang20 Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

[Feature] Support offload and wake up of SGLang Diffusion #19152

Are you sure you want to change the base?

Uh oh!

[Feature] Support offload and wake up of SGLang Diffusion #19152

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

mickqian Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mickqian Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!