Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
76634da
support Markdown/Notebook-Friendly Documentation Export for Downstrea…
klhhhhh Feb 2, 2026
411c310
run make markdown to generate markdown file
klhhhhh Feb 2, 2026
0705b6d
add release and resume handle in scheduler
klhhhhh Feb 22, 2026
db6d4a8
implement release and resume memory in gpu worker
klhhhhh Feb 22, 2026
19dd060
update release and resume api in http server
klhhhhh Feb 22, 2026
3973602
sync docs update
klhhhhh Feb 22, 2026
035583c
pre-commit lint
klhhhhh Feb 22, 2026
42dbcc3
pre-commit lint
klhhhhh Feb 22, 2026
3c5809a
remove tags for diffusion models and update io_struct in post training
klhhhhh Feb 23, 2026
9f5e243
adjust tags in function call
klhhhhh Feb 23, 2026
6aa39f1
implement new wake sleep func directly use get updated module in pipe…
klhhhhh Feb 23, 2026
5aad1c0
run lint
klhhhhh Feb 23, 2026
f9799ca
Implement new wake and sleep also add sanitize moving part for modules
klhhhhh Feb 24, 2026
0dd389b
refactor weight api
klhhhhh Feb 24, 2026
fa51ead
Retrun correct status code call generation when sleeping
klhhhhh Feb 24, 2026
efffb49
update comment
klhhhhh Feb 24, 2026
34b4a08
update lint
klhhhhh Feb 24, 2026
006f9b2
refactor all the code
klhhhhh Feb 24, 2026
a527dd2
refactor all the code
klhhhhh Feb 24, 2026
3d4a471
refactor gpu_worker and utils in openai entrypoint
klhhhhh Feb 24, 2026
5dc1faf
fix bugs in utils
klhhhhh Feb 24, 2026
9ea9b89
fix bugs in utils
klhhhhh Feb 24, 2026
c141e6a
fix bugs in weight api
klhhhhh Feb 24, 2026
7e7e373
fix comment in wake func
klhhhhh Feb 24, 2026
3b05c1c
refactor wake func
klhhhhh Feb 24, 2026
e5b1128
run lint
klhhhhh Feb 24, 2026
2cd6504
first version of sleep wake up test
klhhhhh Feb 25, 2026
720e69e
adjust log and adjust test model
klhhhhh Feb 25, 2026
f17cf9f
test pass
klhhhhh Feb 25, 2026
0ad9e82
change to logger
klhhhhh Feb 25, 2026
27550c0
add gpu mem check
klhhhhh Feb 25, 2026
2a44ca9
add test wake sleep in ci
klhhhhh Feb 25, 2026
4300ab9
run lint
klhhhhh Feb 25, 2026
c446886
adds pytest entry
zhaochenyang20 Feb 25, 2026
746c86b
fix race condition
zhaochenyang20 Feb 25, 2026
de41546
refactor process generation batch
klhhhhh Feb 25, 2026
52b881a
fix bugs:access output using details
klhhhhh Feb 26, 2026
2f12de0
change test name
klhhhhh Feb 26, 2026
4d63b7a
avoid worker exectution failed and keep consistent self._sleeping
klhhhhh Feb 26, 2026
86d02c1
refactor gpu worker
klhhhhh Feb 26, 2026
cc22119
refactor test
klhhhhh Feb 26, 2026
05c7123
add roll out function
klhhhhh Feb 26, 2026
5a64640
update scheduler
klhhhhh Feb 26, 2026
48c9dcb
refactor weight api
klhhhhh Feb 26, 2026
281374e
Merge branch 'main' into main
zhaochenyang20 Feb 26, 2026
3f63339
[refactor] move modules and rollback
zhaochenyang20 Feb 26, 2026
adade66
Merge branch 'main' of github.com:klhhhhh/sglang
zhaochenyang20 Feb 26, 2026
e2b853a
add Mook, https://github.com/Godmook
zhaochenyang20 Feb 26, 2026
a8572df
refactor: unit test
zhaochenyang20 Feb 26, 2026
8f6bf08
refactor: unit test, do not be verbose
zhaochenyang20 Feb 26, 2026
9804b6a
refactor: unit test, assert generate correct
zhaochenyang20 Feb 26, 2026
5093c28
refactor: _get_module_device
zhaochenyang20 Feb 26, 2026
431bd48
refactor: logging control
zhaochenyang20 Feb 26, 2026
9d3bb37
refactor: _handle_memory_occupation
zhaochenyang20 Feb 26, 2026
8cf3839
refactor: resume_memory_occupation
zhaochenyang20 Feb 26, 2026
4298262
update docs
zhaochenyang20 Feb 26, 2026
2170df4
change docs string
zhaochenyang20 Feb 26, 2026
6df31ed
fix rocm ci
zhaochenyang20 Feb 27, 2026
4a0ed98
minor refactor
alphabetc1 Feb 27, 2026
5ba4860
Merge pull request #2 from alphabetc1/feat/wakeup_sleep
klhhhhh Feb 27, 2026
b19dc4e
Merge branch 'main' into main
zhaochenyang20 Feb 27, 2026
6e581ca
Merge branch 'main' of github.com:klhhhhh/sglang
zhaochenyang20 Feb 27, 2026
13506c5
add todo for rollback expection
zhaochenyang20 Feb 27, 2026
06e6986
Adds ratish as the author
zhaochenyang20 Feb 27, 2026
aea49db
adds tests for idempotency
zhaochenyang20 Feb 27, 2026
3a85460
Merge branch 'main' into main
zhaochenyang20 Mar 2, 2026
1d780a6
self fixing comments
zhaochenyang20 Mar 3, 2026
19ee11d
refactor: pass the request instance instead of class type
alphabetc1 Mar 3, 2026
3102e77
move RL related tests to post-training dir
zhaochenyang20 Mar 3, 2026
2d6a570
fix untoched unit test in CI
zhaochenyang20 Mar 3, 2026
7d60cc5
extract _get_module_device into utility helper & add TODO for io_stru…
zhaochenyang20 Mar 3, 2026
ffdd714
Merge branch 'main' into main
zhaochenyang20 Mar 4, 2026
a775d99
Merge branch 'main' of github.com:klhhhhh/sglang
zhaochenyang20 Mar 4, 2026
a21ac27
fix lint
zhaochenyang20 Mar 4, 2026
c109f84
remove unit redunct tests
zhaochenyang20 Mar 4, 2026
5f90c43
move test
alphabetc1 Mar 6, 2026
f9de7e5
Merge branch 'main' into kun-main
alphabetc1 Mar 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/advanced_features/sglang_for_rl.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Enable memory saver support when launching the server:

- This call asserts there are no ongoing requests. Ensure the engine is idle before calling it.
- If `kv_cache` is released, SGLang flushes cache; subsequent requests will rebuild KV cache as needed.
- SGLang Diffusion also supports releasing memory occupation, but there are no `tags` field in the request body.
Comment thread
zhaochenyang20 marked this conversation as resolved.

### Resume Memory

Expand All @@ -58,6 +59,8 @@ Enable memory saver support when launching the server:
| `tags` | Which memory regions to resume. If omitted, all are resumed. | `None` | Type: list[str], values: `kv_cache`, `weights` |
<!-- python/sglang/srt/managers/io_struct.py#L1393 currently only supports `kv_cache`, `weights` -->

SGLang Diffusion also supports resuming memory occupation, but there are no `tags` field in the request body.
Comment thread
zhaochenyang20 marked this conversation as resolved.

## Open-To-Use Refit Functionality

After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. For a deeper dive into SGLang's weight update utilities, see [RL System Deep Thinking: Weight Update Mechanisms](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from typing import Any, Generator, List, Optional, Union

import httpx
from fastapi import UploadFile
from fastapi import HTTPException, UploadFile

from sglang.multimodal_gen.configs.sample.sampling_params import (
DataType,
Expand All @@ -24,7 +24,10 @@
format_lora_message,
save_outputs,
)
from sglang.multimodal_gen.runtime.pipelines_core.schedule_batch import OutputBatch
from sglang.multimodal_gen.runtime.pipelines_core.schedule_batch import (
SLEEPING_ERROR_PREFIX,
OutputBatch,
)
from sglang.multimodal_gen.runtime.scheduler_client import AsyncSchedulerClient
from sglang.multimodal_gen.runtime.server_args import get_global_server_args
from sglang.multimodal_gen.runtime.utils.logging_utils import (
Expand Down Expand Up @@ -257,14 +260,23 @@ async def process_generation_batch(
batch,
) -> tuple[list[str], OutputBatch]:
total_start_time = time.perf_counter()

with log_generation_timer(logger, batch.prompt):
result = await scheduler_client.forward([batch])

if result.output is None and result.output_file_paths is None:
error_msg = result.error or "Unknown error"
raise RuntimeError(
f"Model generation returned no output. Error from scheduler: {error_msg}"
)
if str(error_msg).startswith(SLEEPING_ERROR_PREFIX):
raise HTTPException(
status_code=400,
detail={
"message": error_msg,
},
)
else:
raise RuntimeError(
f"Model generation returned no output. Error from scheduler: {error_msg}"
)

if result.output_file_paths:
save_file_path_list = result.output_file_paths
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
"""Request/response data structures for post-training APIs."""
"""Request/response data structures for post-training APIs.

TODO(Shuwen, Chenyang): Split RL-oriented request types and serving-oriented
request types into dedicated files.
"""

from dataclasses import dataclass

Expand All @@ -17,3 +21,19 @@ class GetWeightsChecksumReqInput:
"""Compute SHA-256 checksum of loaded module weights for verification."""

module_names: list[str] | None = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to separate rl-related types with serving ones later

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean on this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to use dedicated files for type definitions

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the types in this file (UpdateWeightFromDiskReqInput, ReleaseMemoryOccupationReqInput, ResumeMemoryOccupationReqInput) are RL/post-training request types. There are no serving types mixed in here, so there's nothing to separate. UpdateWeightFromDisk — it's also an RL-oriented operation, not a serving one.


@dataclass
class ReleaseMemoryOccupationReqInput:
"""Request to release (sleep) GPU memory occupation for the diffusion engine."""

# TODO (Kun, Chenyang): We shall have rather dedicated
# control of the Diffusion model's memory occupation.
pass


@dataclass
class ResumeMemoryOccupationReqInput:
"""Request to resume (wake) GPU memory occupation for the diffusion engine."""

pass
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@

from sglang.multimodal_gen.runtime.entrypoints.post_training.io_struct import (
GetWeightsChecksumReqInput,
ReleaseMemoryOccupationReqInput,
ResumeMemoryOccupationReqInput,
UpdateWeightFromDiskReqInput,
)
from sglang.multimodal_gen.runtime.scheduler_client import async_scheduler_client
from sglang.multimodal_gen.runtime.utils.logging_utils import init_logger

router = APIRouter()

logger = init_logger(__name__)


@router.post("/update_weights_from_disk")
async def update_weights_from_disk(request: Request):
Expand Down Expand Up @@ -60,3 +65,41 @@ async def get_weights_checksum(request: Request):
return ORJSONResponse({"error": str(e)}, status_code=500)

return ORJSONResponse(response.output, status_code=200)


async def _handle_memory_occupation_request(
req: ReleaseMemoryOccupationReqInput | ResumeMemoryOccupationReqInput,
):
"""Handle memory sleep/wake requests forwarded to scheduler."""
try:
response = await async_scheduler_client.forward(req)
except Exception as e:
logger.exception(f"scheduler_client.forward failed for {type(req).__name__}")
return ORJSONResponse({"success": False, "message": str(e)}, status_code=500)

payload = response.output if isinstance(response.output, dict) else None

if not isinstance(payload, dict) or "success" not in payload:
logger.error(f"missing success in scheduler output: {response.output}")
return ORJSONResponse(
{
"success": False,
"message": f"Missing 'success' field in scheduler response: {response.output}",
},
status_code=500,
)

success = bool(payload["success"])
return ORJSONResponse(payload, status_code=200 if success else 400)


@router.post("/release_memory_occupation")
async def release_memory_occupation():
"""Release GPU memory occupation (sleep the engine)."""
return await _handle_memory_occupation_request(ReleaseMemoryOccupationReqInput())


@router.post("/resume_memory_occupation")
async def resume_memory_occupation():
"""Resume GPU memory occupation (wake the engine)."""
return await _handle_memory_occupation_request(ResumeMemoryOccupationReqInput())
Loading
Loading