[Serve] [SGLang] POC PD disaggregation#63741
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces SGLang Prefill-Decode (PD) disaggregated LLM serving by adding SGLangPDPrefillServer and SGLangPDDecodeServer deployments, and updating the SGLang engine to pass bootstrap coordination fields. Feedback on these changes highlights critical issues: first, setting attributes directly on Pydantic v2 models will raise runtime errors, so object.__setattr__ should be used instead; second, the prefill generator must be consumed in a background task to prevent premature garbage collection and engine hangs; third, asyncio needs to be imported to support this background task; and finally, the new servers should be added to the __all__ list in deployment.py to be properly exposed in the public API.
|
can you add a target in |
|
Added the target in release_tests.yaml. Thanks |
jeffreywang88
left a comment
There was a problem hiding this comment.
Left some design questions for you to consider while revising the RFC.
| to the right place. The decode server generates the bootstrap_room upfront and | ||
| dispatches both sides simultaneously — it does not wait for a prefill response | ||
| before starting decode. |
There was a problem hiding this comment.
Looks similar to the parallel handoff pattern (https://github.com/ray-project/ray/pull/63950/changes#diff-ccc2365e3ea6c5f4fd309223ecbaaa15d44deb1d510f0acf985604060a7f1747R56). Could you assess the feasibility of leveraging the established concurrent handoff mechanism?
| # LLMServer.get_deployment_options calls get_engine_config(), which | ||
| # unconditionally imports vLLM. The SGLang byod image uninstalls vLLM, |
There was a problem hiding this comment.
We will need to somehow decouple vLLM from the common paths. It might require a refactor. Could you explore some options?
| carrying the same prefill bootstrap host/port/room — the decode | ||
| KVReceiver connects to that prefill bootstrap server and blocks | ||
| internally waiting for the KV cache to arrive. | ||
| 5. Streams the decode response back to the client. |
There was a problem hiding this comment.
This is exactly what we need, nice!
| For each chat/completions request it: | ||
| 1. Reads the PREFILL node's bootstrap_host and bootstrap_port, fetched | ||
| from the prefill deployment at init (the bootstrap server lives there). | ||
| 2. Generates a unique bootstrap_room integer. |
There was a problem hiding this comment.
In the design doc, it'd be great to clarify what bootstrap_room does and where it lives.
| # TODO: Users currently need to set disaggregation_mode manually in engine_kwargs. | ||
| # The builder should set this automatically since it already knows which | ||
| # config is prefill and which is decode. |
There was a problem hiding this comment.
We should emulate ray serve LLM's PD API for vLLM here. The user API should be similar.
|
|
||
| Unlike the vLLM flow, we do not wait for a prefill response before | ||
| starting decode — the bootstrap_room is established upfront and both | ||
| sides coordinate directly via SGLang's bootstrap server. |
There was a problem hiding this comment.
Is this bootstrap server strictly required?
| pass | ||
|
|
||
|
|
||
| @PublicAPI(stability="beta") |
There was a problem hiding this comment.
We should start from alpha.
| - UCX_TLS=all | ||
| - UCX_NET_DEVICES=all |
There was a problem hiding this comment.
Could you help me understand why we need these?
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
83f2eb6 to
1205202
Compare
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
|
Not ready for review as of now, pls ignore as of now. |
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 0c1a7dd. Configure here.
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

Description
A POC PR for the my proposal (#63257) related to Ray Serve SGLang PD disaggregation support.
Related issues - #62792 #63257