[Serve] [SGLang] POC PD disaggregation by limarkdcunha · Pull Request #63741 · ray-project/ray

limarkdcunha · 2026-05-30T03:23:32Z

Description

A POC PR for the my proposal (#63257) related to Ray Serve SGLang PD disaggregation support.

Related issues - #62792 #63257

gemini-code-assist

Code Review

This pull request introduces SGLang Prefill-Decode (PD) disaggregated LLM serving by adding SGLangPDPrefillServer and SGLangPDDecodeServer deployments, and updating the SGLang engine to pass bootstrap coordination fields. Feedback on these changes highlights critical issues: first, setting attributes directly on Pydantic v2 models will raise runtime errors, so object.__setattr__ should be used instead; second, the prefill generator must be consumed in a background task to prevent premature garbage collection and engine hangs; third, asyncio needs to be imported to support this background task; and finally, the new servers should be added to the __all__ list in deployment.py to be properly exposed in the public API.

jeffreywang88 · 2026-06-02T04:04:56Z

can you add a target in release_tests.yaml? example: https://github.com/ray-project/ray/blob/master/release/release_tests.yaml#L4643-L4661

limarkdcunha · 2026-06-04T00:59:30Z

Added the target in release_tests.yaml. Thanks

jeffreywang88

Left some design questions for you to consider while revising the RFC.

jeffreywang88 · 2026-06-24T18:00:06Z

+to the right place. The decode server generates the bootstrap_room upfront and
+dispatches both sides simultaneously — it does not wait for a prefill response
+before starting decode.


Looks similar to the parallel handoff pattern (https://github.com/ray-project/ray/pull/63950/changes#diff-ccc2365e3ea6c5f4fd309223ecbaaa15d44deb1d510f0acf985604060a7f1747R56). Could you assess the feasibility of leveraging the established concurrent handoff mechanism?

jeffreywang88 · 2026-06-24T18:01:25Z

+        # LLMServer.get_deployment_options calls get_engine_config(), which
+        # unconditionally imports vLLM. The SGLang byod image uninstalls vLLM,


We will need to somehow decouple vLLM from the common paths. It might require a refactor. Could you explore some options?

jeffreywang88 · 2026-06-24T18:01:59Z

+         carrying the same prefill bootstrap host/port/room — the decode
+         KVReceiver connects to that prefill bootstrap server and blocks
+         internally waiting for the KV cache to arrive.
+      5. Streams the decode response back to the client.


This is exactly what we need, nice!

jeffreywang88 · 2026-06-24T18:02:50Z

+    For each chat/completions request it:
+      1. Reads the PREFILL node's bootstrap_host and bootstrap_port, fetched
+         from the prefill deployment at init (the bootstrap server lives there).
+      2. Generates a unique bootstrap_room integer.


In the design doc, it'd be great to clarify what bootstrap_room does and where it lives.

jeffreywang88 · 2026-06-24T18:03:56Z

+        # TODO: Users currently need to set disaggregation_mode manually in engine_kwargs.
+        # The builder should set this automatically since it already knows which
+        # config is prefill and which is decode.


We should emulate ray serve LLM's PD API for vLLM here. The user API should be similar.

jeffreywang88 · 2026-06-24T18:04:38Z

+
+        Unlike the vLLM flow, we do not wait for a prefill response before
+        starting decode — the bootstrap_room is established upfront and both
+        sides coordinate directly via SGLang's bootstrap server.


Is this bootstrap server strictly required?

jeffreywang88 · 2026-06-24T18:05:01Z

    pass


+@PublicAPI(stability="beta")


We should start from alpha.

jeffreywang88 · 2026-06-24T18:08:40Z

+        - UCX_TLS=all
+        - UCX_NET_DEVICES=all


Could you help me understand why we need these?

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

limarkdcunha · 2026-07-03T00:06:39Z

Not ready for review as of now, pls ignore as of now.

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 0c1a7dd. Configure here.}

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

limarkdcunha changed the title ~~[Serve] POC ready for PD dissaggregation~~ [Serve] [SGLang] POC PD disaggregation May 30, 2026

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

jeffreywang88 reviewed Jun 24, 2026

View reviewed changes

limarkdcunha marked this pull request as ready for review July 2, 2026 22:41

limarkdcunha requested review from a team, MengjinYan, SongGuyang, alimaazamat, andrewsykim, dayshah, edoakes, kfstorm, marosset, matthewdeng, pcmoritz, raulchen, richardliaw, ryanaoleary and thomasdesr as code owners July 2, 2026 22:41

test: Added Ray Serve SGLang PD disaggregation support

1205202

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

limarkdcunha force-pushed the feature/ray-serve-sglang-pd-disaggregation branch from 83f2eb6 to 1205202 Compare July 2, 2026 22:44

cursor Bot reviewed Jul 2, 2026

View reviewed changes

updates as per feedback

2e85e8e

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py

ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Jul 3, 2026

cloud mirror model path fix

0f0c2ec

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/builder.py

bootstrap port offset fix

0cba02a

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/engines/sglang/kv_transfer/pd_connector.py Outdated

prefill token clamp fix

0c1a7dd

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/serving_patterns/prefill_decode/pd_server.py

DP gang scheduling explicit failure

b22e455

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

		# LLMServer.get_deployment_options calls get_engine_config(), which
		# unconditionally imports vLLM. The SGLang byod image uninstalls vLLM,

		pass


		@PublicAPI(stability="beta")

Uh oh!

Conversation

limarkdcunha commented May 30, 2026

Description

Related issues - #62792 #63257

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffreywang88 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

limarkdcunha commented Jun 4, 2026

Uh oh!

jeffreywang88 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

limarkdcunha commented Jul 3, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang88 commented Jun 2, 2026 •

edited

Loading