Skip to content

[Serve] [SGLang] POC PD disaggregation#63741

Open
limarkdcunha wants to merge 6 commits into
ray-project:masterfrom
limarkdcunha:feature/ray-serve-sglang-pd-disaggregation
Open

[Serve] [SGLang] POC PD disaggregation#63741
limarkdcunha wants to merge 6 commits into
ray-project:masterfrom
limarkdcunha:feature/ray-serve-sglang-pd-disaggregation

Conversation

@limarkdcunha

Copy link
Copy Markdown
Contributor

Description

A POC PR for the my proposal (#63257) related to Ray Serve SGLang PD disaggregation support.

Related issues - #62792 #63257

@limarkdcunha limarkdcunha changed the title [Serve] POC ready for PD dissaggregation [Serve] [SGLang] POC PD disaggregation May 30, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SGLang Prefill-Decode (PD) disaggregated LLM serving by adding SGLangPDPrefillServer and SGLangPDDecodeServer deployments, and updating the SGLang engine to pass bootstrap coordination fields. Feedback on these changes highlights critical issues: first, setting attributes directly on Pydantic v2 models will raise runtime errors, so object.__setattr__ should be used instead; second, the prefill generator must be consumed in a background task to prevent premature garbage collection and engine hangs; third, asyncio needs to be imported to support this background task; and finally, the new servers should be added to the __all__ list in deployment.py to be properly exposed in the public API.

Comment thread python/ray/serve/llm/deployment.py Outdated
@jeffreywang88

jeffreywang88 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

can you add a target in release_tests.yaml? example: https://github.com/ray-project/ray/blob/master/release/release_tests.yaml#L4643-L4661

@limarkdcunha

Copy link
Copy Markdown
Contributor Author

Added the target in release_tests.yaml. Thanks

@jeffreywang88 jeffreywang88 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some design questions for you to consider while revising the RFC.

Comment on lines +12 to +14
to the right place. The decode server generates the bootstrap_room upfront and
dispatches both sides simultaneously — it does not wait for a prefill response
before starting decode.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks similar to the parallel handoff pattern (https://github.com/ray-project/ray/pull/63950/changes#diff-ccc2365e3ea6c5f4fd309223ecbaaa15d44deb1d510f0acf985604060a7f1747R56). Could you assess the feasibility of leveraging the established concurrent handoff mechanism?

Comment on lines +144 to +145
# LLMServer.get_deployment_options calls get_engine_config(), which
# unconditionally imports vLLM. The SGLang byod image uninstalls vLLM,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to somehow decouple vLLM from the common paths. It might require a refactor. Could you explore some options?

carrying the same prefill bootstrap host/port/room — the decode
KVReceiver connects to that prefill bootstrap server and blocks
internally waiting for the KV cache to arrive.
5. Streams the decode response back to the client.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what we need, nice!

For each chat/completions request it:
1. Reads the PREFILL node's bootstrap_host and bootstrap_port, fetched
from the prefill deployment at init (the bootstrap server lives there).
2. Generates a unique bootstrap_room integer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc, it'd be great to clarify what bootstrap_room does and where it lives.

Comment on lines +185 to +187
# TODO: Users currently need to set disaggregation_mode manually in engine_kwargs.
# The builder should set this automatically since it already knows which
# config is prefill and which is decode.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should emulate ray serve LLM's PD API for vLLM here. The user API should be similar.


Unlike the vLLM flow, we do not wait for a prefill response before
starting decode — the bootstrap_room is established upfront and both
sides coordinate directly via SGLang's bootstrap server.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bootstrap server strictly required?

Comment thread python/ray/serve/llm/deployment.py Outdated
pass


@PublicAPI(stability="beta")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start from alpha.

Comment on lines +4674 to +4675
- UCX_TLS=all
- UCX_NET_DEVICES=all

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why we need these?

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
@limarkdcunha limarkdcunha force-pushed the feature/ray-serve-sglang-pd-disaggregation branch from 83f2eb6 to 1205202 Compare July 2, 2026 22:44
Comment thread python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py
Comment thread release/release_tests.yaml
Comment thread python/ray/llm/_internal/serve/core/configs/llm_config.py
Comment thread python/ray/llm/_internal/serve/core/configs/llm_config.py
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Comment thread python/ray/llm/_internal/serve/engines/sglang/sglang_engine.py
@limarkdcunha

Copy link
Copy Markdown
Contributor Author

Not ready for review as of now, pls ignore as of now.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Jul 3, 2026
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Comment thread python/ray/llm/_internal/serve/engines/sglang/kv_transfer/pd_connector.py Outdated
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 0c1a7dd. Configure here.

Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants