[Serve] Gate ingress request router body forwarding behind escape hatch by eicherseiji · Pull Request #63183 · ray-project/ray

eicherseiji · 2026-05-07T05:59:31Z

Stacked on #63167.

Round-robin /internal/route ignores the request body, so make body forwarding opt-in via RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY (off by default). When off, HAProxy skips wait-for-body/tune.bufsize and the Lua POSTs Content-Length: 0. When on, prior body-forwarding behavior is restored unchanged for body-aware policies (prefix-cache, etc.).

LLMRouter is unchanged: await request.body() on a Content-Length: 0 request is essentially free, so no gate is needed there. _pick_replica's request_body / body_truncated parameters stay so flipping the flag on doesn't touch call sites.

Test plan

Existing test_ingress_request_router_end_to_end flips the flag on via monkeypatch; body-forwarded path still covered.

gemini-code-assist

Code Review

This pull request introduces a direct streaming mode for Ray LLM Serve, allowing the LLMServer deployment to act as its own ingress. Key changes include the addition of a late-bound ASGI application construction mechanism via serve_build_asgi_app, the implementation of a new LLMRouter for replica-level routing, and updates to HAProxy templates to support optional request body forwarding. A critical feedback point identifies that LLMServer must ensure its engine is started before attempting to build the ASGI app to avoid initialization errors.

gemini-code-assist · 2026-05-07T06:01:40Z

+    async def __serve_build_asgi_app__(self):
+        return await self.engine.build_asgi_app()


In the direct streaming path, LLMServer acts as the ingress deployment. Ray Serve will instantiate the replica and call __serve_build_asgi_app__ to initialize the ASGI application. However, self.engine and its internal state (such as _vllm_args in VLLMEngine) are only initialized in the start() method, which is not automatically called by Serve's lifecycle for ingress deployments.

If start() hasn't been called, self.engine will be None or missing required attributes, leading to an AttributeError when build_asgi_app() is invoked. You should ensure that self.start() is called and awaited before accessing the engine.

Suggested change

async def __serve_build_asgi_app__(self):

return await self.engine.build_asgi_app()

async def __serve_build_asgi_app__(self):

await self.start()

return await self.engine.build_asgi_app()

Wire up the controller to support "ingress bypass" mode where a router deployment (e.g. OpenAiIngress) serves /internal/route for Lua routing decisions, while non-router deployments (e.g. LLMServer) serve data plane traffic directly via their direct ingress ports. Key changes: - controller.py: _get_target_groups_for_app_with_router splits targets into router_targets (for Lua) and main targets (data plane) - application_state.py: track router deployment name per app - build_app.py, client.py, deploy_utils.py, deployment_info.py: plumb router flag through deployment args - default_impl.py: factory plumbing for router deployment - schema.py: router_targets field on TargetGroup Signed-off-by: Seiji Eicher <seiji@anyscale.com>

- Remove unused _summarize_target, _summarize_target_groups, _summarize_fallback_targets helpers - Remove unused get_fallback_targets_for_proxy_manager method - Revert unrelated RAY_SERVE_THROUGHPUT_OPTIMIZED env var plumbing - Fix stale docstring referencing OpenAiIngress Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

This reverts commit 9600be7. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Extend HAProxy configuration to support Lua-based custom request routing for the ingress bypass pattern. When a backend has router_targets configured, HAProxy calls /internal/route on a router replica to get the target host:port, then forwards the data plane request directly to that replica. Key changes: - haproxy.py: BackendConfig gains router_servers, custom_request_routing, max_server_conns, router_server fields; reload tracking; is_head readiness fix; summarize helpers for debugging - haproxy_templates.py: Lua route template for /internal/route calls Signed-off-by: Seiji Eicher <seiji@anyscale.com>

- Remove streaming-only restriction: route ALL POST requests through Lua when ingress bypass is enabled, not just streaming ones - Remove killswitch file check (/tmp/ray-serve-disable-ingress-bypass-routing) - Reduce Lua socket timeout from 5s to 500ms for fast fallback - Revert enable_hap_optimization splitting back to single flag - Remove reload tracking (_reload_count, _last_reload_mode) - Remove _hard_reload and _reload_with_config indirection - Remove get_reload_debug_summary Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

- Remove _summarize_target, _summarize_target_group, _summarize_backend_config (defined but never called) - Add TODO on single router_server selection noting the HA gap Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Serve core - Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a class supplies its own ASGI app via __serve_build_asgi_app__. Drops the placeholder FastAPI() and the dual _set_asgi_app() init. - Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so /direct ingress dispatch skips a per-request RequestMetadata construction and two generate_request_id() UUIDs. LLM engine - Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine implements it. LLMServer.__serve_build_asgi_app__ is a one-line delegate to self.engine.build_asgi_app(). Drops the public vllm_args/engine_client properties that leaked vLLM internals. LLM router - Bind via bind(server=llm_deployment) so the handle is injected as an init kwarg; drops the magic-string serve.get_deployment_handle(). - Replace the awaited llm_config.remote() priming RPC with a synchronous DeploymentHandle._init() + cached _get_request_router(). Removes the 60s timeout, the dependency on a method being removed in ray-project#63065, and the engine-cold-start coupling. - check_health() now asserts the request router is non-None as a liveness signal. - Drop GET / (HAProxy never calls it; Serve uses check_health for readiness). Keep GET /health as a human-operator convenience. - Cache the RequestRouter on self at init; precompute (host, port, full_id_str) tuples in _ready_endpoints to skip per-request formatting; two-tier cache (id(curr_replicas) fast path then frozenset(keys) fallback) avoids the frozenset allocation when curr_replicas is unchanged. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

The `if self._user_callable_asgi_app is not None` short-circuit added in the previous commits matched every @serve.ingress deployment, not just direct streaming, and bypassed RequestMetadata setup (route, request_id, session_id, tracing context, backpressure). That broke: - test_replica_request_context.py::TestHTTPRoute::test_matching_fastapi_route (request context route returned empty string) - test_http_headers.py::TestUserProvidedRequestIDHeader::test_fastapi (request_id header not parsed) - test_http_headers.py::test_reuse_request_id The existing flow at line 2535+ already handles ASGI apps correctly via _call_http_entrypoint (UserMethodInfo.is_asgi_app), so direct streaming goes through the same path with no special-case code. Buildkite failures: ray-project/premerge#65996. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…e hatch Round-robin /internal/route ignores the request body, so we make body forwarding opt-in via RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY (off by default). Saves on every routed request: - HAProxy `wait-for-body` round-trip and per-connection buffer cost (~2 * tune.bufsize * maxconn). - Lua `txn.sf:req_body()` + body re-emit on the socket. - LLMRouter `await request.body()`. When the flag is on, the previous body-forwarding path is restored unchanged for body-aware policies (prefix-cache, etc.): - HAProxy frontend re-emits `wait-for-body` and `tune.bufsize`. - Lua reads the body and forwards it (with X-Body-Truncated when truncated). - LLMRouter reads the body and the truncation header. The end-to-end test that asserted the body got forwarded now flips the flag on via monkeypatch. Stacked on ray-project#63167. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

- router.py: drop the redundant else-branch (defaults already match). - haproxy.py: `str(bool).lower()` instead of "true"/"false" ternary. - haproxy_templates.py: flatten nested {% if %}{% if %} for tune.bufsize. - lua.tmpl, test: drop WHAT-comment narration. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

The expensive part of body forwarding lives in HAProxy + Lua (the `wait-for-body` round-trip, `tune.bufsize` connection cost, and body re-emit on the socket). On the LLMRouter side, `await request.body()` on a Content-Length: 0 request is essentially free (one async tick, returns b""), so gating it earns nothing. Reverts the LLM-side branch and keeps the body read unconditional, making this PR Serve-only. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Brings in ray-project#63215 (fail-loud) and resolves the conflict in ``ingress_request_router.lua.tmpl`` against this branch's FORWARD_BODY gate. Resolution: drop the ``empty_body`` sentinel that ray-project#63215 added. With the FORWARD_BODY gate, an empty body is no longer a distinct failure mode -- both modes call the router with ``body=""`` and let the router decide. The "router or fail" invariant still holds because the router *is* called; we just don't synthesize a sentinel locally. The 4 remaining failure-mode sentinels from ray-project#63215 (``router_unreachable`` / ``router_non_200`` / ``unparseable_replica_id`` / ``unknown_replica_id``) are unchanged. ``test_router_failure_fails_loud_with_reason`` updates fall out of the resolution: both empty and non-empty POSTs now surface as ``router_non_200`` against the broken-router stub. The empty-body case stays in the test to pin that empty bodies are routed through the router rather than silently bypassing it. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Pins that ``tune.bufsize``, ``wait-for-body``, and the Lua ``FORWARD_BODY`` constant render consistently with the env var. Off by default; on only when ``RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1``. Catches the failure mode where any one of the three signals drifts out of sync with the others. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Three pieces of dead code that drifted into this branch but aren't referenced by any FORWARD_BODY-related change and aren't reachable from runtime code paths: - ``router.py``: ``DIRECT_ROUTER_*`` env-var name constants and the ``ingress_request_router_app`` FastAPI placeholder. Defined but unused; staged for unrelated work. - ``haproxy.py``: free-function ``_get_safe_name`` and ``target_to_server`` duplicates of ``HAProxyManager._target_to_server`` / ``HAProxyManager.get_safe_name``. Never called. - ``test_router.py``: a second copy of ``_DirectRouterReplicaId`` / ``_FakeRequest`` / ``_DirectRouterReplica`` / ``_new_direct_router`` appended above an identical existing block (Python uses the second definition, making the first dead). Likely a merge artifact. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Frame the default as a TTFR concern rather than a policy-specific one: buffering and re-emitting large request bodies adds noticeable time-to- first-response. Call out concrete policy categories so a reader knows when to flip the flag without grepping the routers: - round-robin / power-of-two: body-independent, leave off. - session-aware: keys on ``x-session-id`` header, leave off. - prefix-aware / prefix-cache: needs the body, flip on. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Three failures in test_haproxy_api after merging master: 1. ``test_write_ingress_request_router_lua_no_routers``: assertion ``result == (None, False)`` was stale. ``_write_ingress_request_router_lua`` returns just the Lua path (or ``None``), not a tuple. Fix to ``assert result is None``. 2. ``test_ingress_request_router_does_not_leak_into_other_backends``: had embedded assertions that ``wait-for-body`` and ``tune.bufsize 65536`` appear in the rendered config. Both are wrong on master: ``wait-for-body`` is gated behind ``FORWARD_BODY=true`` (off by default), and the bufsize default is 262144 not 65536. Drop these assertions; their FORWARD_BODY-aware replacement is ``test_ingress_request_router_forward_body_gate_renders``. 3. ``test_ingress_request_router_end_to_end``: assertion ``router_captured["bodies"] == ['{"prompt": "hello"}']`` failed because ``_create_router_server``'s readiness probe POSTs ``{}`` to the same ``/internal/route`` handler that records bodies, so the first capture was always the probe body. Clear ``captured["bodies"]`` after the probe completes so callers see only client traffic. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 380647f. Configure here.}

cursor · 2026-05-09T00:42:50Z

+# decision, e.g. prefix-aware / prefix-cache routing.
+RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool(
+    "RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False
+)


Wrong default type passed to get_env_bool

Low Severity

get_env_bool documents and expects a string "0" or "1" as the default parameter, but False (a Python boolean) is passed here. Every other call site in the codebase passes "0" or "1". It happens to work because str(False) produces "False" which doesn't equal "1", but it violates the function's documented contract and is inconsistent with the established pattern.

^{Reviewed by Cursor Bugbot for commit 380647f. Configure here.}

jeffreywang-anyscale

LGTM, leaving nits. Is this "escape hatch" approach a long-term solution we're going for?

jeffreywang-anyscale · 2026-05-09T03:08:55Z

            # Use file locking to prevent concurrent writes from multiple processes
            # This is important in test environments where multiple nodes may run
            # on the same machine
            import fcntl


nit: move imports to top

jeffreywang-anyscale · 2026-05-09T03:09:09Z

        return lua_path

-    def _generate_config_file_internal(self) -> None:
+    def _generate_config_file_internal(self) -> bool:


we aren't returning bool

jeffreywang-anyscale · 2026-05-09T03:12:05Z

+# header (forwarded with the request line) rather than the body.
+#
+# Flip this to true if the configured request router needs the body for its
+# decision, e.g. prefix-aware / prefix-cache routing.


does this imply that we might underperform other LLM routers under prefix-cache affinity routing policy?

This is a fundamental system trade off and not so much of a implementation trade off, but we'll see. Your work of extending to other request routers need to be done for us to be able to answer this question

jeffreywang-anyscale · 2026-05-09T03:12:25Z

+# decision, e.g. prefix-aware / prefix-cache routing.
+RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool(
+    "RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False
+)


…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha · 2026-05-09T16:28:32Z

LGTM, leaving nits. Is this "escape hatch" approach a long-term solution we're going for?

For now, we couldn't think of a better approach to turn it on / off, we have to do a study for when we do other request routing policies that need the request body

…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>

…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

eicherseiji changed the title ~~[Serve][LLM] Gate ingress request router body forwarding behind escape hatch~~ [Serve] Gate ingress request router body forwarding behind escape hatch May 7, 2026

kouroshHakha added the go add ONLY when ready to merge, run all tests label May 7, 2026

eicherseiji added 27 commits May 7, 2026 11:05

[Serve] Wire router deployments through ingress bypass plumbing

cd1f016

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Move http router role into controller plumbing

3bd0852

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Rename http_router internals to ingress_request_router

30bbb00

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Fix ingress request router runtime config flag

4e1f136

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Clean up ingress request router target groups

76157d4

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Trim ingress request router target group tests

b00a351

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Use replica details for backend HTTP targets

16b524f

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Recover ingress request router ports

04111cc

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Share target group construction for ingress request router

255347e

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Inline ingress request router target selection

17ba93b

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Trim unrelated formatting changes

bbc27c1

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Treat ingress request router as ingress target metadata

4499c49

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Address ingress request router controller review

cd88eee

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Trim ingress request router validation

d80c642

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Clarify direct ingress port info variable

c5fc9d0

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Revert "[Serve] Clarify direct ingress port info variable"

bf76ee7

This reverts commit 9600be7. Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Keep ingress replica port naming

522a6c6

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Restore target group comments

f56e2e5

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Inline ingress router target groups

3995ed7

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Trim controller plumbing diff

ee42db1

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Serve] Validate ingress request router target state

62cefd3

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove unrelated _controller_handle caching

87823c5

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Remove unused summarize helpers, add TODO for router HA

3e10dd7

- Remove _summarize_target, _summarize_target_group, _summarize_backend_config (defined but never called) - Add TODO on single router_server selection noting the HA gap Signed-off-by: Seiji Eicher <seiji@anyscale.com>

kouroshHakha and others added 8 commits May 7, 2026 11:05

[Serve][LLM] Clean up direct streaming router wiring

90374b4

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[Serve][LLM] Simplify direct streaming router

4d94611

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

Cleanup

39ad094

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha force-pushed the ingress-bypass/5-disable-body-forwarding branch from 2ba9e2b to 0390ad4 Compare May 7, 2026 18:12

Merge branch 'master' into ingress-bypass/5-disable-body-forwarding

d2a0038

kouroshHakha mentioned this pull request May 8, 2026

[Serve] Gate ingress request router body forwarding behind escape hatch #63232

Closed

4 tasks

kouroshHakha added 4 commits May 8, 2026 12:49

kouroshHakha marked this pull request as ready for review May 8, 2026 19:58

kouroshHakha requested a review from a team as a code owner May 8, 2026 19:58

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread python/ray/serve/tests/test_haproxy_api.py Outdated

Comment thread python/ray/serve/tests/test_haproxy_api.py Outdated

kouroshHakha enabled auto-merge (squash) May 9, 2026 00:38

cursor Bot reviewed May 9, 2026

View reviewed changes

ray-gardener Bot added the serve Ray Serve Related Issue label May 9, 2026

kouroshHakha approved these changes May 9, 2026

View reviewed changes

kouroshHakha merged commit 76ce1d1 into ray-project:master May 9, 2026
8 checks passed

jeffreywang-anyscale reviewed May 9, 2026

View reviewed changes

		async def __serve_build_asgi_app__(self):
		return await self.engine.build_asgi_app()

Conversation

eicherseiji commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 9, 2026

Choose a reason for hiding this comment

Wrong default type passed to get_env_bool

Uh oh!

jeffreywang-anyscale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeffreywang-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha May 9, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang-anyscale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eicherseiji commented May 7, 2026 •

edited

Loading

Wrong default type passed to `get_env_bool`