Skip to content

[Serve] Gate ingress request router body forwarding behind escape hatch#63183

Merged
kouroshHakha merged 95 commits into
ray-project:masterfrom
eicherseiji:ingress-bypass/5-disable-body-forwarding
May 9, 2026
Merged

[Serve] Gate ingress request router body forwarding behind escape hatch#63183
kouroshHakha merged 95 commits into
ray-project:masterfrom
eicherseiji:ingress-bypass/5-disable-body-forwarding

Conversation

@eicherseiji
Copy link
Copy Markdown
Contributor

@eicherseiji eicherseiji commented May 7, 2026

Stacked on #63167.

Round-robin /internal/route ignores the request body, so make body forwarding opt-in via RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY (off by default). When off, HAProxy skips wait-for-body/tune.bufsize and the Lua POSTs Content-Length: 0. When on, prior body-forwarding behavior is restored unchanged for body-aware policies (prefix-cache, etc.).

LLMRouter is unchanged: await request.body() on a Content-Length: 0 request is essentially free, so no gate is needed there. _pick_replica's request_body / body_truncated parameters stay so flipping the flag on doesn't touch call sites.

Test plan

  • Existing test_ingress_request_router_end_to_end flips the flag on via monkeypatch; body-forwarded path still covered.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a direct streaming mode for Ray LLM Serve, allowing the LLMServer deployment to act as its own ingress. Key changes include the addition of a late-bound ASGI application construction mechanism via serve_build_asgi_app, the implementation of a new LLMRouter for replica-level routing, and updates to HAProxy templates to support optional request body forwarding. A critical feedback point identifies that LLMServer must ensure its engine is started before attempting to build the ASGI app to avoid initialization errors.

Comment on lines +198 to +199
async def __serve_build_asgi_app__(self):
return await self.engine.build_asgi_app()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In the direct streaming path, LLMServer acts as the ingress deployment. Ray Serve will instantiate the replica and call __serve_build_asgi_app__ to initialize the ASGI application. However, self.engine and its internal state (such as _vllm_args in VLLMEngine) are only initialized in the start() method, which is not automatically called by Serve's lifecycle for ingress deployments.

If start() hasn't been called, self.engine will be None or missing required attributes, leading to an AttributeError when build_asgi_app() is invoked. You should ensure that self.start() is called and awaited before accessing the engine.

Suggested change
async def __serve_build_asgi_app__(self):
return await self.engine.build_asgi_app()
async def __serve_build_asgi_app__(self):
await self.start()
return await self.engine.build_asgi_app()

@eicherseiji eicherseiji changed the title [Serve][LLM] Gate ingress request router body forwarding behind escape hatch [Serve] Gate ingress request router body forwarding behind escape hatch May 7, 2026
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label May 7, 2026
eicherseiji added 27 commits May 7, 2026 11:05
Wire up the controller to support "ingress bypass" mode where a router
deployment (e.g. OpenAiIngress) serves /internal/route for Lua routing
decisions, while non-router deployments (e.g. LLMServer) serve data
plane traffic directly via their direct ingress ports.

Key changes:
- controller.py: _get_target_groups_for_app_with_router splits targets
  into router_targets (for Lua) and main targets (data plane)
- application_state.py: track router deployment name per app
- build_app.py, client.py, deploy_utils.py, deployment_info.py: plumb
  router flag through deployment args
- default_impl.py: factory plumbing for router deployment
- schema.py: router_targets field on TargetGroup

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove unused _summarize_target, _summarize_target_groups,
  _summarize_fallback_targets helpers
- Remove unused get_fallback_targets_for_proxy_manager method
- Revert unrelated RAY_SERVE_THROUGHPUT_OPTIMIZED env var plumbing
- Fix stale docstring referencing OpenAiIngress

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
This reverts commit 9600be7.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Extend HAProxy configuration to support Lua-based custom request routing
for the ingress bypass pattern. When a backend has router_targets configured,
HAProxy calls /internal/route on a router replica to get the target host:port,
then forwards the data plane request directly to that replica.

Key changes:
- haproxy.py: BackendConfig gains router_servers, custom_request_routing,
  max_server_conns, router_server fields; reload tracking; is_head
  readiness fix; summarize helpers for debugging
- haproxy_templates.py: Lua route template for /internal/route calls

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove streaming-only restriction: route ALL POST requests through
  Lua when ingress bypass is enabled, not just streaming ones
- Remove killswitch file check (/tmp/ray-serve-disable-ingress-bypass-routing)
- Reduce Lua socket timeout from 5s to 500ms for fast fallback
- Revert enable_hap_optimization splitting back to single flag
- Remove reload tracking (_reload_count, _last_reload_mode)
- Remove _hard_reload and _reload_with_config indirection
- Remove get_reload_debug_summary

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove _summarize_target, _summarize_target_group,
  _summarize_backend_config (defined but never called)
- Add TODO on single router_server selection noting the HA gap

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
kouroshHakha and others added 8 commits May 7, 2026 11:05
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Serve core
- Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a
  class supplies its own ASGI app via __serve_build_asgi_app__. Drops
  the placeholder FastAPI() and the dual _set_asgi_app() init.
- Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so
  /direct ingress dispatch skips a per-request RequestMetadata
  construction and two generate_request_id() UUIDs.

LLM engine
- Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine
  implements it. LLMServer.__serve_build_asgi_app__ is a one-line
  delegate to self.engine.build_asgi_app(). Drops the public
  vllm_args/engine_client properties that leaked vLLM internals.

LLM router
- Bind via bind(server=llm_deployment) so the handle is injected as an
  init kwarg; drops the magic-string serve.get_deployment_handle().
- Replace the awaited llm_config.remote() priming RPC with a synchronous
  DeploymentHandle._init() + cached _get_request_router(). Removes the
  60s timeout, the dependency on a method being removed in ray-project#63065, and
  the engine-cold-start coupling.
- check_health() now asserts the request router is non-None as a
  liveness signal.
- Drop GET / (HAProxy never calls it; Serve uses check_health for
  readiness). Keep GET /health as a human-operator convenience.
- Cache the RequestRouter on self at init; precompute (host, port,
  full_id_str) tuples in _ready_endpoints to skip per-request
  formatting; two-tier cache (id(curr_replicas) fast path then
  frozenset(keys) fallback) avoids the frozenset allocation when
  curr_replicas is unchanged.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The `if self._user_callable_asgi_app is not None` short-circuit added
in the previous commits matched every @serve.ingress deployment, not
just direct streaming, and bypassed RequestMetadata setup (route,
request_id, session_id, tracing context, backpressure). That broke:

- test_replica_request_context.py::TestHTTPRoute::test_matching_fastapi_route
  (request context route returned empty string)
- test_http_headers.py::TestUserProvidedRequestIDHeader::test_fastapi
  (request_id header not parsed)
- test_http_headers.py::test_reuse_request_id

The existing flow at line 2535+ already handles ASGI apps correctly via
_call_http_entrypoint (UserMethodInfo.is_asgi_app), so direct streaming
goes through the same path with no special-case code.

Buildkite failures: ray-project/premerge#65996.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…e hatch

Round-robin /internal/route ignores the request body, so we make body
forwarding opt-in via RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY (off
by default). Saves on every routed request:
- HAProxy `wait-for-body` round-trip and per-connection buffer cost
  (~2 * tune.bufsize * maxconn).
- Lua `txn.sf:req_body()` + body re-emit on the socket.
- LLMRouter `await request.body()`.

When the flag is on, the previous body-forwarding path is restored
unchanged for body-aware policies (prefix-cache, etc.):
- HAProxy frontend re-emits `wait-for-body` and `tune.bufsize`.
- Lua reads the body and forwards it (with X-Body-Truncated when truncated).
- LLMRouter reads the body and the truncation header.

The end-to-end test that asserted the body got forwarded now flips the
flag on via monkeypatch.

Stacked on ray-project#63167.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- router.py: drop the redundant else-branch (defaults already match).
- haproxy.py: `str(bool).lower()` instead of "true"/"false" ternary.
- haproxy_templates.py: flatten nested {% if %}{% if %} for tune.bufsize.
- lua.tmpl, test: drop WHAT-comment narration.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The expensive part of body forwarding lives in HAProxy + Lua (the
`wait-for-body` round-trip, `tune.bufsize` connection cost, and body
re-emit on the socket). On the LLMRouter side, `await request.body()`
on a Content-Length: 0 request is essentially free (one async tick,
returns b""), so gating it earns nothing. Reverts the LLM-side branch
and keeps the body read unconditional, making this PR Serve-only.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@kouroshHakha kouroshHakha force-pushed the ingress-bypass/5-disable-body-forwarding branch from 2ba9e2b to 0390ad4 Compare May 7, 2026 18:12
Brings in ray-project#63215 (fail-loud) and resolves the conflict in
``ingress_request_router.lua.tmpl`` against this branch's FORWARD_BODY
gate.

Resolution: drop the ``empty_body`` sentinel that ray-project#63215 added. With
the FORWARD_BODY gate, an empty body is no longer a distinct failure
mode -- both modes call the router with ``body=""`` and let the router
decide. The "router or fail" invariant still holds because the router
*is* called; we just don't synthesize a sentinel locally.

The 4 remaining failure-mode sentinels from ray-project#63215
(``router_unreachable`` / ``router_non_200`` /
``unparseable_replica_id`` / ``unknown_replica_id``) are unchanged.

``test_router_failure_fails_loud_with_reason`` updates fall out of the
resolution: both empty and non-empty POSTs now surface as
``router_non_200`` against the broken-router stub. The empty-body case
stays in the test to pin that empty bodies are routed through the
router rather than silently bypassing it.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Pins that ``tune.bufsize``, ``wait-for-body``, and the Lua
``FORWARD_BODY`` constant render consistently with the env var. Off
by default; on only when ``RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1``.
Catches the failure mode where any one of the three signals drifts out
of sync with the others.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Three pieces of dead code that drifted into this branch but aren't
referenced by any FORWARD_BODY-related change and aren't reachable
from runtime code paths:

- ``router.py``: ``DIRECT_ROUTER_*`` env-var name constants and the
  ``ingress_request_router_app`` FastAPI placeholder. Defined but
  unused; staged for unrelated work.
- ``haproxy.py``: free-function ``_get_safe_name`` and
  ``target_to_server`` duplicates of ``HAProxyManager._target_to_server``
  / ``HAProxyManager.get_safe_name``. Never called.
- ``test_router.py``: a second copy of ``_DirectRouterReplicaId`` /
  ``_FakeRequest`` / ``_DirectRouterReplica`` / ``_new_direct_router``
  appended above an identical existing block (Python uses the second
  definition, making the first dead). Likely a merge artifact.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Frame the default as a TTFR concern rather than a policy-specific one:
buffering and re-emitting large request bodies adds noticeable time-to-
first-response. Call out concrete policy categories so a reader knows
when to flip the flag without grepping the routers:
- round-robin / power-of-two: body-independent, leave off.
- session-aware: keys on ``x-session-id`` header, leave off.
- prefix-aware / prefix-cache: needs the body, flip on.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha marked this pull request as ready for review May 8, 2026 19:58
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 8, 2026 19:58
Comment thread python/ray/serve/tests/test_haproxy_api.py Outdated
Comment thread python/ray/serve/tests/test_haproxy_api.py Outdated
Three failures in test_haproxy_api after merging master:

1. ``test_write_ingress_request_router_lua_no_routers``: assertion
   ``result == (None, False)`` was stale. ``_write_ingress_request_router_lua``
   returns just the Lua path (or ``None``), not a tuple. Fix to
   ``assert result is None``.

2. ``test_ingress_request_router_does_not_leak_into_other_backends``: had
   embedded assertions that ``wait-for-body`` and ``tune.bufsize 65536``
   appear in the rendered config. Both are wrong on master:
   ``wait-for-body`` is gated behind ``FORWARD_BODY=true`` (off by
   default), and the bufsize default is 262144 not 65536. Drop these
   assertions; their FORWARD_BODY-aware replacement is
   ``test_ingress_request_router_forward_body_gate_renders``.

3. ``test_ingress_request_router_end_to_end``: assertion
   ``router_captured["bodies"] == ['{"prompt": "hello"}']`` failed because
   ``_create_router_server``'s readiness probe POSTs ``{}`` to the same
   ``/internal/route`` handler that records bodies, so the first capture
   was always the probe body. Clear ``captured["bodies"]`` after the
   probe completes so callers see only client traffic.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) May 9, 2026 00:38
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 380647f. Configure here.

# decision, e.g. prefix-aware / prefix-cache routing.
RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool(
"RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong default type passed to get_env_bool

Low Severity

get_env_bool documents and expects a string "0" or "1" as the default parameter, but False (a Python boolean) is passed here. Every other call site in the codebase passes "0" or "1". It happens to work because str(False) produces "False" which doesn't equal "1", but it violates the function's documented contract and is inconsistent with the established pattern.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 380647f. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit +1

@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 9, 2026
@kouroshHakha kouroshHakha merged commit 76ce1d1 into ray-project:master May 9, 2026
8 checks passed
Copy link
Copy Markdown
Contributor

@jeffreywang-anyscale jeffreywang-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, leaving nits. Is this "escape hatch" approach a long-term solution we're going for?

# Use file locking to prevent concurrent writes from multiple processes
# This is important in test environments where multiple nodes may run
# on the same machine
import fcntl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move imports to top

return lua_path

def _generate_config_file_internal(self) -> None:
def _generate_config_file_internal(self) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't returning bool

# header (forwarded with the request line) rather than the body.
#
# Flip this to true if the configured request router needs the body for its
# decision, e.g. prefix-aware / prefix-cache routing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that we might underperform other LLM routers under prefix-cache affinity routing policy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fundamental system trade off and not so much of a implementation trade off, but we'll see. Your work of extending to other request routers need to be done for us to be able to answer this question

# decision, e.g. prefix-aware / prefix-cache routing.
RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool(
"RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit +1

chillCode404 pushed a commit to chillCode404/ray-contrib that referenced this pull request May 9, 2026
…ch (ray-project#63183)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha
Copy link
Copy Markdown
Contributor

LGTM, leaving nits. Is this "escape hatch" approach a long-term solution we're going for?

For now, we couldn't think of a better approach to turn it on / off, we have to do a study for when we do other request routing policies that need the request body

dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…ch (ray-project#63183)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…ch (ray-project#63183)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ch (ray-project#63183)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants