[Serve] Gate ingress request router body forwarding behind escape hatch#63183
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a direct streaming mode for Ray LLM Serve, allowing the LLMServer deployment to act as its own ingress. Key changes include the addition of a late-bound ASGI application construction mechanism via serve_build_asgi_app, the implementation of a new LLMRouter for replica-level routing, and updates to HAProxy templates to support optional request body forwarding. A critical feedback point identifies that LLMServer must ensure its engine is started before attempting to build the ASGI app to avoid initialization errors.
| async def __serve_build_asgi_app__(self): | ||
| return await self.engine.build_asgi_app() |
There was a problem hiding this comment.
In the direct streaming path, LLMServer acts as the ingress deployment. Ray Serve will instantiate the replica and call __serve_build_asgi_app__ to initialize the ASGI application. However, self.engine and its internal state (such as _vllm_args in VLLMEngine) are only initialized in the start() method, which is not automatically called by Serve's lifecycle for ingress deployments.
If start() hasn't been called, self.engine will be None or missing required attributes, leading to an AttributeError when build_asgi_app() is invoked. You should ensure that self.start() is called and awaited before accessing the engine.
| async def __serve_build_asgi_app__(self): | |
| return await self.engine.build_asgi_app() | |
| async def __serve_build_asgi_app__(self): | |
| await self.start() | |
| return await self.engine.build_asgi_app() |
Wire up the controller to support "ingress bypass" mode where a router deployment (e.g. OpenAiIngress) serves /internal/route for Lua routing decisions, while non-router deployments (e.g. LLMServer) serve data plane traffic directly via their direct ingress ports. Key changes: - controller.py: _get_target_groups_for_app_with_router splits targets into router_targets (for Lua) and main targets (data plane) - application_state.py: track router deployment name per app - build_app.py, client.py, deploy_utils.py, deployment_info.py: plumb router flag through deployment args - default_impl.py: factory plumbing for router deployment - schema.py: router_targets field on TargetGroup Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove unused _summarize_target, _summarize_target_groups, _summarize_fallback_targets helpers - Remove unused get_fallback_targets_for_proxy_manager method - Revert unrelated RAY_SERVE_THROUGHPUT_OPTIMIZED env var plumbing - Fix stale docstring referencing OpenAiIngress Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
This reverts commit 9600be7. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Extend HAProxy configuration to support Lua-based custom request routing for the ingress bypass pattern. When a backend has router_targets configured, HAProxy calls /internal/route on a router replica to get the target host:port, then forwards the data plane request directly to that replica. Key changes: - haproxy.py: BackendConfig gains router_servers, custom_request_routing, max_server_conns, router_server fields; reload tracking; is_head readiness fix; summarize helpers for debugging - haproxy_templates.py: Lua route template for /internal/route calls Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove streaming-only restriction: route ALL POST requests through Lua when ingress bypass is enabled, not just streaming ones - Remove killswitch file check (/tmp/ray-serve-disable-ingress-bypass-routing) - Reduce Lua socket timeout from 5s to 500ms for fast fallback - Revert enable_hap_optimization splitting back to single flag - Remove reload tracking (_reload_count, _last_reload_mode) - Remove _hard_reload and _reload_with_config indirection - Remove get_reload_debug_summary Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Remove _summarize_target, _summarize_target_group, _summarize_backend_config (defined but never called) - Add TODO on single router_server selection noting the HA gap Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Serve core - Replace LATE_BOUND_ASGI_APP sentinel with no-arg serve.ingress(): a class supplies its own ASGI app via __serve_build_asgi_app__. Drops the placeholder FastAPI() and the dual _set_asgi_app() init. - Hoist _DIRECT_INGRESS_ASGI_REQ_META to a module-level constant so /direct ingress dispatch skips a per-request RequestMetadata construction and two generate_request_id() UUIDs. LLM engine - Add LLMEngine.build_asgi_app() to the engine protocol; VLLMEngine implements it. LLMServer.__serve_build_asgi_app__ is a one-line delegate to self.engine.build_asgi_app(). Drops the public vllm_args/engine_client properties that leaked vLLM internals. LLM router - Bind via bind(server=llm_deployment) so the handle is injected as an init kwarg; drops the magic-string serve.get_deployment_handle(). - Replace the awaited llm_config.remote() priming RPC with a synchronous DeploymentHandle._init() + cached _get_request_router(). Removes the 60s timeout, the dependency on a method being removed in ray-project#63065, and the engine-cold-start coupling. - check_health() now asserts the request router is non-None as a liveness signal. - Drop GET / (HAProxy never calls it; Serve uses check_health for readiness). Keep GET /health as a human-operator convenience. - Cache the RequestRouter on self at init; precompute (host, port, full_id_str) tuples in _ready_endpoints to skip per-request formatting; two-tier cache (id(curr_replicas) fast path then frozenset(keys) fallback) avoids the frozenset allocation when curr_replicas is unchanged. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The `if self._user_callable_asgi_app is not None` short-circuit added in the previous commits matched every @serve.ingress deployment, not just direct streaming, and bypassed RequestMetadata setup (route, request_id, session_id, tracing context, backpressure). That broke: - test_replica_request_context.py::TestHTTPRoute::test_matching_fastapi_route (request context route returned empty string) - test_http_headers.py::TestUserProvidedRequestIDHeader::test_fastapi (request_id header not parsed) - test_http_headers.py::test_reuse_request_id The existing flow at line 2535+ already handles ASGI apps correctly via _call_http_entrypoint (UserMethodInfo.is_asgi_app), so direct streaming goes through the same path with no special-case code. Buildkite failures: ray-project/premerge#65996. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…e hatch Round-robin /internal/route ignores the request body, so we make body forwarding opt-in via RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY (off by default). Saves on every routed request: - HAProxy `wait-for-body` round-trip and per-connection buffer cost (~2 * tune.bufsize * maxconn). - Lua `txn.sf:req_body()` + body re-emit on the socket. - LLMRouter `await request.body()`. When the flag is on, the previous body-forwarding path is restored unchanged for body-aware policies (prefix-cache, etc.): - HAProxy frontend re-emits `wait-for-body` and `tune.bufsize`. - Lua reads the body and forwards it (with X-Body-Truncated when truncated). - LLMRouter reads the body and the truncation header. The end-to-end test that asserted the body got forwarded now flips the flag on via monkeypatch. Stacked on ray-project#63167. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- router.py: drop the redundant else-branch (defaults already match).
- haproxy.py: `str(bool).lower()` instead of "true"/"false" ternary.
- haproxy_templates.py: flatten nested {% if %}{% if %} for tune.bufsize.
- lua.tmpl, test: drop WHAT-comment narration.
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
The expensive part of body forwarding lives in HAProxy + Lua (the `wait-for-body` round-trip, `tune.bufsize` connection cost, and body re-emit on the socket). On the LLMRouter side, `await request.body()` on a Content-Length: 0 request is essentially free (one async tick, returns b""), so gating it earns nothing. Reverts the LLM-side branch and keeps the body read unconditional, making this PR Serve-only. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2ba9e2b to
0390ad4
Compare
Brings in ray-project#63215 (fail-loud) and resolves the conflict in ``ingress_request_router.lua.tmpl`` against this branch's FORWARD_BODY gate. Resolution: drop the ``empty_body`` sentinel that ray-project#63215 added. With the FORWARD_BODY gate, an empty body is no longer a distinct failure mode -- both modes call the router with ``body=""`` and let the router decide. The "router or fail" invariant still holds because the router *is* called; we just don't synthesize a sentinel locally. The 4 remaining failure-mode sentinels from ray-project#63215 (``router_unreachable`` / ``router_non_200`` / ``unparseable_replica_id`` / ``unknown_replica_id``) are unchanged. ``test_router_failure_fails_loud_with_reason`` updates fall out of the resolution: both empty and non-empty POSTs now surface as ``router_non_200`` against the broken-router stub. The empty-body case stays in the test to pin that empty bodies are routed through the router rather than silently bypassing it. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Pins that ``tune.bufsize``, ``wait-for-body``, and the Lua ``FORWARD_BODY`` constant render consistently with the env var. Off by default; on only when ``RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY=1``. Catches the failure mode where any one of the three signals drifts out of sync with the others. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Three pieces of dead code that drifted into this branch but aren't referenced by any FORWARD_BODY-related change and aren't reachable from runtime code paths: - ``router.py``: ``DIRECT_ROUTER_*`` env-var name constants and the ``ingress_request_router_app`` FastAPI placeholder. Defined but unused; staged for unrelated work. - ``haproxy.py``: free-function ``_get_safe_name`` and ``target_to_server`` duplicates of ``HAProxyManager._target_to_server`` / ``HAProxyManager.get_safe_name``. Never called. - ``test_router.py``: a second copy of ``_DirectRouterReplicaId`` / ``_FakeRequest`` / ``_DirectRouterReplica`` / ``_new_direct_router`` appended above an identical existing block (Python uses the second definition, making the first dead). Likely a merge artifact. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Frame the default as a TTFR concern rather than a policy-specific one: buffering and re-emitting large request bodies adds noticeable time-to- first-response. Call out concrete policy categories so a reader knows when to flip the flag without grepping the routers: - round-robin / power-of-two: body-independent, leave off. - session-aware: keys on ``x-session-id`` header, leave off. - prefix-aware / prefix-cache: needs the body, flip on. Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Three failures in test_haproxy_api after merging master:
1. ``test_write_ingress_request_router_lua_no_routers``: assertion
``result == (None, False)`` was stale. ``_write_ingress_request_router_lua``
returns just the Lua path (or ``None``), not a tuple. Fix to
``assert result is None``.
2. ``test_ingress_request_router_does_not_leak_into_other_backends``: had
embedded assertions that ``wait-for-body`` and ``tune.bufsize 65536``
appear in the rendered config. Both are wrong on master:
``wait-for-body`` is gated behind ``FORWARD_BODY=true`` (off by
default), and the bufsize default is 262144 not 65536. Drop these
assertions; their FORWARD_BODY-aware replacement is
``test_ingress_request_router_forward_body_gate_renders``.
3. ``test_ingress_request_router_end_to_end``: assertion
``router_captured["bodies"] == ['{"prompt": "hello"}']`` failed because
``_create_router_server``'s readiness probe POSTs ``{}`` to the same
``/internal/route`` handler that records bodies, so the first capture
was always the probe body. Clear ``captured["bodies"]`` after the
probe completes so callers see only client traffic.
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 380647f. Configure here.
| # decision, e.g. prefix-aware / prefix-cache routing. | ||
| RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool( | ||
| "RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False | ||
| ) |
There was a problem hiding this comment.
Wrong default type passed to get_env_bool
Low Severity
get_env_bool documents and expects a string "0" or "1" as the default parameter, but False (a Python boolean) is passed here. Every other call site in the codebase passes "0" or "1". It happens to work because str(False) produces "False" which doesn't equal "1", but it violates the function's documented contract and is inconsistent with the established pattern.
Reviewed by Cursor Bugbot for commit 380647f. Configure here.
jeffreywang-anyscale
left a comment
There was a problem hiding this comment.
LGTM, leaving nits. Is this "escape hatch" approach a long-term solution we're going for?
| # Use file locking to prevent concurrent writes from multiple processes | ||
| # This is important in test environments where multiple nodes may run | ||
| # on the same machine | ||
| import fcntl |
There was a problem hiding this comment.
nit: move imports to top
| return lua_path | ||
|
|
||
| def _generate_config_file_internal(self) -> None: | ||
| def _generate_config_file_internal(self) -> bool: |
There was a problem hiding this comment.
we aren't returning bool
| # header (forwarded with the request line) rather than the body. | ||
| # | ||
| # Flip this to true if the configured request router needs the body for its | ||
| # decision, e.g. prefix-aware / prefix-cache routing. |
There was a problem hiding this comment.
does this imply that we might underperform other LLM routers under prefix-cache affinity routing policy?
There was a problem hiding this comment.
This is a fundamental system trade off and not so much of a implementation trade off, but we'll see. Your work of extending to other request routers need to be done for us to be able to answer this question
| # decision, e.g. prefix-aware / prefix-cache routing. | ||
| RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY = get_env_bool( | ||
| "RAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY", False | ||
| ) |
…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
For now, we couldn't think of a better approach to turn it on / off, we have to do a study for when we do other request routing policies that need the request body |
…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…ch (ray-project#63183) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>


Stacked on #63167.
Round-robin
/internal/routeignores the request body, so make body forwarding opt-in viaRAY_SERVE_INGRESS_REQUEST_ROUTER_FORWARD_BODY(off by default). When off, HAProxy skipswait-for-body/tune.bufsizeand the Lua POSTsContent-Length: 0. When on, prior body-forwarding behavior is restored unchanged for body-aware policies (prefix-cache, etc.).LLMRouter is unchanged:
await request.body()on a Content-Length: 0 request is essentially free, so no gate is needed there._pick_replica'srequest_body/body_truncatedparameters stay so flipping the flag on doesn't touch call sites.Test plan
test_ingress_request_router_end_to_endflips the flag on viamonkeypatch; body-forwarded path still covered.