[serve] Raise error when multiplex on ingress deployment used with direct ingress#64045
[serve] Raise error when multiplex on ingress deployment used with direct ingress#64045akyang-anyscale wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces validation to reject model multiplexing on the ingress deployment when direct ingress is enabled. It adds static detection of @serve.multiplexed decorators and implements the validation check at application build time, accompanied by unit and integration tests. The review feedback points out that the validation check should be updated to also verify if HAProxy is enabled (RAY_SERVE_ENABLE_HA_PROXY), ensuring consistency with the raised error message and preventing unsupported deployments under HAProxy.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 0c1a3a9. Configure here.
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
| built_app.validate_single_fastapi_ingress() | ||
| # This task runs on the cluster, so its view of the direct-ingress flag | ||
| # mirrors the replicas' (they inherit this task's runtime_env). | ||
| built_app.validate_multiplexing_with_direct_ingress( |
There was a problem hiding this comment.
What if we added a uses_multiplexing bit to the deploy args proto? Then the controller can validate in deploy_applications just once instead of duplicating the check
There was a problem hiding this comment.
will we be able to catch dynamically initialized multiplexing https://github.com/ray-project/ray/blob/master/python/ray/llm/_internal/serve/core/server/llm_server.py#L242-L244
There was a problem hiding this comment.
@eicherseiji I'm pretty sure deploy_applications isn't covered in the declarative path, but I could add the check when creating the deployment info, which happens in both cases.
@abrarsheikh this method would not catch that. I think the only way to do that would be at replica initialization time, wdyt?
There was a problem hiding this comment.
@akyang-anyscale Ah DeploymentInfo makes sense then
Also I would support a check at replica initialization as well for correctness in the dynamic multiplexing case
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
| # Imported lazily to avoid a circular import at module load time | ||
| # (multiplex -> metrics -> context -> client -> application_state). | ||
| from ray.serve.multiplex import _callable_uses_multiplexing |
There was a problem hiding this comment.
let's move _callable_uses_multiplexing into utility file to break the cir dep
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

Serve currently does not support model multiplexing on the ingress deployment when direct ingress is enabled (also when HAProxy is enabled). Raise an error instead of silently serving the app without proper multiplexing support.
We do this 2 ways: