-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
feat(api): Return 503 on /health when engine is dead #24897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): Return 503 on /health when engine is dead #24897
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies the need to return an HTTP 503 status when the engine is dead, improving the semantics of the /health endpoint. My review focuses on making this behavior consistent across the entire API by proposing a more robust implementation using a global exception handler, which will also simplify the code and enhance maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this try...except
block correctly handles the EngineDeadError
for the /health
endpoint, this approach has a significant drawback: other endpoints that rely on the engine (e.g., /v1/chat/completions
) will still raise EngineDeadError
and fall back to a generic 500 error. This creates inconsistent API behavior for the same underlying issue.
A more robust and maintainable solution is to use a global FastAPI exception handler. This ensures all endpoints consistently return a 503 Service Unavailable
response when the engine is dead, and it keeps the route handler logic clean.
You can achieve this by adding the following exception handler to the build_app
function (e.g., after the RequestValidationError
handler around line 1614):
@app.exception_handler(EngineDeadError)
async def engine_dead_exception_handler(request: Request, exc: EngineDeadError):
# The /health and /ping endpoints expect a plain Response
if request.url.path in ("/health", "/ping"):
return Response(status_code=HTTPStatus.SERVICE_UNAVAILABLE.value)
# Other endpoints expect an OpenAI-compatible error JSON
err = ErrorResponse(
error=ErrorInfo(message="The service is currently unavailable, please try again later. Reason: Engine is dead.",
type=HTTPStatus.SERVICE_UNAVAILABLE.phrase,
code=HTTPStatus.SERVICE_UNAVAILABLE.value))
return JSONResponse(err.model_dump(),
status_code=HTTPStatus.SERVICE_UNAVAILABLE.value)
With this handler in place, the body of the health
function can be reverted to its original, simpler implementation.
await engine_client(raw_request).check_health()
return Response(status_code=200)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion to use a global exception handler, this is a great point for ensuring API consistency.
I initially explored this approach. However, my investigation revealed that a pre-existing, higher-level try...except block in launcher.py wraps the entire application.
Due to the execution order, this generic handler would intercept the EngineDeadError before a newly registered, specific handler in api_server.py could act, effectively overriding it and still returning a 500.
To ensure the correct 503 is returned for the /health endpoint without altering the existing high-level exception logic, I chose to handle the EngineDeadError specifically within the route itself. This is the most direct and lowest-risk way to achieve the desired behavior and avoid issues with handler precedence.
I'm happy to discuss further if you think modifying the root handler in launcher.py is a better path forward!
c347ea8
to
55852db
Compare
This PR of mine solves a similar problem. #24491, maybe need discuss. keep both or just one? |
Hi @lengrongfu, thank you for pointing this out! I've reviewed your PR #24491 and the related discussion in issue #24207. This is a great observation. It seems we are working on improving the
I believe our changes can coexist perfectly. My PR ensures we report a known failure with the correct semantic, while your PR works on the more complex challenge of how to detect a silent failure. I'm happy to collaborate to ensure our changes merge smoothly together. My current implementation should not conflict with your proposed |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dongbo910220 <1275604947@qq.com>
55852db
to
0cf8c31
Compare
This makes the health check more precise by only returning 503 for engine death scenarios rather than all exceptions. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dongbo910220 <1275604947@qq.com>
0cf8c31
to
400e102
Compare
Hi @DarkLight1337, would you have a moment to review this PR when you get a chance? It's a small fix for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @robertgshaw2-redhat seems busy, I'll just stamp this since it looks reasonable
Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit. |
Hi @cadedaniel, apologies for the late reply, I was on vacation. Thank you for the suggestion to add a test for this behavior. I've just created a new pull request #26074 to add the corresponding test case. It mocks the Would appreciate a review on the new PR when you have a chance. Thanks! |
Purpose
This pull request resolves #19881 by improving the HTTP semantics of the
/health
endpoint when the V1 engine dies unexpectedly.Currently, when the
EngineCore
process is terminated (e.g., viakill -9
), vLLM is able to reliably detect this condition thanks to the robust monitoring mechanism introduced in #21728. This detection correctly raises anEngineDeadError
, which is then caught by a generic, high-level exception handler inlauncher.py
that returns a broadHTTP 500 Internal Server Error
.This PR introduces a more specific
try...except
block forEngineDeadError
directly within the/health
route handler. This change achieves two key objectives:HTTP 503 Service Unavailable
. This accurately signals that the service is temporarily unable to handle requests due to an unavailable dependency (the engine), which is distinct from a500
(an unexpected bug in the application code).503
response allows automated systems like Kubernetes and load balancers to make better decisions, such as gracefully routing traffic away from the unhealthy instance instead of treating it as a crashing application.Test Plan
The functionality can be verified with the following end-to-end test using the
multiprocessing
backend:Start the vLLM server on this branch:
Confirm the service is healthy:
< HTTP/1.1 200 OK
.Find and terminate the
EngineCore
process:pgrep -f "vllm.entrypoints.openai.api_server"
EngineCore
child process PID (replace<SERVER_PID>
with the result from the previous step):pgrep -P <SERVER_PID>
EngineCore
process (replace<ENGINE_PID>
):Verify the new health check behavior:
< HTTP/1.1 503 Service Unavailable
.Test Result
This section should be filled out after running the test plan on your branch.
Before (on
main
branch):When the
EngineCore
process is killed, acurl
command to the/health
endpoint returnsHTTP 500 Internal Server Error
. This is because theEngineDeadError
is caught by a generic exception handler called runtime_exception_handler inlauncher.py
. While the server does not crash, the status code does not accurately reflect the "service unavailable" nature of the failure.After (on this
improve-health-check
branch):When the
EngineCore
process is killed, acurl
command to the/health
endpoint correctly returnsHTTP 503 Service Unavailable
. The server remains responsive and provides the correct semantic signal for monitoring systems before gracefully shutting down.