Skip to content

Conversation

dongbo910220
Copy link
Contributor

@dongbo910220 dongbo910220 commented Sep 15, 2025

Purpose

This pull request resolves #19881 by improving the HTTP semantics of the /health endpoint when the V1 engine dies unexpectedly.

Currently, when the EngineCore process is terminated (e.g., via kill -9), vLLM is able to reliably detect this condition thanks to the robust monitoring mechanism introduced in #21728. This detection correctly raises an EngineDeadError, which is then caught by a generic, high-level exception handler in launcher.py that returns a broad HTTP 500 Internal Server Error.

This PR introduces a more specific try...except block for EngineDeadError directly within the /health route handler. This change achieves two key objectives:

  1. Corrects the HTTP Semantic: It changes the response to a more appropriate HTTP 503 Service Unavailable. This accurately signals that the service is temporarily unable to handle requests due to an unavailable dependency (the engine), which is distinct from a 500 (an unexpected bug in the application code).
  2. Improves Production Observability: An explicit 503 response allows automated systems like Kubernetes and load balancers to make better decisions, such as gracefully routing traffic away from the unhealthy instance instead of treating it as a crashing application.

Test Plan

The functionality can be verified with the following end-to-end test using the multiprocessing backend:

  1. Start the vLLM server on this branch:

    python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf --distributed-executor-backend mp
  2. Confirm the service is healthy:

    curl -v http://localhost:8000/health
    • Expected Output: The response should include < HTTP/1.1 200 OK.
  3. Find and terminate the EngineCore process:

    • In a new terminal, find the main server's PID: pgrep -f "vllm.entrypoints.openai.api_server"
    • Find the EngineCore child process PID (replace <SERVER_PID> with the result from the previous step): pgrep -P <SERVER_PID>
    • Forcefully kill the EngineCore process (replace <ENGINE_PID>):
      kill -9 <ENGINE_PID>
  4. Verify the new health check behavior:

    • Wait for the monitor thread to detect the failure.
    • Check the health endpoint again:
      curl -v http://localhost:8000/health
    • Expected Output: The response should now include < HTTP/1.1 503 Service Unavailable.

Test Result

This section should be filled out after running the test plan on your branch.

Before (on main branch):
When the EngineCore process is killed, a curl command to the /health endpoint returns HTTP 500 Internal Server Error. This is because the EngineDeadError is caught by a generic exception handler called runtime_exception_handler in launcher.py. While the server does not crash, the status code does not accurately reflect the "service unavailable" nature of the failure.

After (on this improve-health-check branch):
When the EngineCore process is killed, a curl command to the /health endpoint correctly returns HTTP 503 Service Unavailable. The server remains responsive and provides the correct semantic signal for monitoring systems before gracefully shutting down.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies the need to return an HTTP 503 status when the engine is dead, improving the semantics of the /health endpoint. My review focuses on making this behavior consistent across the entire API by proposing a more robust implementation using a global exception handler, which will also simplify the code and enhance maintainability.

Comment on lines +451 to +358
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this try...except block correctly handles the EngineDeadError for the /health endpoint, this approach has a significant drawback: other endpoints that rely on the engine (e.g., /v1/chat/completions) will still raise EngineDeadError and fall back to a generic 500 error. This creates inconsistent API behavior for the same underlying issue.

A more robust and maintainable solution is to use a global FastAPI exception handler. This ensures all endpoints consistently return a 503 Service Unavailable response when the engine is dead, and it keeps the route handler logic clean.

You can achieve this by adding the following exception handler to the build_app function (e.g., after the RequestValidationError handler around line 1614):

    @app.exception_handler(EngineDeadError)
    async def engine_dead_exception_handler(request: Request, exc: EngineDeadError):
        # The /health and /ping endpoints expect a plain Response
        if request.url.path in ("/health", "/ping"):
            return Response(status_code=HTTPStatus.SERVICE_UNAVAILABLE.value)

        # Other endpoints expect an OpenAI-compatible error JSON
        err = ErrorResponse(
            error=ErrorInfo(message="The service is currently unavailable, please try again later. Reason: Engine is dead.",
                            type=HTTPStatus.SERVICE_UNAVAILABLE.phrase,
                            code=HTTPStatus.SERVICE_UNAVAILABLE.value))
        return JSONResponse(err.model_dump(),
                            status_code=HTTPStatus.SERVICE_UNAVAILABLE.value)

With this handler in place, the body of the health function can be reverted to its original, simpler implementation.

    await engine_client(raw_request).check_health()
    return Response(status_code=200)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion to use a global exception handler, this is a great point for ensuring API consistency.

I initially explored this approach. However, my investigation revealed that a pre-existing, higher-level try...except block in launcher.py wraps the entire application.

Due to the execution order, this generic handler would intercept the EngineDeadError before a newly registered, specific handler in api_server.py could act, effectively overriding it and still returning a 500.

To ensure the correct 503 is returned for the /health endpoint without altering the existing high-level exception logic, I chose to handle the EngineDeadError specifically within the route itself. This is the most direct and lowest-risk way to achieve the desired behavior and avoid issues with handler precedence.

I'm happy to discuss further if you think modifying the root handler in launcher.py is a better path forward!

@dongbo910220
Copy link
Contributor Author

cc @robertgshaw2-redhat #19881

@lengrongfu
Copy link
Contributor

This PR of mine solves a similar problem. #24491, maybe need discuss. keep both or just one?

@dongbo910220
Copy link
Contributor Author

Hi @lengrongfu, thank you for pointing this out! I've reviewed your PR #24491 and the related discussion in issue #24207.

This is a great observation. It seems we are working on improving the /health endpoint from two different, but complementary, angles:

I believe our changes can coexist perfectly. My PR ensures we report a known failure with the correct semantic, while your PR works on the more complex challenge of how to detect a silent failure.

I'm happy to collaborate to ensure our changes merge smoothly together. My current implementation should not conflict with your proposed minimal_generation() logic.

Copy link

mergify bot commented Sep 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dongbo910220.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 17, 2025
dongbo910220 and others added 3 commits September 17, 2025 22:24
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
This makes the health check more precise by only returning 503 for
engine death scenarios rather than all exceptions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dongbo910220 <1275604947@qq.com>
@dongbo910220
Copy link
Contributor Author

Hi @DarkLight1337, would you have a moment to review this PR when you get a chance?

It's a small fix for the /health endpoint's status code (resolves #19881) . Thanks!

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @robertgshaw2-redhat seems busy, I'll just stamp this since it looks reasonable

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 18, 2025 12:55
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025
@DarkLight1337 DarkLight1337 merged commit 67244c8 into vllm-project:main Sep 18, 2025
54 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
@cadedaniel
Copy link
Collaborator

Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
)

Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: charlifu <charlifu@amd.com>
@dongbo910220
Copy link
Contributor Author

Hi @cadedaniel, apologies for the late reply, I was on vacation.

Thank you for the suggestion to add a test for this behavior. I've just created a new pull request #26074 to add the corresponding test case. It mocks the EngineDeadError and verifies that the /health endpoint correctly returns a 503.

Would appreciate a review on the new PR when you have a chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Implement check_health for V1
4 participants