[V1] add generate optional in health api #24491

lengrongfu · 2025-09-09T07:52:04Z

Purpose

FIX #24207

Add a optional generate param to /health?generate=true api, can to generate a max_tokens is 2.

Test Plan

If EngineCore progress crash, this api can response 500 http status code.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds an optional generate parameter to the /health endpoint for a more thorough health check. The implementation is a good start, but the error handling can be made more robust. I've suggested a change to catch a broader range of exceptions and to handle all parts of the health check consistently within a single try...except block. This will prevent unhandled exceptions and ensure that any failure during the health check correctly returns a 500 status code.

vllm/entrypoints/openai/api_server.py

tomasruizt · 2025-09-09T09:28:20Z

2 ideas:

The abstract EngineClient could define a method minimal_generation() that encapsulates this logic. It already defines other logic like beam_search(). Perhaps even returning the response string. This makes the minimal generation usable even without the REST API. The REST API delgates to this method when generate=True.

prompt = "Hi"
sampling_params = SamplingParams(temperature=0, max_tokens=2)
request_id = random_uuid()
async for _ in client.generate(prompt, sampling_params,
                               request_id):
    pass

Currently, the exception is translated into a http 500 response, but the causing exception is not logged. You could log the exception in the try catch before returning (https://stackoverflow.com/a/5191885/5730291),

except Exception as e:
    logger.exception(e)
    return Response(status_code=500)

lengrongfu · 2025-09-10T02:48:37Z

@tomasruizt hi, according to you suggest, update done.

simon-mo · 2025-09-10T04:56:41Z

I have some concern over this design. The request will be enqueued, but if the queue is long (but vLLM is still healthy), the request will timeout. Priority scheduling is not enabled by default. This creates false positive signal for health check generation.

Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch.

Also cc @njhill @robertgshaw2-redhat on the AsyncLLM interface addition

tomasruizt · 2025-09-10T05:43:12Z

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

lengrongfu · 2025-09-10T06:16:40Z

We whether can to check scheduler running queue length, if it gt 0, don't exec generate method.

lengrongfu · 2025-09-13T02:44:14Z

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

I think /health api only make sure server is ok, can work, but generate timeout should use metric to export this issue. so when scheduler running queue length is more than 0 or wating queue length more than 0, we should think this serve is health.
@simon-mo @tomasruizt

cadedaniel · 2025-09-22T23:39:11Z

It is often helpful to distinguish between readiness and liveness for healthchecks. Ready being ready to handle more load, and live being the server is live and either recovering or starting up.

Separately, can we add a test for this? Otherwise, it is difficult to rely on this feature as it can break at any time.

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu requested a review from aarnphm as a code owner September 9, 2025 07:52

mergify bot added the frontend label Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

lengrongfu force-pushed the feat/health-generate branch from 7648745 to bbe9dae Compare September 9, 2025 09:02

lengrongfu requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 9, 2025 16:18

mergify bot added the v1 label Sep 9, 2025

lengrongfu mentioned this pull request Sep 16, 2025

feat(api): Return 503 on /health when engine is dead #24897

Merged

lengrongfu force-pushed the feat/health-generate branch from e581133 to dbb7c06 Compare September 19, 2025 10:21

lengrongfu requested a review from chaunceyjiang as a code owner September 19, 2025 10:21

lengrongfu force-pushed the feat/health-generate branch from dbb7c06 to 7559956 Compare September 19, 2025 10:51

lengrongfu force-pushed the feat/health-generate branch from 7559956 to 353f3cb Compare September 28, 2025 01:54

lengrongfu requested review from DarkLight1337, simon-mo and NickLucche as code owners September 28, 2025 01:54

lengrongfu added 4 commits September 27, 2025 19:05

[V1] add generate optional in health api

0e20baf

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

add minimal_generation to async_llm

c627686

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

add check running queue length and waiting queue length

e3e2159

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

add test_health to test health api

26258af

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu force-pushed the feat/health-generate branch from 353f3cb to 26258af Compare September 28, 2025 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] add generate optional in health api #24491

[V1] add generate optional in health api #24491

lengrongfu commented Sep 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

tomasruizt commented Sep 9, 2025

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

simon-mo commented Sep 10, 2025

Uh oh!

tomasruizt commented Sep 10, 2025 •

edited

Loading

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

lengrongfu commented Sep 13, 2025

Uh oh!

cadedaniel commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

[V1] add generate optional in health api #24491

Are you sure you want to change the base?

[V1] add generate optional in health api #24491

Conversation

lengrongfu commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tomasruizt commented Sep 9, 2025

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

simon-mo commented Sep 10, 2025

Uh oh!

tomasruizt commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lengrongfu commented Sep 10, 2025

Uh oh!

lengrongfu commented Sep 13, 2025

Uh oh!

cadedaniel commented Sep 22, 2025

Uh oh!

Uh oh!

lengrongfu commented Sep 9, 2025 •

edited by github-actions bot

Loading

tomasruizt commented Sep 10, 2025 •

edited

Loading