Skip to content

Conversation

lengrongfu
Copy link
Contributor

@lengrongfu lengrongfu commented Sep 9, 2025

Purpose

FIX #24207

Add a optional generate param to /health?generate=true api, can to generate a max_tokens is 2.

Test Plan

If EngineCore progress crash, this api can response 500 http status code.

image

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@lengrongfu lengrongfu requested a review from aarnphm as a code owner September 9, 2025 07:52
@mergify mergify bot added the frontend label Sep 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds an optional generate parameter to the /health endpoint for a more thorough health check. The implementation is a good start, but the error handling can be made more robust. I've suggested a change to catch a broader range of exceptions and to handle all parts of the health check consistently within a single try...except block. This will prevent unhandled exceptions and ensure that any failure during the health check correctly returns a 500 status code.

@tomasruizt
Copy link
Contributor

2 ideas:

  1. The abstract EngineClient could define a method minimal_generation() that encapsulates this logic. It already defines other logic like beam_search(). Perhaps even returning the response string. This makes the minimal generation usable even without the REST API. The REST API delgates to this method when generate=True.
prompt = "Hi"
sampling_params = SamplingParams(temperature=0, max_tokens=2)
request_id = random_uuid()
async for _ in client.generate(prompt, sampling_params,
                               request_id):
    pass
  1. Currently, the exception is translated into a http 500 response, but the causing exception is not logged. You could log the exception in the try catch before returning (https://stackoverflow.com/a/5191885/5730291),
except Exception as e:
    logger.exception(e)
    return Response(status_code=500)

@lengrongfu
Copy link
Contributor Author

@tomasruizt hi, according to you suggest, update done.

@simon-mo
Copy link
Collaborator

I have some concern over this design. The request will be enqueued, but if the queue is long (but vLLM is still healthy), the request will timeout. Priority scheduling is not enabled by default. This creates false positive signal for health check generation.

Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch.

Also cc @njhill @robertgshaw2-redhat on the AsyncLLM interface addition

@tomasruizt
Copy link
Contributor

tomasruizt commented Sep 10, 2025

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

@lengrongfu
Copy link
Contributor Author

We whether can to check scheduler running queue length, if it gt 0, don't exec generate method.

@lengrongfu
Copy link
Contributor Author

@simon-mo very good point!

I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint.

It's a different outcome than http 500 error, which signals that the server is completely dead.

This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs.

The difference in outcomes is also clear by the fact that in the timeout outcome the EngineClient does not throw any exception.

I think /health api only make sure server is ok, can work, but generate timeout should use metric to export this issue. so when scheduler running queue length is more than 0 or wating queue length more than 0, we should think this serve is health.
@simon-mo @tomasruizt

@cadedaniel
Copy link
Collaborator

It is often helpful to distinguish between readiness and liveness for healthchecks. Ready being ready to handle more load, and live being the server is live and either recovering or starting up.

Separately, can we add a test for this? Otherwise, it is difficult to rely on this feature as it can break at any time.

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Support similar API, such as /health_generate
4 participants