-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[V1] add generate optional in health api #24491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds an optional generate
parameter to the /health
endpoint for a more thorough health check. The implementation is a good start, but the error handling can be made more robust. I've suggested a change to catch a broader range of exceptions and to handle all parts of the health check consistently within a single try...except
block. This will prevent unhandled exceptions and ensure that any failure during the health check correctly returns a 500 status code.
7648745
to
bbe9dae
Compare
2 ideas:
prompt = "Hi"
sampling_params = SamplingParams(temperature=0, max_tokens=2)
request_id = random_uuid()
async for _ in client.generate(prompt, sampling_params,
request_id):
pass
except Exception as e:
logger.exception(e)
return Response(status_code=500) |
@tomasruizt hi, according to you suggest, update done. |
I have some concern over this design. The request will be enqueued, but if the queue is long (but vLLM is still healthy), the request will timeout. Priority scheduling is not enabled by default. This creates false positive signal for health check generation. Instead, we should only do generation if there is no generation when engine is idle so we are not interrupting current batch. Also cc @njhill @robertgshaw2-redhat on the AsyncLLM interface addition |
@simon-mo very good point! I'd like to argue nevertheless, that a timeout on generate is highly informative for the client, since it means, as you said, that the server cannot serve generation requests with the requested timeout constraint. It's a different outcome than http 500 error, which signals that the server is completely dead. This is a nuanced difference, but the users of this endpoint (kubernetes users) are also likely to understand it from the endpoint docs. The difference in outcomes is also clear by the fact that in the timeout outcome the |
We whether can to check scheduler running queue length, if it gt 0, don't exec generate method. |
I think |
e581133
to
dbb7c06
Compare
dbb7c06
to
7559956
Compare
It is often helpful to distinguish between readiness and liveness for healthchecks. Ready being ready to handle more load, and live being the server is live and either recovering or starting up. Separately, can we add a test for this? Otherwise, it is difficult to rely on this feature as it can break at any time. |
7559956
to
353f3cb
Compare
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
353f3cb
to
26258af
Compare
Purpose
FIX #24207
Add a optional generate param to
/health?generate=true
api, can to generate a max_tokens is 2.Test Plan
If EngineCore progress crash, this api can response 500 http status code.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.