-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle response body parsing for both streaming and non-streaming cases #178
Comments
/assign @courageJ |
@liu-cong: GitHub didn't allow me to assign the following users: courageJ. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Another thing to pay attention to is that streaming response starts with
But the standard mode doesn't have |
/assign @JeffLuoo |
Just validated this, we are getting metrics for responses of all types |
Usage stats collection
vLLM supports stream mode, with
"stream":True
set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so{"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}
.To enable usage stats, we should pass the
"stream_options": {"include_usage": True}
parameter. The usage stats will be populated for the last chunk, and isnull
for others.To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string
[DONE]
, while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.Open question
vllm only returns usage stats in stream mode if
"stream_options": {"include_usage": True}
is set in the request. Should we inject this if metric collection is enabled?Error handling
Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:
Appendix
I used the following code snippet to stream the response and print the chunks:
Example output of the code snippet:
The text was updated successfully, but these errors were encountered: