Handle response body parsing for both streaming and non-streaming cases #178

liu-cong · 2025-01-09T17:47:10Z

Usage stats collection

vLLM supports stream mode, with "stream":True set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so {"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}.

To enable usage stats, we should pass the "stream_options": {"include_usage": True} parameter. The usage stats will be populated for the last chunk, and is null for others.

To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string [DONE], while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.

Open question

vllm only returns usage stats in stream mode if "stream_options": {"include_usage": True} is set in the request. Should we inject this if metric collection is enabled?

Error handling

Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:

Network Errors: Connection issues, timeouts, and other network problems can disrupt the stream.
Model Server Errors: The server might encounter issues processing the request or generating the stream. This can be handled by looking at the normal HTTP error codes.
Client Errors: Problems on the client-side, such as decoding errors or timeouts.
Content Errors: Issues with the streamed content itself, like corruption or unexpected formats.

Appendix

I used the following code snippet to stream the response and print the chunks:

import requests
import time

def non_stream():
    json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 100,
      "prompt": prompt,
      "temperature": 0,
      "stream": False,
      "stream_options": {"include_usage": True},
    }
    response = requests.post(api_url, json=json, stream=False)
    response.raise_for_status()
    print(response.text)
    
def stream_vllm_response(prompt, api_url="http://localhost:8000/generate"):
  """Streams the response from a vLLM server.

  Args:
    prompt: The prompt to send to the server.
    api_url: The URL of the vLLM server.

  Yields:
    Chunks of the generated text.
  """
  json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 5,
      "prompt": prompt,
     "temperature": 0,
      "stream": True
  }
  response = requests.post(api_url, json=json, stream=True)
  response.raise_for_status()
  
  print("Initial HTTP Headers:")
  for header, value in response.headers.items():
    print(f"{header}: {value}")

  for chunk in response.iter_lines():
    if chunk:
      decoded_chunk = chunk.decode("utf-8")
      yield decoded_chunk

# Example usage:
# api_url = "http://localhost:8000/v1/completions"  # Replace with your vLLM server URL
api_url = "http://35.239.44.127:8081/v1/completions"
prompt = "Tell me the history of the US"
start = time.time()

print(non_stream())

print("===Streaming")
for chunk in stream_vllm_response(prompt, api_url):
  print(chunk, end="", flush=True)
  print(f"Elapsed {time.time() - start} seconds \n")

Example output of the code snippet:

python3 stream.py   
/Users/conliu/projects/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
{"id":"cmpl-356791d989ac476797a39076f866da1a","object":"text_completion","created":1734721203,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".\nThe United States of America is a country in North America. It is the third largest country in the world. It is the fourth most populous country in the world. It is the most powerful country in the world. It is the most prosperous country in the world. It is the most technologically advanced country in the world. It is the most influential country in the world. It is the most democratic country in the world. It is the most generous country","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":108,"completion_tokens":100,"active_lora_adapters":{},"registered_lora_adapters":{},"pending_queue_size":0}}
None
===Streaming
Initial HTTP Headers:
date: Fri, 03 Jan 2025 18:10:23 GMT
server: uvicorn
content-type: text/event-stream; charset=utf-8
x-went-into-resp-headers: true
transfer-encoding: chunked
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.24161005020141602 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.2615821361541748 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"The","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.27447509765625 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" United","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.30302906036376953 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" States","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}Elapsed 0.3072359561920166 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[],"usage":{"prompt_tokens":8,"total_tokens":13,"completion_tokens":5}}Elapsed 0.3072531223297119 seconds 

data: [DONE]Elapsed 0.30726003646850586 seconds

The text was updated successfully, but these errors were encountered:

liu-cong · 2025-01-16T17:53:21Z

/assign @courageJ

k8s-ci-robot · 2025-01-16T17:53:23Z

@liu-cong: GitHub didn't allow me to assign the following users: courageJ.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @courageJ

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

JeffLuoo · 2025-02-12T22:24:42Z

Another thing to pay attention to is that streaming response starts with data: , e.g.:

data: {"id":"cmpl-494d3c23-3dee-48a1-8210-c374ef42572d","object":"text_completion","created":1739398709,"model":"tweet-summary-1","choices":[{"index":0,"text":" plot","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

But the standard mode doesn't have data: prefix in the response.

Address kubernetes-sigs#178

Address #178

ahg-g · 2025-03-14T17:23:14Z

/assign @JeffLuoo

kfswain · 2025-03-25T17:44:53Z

Just validated this, we are getting metrics for responses of all types

liu-cong mentioned this issue Jan 16, 2025

Adding metrics for request total, latency and size #177

Merged

liu-cong mentioned this issue Jan 22, 2025

[Metrics] Add input/output token and request size metrics #214

Merged

ahg-g mentioned this issue Feb 18, 2025

Tracking issues for release 0.2.0 #362

Closed

33 tasks

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 19, 2025

[Metrics] Add streaming support for metrics

357d7c8

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 20, 2025

[Metrics] Add streaming support for metrics

3970bf9

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 20, 2025

[Metrics] Add streaming support for metrics

4249311

Address kubernetes-sigs#178

JeffLuoo mentioned this issue Feb 20, 2025

[Metrics] Add vLLM streaming support for metrics #329

Merged

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 20, 2025

[Metrics] Add streaming support for metrics

0d2e87b

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 20, 2025

[Metrics] Add streaming support for metrics

00c54e0

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 20, 2025

[Metrics] Add streaming support for metrics

a90bd84

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Feb 21, 2025

[Metrics] Add streaming support for metrics

c6178c2

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Mar 3, 2025

[Metrics] Add streaming support for metrics

ed0c231

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Mar 5, 2025

[Metrics] Add streaming support for metrics

0b8f14f

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Mar 10, 2025

[Metrics] Add streaming support for metrics

b82103b

Address kubernetes-sigs#178

JeffLuoo added a commit to JeffLuoo/gateway-api-inference-extension that referenced this issue Mar 12, 2025

[Metrics] Add streaming support for metrics

d479c3d

Address kubernetes-sigs#178

k8s-ci-robot pushed a commit that referenced this issue Mar 14, 2025

[Metrics] Add streaming support for metrics (#329)

28ea321

Address #178

ahg-g mentioned this issue Mar 14, 2025

v0.3.0 Release Tracker #493

Open

14 tasks

k8s-ci-robot assigned JeffLuoo Mar 14, 2025

kfswain closed this as completed Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle response body parsing for both streaming and non-streaming cases #178

Handle response body parsing for both streaming and non-streaming cases #178

liu-cong commented Jan 9, 2025 •

edited

Loading

liu-cong commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

JeffLuoo commented Feb 12, 2025 •

edited

Loading

ahg-g commented Mar 14, 2025

kfswain commented Mar 25, 2025

Handle response body parsing for both streaming and non-streaming cases #178

Handle response body parsing for both streaming and non-streaming cases #178

Comments

liu-cong commented Jan 9, 2025 • edited Loading

Usage stats collection

Open question

Error handling

Appendix

liu-cong commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

JeffLuoo commented Feb 12, 2025 • edited Loading

ahg-g commented Mar 14, 2025

kfswain commented Mar 25, 2025

liu-cong commented Jan 9, 2025 •

edited

Loading

JeffLuoo commented Feb 12, 2025 •

edited

Loading