Skip to content

[Bug]: vLLM returning 415 status code at high load #14333

@chiragjn

Description

@chiragjn

Your current environment

The output of `python collect_env.py`

// TODO
// not able to run this because it is not an interactive environment

🐛 Describe the bug

We are running neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 on 2 x H100 80 GB

vLLM openai image tag: v0.7.3

Docker Args

--host 0.0.0.0 --port 8000 --disable-log-requests --download-dir /data/ --tokenizer-mode auto --model neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 --tokenizer neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 --trust-remote-code --dtype auto --tensor-parallel-size 2 --gpu-memory-utilization 0.99 --served-model-name llm --max-model-len 20000 --enforce-eager --kv-cache-dtype fp8 --max-num-seqs 16

When running a load test (input = 16000 tokens, output = 256 tokens), as load increases at some point vLLM starts returning 415 for most of the requests

INFO 03-05 22:52:30 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 174.9 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.9%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:35 metrics.py:455] Avg prompt throughput: 7901.7 tokens/s, Avg generation throughput: 9.2 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 25.2%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:40 metrics.py:455] Avg prompt throughput: 8464.6 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 4 reqs, GPU KV cache usage: 33.6%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:46 metrics.py:455] Avg prompt throughput: 8490.1 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 15 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 41.9%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:51 metrics.py:455] Avg prompt throughput: 2907.7 tokens/s, Avg generation throughput: 120.9 tokens/s, Running: 16 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 44.8%, CPU KV cache usage: 0.0%.
INFO:     100.64.0.26:41786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     100.64.0.26:35572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     100.64.0.26:58366 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     100.64.0.27:49902 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     100.64.0.26:58372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-05 22:52:56 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 174.9 tokens/s, Running: 11 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.8%, CPU KV cache usage: 0.0%.
INFO:     100.64.0.27:54012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-05 22:53:01 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 156.5 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 28.2%, CPU KV cache usage: 0.0%.
INFO:     100.64.0.25:51610 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO:     100.64.0.25:51610 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO:     100.64.0.25:51624 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO:     100.64.0.25:51636 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO:     100.64.0.26:42064 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO:     100.64.0.25:51624 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
...

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions