-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
// TODO
// not able to run this because it is not an interactive environment
🐛 Describe the bug
We are running neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 on 2 x H100 80 GB
vLLM openai image tag: v0.7.3
Docker Args
--host 0.0.0.0 --port 8000 --disable-log-requests --download-dir /data/ --tokenizer-mode auto --model neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 --tokenizer neuralmagic/Llama-3.3-70B-Instruct-quantized.w8a8 --trust-remote-code --dtype auto --tensor-parallel-size 2 --gpu-memory-utilization 0.99 --served-model-name llm --max-model-len 20000 --enforce-eager --kv-cache-dtype fp8 --max-num-seqs 16
When running a load test (input = 16000 tokens, output = 256 tokens), as load increases at some point vLLM starts returning 415 for most of the requests
INFO 03-05 22:52:30 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 174.9 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.9%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:35 metrics.py:455] Avg prompt throughput: 7901.7 tokens/s, Avg generation throughput: 9.2 tokens/s, Running: 9 reqs, Swapped: 0 reqs, Pending: 7 reqs, GPU KV cache usage: 25.2%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:40 metrics.py:455] Avg prompt throughput: 8464.6 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 12 reqs, Swapped: 0 reqs, Pending: 4 reqs, GPU KV cache usage: 33.6%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:46 metrics.py:455] Avg prompt throughput: 8490.1 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 15 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 41.9%, CPU KV cache usage: 0.0%.
INFO 03-05 22:52:51 metrics.py:455] Avg prompt throughput: 2907.7 tokens/s, Avg generation throughput: 120.9 tokens/s, Running: 16 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 44.8%, CPU KV cache usage: 0.0%.
INFO: 100.64.0.26:41786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 100.64.0.26:35572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 100.64.0.26:58366 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 100.64.0.27:49902 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 100.64.0.26:58372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-05 22:52:56 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 174.9 tokens/s, Running: 11 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.8%, CPU KV cache usage: 0.0%.
INFO: 100.64.0.27:54012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 03-05 22:53:01 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 156.5 tokens/s, Running: 10 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 28.2%, CPU KV cache usage: 0.0%.
INFO: 100.64.0.25:51610 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO: 100.64.0.25:51610 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO: 100.64.0.25:51624 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO: 100.64.0.25:51636 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO: 100.64.0.26:42064 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
INFO: 100.64.0.25:51624 - "POST /v1/chat/completions HTTP/1.1" 415 Unsupported Media Type
...
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working