You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The generation throughput speed is too slow. I think the reason might be the low GPU KV cache usage. How can I increase the GPU KV cache usage and improve the generation throughput speed?
The text was updated successfully, but these errors were encountered:
If you're only running 1 request, your KV Cache is unlikely to be filled up. If you want to improve generation speed you can consider using FP8 or speculation
Your current environment
How would you like to use vllm
1.Model size: 14B
2.Performance: Average prompt throughput: 120.1 tokens/s, average generation throughput: 41.9 tokens/s, running: 1 request, swapped: 0 requests, pending: 0 requests, GPU KV cache usage: 1.0%, CPU KV cache usage: 0.0%.
The generation throughput speed is too slow. I think the reason might be the low GPU KV cache usage. How can I increase the GPU KV cache usage and improve the generation throughput speed?
The text was updated successfully, but these errors were encountered: