Unexpected latency of StarCoder when enable tensor parallel #666
Unanswered
zhaoyang-star
asked this question in
Q&A
Replies: 1 comment
-
I found that the running requests is ~6 when After I set |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The latency of StarCoder running on 2 A100 40GB is higher than that running on 1 A100.
While, the latency of LLaMA-13B running on 2 A100 40GB is lower than that running on 1 A100 as expected.
So does MultiQueryAttenion impl in vLLM cause this?
Note: Prompt token length and output token length both are set to 1k.
Beta Was this translation helpful? Give feedback.
All reactions