Unexpected latency of StarCoder when enable tensor parallel #666

zhaoyang-star · 2023-08-03T09:35:05Z

zhaoyang-star
Aug 3, 2023

The latency of StarCoder running on 2 A100 40GB is higher than that running on 1 A100.
While, the latency of LLaMA-13B running on 2 A100 40GB is lower than that running on 1 A100 as expected.

So does MultiQueryAttenion impl in vLLM cause this?

Note: Prompt token length and output token length both are set to 1k.

zhaoyang-star · 2023-08-12T09:19:46Z

zhaoyang-star
Aug 12, 2023
Author

I found that the running requests is ~6 when max_num_seqs is set to 10 for LLaMA-13B. Meanwhile, the running requests is 10 when max_num_seqs is set to 10 for StarCoder. So the above chart about LLaMA-13B could not be used.

After I set max_num_seqs to 6 for both models and found that the latency of the two models on 2 A100 are both larger than that on 1 A100. So MultiQueryAttention may not be the reason.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected latency of StarCoder when enable tensor parallel #666

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Unexpected latency of StarCoder when enable tensor parallel #666

zhaoyang-star Aug 3, 2023

Replies: 1 comment

zhaoyang-star Aug 12, 2023 Author

zhaoyang-star
Aug 3, 2023

zhaoyang-star
Aug 12, 2023
Author