Optimize MQA Kernel #452

zhuohan123 · 2023-07-13T04:11:58Z

This PR implements the MQA paged attention kernel and modifies the GPT Bigcode model to utilize the optimized MQA kernel.

TODO: Check performance gain.

nivibilla · 2023-07-13T11:15:35Z

How hard is it to port MQA to LLaMA models?

WoosukKwon

Awesome! Thanks for the PR. Left some comments.

vllm/config.py

WoosukKwon · 2023-07-14T00:46:52Z

vllm/model_executor/layers/attention.py

+        assert self.num_heads % self.num_kv_heads == 0
+        self.head_mapping = torch.repeat_interleave(
+            torch.arange(self.num_kv_heads, dtype=torch.int32, device="cuda"),
+            num_heads // self.num_kv_heads)


Style nit:

Suggested change

assert self.num_heads % self.num_kv_heads == 0

self.head_mapping = torch.repeat_interleave(

torch.arange(self.num_kv_heads, dtype=torch.int32, device="cuda"),

num_heads // self.num_kv_heads)

assert self.num_heads % self.num_kv_heads == 0

self.num_queries_per_kv = self.num_heads // self.num_kv_heads

self.head_mapping = torch.repeat_interleave(

torch.arange(self.num_kv_heads, dtype=torch.int32, device="cuda"),

self.num_queries_per_kv)

self.num_queries_per_kv can be also used in L97 and L100.

zhuohan123 · 2023-07-17T07:34:16Z

Starcoder latency after this PR on 1 GCP A100:

$ python benchmark_latency.py --model bigcode/starcoder --batch-size 1 --input-len 128 --output-len 128 --num-iters 1
Namespace(model='bigcode/starcoder', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=1, n=1, use_beam_search=False, num_iters=1, profile=False)
INFO 07-17 07:23:28 llm_engine.py:60] Initializing an LLM engine with config: model='bigcode/starcoder', tokenizer='bigcode/starcoder', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-17 07:25:40 llm_engine.py:134] # GPU blocks: 20280, # CPU blocks: 13107
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Avg latency: 3.6188764572143555 seconds

before this PR:

(baseline-branch) ubuntu@zhuohan-vllm-4-manual:~/nfs/cacheflow/base-branch/vllm/benchmarks$ python benchmark_latency.py --model bigcode/starcoder --batch-size 1 --input-len 128 --output-len 128
--num-iters 1
Namespace(model='bigcode/starcoder', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=1, n=1, use_beam_search=False, num_iters=1)
INFO 07-17 07:28:23 llm_engine.py:60] Initializing an LLM engine with config: model='bigcode/starcoder', tokenizer='bigcode/starcoder', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-17 07:30:55 llm_engine.py:134] # GPU blocks: 49, # CPU blocks: 273
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Avg latency: 4.571176528930664 seconds

Note that the throughput can be further boosted because of the extra cache blocks provided.

zhyncs · 2023-07-17T09:33:24Z

Hi @zhuohan123

May I ask, why implement the MQA kernel optimization in the single_query_cached_kv_attention_kernel instead of abstracting a kernel separately

zhuohan123 · 2023-07-18T23:24:48Z

Hi @zhuohan123

May I ask, why implement the MQA kernel optimization in the single_query_cached_kv_attention_kernel instead of abstracting a kernel separately

Hi! MQA can be combined with some other attention modifications, including RoPE and Alibi embeddings. Having a separate class for MQA can introduce many new classes with these embedding variations.

zhuohan123 added 7 commits July 2, 2023 16:59

get correct number of heads for mqa in cache engine

678a62d

Merge branch 'main' into mqa-optimization

a908d07

Merge branch 'main' into mqa-optimization

e4cda21

MQA Kernel

882ddcf

fix gpt bigcode model

b104aea

fix

8fa36f0

fix

769ff62

zhuohan123 requested a review from WoosukKwon July 13, 2023 04:17

WoosukKwon approved these changes Jul 14, 2023

View reviewed changes

WoosukKwon linked an issue Jul 14, 2023 that may be closed by this pull request

[StarCoder] TypeError: Got unsupported ScalarType BFloat16 #393

Closed

WoosukKwon mentioned this pull request Jul 14, 2023

Starcoder is 5-10x slower on vllm than HF's TGI when passing in a continuous batch of requests #462

Closed

WoosukKwon linked an issue Jul 14, 2023 that may be closed by this pull request

Starcoder is 5-10x slower on vllm than HF's TGI when passing in a continuous batch of requests #462

Closed

zhuohan123 added 2 commits July 14, 2023 16:21

fix review comments

f110936

add comments according to review comments

809c1f3

zhuohan123 merged commit 96853af into main Jul 15, 2023

zhuohan123 mentioned this pull request Jul 16, 2023

How does this compare to MQA (multi-query attention)? #169

Closed

zhuohan123 deleted the mqa-optimization branch July 18, 2023 22:18

This was referenced Jul 19, 2023

+34% higher throughput? #421

Closed

[wip] support multi-query-attention #305

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Optimize MQA Kernel (vllm-project#452)

4e6e50c

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Optimize MQA Kernel (vllm-project#452)

850b373

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MQA Kernel #452

Optimize MQA Kernel #452

zhuohan123 commented Jul 13, 2023

nivibilla commented Jul 13, 2023

WoosukKwon left a comment

WoosukKwon Jul 14, 2023

zhuohan123 commented Jul 17, 2023 •

edited

Loading

zhyncs commented Jul 17, 2023

zhuohan123 commented Jul 18, 2023

Optimize MQA Kernel #452

Optimize MQA Kernel #452

Conversation

zhuohan123 commented Jul 13, 2023

nivibilla commented Jul 13, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon Jul 14, 2023

Choose a reason for hiding this comment

zhuohan123 commented Jul 17, 2023 • edited Loading

zhyncs commented Jul 17, 2023

zhuohan123 commented Jul 18, 2023

zhuohan123 commented Jul 17, 2023 •

edited

Loading