[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

naed90 · 2023-07-10T22:20:25Z

Instead of having each thread group fetch the query head (which causes 64x memory to be read), we have all threads in the block share the task of loading the query head. On the benchmark of running 1000 sequences through LLaMA13B on an A100 (80GB), this improves the throughput by 1.10x.

single_query_cached_kv_attention kernel

naed90 · 2023-07-10T22:23:01Z

See #421 for a detailed description and analysis of this commit.

zhyncs · 2023-07-11T10:58:28Z

Hi @naed90

overall LGTM, just had one small nitpick and looks like some formatting issues to address

naed90 · 2023-07-11T12:28:29Z

Hi @naed90

overall LGTM, just had one small nitpick and looks like some formatting issues to address

ty.
can't seem to find your review, can you send a link to it?

naed90 · 2023-07-13T13:34:35Z

@WoosukKwon @zhuohan123 hey, what do you think?

WoosukKwon · 2023-07-13T13:38:17Z

Hey @naed90, thanks for submitting the PR and apologies for the late response. I was busy for the last few days. Will take a look your issue and PR today.

naed90 · 2023-07-18T10:45:11Z

Hey @naed90, thanks for submitting the PR and apologies for the late response. I was busy for the last few days. Will take a look your issue and PR today.

@WoosukKwon bump :)

zhuohan123 · 2023-07-24T22:25:33Z

Tested a bit on the latency side:

Before optimization

$ python benchmark_latency.py --model huggyllama/llama-13b --input-len 128 --output-len 128 --num-iters 20
Namespace(model='huggyllama/llama-13b', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=20, trust_remote_code=False)
INFO 07-24 21:53:31 llm_engine.py:67] Initializing an LLM engine with config: model='huggyllama/llama-13b', tokenizer='huggyllama/llama-13b', tokenizer_mode=auto, trust_remote_code=False, dtype=t
orch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 07-24 21:53:31 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/l
lama-tokenizer' instead of the original tokenizer.
INFO 07-24 21:54:01 llm_engine.py:183] # GPU blocks: 899, # CPU blocks: 327
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:11<00:00,  3.56s/it]
Avg latency: 3.5580986022949217 seconds

After optimization

$ python benchmark_latency.py --model huggyllama/llama-13b --input-len 128 --output-len 128
--num-iters 20
Namespace(model='huggyllama/llama-13b', tokenizer=None, tensor_parallel_size=1, input_len=128, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=20, trust_remote_code=False)    INFO 07-24 21:55:36 llm_engine.py:67] Initializing an LLM engine with config: model='huggyllama/llama-13b', tokenizer='huggyllama/llama-13b', tokenizer_mode=auto, trust_remote_code=False, dtype=t
orch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)                                                                                    INFO 07-24 21:55:36 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/l
lama-tokenizer' instead of the original tokenizer.
INFO 07-24 21:56:08 llm_engine.py:183] # GPU blocks: 899, # CPU blocks: 327
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [01:09<00:00,  3.49s/it]
Avg latency: 3.4891188383102416 seconds

zhuohan123

Thank you for your contribution! Left some small comments. We should be able to merge this after the changes.

zhuohan123 · 2023-07-24T22:47:50Z

csrc/attention/attention_kernels.cu

@@ -116,12 +117,15 @@ __global__ void single_query_cached_kv_attention_kernel(
  // th vectors of the query, and so on.
  // NOTE(woosuk): Because q is split from a qkv tensor, it may not be contiguous.
  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
-  Q_vec q_vecs[NUM_VECS_PER_THREAD];
+  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
+  if (thread_group_idx <= NUM_THREAD_GROUPS_LOWER_BOUND) {


This if seems redundant if we assume NUM_THREADS should is divisible by THREAD_GROUP_SIZE?

Replaced with an assert.

csrc/attention/attention_kernels.cu

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

zhuohan123

LGTM! Thank you again for your hard work and detailed profiling!

…llm-project#420)

Supporting PR for HabanaAI/vllm-hpu-extension#14

[OPTIMIZATION] Optimizes the

1f256ee

single_query_cached_kv_attention kernel

naed90 mentioned this pull request Jul 10, 2023

+34% higher throughput? #421

Closed

zhuohan123 requested changes Jul 25, 2023

View reviewed changes

OlivierDehaene added a commit to OlivierDehaene/vllm that referenced this pull request Jul 28, 2023

merge from vllm-project#420

084ca75

naed90 and others added 2 commits August 4, 2023 11:57

Rename NUM_THREAD_GROUPS_LOWER_BOUND

b27f396

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Assumes NUM_THREADS % THREAD_GROUP_SIZE == 0

455038f

zhuohan123 approved these changes Aug 4, 2023

View reviewed changes

zhuohan123 merged commit 79af7e9 into vllm-project:main Aug 4, 2023
2 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (v…

f66bdd9

…llm-project#420)

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (v…

6b567fe

…llm-project#420)

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Oct 24, 2024

Add support for various softmax normalization options (vllm-project#420)

7f58ad1

Supporting PR for HabanaAI/vllm-hpu-extension#14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

naed90 commented Jul 10, 2023

naed90 commented Jul 10, 2023

zhyncs commented Jul 11, 2023

naed90 commented Jul 11, 2023

naed90 commented Jul 13, 2023 •

edited

Loading

WoosukKwon commented Jul 13, 2023

naed90 commented Jul 18, 2023

zhuohan123 commented Jul 24, 2023

zhuohan123 left a comment

zhuohan123 Jul 24, 2023

naed90 Aug 4, 2023 •

edited

Loading

zhuohan123 left a comment

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel #420

Conversation

naed90 commented Jul 10, 2023

naed90 commented Jul 10, 2023

zhyncs commented Jul 11, 2023

naed90 commented Jul 11, 2023

naed90 commented Jul 13, 2023 • edited Loading

WoosukKwon commented Jul 13, 2023

naed90 commented Jul 18, 2023

zhuohan123 commented Jul 24, 2023

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Jul 24, 2023

Choose a reason for hiding this comment

naed90 Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

naed90 commented Jul 13, 2023 •

edited

Loading

naed90 Aug 4, 2023 •

edited

Loading