Prefix Caching- fix t4 triton error #2517

caoshiyi · 2024-01-20T09:20:32Z

Fix #2513, need a smaller block size for Turing GPUs

Yard1 · 2024-01-20T09:24:18Z

vllm/model_executor/layers/triton_kernel/prefix_prefill.py

@@ -5,6 +5,8 @@
 import triton
 import triton.language as tl

+TESLA = 'Tesla' in torch.cuda.get_device_name(0)


would it be possible to check for compute capability instead? also, we should do this inside context_attention_fwd, as calling CUDA APIs before we set CUDA_VISIBLE_DEVICES will lead to errors.

Maybe we can set prefix_block_size as a parameter in CacheConfig and allow user configure in LLM?

this sort of a thing should be ideally derived automatically.

@Yard1 @caoshiyi Does the block size affect the memory utilization or prefix speed?

@esmeetu The block size is mainly dependent on the shared mem size for different GPU architectures. It will affect the prefix-prefill kernel speed a little bit but has nothing to do with the GPU memory utilization.

esmeetu · 2024-01-20T12:00:49Z

Amazing! @caoshiyi Thanks for your help! This is good for me now and speed is indeed a x2-x3 speedup. But when doing further testing, i encountering engine stuck issue when GPU KV Cache is full (i change prefix 5~6 times). And the request is always at the pending state.

INFO 01-20 19:58:39 llm_engine.py:823] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 1 reqs, GPU KV cache usage: 94.3%, CPU KV cache usage: 0.0%

After one more change for prefix(will take up >10% KV cache), the engine will stuck.
So when will the Prefix will release the KV Cache?

esmeetu · 2024-01-20T13:28:52Z

#2511 looks solving my second issue.

zhuohan123

LGTM! Left a small comment.

zhuohan123 · 2024-01-24T22:31:59Z

vllm/model_executor/layers/triton_kernel/prefix_prefill.py

@@ -5,6 +5,8 @@
 import triton
 import triton.language as tl

+TESLA = 'Tesla' in torch.cuda.get_device_name(0)


Can we set this variable in a function instead of a global variable? Setting it in global variable may lead to issues in distributed setting.

WoosukKwon · 2024-02-14T02:10:51Z

@caoshiyi What is the blocker to this PR? Could you address @Yard1 and @zhuohan123's comments?

esmeetu · 2024-02-15T10:02:41Z

vllm/model_executor/layers/triton_kernel/prefix_prefill.py

@@ -618,7 +618,9 @@ def context_attention_fwd(q,
                              b_ctx_len,
                              max_input_len,
                              alibi_slopes=None):
-        BLOCK = 128
+
+        cap = torch.cuda.get_device_capability()


Does prefix caching adapt other hardware? like AMD? This only considers cuda arch. Might it better that we define a global utility to get block size which handles different hardwares.

I believe this kernel only works for NVIDIA right now. Let me merge this fix first and we can systematically test for AMD later.

zhuohan123

LGTM!

caoshiyi added 3 commits January 20, 2024 09:12

fix triton kernel block size for t4 gpu

15421a6

format

2f25efa

minor

30151a7

Yard1 reviewed Jan 20, 2024

View reviewed changes

zhuohan123 approved these changes Jan 24, 2024

View reviewed changes

WoosukKwon mentioned this pull request Feb 15, 2024

[v0.3.1] Release Tracker #2859

Closed

5 tasks

caoshiyi added 2 commits February 15, 2024 07:57

move cap check inside context_attention_fwd

47ea3ff

clean

1824cdb

caoshiyi requested a review from zhuohan123 February 15, 2024 08:20

esmeetu reviewed Feb 15, 2024

View reviewed changes

zhuohan123 approved these changes Feb 16, 2024

View reviewed changes

zhuohan123 merged commit 64da65b into vllm-project:main Feb 16, 2024
17 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024

Prefix Caching- fix t4 triton error (vllm-project#2517)

120b2fd

xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024

Prefix Caching- fix t4 triton error (vllm-project#2517)

73614b6

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024

Prefix Caching- fix t4 triton error (vllm-project#2517)

9a5b531

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix Caching- fix t4 triton error #2517

Prefix Caching- fix t4 triton error #2517

caoshiyi commented Jan 20, 2024

Yard1 Jan 20, 2024

esmeetu Jan 20, 2024 •

edited

Yard1 Jan 20, 2024

esmeetu Jan 21, 2024

caoshiyi Jan 28, 2024

esmeetu commented Jan 20, 2024 •

edited

esmeetu commented Jan 20, 2024

zhuohan123 left a comment

zhuohan123 Jan 24, 2024

WoosukKwon commented Feb 14, 2024

esmeetu Feb 15, 2024

zhuohan123 Feb 16, 2024

zhuohan123 left a comment

Prefix Caching- fix t4 triton error #2517

Prefix Caching- fix t4 triton error #2517

Conversation

caoshiyi commented Jan 20, 2024

Yard1 Jan 20, 2024

Choose a reason for hiding this comment

esmeetu Jan 20, 2024 • edited

Choose a reason for hiding this comment

Yard1 Jan 20, 2024

Choose a reason for hiding this comment

esmeetu Jan 21, 2024

Choose a reason for hiding this comment

caoshiyi Jan 28, 2024

Choose a reason for hiding this comment

esmeetu commented Jan 20, 2024 • edited

esmeetu commented Jan 20, 2024

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Jan 24, 2024

Choose a reason for hiding this comment

WoosukKwon commented Feb 14, 2024

esmeetu Feb 15, 2024

Choose a reason for hiding this comment

zhuohan123 Feb 16, 2024

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

esmeetu Jan 20, 2024 •

edited

esmeetu commented Jan 20, 2024 •

edited