Force paged attention v2 for long contexts #1510

Yard1 · 2023-10-30T21:20:55Z

Removes the hard limit on context length that was tied to paged attention v1 limitations and instead forces v2 to be used if the context cannot fit in shared memory.

Yard1 · 2023-10-30T21:21:00Z

cc @WoosukKwon

vllm/model_executor/layers/attention.py

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon

LGTM! Thanks for the fix!

…llm-project#1559) Merge changes from habana_main for embedding fix HabanaAI#1510 ---- details ---- Fix the failures at warmup stage in pooling mode -- due to. [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2904, in warmup_model [rank0]: self.warmup_graphs( [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2714, in warmup_graphs [rank0]: self.warmup_scenario(batch_size, [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2561, in warmup_scenario [rank0]: inputs = self.prepare_model_input_align_worker( [rank0]: File "/wm/vllm-fork/vllm/worker/model_runner_base.py", line 233, in prepare_model_input_align_worker [rank0]: raise NotImplementedError [rank0]: NotImplementedError Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

Fix the failures at warmup stage in pooling mode -- due to. [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2904, in warmup_model [rank0]: self.warmup_graphs( [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2714, in warmup_graphs [rank0]: self.warmup_scenario(batch_size, [rank0]: File "/wm/vllm-fork/vllm/worker/hpu_model_runner.py", line 2561, in warmup_scenario [rank0]: inputs = self.prepare_model_input_align_worker( [rank0]: File "/wm/vllm-fork/vllm/worker/model_runner_base.py", line 233, in prepare_model_input_align_worker [rank0]: raise NotImplementedError [rank0]: NotImplementedError Co-authored-by: Libin Tang <litang@habana.ai>

Force paged attention v2 for long contexts

b6237d0

WoosukKwon self-requested a review October 31, 2023 04:48

WoosukKwon reviewed Nov 1, 2023

View reviewed changes

vllm/model_executor/layers/attention.py Outdated Show resolved Hide resolved

Apply suggestion from code review

b3c22ac

Yard1 requested a review from WoosukKwon November 1, 2023 18:19

Lint

7023e5e

WoosukKwon reviewed Nov 1, 2023

View reviewed changes

vllm/model_executor/layers/attention.py Show resolved Hide resolved

Update vllm/model_executor/layers/attention.py

2997044

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Yard1 requested a review from WoosukKwon November 1, 2023 23:16

WoosukKwon approved these changes Nov 1, 2023

View reviewed changes

WoosukKwon merged commit 9738b84 into vllm-project:main Nov 1, 2023

Yard1 deleted the force_pa_v2_for_long_context branch November 1, 2023 23:30

exceedzhang mentioned this pull request Nov 2, 2023

#1529 # 1510 exceedzhang/vllm#4

Closed

viktor-ferenczi mentioned this pull request Nov 3, 2023

No support for longer context lengths. #1108

Closed

WoosukKwon mentioned this pull request Nov 7, 2023

RuntimeError: vLLM cannot currently support max_model_len=65536 with block_size=16 on GPU with compute capability (8, 9) (required shared memory 264252.0 > available shared memory 101376). #1267

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Force paged attention v2 for long contexts (vllm-project#1510)

6558a99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Force paged attention v2 for long contexts #1510

Force paged attention v2 for long contexts #1510

Uh oh!

Yard1 commented Oct 30, 2023

Uh oh!

Yard1 commented Oct 30, 2023

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Force paged attention v2 for long contexts #1510

Force paged attention v2 for long contexts #1510

Uh oh!

Conversation

Yard1 commented Oct 30, 2023

Uh oh!

Yard1 commented Oct 30, 2023

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!