flashinfer paged attention #2772

sumo43 · 2024-02-05T20:15:39Z

Description

This PR implements FlashInfer's implementation of GQA PagedAttention which is up to 2-3x faster than vLLM's version. I implement flashinfer for prefill and decoding, while still using cache_ops.

https://github.com/flashinfer-ai/flashinfer/
https://flashinfer.ai/2024/02/02/introduce-flashinfer.html

Performance Results

I used the following setup:

python3 benchmarks/benchmark_throughput.py \
        --input-len 1000 \
        --output-len 1000 \
        --model mistralai/Mistral-7B-v0.1 \
        --num-prompts 100 \
        --enforce-eager

Throughput with flashinfer: 2.63 requests/s, 5258.27 tokens/s
Throughput without flashinfer: 1.82 requests/s, 3642.17 tokens/s

TODOS

Reduce overhead when creating prefill_wrapper, decode_wrapper (DONE)
check kv cache indexing, i think there is a bug there (DONE)
run and debug with tp, different models
fix sampler delay (?) (DONE)

esmeetu · 2024-02-06T10:21:50Z

I got error on T4 GPU with half dtype model.
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

sumo43 · 2024-02-06T15:32:09Z

I got error on T4 GPU with half dtype model. RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

are you using the kvcache2 branch? also, try setting to float16 instead of half

esmeetu · 2024-02-07T07:13:25Z

Hi @sumo43 Yes i do.
I changed to Yi-6B-Chat model, it throw new error:

  File "/.conda/envs/infer/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed with error code no kernel image is available for execution on the device

And when i using the tp2 for testing, the engine got stuck. It seems tensor parallelism not supported.

zhuohan123 · 2024-02-07T08:00:49Z

@sumo43 Please feel free to ping me when the PR is ready for review!

sumo43 · 2024-02-07T15:22:23Z

@sumo43 Please feel free to ping me when the PR is ready for review!

Sounds good. so far I made the kv cache compatible with flashinfer and checked that the outputs were coherent. I'm currently debugging a few issues, like the sampler potentially taking longer to run (?) but i'll make it ready for review soon. Thanks

sumo43 · 2024-02-07T22:34:33Z

so, i tested the core functionality and it works. however, my code doesn't support cudagraphs so the tests fail (with eager mode they pass). also, flashinfer is only available with python3.10 and 3.11 wheels, so the docker tests using python3.8 don't pass.

Yard1 · 2024-02-08T10:52:48Z

Regarding CUDA graphs, this PR should help (though it may not be the only thing needed) - flashinfer-ai/flashinfer#111

WoosukKwon · 2024-02-12T19:41:04Z

Hi @sumo43, thanks for submitting the PR! To accelerate the merge, we'd like to directly push some modifications to the PR. For example, we'd like to use FlashInfer's C++ APIs rather than the Python APIs. Would you allow us to directly commit the changes to this PR? Of course, you'll remain as a co-author of the PR.

sumo43 · 2024-02-12T19:43:57Z

Hi @WoosukKwon. Absolutely, feel free to make any changes you need.

shanshanpt · 2024-02-20T03:25:27Z

hi @sumo43
docker image: nvcr.io/nvidia/pytorch:23.07-py3 , Python 3.10.6, A100 x 8

Try to run branch/kvcache2, found an error as follows:

script command:
python3 benchmarks/benchmark_throughput.py --input-len 1000 --output-len 1000 --model /model/Mistral-7B-v0.1 --num-prompts 100 --enforce-eager

error log:
INFO 02-20 03:21:17 llm_engine.py:327] # GPU blocks: 27153, # CPU blocks: 2048
Processed prompts: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 336, in
main(args)
File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 209, in main
elapsed_time = run_vllm(requests, args.model, args.tokenizer,
File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 111, in run_vllm
llm._run_engine(use_tqdm=True)
File "/model/vllm-flashinfer/vllm/entrypoints/llm.py", line 208, in _run_engine
step_outputs = self.llm_engine.step()
File "/model/vllm-flashinfer/vllm/engine/llm_engine.py", line 802, in step
all_outputs = self._run_workers(
File "/model/vllm-flashinfer/vllm/engine/llm_engine.py", line 989, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/worker/worker.py", line 219, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/worker/model_runner.py", line 613, in execute_model
hidden_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 304, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 257, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 205, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 155, in forward
attn_output = self.attn(q, k, v, kv_cache, input_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/model/vllm-flashinfer/vllm/model_executor/layers/attention.py", line 168, in forward
output = flashinfer.single_prefill_with_kv_cache(query, key.contiguous(), value.contiguous(), causal=True)
File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 139, in single_prefill_with_kv_cache
return _kernels.single_prefill_with_kv_cache(
ValueError: When causal is true, kv_len must be greater than or equal to qo_len, got kv_len32032 and qo_len 128128

markluofd · 2024-02-20T06:02:39Z

hi @sumo43 docker image: nvcr.io/nvidia/pytorch:23.07-py3 , Python 3.10.6, A100 x 8

Try to run branch/kvcache2, found an error as follows:

script command: python3 benchmarks/benchmark_throughput.py --input-len 1000 --output-len 1000 --model /model/Mistral-7B-v0.1 --num-prompts 100 --enforce-eager

error log: INFO 02-20 03:21:17 llm_engine.py:327] # GPU blocks: 27153, # CPU blocks: 2048 Processed prompts: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last): File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 336, in main(args) File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 209, in main elapsed_time = run_vllm(requests, args.model, args.tokenizer, File "/model/vllm-flashinfer/benchmarks/benchmark_throughput.py", line 111, in run_vllm llm._run_engine(use_tqdm=True) File "/model/vllm-flashinfer/vllm/entrypoints/llm.py", line 208, in _run_engine step_outputs = self.llm_engine.step() File "/model/vllm-flashinfer/vllm/engine/llm_engine.py", line 802, in step all_outputs = self._run_workers( File "/model/vllm-flashinfer/vllm/engine/llm_engine.py", line 989, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/model/vllm-flashinfer/vllm/worker/worker.py", line 219, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/model/vllm-flashinfer/vllm/worker/model_runner.py", line 613, in execute_model hidden_states = model_executable( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 304, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 257, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 205, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/model/vllm-flashinfer/vllm/model_executor/models/mistral.py", line 155, in forward attn_output = self.attn(q, k, v, kv_cache, input_metadata) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/model/vllm-flashinfer/vllm/model_executor/layers/attention.py", line 168, in forward output = flashinfer.single_prefill_with_kv_cache(query, key.contiguous(), value.contiguous(), causal=True) File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 139, in single_prefill_with_kv_cache return _kernels.single_prefill_with_kv_cache( ValueError: When causal is true, kv_len must be greater than or equal to qo_len, got kv_len32032 and qo_len 128128

Got the same error

Qiubo1 · 2024-02-21T06:15:27Z

I using 300 requests to test LLAMA13B with flashinfer and with original pageattention ,original pageattention throughtput is faster than flashinfer 10%. I wonder flashinfer is only works in GQA construct ???

sumo43 · 2024-02-21T06:30:10Z

@pythonononer yeah i noticed it too. I'm looking into whether C++ API is faster or not. Also, @shanshanpt i'd recommend using the pass-ci branch since it's a bit newer and passes CI tests.

Qiubo1 · 2024-02-21T07:44:34Z

@pythonononer yeah i noticed it too. I'm looking into whether C++ API is faster or not. Also, @shanshanpt i'd recommend using the pass-ci branch since it's a bit newer and passes CI tests.

I think c++ api is equal to python api. in custom , using pybind to turn c++ interface to python just improve a little. Maybe we need the arthur to make the optimization.

There are some restrictions I conclude: 1. python version>=3.9, torch >=2.1, cuda>11.8. 2. must open eager mode and tp==1 ,so LLAMA 70B not works.

MikeChenfu · 2024-04-11T22:43:44Z

Is this PR still active? I also get the same error .

 RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype bFloat16.

sumo43 added 9 commits February 4, 2024 01:26

prefill works, needs decoding impl

e150f3c

add some decoding code

dc3e5d4

working decoding

55191f5

fix index, remove some ops

5b5fff0

1.5x throughput 1000in/out. TODO fix cache update

34f1743

tweak kernel

a548101

fix ind

288717f

fix kvcache?

7c8dfaf

fix gqa shape

a6bb43a

sumo43 marked this pull request as draft February 5, 2024 20:16

sumo43 mentioned this pull request Feb 5, 2024

Import FlashInfer: 3x faster PagedAttention than vLLM #2767

Open

rm unnecessary files, add flashinfer as requirement

3922712

sumo43 added 2 commits February 6, 2024 21:41

fix kv_cache indexing

7aa25fa

batched kvcache indexing

a176b8d

sumo43 and others added 6 commits February 7, 2024 20:40

working version. last commit before formatting + tests

037a54a

formatting and compatibility pt 1

8368d0b

formatting and compatibility pt 2

8276872

fix reqs

4d06135

Merge branch 'main' into kvcache2

b0d5cff

rm hardcoded values, passes tests w/ eager mode

57874d5

sumo43 changed the title ~~[WIP] flashinfer paged attention~~ flashinfer paged attention Feb 7, 2024

sumo43 marked this pull request as ready for review February 7, 2024 22:33

fix indexing

cddf14d

sumo43 added 3 commits February 8, 2024 23:56

rm prefill

35f3e18

remove file

f430e50

fix device for tp

af69cd6

casper-hansen mentioned this pull request Feb 11, 2024

[Performance] Support MQA/GQA in decode phase by using FlashAttention #2744

Open

WoosukKwon self-requested a review February 12, 2024 19:21

WoosukKwon self-assigned this Feb 12, 2024

zhaoyang-star mentioned this pull request Feb 18, 2024

How to use low-bit KV Cache in flashinfer? flashinfer-ai/flashinfer#125

Open

yukavio mentioned this pull request Apr 10, 2024

[Feature]: Tree attention about Speculative Decoding #3960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flashinfer paged attention #2772

flashinfer paged attention #2772

sumo43 commented Feb 5, 2024 •

edited

Loading

esmeetu commented Feb 6, 2024

sumo43 commented Feb 6, 2024

esmeetu commented Feb 7, 2024

zhuohan123 commented Feb 7, 2024

sumo43 commented Feb 7, 2024

sumo43 commented Feb 7, 2024 •

edited

Loading

Yard1 commented Feb 8, 2024

WoosukKwon commented Feb 12, 2024

sumo43 commented Feb 12, 2024 •

edited

Loading

shanshanpt commented Feb 20, 2024 •

edited

Loading

markluofd commented Feb 20, 2024

Qiubo1 commented Feb 21, 2024

sumo43 commented Feb 21, 2024 •

edited

Loading

Qiubo1 commented Feb 21, 2024

MikeChenfu commented Apr 11, 2024

flashinfer paged attention #2772

Are you sure you want to change the base?

flashinfer paged attention #2772

Conversation

sumo43 commented Feb 5, 2024 • edited Loading

Description

Performance Results

TODOS

esmeetu commented Feb 6, 2024

sumo43 commented Feb 6, 2024

esmeetu commented Feb 7, 2024

zhuohan123 commented Feb 7, 2024

sumo43 commented Feb 7, 2024

sumo43 commented Feb 7, 2024 • edited Loading

Yard1 commented Feb 8, 2024

WoosukKwon commented Feb 12, 2024

sumo43 commented Feb 12, 2024 • edited Loading

shanshanpt commented Feb 20, 2024 • edited Loading

markluofd commented Feb 20, 2024

Qiubo1 commented Feb 21, 2024

sumo43 commented Feb 21, 2024 • edited Loading

Qiubo1 commented Feb 21, 2024

MikeChenfu commented Apr 11, 2024

sumo43 commented Feb 5, 2024 •

edited

Loading

sumo43 commented Feb 7, 2024 •

edited

Loading

sumo43 commented Feb 12, 2024 •

edited

Loading

shanshanpt commented Feb 20, 2024 •

edited

Loading

sumo43 commented Feb 21, 2024 •

edited

Loading