[Kernel] Use flashinfer for decoding #4353

LiuXiaoxuanPKU · 2024-04-25T06:03:51Z

This PR is a first attempt to integrate flashinfer for the decoding phase. The PR still uses flash attention for the prefill phase for now.

Updated after discussion with @yzh119
Things need to be fixed:

Use flashinfer's rope embedding --> Will turn off Flashinfer's rope support and use vllm's rope.
Alibi slope --> Flashinfer will support alibi_slope as a input parameter.
Cuda Graph support --> Flashinfer will simplify begin_forward function and add cuda graph support, no need to pad from the vllm side.

Next step:

Remove backend interface change
Add tests for e2e correctness
Add tests to check flashinfer integration works with tensor parallel

LiuXiaoxuanPKU · 2024-04-28T06:25:20Z

Pass the correctness test for single GPU and TP setting. Feel free to take a first pass @rkooo567

rkooo567 · 2024-04-28T07:31:37Z

Cuda Graph support --> Flashinfer will simplify begin_forward function and add cuda graph support, no need to pad from the vllm side.

But we still need to pad inputs for other parts of kernel except attention? (I think flashinfer-ai/flashinfer#187 could be sth nice to add for prefill cuda graph)

rkooo567

Very clean!! many comments are nits. So it seems like

not working with prefill
not working with prefix caching
not working with chunked prefill

This is correct right? In this case, we should make sure we raise exception properly.

Also, for prefix attn kernel, does flash infer has equivalent?

vllm/attention/backends/flashinfer.py

csrc/cache_kernels.cu

tests/basic_correctness/test_flashinfer.py

tests/distributed/test_flashinfer_distributed.py

vllm/utils.py

vllm/attention/backends/flashinfer.py

tests/kernels/test_cache.py

esmeetu · 2024-04-28T10:43:13Z

@LiuXiaoxuanPKU Thank you for this PR! Can we support other prefill backend instead of just the flash attention? like XFormers.

vllm/worker/model_runner.py

esmeetu · 2024-04-29T03:20:14Z

@LiuXiaoxuanPKU Thank you for this PR! Can we support other prefill backend instead of just the flash attention? like XFormers.

I have implemented a xformer version but the performance was not good(10%~20% throughout drop).

rkooo567 · 2024-04-29T09:04:49Z

I think the goal is to use flash infer prefill eventually, so flash attn is replaced soon!

…hinfer

LiuXiaoxuanPKU · 2024-05-02T05:40:04Z

@LiuXiaoxuanPKU what's the latest supported version from flash infer? Also, is this something we can just simply build it with torch 2.3?

The latest support version of python package is 2.2. Yes, we can build from source with torch 2.3. I tested it locally and it passed the tests.

rkooo567 · 2024-05-02T05:47:48Z

Let's merge it without adding tests to CI? I think we don't need to be blocked by them (to merge). I will also create an issue to their repo for 2.3 support (we can enable tests when it is officially supported)

rkooo567 · 2024-05-02T05:49:21Z

flashinfer-ai/flashinfer#227

ywang96

Left a few comments - Thanks for working on this!

vllm/attention/backends/flashinfer.py

ywang96 · 2024-05-02T19:11:20Z

vllm/attention/backends/flashinfer.py

+    data_type: torch.dtype = None
+
+    def __post_init__(self):
+        if not self.is_prompt:


When using flashinfer, we are also creating the FlashInferMetadata, which will also call post_init by default, here we want to skip the post_init if it's the prefill phase.

IMO it's worth putting aNOTE in the code regarding this explanation.

ywang96 · 2024-05-02T19:36:26Z

vllm/worker/model_runner.py

+                # Allocate 16MB workspace buffer
+                # Follow the example of flashinfer: https://docs.flashinfer.ai/api/python/decode.html
+                self.flashinfer_workspace_buffer = torch.empty(
+                    16 * 1024 * 1024, dtype=torch.uint8, device=self.device)


Just curious - is there any point of making workspace buffer size configurable?

Currently, it's always 16MB

vllm/attention/selector.py

vllm/sequence.py

cadedaniel

Approving to unblock @LiuXiaoxuanPKU given @rkooo567 's approval.

@ywang96 if you have any more comments provide them and @LiuXiaoxuanPKU can follow up in next PR

cadedaniel · 2024-05-03T18:15:07Z

vllm/attention/backends/flashinfer.py

+                                              AttentionMetadataPerStage)
+
+
+class FlashInferBackend(AttentionBackend):


(can do in future PRs): good to have a small description in docstring so new people can understand what FlashInfer is / why it's useful over other backends / link to learn more

cadedaniel · 2024-05-03T18:17:11Z

vllm/attention/backends/flashinfer.py

+
+    is_prompt: bool
+
+    use_cuda_graph: bool = False


(can do in future PRs): good to add comment for values (e.g. cuda graph not supported yet)

cadedaniel · 2024-05-03T18:19:17Z

tests/distributed/test_basic_distributed_correctness.py

@@ -33,16 +34,19 @@ def test_models(
    dtype: str,
    max_tokens: int,
 ) -> None:
+    enforce_eager = False
+    backend_by_env_var = os.getenv(VLLM_ATTENTION_BACKEND)


(can do in future PR): we should integrate this with #4548

cadedaniel · 2024-05-03T18:22:08Z

vllm/config.py

+        return self.hf_text_config.num_attention_heads // \
+                    parallel_config.tensor_parallel_size


(can do in future PR): need a TODO for vision models?

ywang96

LGTM! I don't have any other comment ATM - please resolve the conflict though, thanks!

rkooo567 · 2024-05-04T00:37:35Z

awseom! super excited to see performance with cuda graph!

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

zhyncs · 2024-05-06T03:34:51Z

Hi @LiuXiaoxuanPKU Great work! After switching to the new backend, has there been any performance improvement compared to before? Have you conducted any relevant benchmarks? Thanks.

zhyncs · 2024-05-06T06:56:22Z

Hi @LiuXiaoxuanPKU Is FlashInfer currently enabled by default? After testing the throughput on the ShareGPT dataset, there was no significant improvement on vLLM, and the gap with LMDeploy is still quite large (8.40 vs 19.62).
I'm not sure if there is a configuration error or something I haven't considered. Here is the reproduction method, please let me know if there are any mistakes. Thanks.

# env
NVIDIA A100-SXM4-80GB
PyTorch: 2.3.0+cu118
vLLM: 0.4.2+cu118
flash-attn: 2.5.8
LMDeploy: 0.4.0+81108ff

# vLLM Server
python3 -m vllm.entrypoints.openai.api_server --model /workdir/Meta-Llama-3-8B-Instruct
# vLLM Client
python3 benchmark_serving.py --backend vllm --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model /workdir/Meta-Llama-3-8B-Instruct

# LMDeploy Server
python3 -m lmdeploy serve api_server /workdir/Meta-Llama-3-8B-Instruct --cache-max-entry-count 0.95
# LMDeploy Client
python3 benchmark_serving.py --backend lmdeploy --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer /workdir/Meta-Llama-3-8B-Instruct --model llama3 --port 23333

# vLLM Result

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  119.11
Total input tokens:                      215196
Total generated tokens:                  186473
Request throughput (req/s):              8.40
Input token throughput (tok/s):          1806.76
Output token throughput (tok/s):         1565.61
---------------Time to First Token----------------
Mean TTFT (ms):                          32831.19
Median TTFT (ms):                        24628.54
P99 TTFT (ms):                           88550.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          154.28
Median TPOT (ms):                        149.54
P99 TPOT (ms):                           406.19
==================================================

# LMDeploy Result

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  50.98
Total input tokens:                      215196
Total generated tokens:                  187514
Request throughput (req/s):              19.62
Input token throughput (tok/s):          4221.44
Output token throughput (tok/s):         3678.41
---------------Time to First Token----------------
Mean TTFT (ms):                          18022.59
Median TTFT (ms):                        16949.77
P99 TTFT (ms):                           39935.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.41
Median TPOT (ms):                        28.74
P99 TPOT (ms):                           77.57
==================================================

LiuXiaoxuanPKU · 2024-05-06T17:48:00Z

Hi @zhyncs, thanks for the interest and benchmarking, several things here:
FlashInfer is not turned on by default, it can only be enabled with environment variable VLLM_ATTENTION_BACKEND=FLASHINFER.

We don't turn it on by default because of performance concerns:

We don't have cuda graph support for flashinfer yet. Without cudagraph, it might be hard to see any performance benefits, though we have not thoroughly benchmarked it yet.
Flashinfer's begin_forward function will cause some extra CPU/GPU communication, which will be optimized from the flashinfer side.

1/2 are both doable and fixable. We will coordinate with Flashinfer team. After fixing the performance issue, we can turn Flashinfer on by default.

zhyncs · 2024-05-06T17:50:54Z

Hi @zhyncs, thanks for the interest and benchmarking, several things here:

FlashInfer is not turned on by default, it can only be enabled with environment variable VLLM_ATTENTION_BACKEND=FLASHINFER.

We don't turn it on by default because of performance concerns:

We don't have cuda graph support for flashinfer yet. Without cudagraph, it might be hard to see any performance benefits, though we have not thoroughly benchmarked it yet.

Flashinfer's begin_forward function will cause some extra CPU/GPU communication, which will be optimized from the flashinfer side.

1/2 are both doable and fixable. We will coordinate with Flashinfer team. After fixing the performance issue, we can turn Flashinfer on by default.

Thanks for your reply!

Qiubo1 · 2024-05-07T08:12:22Z

I used to try flashinfer for decoding in vllm, but its perfomance is poorer than FA2.5, i doubt if i make mistake the wrong way, have u compare the perf between fa2.5 and flashinfer in decoding?

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

MichoChan · 2024-05-14T03:49:47Z

Hi @zhyncs, thanks for the interest and benchmarking, several things here: FlashInfer is not turned on by default, it can only be enabled with environment variable VLLM_ATTENTION_BACKEND=FLASHINFER.

We don't turn it on by default because of performance concerns:

We don't have cuda graph support for flashinfer yet. Without cudagraph, it might be hard to see any performance benefits, though we have not thoroughly benchmarked it yet.

Flashinfer's begin_forward function will cause some extra CPU/GPU communication, which will be optimized from the flashinfer side.

1/2 are both doable and fixable. We will coordinate with Flashinfer team. After fixing the performance issue, we can turn Flashinfer on by default.

hi, when turn on by default？

MichoChan · 2024-05-14T14:33:05Z

Hi @zhyncs, thanks for the interest and benchmarking, several things here: FlashInfer is not turned on by default, it can only be enabled with environment variable VLLM_ATTENTION_BACKEND=FLASHINFER.
We don't turn it on by default because of performance concerns:

We don't have cuda graph support for flashinfer yet. Without cudagraph, it might be hard to see any performance benefits, though we have not thoroughly benchmarked it yet.

Flashinfer's begin_forward function will cause some extra CPU/GPU communication, which will be optimized from the flashinfer side.

1/2 are both doable and fixable. We will coordinate with Flashinfer team. After fixing the performance issue, we can turn Flashinfer on by default.

hi, when turn on by default？

i found when qps or seq len is not large , the speed is slower than vllm base decode kernel, may 2 is needed to remove cpu/gpu copy

rkooo567 · 2024-05-14T15:56:35Z

@MichoChan I think it is because it doesn't have cuda graph support yet (at large qps, the cpu overhead cuda graph removes is negligible usually)

Calculusss · 2024-05-24T03:05:58Z

Hi @zhyncs, thanks for the interest and benchmarking, several things here: FlashInfer is not turned on by default, it can only be enabled with environment variable VLLM_ATTENTION_BACKEND=FLASHINFER.

We don't turn it on by default because of performance concerns:

We don't have cuda graph support for flashinfer yet. Without cudagraph, it might be hard to see any performance benefits, though we have not thoroughly benchmarked it yet.

Flashinfer's begin_forward function will cause some extra CPU/GPU communication, which will be optimized from the flashinfer side.

1/2 are both doable and fixable. We will coordinate with Flashinfer team. After fixing the performance issue, we can turn Flashinfer on by default.

Hi @LiuXiaoxuanPKU, Do you have a timeline regarding the support for CUDA Graph?

rkooo567 · 2024-05-24T05:10:07Z

We are blocked by flash infer side to enable cuda graph support. Should be able to support after that (I assume 2+weeks to get it delivered )

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

LiuXiaoxuanPKU added 4 commits April 22, 2024 19:22

draft

bef3629

fix

c1360e6

draft, pass simple decode tests

ad9189c

minor

6b26d3b

LiuXiaoxuanPKU marked this pull request as draft April 25, 2024 06:07

rkooo567 self-assigned this Apr 25, 2024

LiuXiaoxuanPKU added 5 commits April 25, 2024 22:55

remove attn backend interface change

e781798

basic test

5d469b7

Merge branch 'main' into flashinfer

308dd6c

fix distributed tests

883be3b

Merge branch 'main' into flashinfer

aeda01e

LiuXiaoxuanPKU marked this pull request as ready for review April 28, 2024 06:24

LiuXiaoxuanPKU requested a review from rkooo567 April 28, 2024 06:25

LiuXiaoxuanPKU changed the title ~~[WIP][Kernel] Use flashinfer for decoding~~ [Kernel] Use flashinfer for decoding Apr 28, 2024

rkooo567 requested changes Apr 28, 2024

View reviewed changes

jikunshang reviewed Apr 28, 2024

View reviewed changes

tests/kernels/test_cache.py Show resolved Hide resolved

esmeetu reviewed Apr 28, 2024

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

rkooo567 added the action-required label Apr 28, 2024

comaniac mentioned this pull request Apr 29, 2024

[Speculative decoding] Fix async executing #4249

Closed

try to fix ci

7109622

zhuohan123 mentioned this pull request Apr 30, 2024

[Bug]: Server crash for bloom-3b while use prefix_caching, AssertionError assert Lk in {16, 32, 64, 128} #4171

Closed

LiuXiaoxuanPKU and others added 4 commits April 30, 2024 16:53

fix comments

ab00582

Merge branch 'main' into flashinfer

5af4105

typo

67fe4fd

Merge branch 'flashinfer' of github.com:LiuXiaoxuanPKU/vllm into flas…

d48d25c

…hinfer

LiuXiaoxuanPKU added 3 commits May 2, 2024 18:08

remove test

470df94

remove flashinfer from docker

46ccb63

revert docker change

14f7cef

ywang96 reviewed May 2, 2024

View reviewed changes

fix comments and import error

e5245c0

cadedaniel approved these changes May 3, 2024

View reviewed changes

ywang96 approved these changes May 3, 2024

View reviewed changes

LiuXiaoxuanPKU added 2 commits May 3, 2024 20:04

Merge branch 'main' into flashinfer

11da264

minor

fb7615d

cadedaniel merged commit 43c413e into vllm-project:main May 3, 2024
59 checks passed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024

[Kernel] Use flashinfer for decoding (vllm-project#4353)

4b0f703

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Kernel] Use flashinfer for decoding (vllm-project#4353)

0c86070

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

[Kernel] Use flashinfer for decoding (vllm-project#4353)

a8dc86c

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

comaniac mentioned this pull request May 21, 2024

[Misc] Take user preference in attention selector #4960

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Kernel] Use flashinfer for decoding (vllm-project#4353)

9fb8676

Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Use flashinfer for decoding #4353

[Kernel] Use flashinfer for decoding #4353

LiuXiaoxuanPKU commented Apr 25, 2024 •

edited

LiuXiaoxuanPKU commented Apr 28, 2024

rkooo567 commented Apr 28, 2024 •

edited

rkooo567 left a comment

esmeetu commented Apr 28, 2024

esmeetu commented Apr 29, 2024

rkooo567 commented Apr 29, 2024

LiuXiaoxuanPKU commented May 2, 2024

rkooo567 commented May 2, 2024 •

edited

rkooo567 commented May 2, 2024

ywang96 left a comment

ywang96 May 2, 2024

ywang96 May 2, 2024

LiuXiaoxuanPKU May 3, 2024

cadedaniel left a comment

cadedaniel May 3, 2024

cadedaniel May 3, 2024

cadedaniel May 3, 2024

cadedaniel May 3, 2024

ywang96 left a comment

rkooo567 commented May 4, 2024

zhyncs commented May 6, 2024

zhyncs commented May 6, 2024

LiuXiaoxuanPKU commented May 6, 2024

zhyncs commented May 6, 2024

Qiubo1 commented May 7, 2024

MichoChan commented May 14, 2024

MichoChan commented May 14, 2024

rkooo567 commented May 14, 2024

Calculusss commented May 24, 2024 •

edited

rkooo567 commented May 24, 2024

		AttentionMetadataPerStage)


		class FlashInferBackend(AttentionBackend):

		return self.hf_text_config.num_attention_heads // \
		parallel_config.tensor_parallel_size

[Kernel] Use flashinfer for decoding #4353

[Kernel] Use flashinfer for decoding #4353

Conversation

LiuXiaoxuanPKU commented Apr 25, 2024 • edited

LiuXiaoxuanPKU commented Apr 28, 2024

rkooo567 commented Apr 28, 2024 • edited

rkooo567 left a comment

Choose a reason for hiding this comment

esmeetu commented Apr 28, 2024

esmeetu commented Apr 29, 2024

rkooo567 commented Apr 29, 2024

LiuXiaoxuanPKU commented May 2, 2024

rkooo567 commented May 2, 2024 • edited

rkooo567 commented May 2, 2024

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 May 2, 2024

Choose a reason for hiding this comment

ywang96 May 2, 2024

Choose a reason for hiding this comment

LiuXiaoxuanPKU May 3, 2024

Choose a reason for hiding this comment

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel May 3, 2024

Choose a reason for hiding this comment

cadedaniel May 3, 2024

Choose a reason for hiding this comment

cadedaniel May 3, 2024

Choose a reason for hiding this comment

cadedaniel May 3, 2024

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

rkooo567 commented May 4, 2024

zhyncs commented May 6, 2024

zhyncs commented May 6, 2024

LiuXiaoxuanPKU commented May 6, 2024

zhyncs commented May 6, 2024

Qiubo1 commented May 7, 2024

MichoChan commented May 14, 2024

MichoChan commented May 14, 2024

rkooo567 commented May 14, 2024

Calculusss commented May 24, 2024 • edited

rkooo567 commented May 24, 2024

LiuXiaoxuanPKU commented Apr 25, 2024 •

edited

rkooo567 commented Apr 28, 2024 •

edited

rkooo567 commented May 2, 2024 •

edited

Calculusss commented May 24, 2024 •

edited