[torch.compile] support all attention backends #10558

youkaichao · 2024-11-22T00:18:35Z

previously we register attention ops separately, e.g. flashinfer, flash attention.

this pr changes the registration to be the unified attention interface, so that we don't need to register these attention backends one by one.

how it works:

when we create an attention class, we register it in the per-model static forward context, identified by its layer name
when we call the attention implementation, we pass in the layer name through pytorch custom op, and inside the custom op, we find the attention object, and call the implementation.

TODO:

in the future, we should make all attention implementation accept an output argument, so that it is aligned with the v1 attention behavior.

Signed-off-by: youkaichao <youkaichao@gmail.com>

github-actions · 2024-11-22T00:18:46Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: youkaichao <youkaichao@gmail.com>

vllm/attention/backends/abstract.py

vllm/model_executor/models/bart.py

vllm/model_executor/models/commandr.py

vllm/model_executor/models/internlm2.py

vllm/forward_context.py

Signed-off-by: youkaichao <youkaichao@gmail.com>

DarkLight1337

The model changes LGTM.

WoosukKwon · 2024-11-22T16:50:47Z

vllm/attention/layer.py

                             alibi_slopes, sliding_window, kv_cache_dtype,
                             blocksparse_params, logits_soft_cap)

+        self.use_direct_call = envs.VLLM_USE_V1 or current_platform.is_tpu()


Just curious: why is TPU included in this exception?
For V1, it's because of the output argument, right?

for v1, yes.

for TPU, the kv cache is not a tensor. it is a tuple of tensor.

their signatures of attention op do not match others.

WoosukKwon · 2024-11-22T16:58:11Z

vllm/forward_context.py


 @contextmanager
-def set_forward_context(context: Any):
+def set_forward_context(context: Any, vllm_config: VllmConfig):


A dumb question: can we use get_current_vllm_config() here?

no, get_current_vllm_config() only works during model initialization.

this is model execution.

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao · 2024-11-22T22:04:33Z

error comes from huggingface timeout

Signed-off-by: youkaichao <youkaichao@gmail.com>

This PR does some refactoring primarily on spyre_model_runner. This changes tries to reduce code deduplication between static batching and continuous batching. However, the intention of this work will not be complete until a next PR has as goal remove kv cache manager from the spyre model runner. Summary of changes: - Reduce code deduplication in spyre model runner, some methods are common in `SpyreMoldeRunner` class, while `StaticBatchingSpyreModelRunner` and `ContinuousBatchingSpyreModelRunner` override few of them to do their specific logic - Changed `ContinuousBatchingFmsModel` class to get the attention metadata via forward context, and changed the model runner to pass to use the `with set_forward_context` to pass the attention metadata. This is the way vLLM does to support multiple attention backends [[REF](vllm-project/vllm#10558)] - Moved the left pads to the CachedRequestState. - Bugfix: The `execute_model` in CB model runner was inconsistent with the data of input batch when it outputs the resul in `CBSpyreModelRunnerOutput`. Changed it with prepare_prompt to use the data of input batch. - Misc: few renamed variables, more comments, and TODOs --------- Signed-off-by: Wallas Santos <wallashss@ibm.com>

support all attention backend

94afd72

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao added 14 commits November 21, 2024 16:24

add comments

3df29a2

Signed-off-by: youkaichao <youkaichao@gmail.com>

raise error for duplicate layers

e6c6a8c

Signed-off-by: youkaichao <youkaichao@gmail.com>

change all

c238ac8

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix glm

631fae9

Signed-off-by: youkaichao <youkaichao@gmail.com>

florence2

0a9aef8

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix bart

d01b26b

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix bart

4980105

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix internlm2

b885437

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix internlm2

1416658

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix internlm2

eccbd4d

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix internlm2

7ce1a77

Signed-off-by: youkaichao <youkaichao@gmail.com>

revert flash attention registration

5ff71b6

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix lint

ad60df1

Signed-off-by: youkaichao <youkaichao@gmail.com>

revert flash attention

d6a9c44

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao marked this pull request as ready for review November 22, 2024 03:07

youkaichao requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and zhuohan123 as code owners November 22, 2024 03:07

youkaichao added 2 commits November 21, 2024 19:18

fix tpu

c802a27

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix v1 code

3de827b

Signed-off-by: youkaichao <youkaichao@gmail.com>