-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[torch.compile] support all attention backends #10558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] support all attention backends #10558
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model changes LGTM.
vllm/attention/layer.py
Outdated
alibi_slopes, sliding_window, kv_cache_dtype, | ||
blocksparse_params, logits_soft_cap) | ||
|
||
self.use_direct_call = envs.VLLM_USE_V1 or current_platform.is_tpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why is TPU included in this exception?
For V1, it's because of the output
argument, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for v1, yes.
for TPU, the kv cache is not a tensor. it is a tuple of tensor.
their signatures of attention op do not match others.
|
||
@contextmanager | ||
def set_forward_context(context: Any): | ||
def set_forward_context(context: Any, vllm_config: VllmConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dumb question: can we use get_current_vllm_config()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, get_current_vllm_config()
only works during model initialization.
this is model execution.
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
error comes from huggingface timeout |
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
This PR does some refactoring primarily on spyre_model_runner. This changes tries to reduce code deduplication between static batching and continuous batching. However, the intention of this work will not be complete until a next PR has as goal remove kv cache manager from the spyre model runner. Summary of changes: - Reduce code deduplication in spyre model runner, some methods are common in `SpyreMoldeRunner` class, while `StaticBatchingSpyreModelRunner` and `ContinuousBatchingSpyreModelRunner` override few of them to do their specific logic - Changed `ContinuousBatchingFmsModel` class to get the attention metadata via forward context, and changed the model runner to pass to use the `with set_forward_context` to pass the attention metadata. This is the way vLLM does to support multiple attention backends [[REF](vllm-project/vllm#10558)] - Moved the left pads to the CachedRequestState. - Bugfix: The `execute_model` in CB model runner was inconsistent with the data of input batch when it outputs the resul in `CBSpyreModelRunnerOutput`. Changed it with prepare_prompt to use the data of input batch. - Misc: few renamed variables, more comments, and TODOs --------- Signed-off-by: Wallas Santos <wallashss@ibm.com>
previously we register attention ops separately, e.g. flashinfer, flash attention.
this pr changes the registration to be the unified attention interface, so that we don't need to register these attention backends one by one.
how it works:
TODO:
in the future, we should make all attention implementation accept an
output
argument, so that it is aligned with the v1 attention behavior.