Skip to content

Conversation

maleksan85
Copy link
Contributor

@maleksan85 maleksan85 commented Aug 12, 2025

Torch compile was erroring in attention layer on assert for q_scale to be equals 1.0. The error came originally from HIP saying that operation is not allowed during cuda graph capture. Thus implementing a copy of q_scale - q_scale_float (similar to k_scale_float and v_scale float).

PS q_scale needs to be one because upscalling doesn't happen on AMD from predeceasing GEMMs and scales are only applied to k and v if those are in fp8.

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Aug 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for torch.compile errors related to q_scale assertions in the attention layer, particularly for FP8 KV cache on HIP devices. The approach of creating a float copy _q_scale_float for assertions is sound and consistent with existing patterns for k_scale and v_scale. The changes in vllm/attention/layer.py and vllm/v1/attention/backends/triton_attn.py are correct. However, there is a potential issue in vllm/model_executor/layers/quantization/kv_cache.py where _q_scale_float might be assigned a tensor instead of a float, which could lead to the same torch.compile issues. I've added a comment with a suggested fix.

Copy link
Collaborator

@gshtras gshtras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix
To clarify: this issue would present itself when using full_cuda_graph:true and using the unified attention backend.
Would happen on CUDA and ROCm>=7.0
ROCm<7.0 allows to access tensor contents on the CPU side (assert is one example of suc access) during graph capture
cc @SageMoore

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras force-pushed the fp8_kv_cache_fix_for_torch_compile branch from 76153d5 to c9e1469 Compare August 14, 2025 16:41
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Copy link

mergify bot commented Aug 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 20, 2025
@maleksan85 maleksan85 closed this Sep 5, 2025
@maleksan85 maleksan85 reopened this Sep 9, 2025
@maleksan85 maleksan85 requested a review from tdoublep as a code owner September 9, 2025 16:55
Copy link

mergify bot commented Sep 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Sep 10, 2025
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
gshtras added a commit to ROCm/vllm that referenced this pull request Sep 10, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!
Please try merge from main to fix the ci issue

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge from main to solve the pre-commit issue

Copy link

mergify bot commented Sep 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 13, 2025
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
@mergify mergify bot removed the needs-rebase label Sep 16, 2025
@gshtras gshtras enabled auto-merge (squash) September 16, 2025 18:48
@gshtras gshtras merged commit 3053a22 into vllm-project:main Sep 16, 2025
49 checks passed
@gshtras gshtras deleted the fp8_kv_cache_fix_for_torch_compile branch September 16, 2025 22:17
bringlein added a commit to bringlein/vllm that referenced this pull request Sep 18, 2025
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Signed-off-by: charlifu <charlifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants