fp8 kv cache support fix for torch.compile #22758

maleksan85 · 2025-08-12T19:33:38Z

Torch compile was erroring in attention layer on assert for q_scale to be equals 1.0. The error came originally from HIP saying that operation is not allowed during cuda graph capture. Thus implementing a copy of q_scale - q_scale_float (similar to k_scale_float and v_scale float).

PS q_scale needs to be one because upscalling doesn't happen on AMD from predeceasing GEMMs and scales are only applied to k and v if those are in fp8.

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

github-actions · 2025-08-12T19:33:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces a fix for torch.compile errors related to q_scale assertions in the attention layer, particularly for FP8 KV cache on HIP devices. The approach of creating a float copy _q_scale_float for assertions is sound and consistent with existing patterns for k_scale and v_scale. The changes in vllm/attention/layer.py and vllm/v1/attention/backends/triton_attn.py are correct. However, there is a potential issue in vllm/model_executor/layers/quantization/kv_cache.py where _q_scale_float might be assigned a tensor instead of a float, which could lead to the same torch.compile issues. I've added a comment with a suggested fix.

vllm/model_executor/layers/quantization/kv_cache.py

gshtras

Thanks for the fix
To clarify: this issue would present itself when using full_cuda_graph:true and using the unified attention backend.
Would happen on CUDA and ROCm>=7.0
ROCm<7.0 allows to access tensor contents on the CPU side (assert is one example of suc access) during graph capture
cc @SageMoore

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify · 2025-08-20T20:14:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-09T16:55:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…torch_compile

yewentao256

LGTM, thanks for the work!
Please try merge from main to fix the ci issue

yewentao256

Please merge from main to solve the pre-commit issue

mergify · 2025-09-13T14:15:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Signed-off-by: charlifu <charlifu@amd.com>

fp8 kv cache support for fp4 llama 3.1 405B

164c0f3

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, mgoin, tlrmchlsmth and yewentao256 as code owners August 12, 2025 19:33

mergify bot added the v1 label Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/kv_cache.py Outdated Show resolved Hide resolved

gshtras approved these changes Aug 12, 2025

View reviewed changes

minor fix for q_scale to be tensor fix

c9e1469

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras force-pushed the fp8_kv_cache_fix_for_torch_compile branch from 76153d5 to c9e1469 Compare August 14, 2025 16:41

lint error fix

c29c2fd

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot added the needs-rebase label Aug 20, 2025

maleksan85 closed this Sep 5, 2025

maleksan85 reopened this Sep 9, 2025

maleksan85 requested a review from tdoublep as a code owner September 9, 2025 16:55

Merge remote-tracking branch 'origin/main' into fp8_kv_cache_fix_for_…

6a25ff3

…torch_compile

mergify bot removed the needs-rebase label Sep 10, 2025

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025

gshtras added a commit to ROCm/vllm that referenced this pull request Sep 10, 2025

Porting vllm-project#22758 HIP7 fix

77cb428

yewentao256 approved these changes Sep 10, 2025

View reviewed changes

gshtras added 2 commits September 10, 2025 18:01

Merge branch 'main' into fp8_kv_cache_fix_for_torch_compile

52d4c10

Merge branch 'main' into fp8_kv_cache_fix_for_torch_compile

5bc9d13

yewentao256 reviewed Sep 13, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 13, 2025

merge with main

810fec1

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot removed the needs-rebase label Sep 16, 2025

gshtras enabled auto-merge (squash) September 16, 2025 18:48

Merge branch 'main' into fp8_kv_cache_fix_for_torch_compile

bb5b672

gshtras merged commit 3053a22 into vllm-project:main Sep 16, 2025
49 checks passed

gshtras deleted the fp8_kv_cache_fix_for_torch_compile branch September 16, 2025 22:17

bringlein added a commit to bringlein/vllm that referenced this pull request Sep 18, 2025

manually replicating vllm-project#22758 for rocm_attn

f848fd7

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fp8 kv cache support fix for torch.compile #22758

fp8 kv cache support fix for torch.compile #22758

Uh oh!

maleksan85 commented Aug 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gshtras left a comment

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

mergify bot commented Sep 9, 2025

Uh oh!

yewentao256 left a comment •

edited

Loading

Uh oh!

yewentao256 left a comment

Uh oh!

mergify bot commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fp8 kv cache support fix for torch.compile #22758

fp8 kv cache support fix for torch.compile #22758

Uh oh!

Conversation

maleksan85 commented Aug 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 20, 2025

Uh oh!

mergify bot commented Sep 9, 2025

Uh oh!

yewentao256 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

maleksan85 commented Aug 12, 2025 •

edited by github-actions bot

Loading

yewentao256 left a comment •

edited

Loading