Add gather_indexer_k_quant_cache kernel #25931

Barry-Delaney · 2025-09-30T06:35:40Z

This PR added cp_gather_indexer_k_quant_cache for getting quantized k/k_scale from indexer k cache.

github-actions · 2025-09-30T06:35:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a new CUDA kernel cp_gather_indexer_k_quant_cache for gathering quantized K-cache data. My review has identified some critical issues. There's a significant bug in the CUDA kernel's grid launch configuration that will lead to incorrect data gathering for longer sequences. Additionally, there are several const correctness issues and incorrect mutability annotations in the C++ code and PyTorch bindings, which misrepresent the function's contract and can cause subtle bugs. I've provided suggestions to fix these issues.

csrc/cache_kernels.cu

csrc/cache.h

csrc/cache_kernels.cu

csrc/torch_bindings.cpp

simon-mo · 2025-09-30T22:38:24Z

Can you integrate it and test the performance? cc @zyongye

mergify · 2025-10-03T07:03:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Barry-Delaney.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zyongye · 2025-10-06T04:27:55Z

I am testing this branch.
I replace the line here to

ops.cp_gather_indexer_k_quant_cache(
                kv_cache,
                k_fp8,
                k_scale,
                prefill_metadata.block_table,
                prefill_metadata.cu_seq_lens,
            )

And getting 0.58 on gsm8k 20 shots. I did rebase with the main branch. Am I doing anything wrong?

Barry-Delaney · 2025-10-07T06:08:39Z

I fixed the bug in the latest force push. Could you pls help on verifying again? @zyongye Thanks in advance!

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

zyongye · 2025-10-07T15:15:12Z

Retest with the new branch. Get a good result in gsm8k and gsm8k 20 shots

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9447|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9469|±  |0.0062|

20 shots

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9431|±  |0.0064|
|     |       |strict-match    |    20|exact_match|↑  |0.9439|±  |0.0063|

heheda12345 · 2025-10-07T15:43:42Z

@zyongye I think we can merge this PR first, and then you open a new PR for kernel integration.

heheda12345

Thanks for this important kernel!

zyongye

Maybe need to add this to pass rocm build

csrc/cache_kernels.cu

Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Simon Mo <simon.mo@hey.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/cache.h Outdated Show resolved Hide resolved

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/torch_bindings.cpp Outdated Show resolved Hide resolved

Barry-Delaney force-pushed the indexer_gather_kernel branch 3 times, most recently from d6b44e5 to b3c47a9 Compare September 30, 2025 19:48

mergify bot added the needs-rebase label Oct 3, 2025

Barry-Delaney force-pushed the indexer_gather_kernel branch from b3c47a9 to 5f7b7fe Compare October 7, 2025 06:07

mergify bot removed the needs-rebase label Oct 7, 2025

Barry-Delaney added 3 commits October 7, 2025 08:47

Add gather_indexer_k_quant_cache kernel

f597cc9

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Fix on large num_tokens

a9b2f18

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Perf optimization

bd88b9b

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Barry-Delaney force-pushed the indexer_gather_kernel branch from 38b409e to bd88b9b Compare October 7, 2025 08:47

heheda12345 approved these changes Oct 7, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) October 7, 2025 15:44

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

Merge branch 'main' into indexer_gather_kernel

38b83fc

zyongye reviewed Oct 7, 2025

View reviewed changes

csrc/cache_kernels.cu Show resolved Hide resolved

simon-mo and others added 4 commits October 7, 2025 12:45

Update csrc/cache_kernels.cu

9bbca67

Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Simon Mo <simon.mo@hey.com>

skip the whole kernel for rocm

7bf9fe1

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix

ce8576d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

try fix

e46273e

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 disabled auto-merge October 8, 2025 04:20

heheda12345 enabled auto-merge (squash) October 8, 2025 04:21

heheda12345 merged commit 127c8b7 into vllm-project:main Oct 8, 2025
85 checks passed

zyongye mentioned this pull request Oct 9, 2025

[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather #26456

Merged

Uh oh!

Add gather_indexer_k_quant_cache kernel #25931

Add gather_indexer_k_quant_cache kernel #25931

Uh oh!

Conversation

Barry-Delaney commented Sep 30, 2025

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simon-mo commented Sep 30, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

zyongye commented Oct 6, 2025

Uh oh!

Barry-Delaney commented Oct 7, 2025

Uh oh!

zyongye commented Oct 7, 2025

Uh oh!

heheda12345 commented Oct 7, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

zyongye left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants