[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA #28988

FENP · 2025-11-19T03:37:21Z

Purpose

Ref to issue #25749. Enable PCP for MLA models.
This PR mainly includes the following changes:

Modified vllm/v1/worker/gpu_model_runner.py for PCP partitioning logic for tokens
Modified vllm/v1/attention/backends/mla/common.py to adapt the MLA backend to PCP
Add utility functions required by PCP to vllm/v1/attention/backends/utils.py
Renamed variables and functions shared by both PCP and DCP

Test Plan

vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --prefill-context-parallel-size 2

Test Result

PCP1 (Baseline)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6277|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.6179|±  |0.0095|

PCP2

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6259|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.6164|±  |0.0095|

PCP4

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6308|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.6236|±  |0.0094|

PCP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6240|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.6168|±  |0.0095|

Support piecewise graph
Support chunk prefill and prefix caching
Add ci test
Accuracy test

Future work

These items will be tackled in follow-up PRs; community contributions are warmly welcomed.

PCP support decode length > 1
PCP support fullgraph
PCP support others MLA prefill attention backends (currently only FlashAttention prefill is supported)
Ring-CP style attention backend algorithm, ref RFC [RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133.
PCP support GQA
PCP support P/D Disaggregation
PCP suppots DSA
PCP support distributed PCP (multi-machine)
dynamic PCP？(https://arxiv.org/pdf/2510.10620)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-11-19T03:37:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for Prefill Context Parallelism (PCP) to the Multi-Level Attention (MLA) backend. The changes are extensive, involving refactoring of context parallelism logic to be more generic, adding new metadata and utility functions for PCP, and implementing the PCP attention logic based on the Dual-Chunk-Swap strategy.

My review has identified a critical issue in the attention correction logic where DCP and PCP corrections are applied in the wrong order, which will lead to incorrect results. I have also pointed out a significant performance issue related to nested communication calls that should be optimized. Overall, the PR is a good step towards enabling PCP, but these critical issues need to be addressed.

vllm/v1/attention/backends/mla/common.py

gemini-code-assist · 2025-11-19T03:39:27Z

vllm/v1/attention/backends/mla/common.py

            cur_allgather_kvcache.copy_(
-                get_dcp_group().all_gather(local_gathered_kvcache, dim=0)
+                get_pcp_group().all_gather(
+                    get_dcp_group().all_gather(local_gathered_kvcache, dim=0),
+                    dim=0,
+                )
            )


The nested all_gather calls, first over the DCP group and then over the PCP group, are inefficient as they introduce extra communication overhead and synchronization points. This should be optimized into a single all_gather operation.

To achieve this, a new communication group that combines the ranks from both DCP and PCP should be created during initialization. Then, a single all_gather can be performed over this combined "context parallel" (CP) group. This will be more performant. The TODO comment already acknowledges this, and this comment serves to emphasize its importance for performance.

chatgpt-codex-connector

💡 Codex Review

https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/common.py#L212-L215
Importing undefined get_pcp_group

Lines 212‑215 import get_pcp_group from vllm.distributed.parallel_state, but that module still only exposes get_dcp_group (the commit merely introduced a _CP variable without any getter). Importing common.py will therefore immediately raise ImportError: cannot import name 'get_pcp_group', so none of the new PCP code paths can even be instantiated.

https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/flashattn_mla.py#L79-L86
FlashAttn builder now passes nonexistent kwarg

The call to super().__init__(…, supports_cp_with_varlen=True) in FlashAttnMLAMetadataBuilder.__init__ (lines 79‑86) will raise TypeError: __init__() got an unexpected keyword argument 'supports_cp_with_varlen' because MLACommonMetadataBuilder.__init__ still only accepts supports_dcp_with_varlen. This prevents the FlashAttn MLA backend from constructing at all.

https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/mla/common.py#L572-L574
Referencing cp_kv_cache_interleave_size attribute that does not exist

Lines 572‑574 now read self.cp_local_block_size = parallel_config.cp_kv_cache_interleave_size, but ParallelConfig (vllm/config/parallel.py) defines only dcp_kv_cache_interleave_size. As soon as MLACommonMetadataBuilder is constructed this access raises AttributeError: 'ParallelConfig' object has no attribute 'cp_kv_cache_interleave_size', so the MLA backend cannot even initialize.

https://github.com/vllm-project/vllm/blob/9c4884f9884071f7d36b26df87b69eeb6a08ae26/v1/attention/backends/utils.py#L1118-L1124
New utils annotation causes NameError on import

The new helper pcp_kv_allgather_and_restore (lines 1118‑1124) annotates pcp_group: GroupCoordinator, but GroupCoordinator is only imported inside the TYPE_CHECKING block and there is no from __future__ import annotations. When Python evaluates these annotations at import time it looks up GroupCoordinator, fails to find the name, and raises NameError, breaking vllm.v1.attention.backends.utils for every runtime import.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

mergify · 2025-11-21T17:00:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

some very initial comments; will review more thoroughly tmrw

LucasWilkinson · 2025-12-04T06:02:52Z

vllm/v1/worker/gpu_model_runner.py

            logits_indices = query_start_loc[1:] - 1
+            if self.pcp_world_size > 1:
+                logits_indices = (
+                    torch.from_numpy(cu_num_tokens) * self.pcp_world_size
+                    - self.pcp_manager.num_pcp_pads_cpu_tensor[:num_reqs]
+                    - 1
+                )
+            else:
+                logits_indices = query_start_loc[1:] - 1


why repeated logits_indices = query_start_loc[1:] - 1?

can we do something like? the ultimate goal is to lower the visual presence and cognitive load so people can easily ignore the PCP stuff when reading gpu_model_runner if they don't care about PCP

logits_indices = query_start_loc[1:] - 1 if self.pcp_world_size > 1: logits_indices = self.pcp_manager.get_logits_indices(cu_num_tokens, num_reqs)

👍 Modified as suggested. Refactor the logic of other PCPs into functions in this way, including: get_restore_hidden_states, get_discard_request_mask, and get_padded_slot_mapping.

LucasWilkinson · 2025-12-04T06:07:07Z

vllm/v1/worker/gpu_model_runner.py

                # create a dummy block table and slot mapping for them.
                blk_table_tensor = torch.zeros(
-                    (num_reqs_padded, 1),
+                    (num_tokens_padded, 1),


nit: rebase error?

Yes, sorry for the mistake, fixed it.

LucasWilkinson · 2025-12-04T06:07:58Z

vllm/v1/worker/gpu_model_runner.py

+        else:
+            self.discard_request_mask.np[:num_reqs] = (
+                self.seq_lens.np[:num_reqs] < num_tokens_np
+            )


nit similar to: https://github.com/vllm-project/vllm/pull/28988/files#r2587693784

can we do:

if self.pcp_world_size > 1: self.discard_request_mask.np[:num_reqs] = self.pcp_manager.get_discard_request_mask(...) else: self.discard_request_mask.np[:num_reqs] = ( self.seq_lens.np[:num_reqs] < num_tokens_np )

Modified as suggested.

LucasWilkinson · 2025-12-04T06:09:10Z

vllm/v1/worker/gpu_model_runner.py

-                blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1)
+                if self.pcp_world_size == 1:
+                    slot_mapping[num_tokens:num_tokens_padded].fill_(-1)
+                    blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1)


won't this cause issues for decode batches with full-cudagraphs (we should try to get FULL_AND_PIECEWISE turned on for DCP)

Since the PCP logic is incompatible with pad attn by now, I've added an if condition here to ensure that these two lines of code are executed only when PCP is disabled. I don't think this will have any impact on DCP.

mergify · 2025-12-05T08:59:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

…ckend. Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

LucasWilkinson · 2025-12-08T06:05:11Z

vllm/v1/attention/backends/utils.py

+        2 * pcp_size,
+        return_head=True,
+    )
+    kv_tail_indices, _ = get_pcp_part_indices(


shouldn't this be _, kv_tail_indices?

No, return_head is setted True here. For query, the required KV is starting from the first token, so we always need to return the indices starting from the head.

mergify · 2025-12-08T10:01:51Z

Hi @FENP, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2025-12-08T11:14:06Z

Hi @FENP, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

LucasWilkinson

i ran some profiles and I think theres way to many small torch ops in the current implementation (e.g. index_select) leading to excessive CPU overhead; i know this is targeting very long prefills (this was tested with 16k input) but I think we should try to optimize this more before landing

LucasWilkinson · 2025-12-08T22:25:36Z

vllm/v1/attention/backends/mla/common.py

+                cu_seqlens_q=prefill.query_start_loc // 2,
+                cu_seqlens_k=prefill.query_start_loc // 2 * (self.pcp_rank + 1),
+                max_seqlen_q=prefill.max_query_len // 2,
+                max_seqlen_k=prefill.max_query_len // 2 * (self.pcp_rank + 1),


can we move the // 2 floor divs of the hot path? im seeing them show up in the profiles

LucasWilkinson · 2025-12-08T22:26:56Z

vllm/v1/attention/backends/mla/common.py

+            assert pcp_metadata is not None
+            output_head, lse_head = self._flash_attn_varlen_diff_headdims(
+                q=torch.index_select(q, 0, pcp_metadata.query_head_indices),
+                k=torch.index_select(k, 0, pcp_metadata.kv_head_indices),


could we replace the k/v index selects by instead treating pcp_metadata.kv_head_indices as the block table for a page_size==1 kv-cache (and pass k/v directly)? FA3 supports page size 1

👍 It's worth a try

Hi @LucasWilkinson, I've thought about this issue a bit more.
Could we consider using a single Triton kernel to perform the index select for qkv all at once? Compared to treating pcp_metadata.kv_head_indices as a block table, this approach could avoid the additional index select call for q. In addition, I think this approach would also make it easier to migrate to other attention backends.

makes sense to me; assuming the triton launch overheads are reasonable since we unfortunately don't have prefill CGs to hide this overhead

Thanks for the feedback! I'm currently working on this change — will push the update soon.

mergify · 2025-12-09T09:11:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @FENP.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zhenwenqi2024 · 2025-12-14T13:06:54Z

vllm/v1/worker/gpu_model_runner.py

            # TODO: Support prompt logprobs.
            logits_indices = query_start_loc[1:] - 1
+            if self.pcp_world_size > 1:
+                logits_indices = self.pcp_manager.get_logits_indices(


logits_indices here seem a cpu_tensor , maybe it should be a devie_tensor?

Make sense.

zhenwenqi2024 · 2025-12-19T08:49:24Z

vllm/v1/worker/utils.py

+        ]
+        all_positions = np.concatenate(all_positions_lst)
+        self.pcp_allgather_restore_idx.np[: all_positions.shape[0]] = (
+            all_positions.argsort()


now, all_positions is a cpu_tensor, how do you think put the tensor to device and then sort?

zhenwenqi2024 · 2025-12-22T01:34:41Z

vllm/v1/worker/gpu_model_runner.py

+                    blk_table_tensor[num_reqs:num_reqs_padded].fill_(-1)
+
+            if self.pcp_world_size > 1:
+                slot_mapping = self.pcp_manager.get_padded_slot_mapping(


the address of slot_mapping here seems changed, I think it may infulence cudagraph in future。maybe we could make the address fixed？

zhenwenqi2024 · 2025-12-22T03:41:42Z

vllm/v1/worker/utils.py

+        cp_unpad_mask = self.pcp_unpad_mask_cpu_tensor[
+            : num_tokens * self.pcp_world_size
+        ]
+        pcp_padded_slot_mapping.fill_(-1)


could we delete this operation and initilized with full of -1

chaunceyjiang · 2025-12-23T03:19:25Z

hi @FENP Can you resolve the conflict? I want to test it locally.

FENP requested review from LucasWilkinson and pavanimajety as code owners November 19, 2025 03:37

mergify bot added the v1 label Nov 19, 2025

mergify bot added the needs-rebase label Nov 19, 2025

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

mergify bot removed the needs-rebase label Nov 19, 2025

FENP force-pushed the prefill-context-parallel-mla branch 3 times, most recently from 2d9034a to 8ac9843 Compare November 20, 2025 07:10

FENP requested review from mgoin and tjtanaa as code owners November 20, 2025 07:10

mergify bot added nvidia rocm Related to AMD ROCm labels Nov 20, 2025

github-project-automation bot added this to NVIDIA Nov 20, 2025

FENP force-pushed the prefill-context-parallel-mla branch from 8ac9843 to 5e79da7 Compare November 20, 2025 07:38

mergify bot added the needs-rebase label Nov 21, 2025

FENP requested review from ProExpertProg, WoosukKwon, benchislett, hmellor, houseroad, luccafong, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners November 24, 2025 06:59

LucasWilkinson reviewed Dec 4, 2025

View reviewed changes

FENP force-pushed the prefill-context-parallel-mla branch 3 times, most recently from b8d3dea to 5a51e85 Compare December 4, 2025 09:07

mergify bot added the needs-rebase label Dec 5, 2025

FENP and others added 7 commits December 8, 2025 10:44

model runner support PCP.

011ec6a

Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Fix cp variable naming & Support prefill context parallel with MLA ba…

572ba01

…ckend. Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

pcp all-gather during both prefill and decode

04ca6e0

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Remove unused variable

c742d72

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Fix incorrect modifications made during code merging.

043a16b

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Refactor the logic of PCP into functions.

1aa7e6b

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

Split num_tokens by pcp_world_size when profile_run

e505cf6

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

LucasWilkinson reviewed Dec 8, 2025

View reviewed changes

FENP force-pushed the prefill-context-parallel-mla branch from 5a51e85 to 92d8766 Compare December 8, 2025 09:30

mergify bot removed the needs-rebase label Dec 8, 2025

FENP force-pushed the prefill-context-parallel-mla branch from 92d8766 to 45c5c7c Compare December 8, 2025 10:45

Add cudagraph check & CI

4b40860

Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>

FENP force-pushed the prefill-context-parallel-mla branch from 45c5c7c to 4b40860 Compare December 8, 2025 11:31

LucasWilkinson reviewed Dec 9, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 9, 2025

zhenwenqi2024 reviewed Dec 14, 2025

View reviewed changes

weiguihua2 mentioned this pull request Dec 18, 2025

[PCP] PCP support ray #30123

Open

5 tasks

zhenwenqi2024 reviewed Dec 19, 2025

View reviewed changes

zhenwenqi2024 reviewed Dec 22, 2025

View reviewed changes

Uh oh!

[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA #28988

Are you sure you want to change the base?

[Feature][Attention][PCP] Support PCP (Prefill Context Parallel) with MLA #28988

Uh oh!

Conversation

FENP commented Nov 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Future work

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

mergify bot commented Nov 21, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FENP Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FENP Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FENP Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FENP Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

FENP commented Nov 19, 2025 •

edited by github-actions bot

Loading

FENP Dec 4, 2025 •

edited

Loading

FENP Dec 4, 2025 •

edited

Loading

FENP Dec 4, 2025 •

edited

Loading

FENP Dec 9, 2025 •

edited

Loading