[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. #28449

cboss6 · 2025-11-11T06:49:07Z

Purpose

Enable deepep low latency all2all backend with round-robin expert plamement strategy, which produces significant performance improvement.

Performance

Test Platform: CUDA 12.8, drivier 550.144.03
Model: DeepSeek-R1-671B
GPU: H20 * 8 * 2 nodes
Vllm config: dp=16, tp=1, enable_expert_parallel=1, all2all_backend=deepep_low_latency, use_deep_gemm=1

The current functionality has been fully implemented with correct accuracy and significant performance improvements.
In the benchmark with parameters times=10, num_prompts=512, dataset=sharegpt, input_len=1024, output_len=512, max_concurrency=8, and req_rate=8,
the results show that compared to the default linear placement, round-robin's throughput increased by 14.57% and TPOT improved by 13.38%.

serving command (head node)

export NCCL_CHECK_DISABLE=1
export NCCL_COLLNET_ENABLE=0
export NCCL_IB_CUDA_SUPPORT=1
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_SL=3
export NCCL_IB_TC=160
export NCCL_LL_THRESHOLD=16384
export NCCL_NET_GDR_LEVEL=2
export NCCL_NVLS_ENABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_PXN_DISABLE=0
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1
export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=bond1
export NVSHMEM_HCA_LIST=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1
export NVSHMEM_IB_TRAFFIC_CLASS=160
export NVSHMEM_DIR=/usr/local/nvshmem
export LD_LIBRARY_PATH=${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH
export PATH=${NVSHMEM_DIR}/bin:$PATH
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_DISABLE_NCCL_FOR_DP_SYNCHRONIZATION=1
export VLLM_ATTENTION_BACKEND=FLASHMLA
export VLLM_USE_DEEP_GEMM=1
export DG_JIT_CACHE_DIR=/root/.cache/vllm/deep_gemm/

vllm serve /path/to/DeepSeek-R1
--host 0.0.0.0
-tp 1
--max-model-len 16384
--max-num-batched-tokens 16384
--expert-placement-strategy round_robin
--enable-chunked-prefill
--gpu-memory-utilization 0.8
--load-format "auto"
--enable-expert-parallel
--data-parallel-size 16
--data-parallel-size-local 8
--data-parallel-address ${HOST_IP}
--data-parallel-rpc-port 12345
--api-server-count=8

benchmark command

vllm bench serve \ --backend vllm \ --base-url "http://127.0.0.1:8500" \ --port 8500 \ --endpoint '/v1/completions' \ --model ${model} \ --dataset-name sharegpt \ --num-prompts 512 \ --max-concurrency 8
--request-rate 8 \ --random-input-len 1024 \ --random-output-len 512

Accuracy

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.97	±	0.0171
		strict-match	5	exact_match	↑	0.97	±	0.0171

gemini-code-assist

Code Review

This pull request introduces support for the deepep-low-latency all-to-all backend with a round-robin expert placement strategy, which is a valuable performance enhancement for Mixture-of-Experts models. The implementation is well-structured, adding the necessary logic for mapping global to physical expert IDs and including checks for unsupported backends. My review includes a couple of suggestions: one to improve code quality and performance in the routing table generation, and another to relax a condition that currently limits the new placement strategy to specific model architectures, potentially broadening its applicability.

vllm/model_executor/layers/fused_moe/layer.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/fused_moe/layer.py

mergify · 2025-11-11T08:16:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cboss6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2025-11-11T18:23:13Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+        self.global_to_physical: torch.Tensor | None = None
+        self.physical_to_global: torch.Tensor | None = None
+        self.local_expert_global_ids: torch.Tensor | None = None


I think these should just be additional optional parameters to the __init__ method. Then we won't need a setter.

You can pass these through maybe_make_prepare_finalize as optional parameters as well.

Done, the setter is removed.

mergify · 2025-11-11T18:23:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cboss6.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

bnellnm · 2025-11-11T18:26:49Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+        physical = self.global_to_physical[topk_ids.to(torch.long)]
+        return physical.to(topk_ids.dtype)


Are the casts necessary here? topk_ids should already be an int of some type and if needed, global_to_physical could be cast in set_expert_routing_info (or __init__) using topk_indices_dtype so that there's no issues about type mismatches.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

bnellnm · 2025-11-11T18:31:47Z

vllm/model_executor/layers/fused_moe/layer.py

+        if hasattr(layer, "_ensure_expert_routing_tables"):
+            layer._ensure_expert_routing_tables()
        prepare_finalize = self.maybe_make_prepare_finalize()

        if prepare_finalize is not None:
            logger.debug(
                "%s for %s(%s)", prepare_finalize.__class__.__name__, self, id(self)
            )
+            if (
+                getattr(layer, "use_ep", False)
+                and hasattr(prepare_finalize, "set_expert_routing_info")
+                and hasattr(layer, "expert_global_to_physical")
+                and hasattr(layer, "expert_physical_to_global")
+                and hasattr(layer, "expert_local_to_global")
+            ):
+                prepare_finalize.set_expert_routing_info(
+                    layer.expert_global_to_physical,
+                    layer.expert_physical_to_global,
+                    layer.expert_local_to_global,
+                )


The construction logic was just refactored recently and maybe_init_modular_kernel has moved to FusedMoE. So that should make this a little bit simpler.

bnellnm · 2025-11-11T18:34:08Z

vllm/model_executor/layers/fused_moe/layer.py

+                    self.use_deepep_ht_kernels
+                    or self.use_pplx_kernels
+                    or self.use_flashinfer_cutlass_kernels


Can you change the condition to be not self.use_deepep_ll_kernels? That way, if any new all2all mechanisms are added, they won't accidentally skip this code.

bnellnm · 2025-11-11T18:46:38Z

Nice perf numbers! I just had a few minor comments.

cboss6 · 2025-11-14T11:19:07Z

Hi @bnellnm, I’ve updated the code based on your comments. Could you take another look when you have a moment? Thanks a lot!

bnellnm · 2025-11-14T14:00:04Z

vllm/model_executor/layers/fused_moe/layer.py

    def _maybe_make_prepare_finalize(
        moe: FusedMoEConfig,
        quant_config: FusedMoEQuantConfig | None,
+        layer: torch.nn.Module | None = None,


Could you pass the maps as individual optional tensors? Or an optional tuple of tensors? In latest main, this function has been moved to a standalone function that can't depend on the layer.

Also, I think adding a layer argument here would break one of the unit tests that depend on this function.

Thanks for the suggestion! I’ve updated the code to use a tuple of routing tables instead of passing the layer, and I’ve also rebased onto the latest main.

Could you take a look and see if this approach is acceptable to you?

Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: tbzhang <tbzhang@outlook.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>

Signed-off-by: bruceszchen <bruceszchen@tencent.com>

bnellnm

LGTM!

Signed-off-by: bruceszchen <bruceszchen@tencent.com>

hmellor

Nothing seems obviously incorrect to me, let's see what the tests say

…ent. (vllm-project#28449) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…ent. (vllm-project#28449) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Bhagyashri <Bhagyashri.Gaikwad2@ibm.com>

…ent. (vllm-project#28449) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

…ent. (#28449) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: jiang1.li <jiang1.li@intel.com>

…ent. (vllm-project#28449) Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

cboss6 requested review from mgoin and pavanimajety as code owners November 11, 2025 06:49

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 11, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

mergify bot added needs-rebase and removed needs-rebase labels Nov 11, 2025

cboss6 force-pushed the bruce/enable-deepepll-with-roundrobin branch from 3fe12ae to 3a13456 Compare November 11, 2025 10:02

bnellnm reviewed Nov 11, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 11, 2025

bnellnm reviewed Nov 11, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed Nov 11, 2025

View reviewed changes

cboss6 requested review from robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 14, 2025 09:30

cboss6 force-pushed the bruce/enable-deepepll-with-roundrobin branch from 8645f1d to 2a8992d Compare November 14, 2025 11:08

bnellnm reviewed Nov 14, 2025

View reviewed changes

cboss6 force-pushed the bruce/enable-deepepll-with-roundrobin branch from 2a8992d to 33a5de5 Compare November 16, 2025 16:34

mergify bot removed the needs-rebase label Nov 16, 2025

cboss6 added 2 commits November 17, 2025 00:45

Enable deepep-low-latency with round-robin expert placement.

ab69b00

Signed-off-by: bruceszchen <bruceszchen@tencent.com> Co-authored-by: tbzhang <tbzhang@outlook.com> Signed-off-by: bruceszchen <bruceszchen@tencent.com>

Format.

90bca85

Signed-off-by: bruceszchen <bruceszchen@tencent.com>

cboss6 force-pushed the bruce/enable-deepepll-with-roundrobin branch from e2a3038 to 90bca85 Compare November 16, 2025 16:46

bnellnm approved these changes Nov 17, 2025

View reviewed changes

Format.

518b9e2

Signed-off-by: bruceszchen <bruceszchen@tencent.com>

hmellor reviewed Nov 19, 2025

View reviewed changes

Merge branch 'main' into bruce/enable-deepepll-with-roundrobin

8cfa1fd

hmellor added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025

hmellor approved these changes Nov 19, 2025

View reviewed changes

hmellor merged commit da2f680 into vllm-project:main Nov 19, 2025
53 checks passed

		physical = self.global_to_physical[topk_ids.to(torch.long)]
		return physical.to(topk_ids.dtype)

Uh oh!

[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. #28449

[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. #28449

Uh oh!

Conversation

cboss6 commented Nov 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Performance

serving command (head node)

benchmark command

Accuracy

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

bnellnm Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboss6 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

bnellnm Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

cboss6 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnellnm Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboss6 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

cboss6 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Nov 11, 2025

Uh oh!

cboss6 commented Nov 14, 2025

Uh oh!

bnellnm Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboss6 Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

cboss6 commented Nov 11, 2025 •

edited by github-actions bot

Loading

bnellnm Nov 11, 2025 •

edited

Loading

bnellnm Nov 11, 2025 •

edited

Loading

bnellnm Nov 14, 2025 •

edited

Loading