[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE #5723

rosenrodt · 2025-07-03T16:31:40Z

[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency

Description

Update TRTLLM MoE to have better perf or at least on par with CUTLASS MoE at mid-high concurrency (64 tokens per expert).
- For num_experts=128, top_k=8: 64 tokens/expert means concurrency of 1024 if unrealistically assuming round-robin token distribution
Update default heurstics of the kernel selection
AttentionDP for DeepSeek and Qwen3 (pytorch backend)

Test Coverage

accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=TRTLLM-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=TRTLLM-mtp_nextn=0-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=TRTLLM-mtp_nextn=2-fp8kv=False-attention_dp=False-cuda_graph=True-overlap_scheduler=False-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4[moe_backend=TRTLLM-mtp_nextn=2-fp8kv=True-attention_dp=False-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=TRTLLM-mtp_nextn=0-tp4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=TRTLLM-mtp_nextn=0-ep4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=TRTLLM-mtp_nextn=2-tp4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_nvfp4_4gpus[moe_backend=TRTLLM-mtp_nextn=2-ep4-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[tep4_latency_moe_trtllm]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[tep4_latency_moe_cutlass]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_trtllm]
accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_cutlass]
...

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

rosenrodt · 2025-07-04T04:38:22Z

/bot run

tensorrt-cicd · 2025-07-04T04:44:05Z

PR_Github #10929 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-04T07:10:25Z

PR_Github #10929 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8078 completed with status: 'FAILURE'

rosenrodt · 2025-07-04T09:29:43Z

@nekorobov Let's wait for #5743 to be merged first

rosenrodt · 2025-07-04T09:54:47Z

/bot run

tensorrt-cicd · 2025-07-04T09:59:57Z

PR_Github #10989 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-04T10:10:04Z

PR_Github #10989 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8117 completed with status: 'FAILURE'

rosenrodt · 2025-07-07T01:36:30Z

/bot run

tensorrt-cicd · 2025-07-07T01:42:06Z

PR_Github #11077 [ run ] triggered by Bot

rosenrodt · 2025-07-07T01:47:52Z

/bot run

tensorrt-cicd · 2025-07-07T01:53:16Z

PR_Github #11078 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-07T01:53:19Z

PR_Github #11077 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-07T06:49:37Z

PR_Github #11078 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8191 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

rosenrodt · 2025-07-07T06:56:51Z

@nekorobov @byshiue PR is ready for re-review. I pushed updates earlier today to enable TRTLLM MoE + attention DP.

Also, since I cherry-picked #5743 to avoid intermittent failures, I prefer to wait until #5743 is merged.

tests/integration/defs/accuracy/test_llm_api_pytorch.py

rosenrodt · 2025-07-07T10:17:25Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-07-07T10:22:56Z

PR_Github #11140 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-07T11:28:10Z

PR_Github #11140 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8236 (Partly Tested) completed with status: 'SUCCESS'

rosenrodt · 2025-07-07T13:14:58Z

The partial test for newly added cases passed. The previous commit completed full test with success. PR is in good shape.

[2025-07-07T11:12:17.275Z] DGX_B200-4_GPUs-PyTorch-Post-Merge-1/accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_cutlass] PASSED [ 64%]
[2025-07-07T11:14:08.816Z] DGX_B200-4_GPUs-PyTorch-Post-Merge-1/accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[dep4_latency_moe_trtllm] PASSED [ 67%]
[2025-07-07T11:16:15.323Z] DGX_B200-4_GPUs-PyTorch-Post-Merge-1/accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[tep4_latency_moe_cutlass] PASSED [ 70%]
[2025-07-07T11:18:09.403Z] DGX_B200-4_GPUs-PyTorch-Post-Merge-1/accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B::test_nvfp4[tep4_latency_moe_trtllm] PASSED [ 72%]

rosenrodt · 2025-07-08T02:20:52Z

/bot run

tensorrt-cicd · 2025-07-08T08:19:07Z

PR_Github #11255 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: Cherry picked from #5816 to avoid build failures on ToT

rosenrodt · 2025-07-08T08:22:56Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-07-08T08:24:28Z

PR_Github #11256 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T08:28:05Z

PR_Github #11258 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T08:28:08Z

PR_Github #11256 [ run ] completed with state ABORTED

- add trtllm moe nvfp4 cubins for up to 64 tokens per expert - update nvfp4 trtllm moe default heuristic - TLLM_BATCHED_GEMM_PRINT_NAME=1 to print kernel name and problem shape Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2025-07-08T15:18:11Z

/bot run --extra-stage "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,GB200-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-07-08T15:23:26Z

PR_Github #11317 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T15:23:28Z

PR_Github #11258 [ run ] completed with state ABORTED

rosenrodt requested a review from a team as a code owner July 3, 2025 16:31

rosenrodt requested review from suyoggupta and pcastonguay July 3, 2025 16:31

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from ef5ea41 to 606ae9d Compare July 3, 2025 16:34

rosenrodt requested a review from nekorobov July 3, 2025 16:34

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from 606ae9d to 8b0eebc Compare July 4, 2025 04:35

rosenrodt changed the title ~~[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency~~ [feat] [https://nvbugs/5369010] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency Jul 4, 2025

nekorobov approved these changes Jul 4, 2025

View reviewed changes

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from 105dfad to 3945b94 Compare July 4, 2025 09:10

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from 3945b94 to 9969e33 Compare July 4, 2025 09:54

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch 2 times, most recently from 1171165 to 626add3 Compare July 7, 2025 01:35

rosenrodt changed the title ~~[feat] [https://nvbugs/5369010] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency~~ [feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE Jul 7, 2025

nekorobov approved these changes Jul 7, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from 1b301e9 to 89c2d2c Compare July 7, 2025 10:08

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch 4 times, most recently from 419a0fc to f94a32b Compare July 8, 2025 08:12

rosenrodt and others added 6 commits July 8, 2025 23:15

update moe trtllm nvfp4 cubins

c1f555c

- add trtllm moe nvfp4 cubins for up to 64 tokens per expert - update nvfp4 trtllm moe default heuristic - TLLM_BATCHED_GEMM_PRINT_NAME=1 to print kernel name and problem shape Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

enable attention_dp for TRTLLM MoE

09a6e89

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

feat: Adds optional module cache for TRT-LLM Gen Gemm interfaces

f462557

Signed-off-by: David Clark <215764518+davidclark-nv@users.noreply.github.com> Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

format

622be6f

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

add more test

e2389b2

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

fix deepseek-v3 TRTLLM MoE + attentionDP

7c07651

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt force-pushed the trtllm-moe-nvfp4-cubin-high-concurrency branch from f94a32b to 7c07651 Compare July 8, 2025 15:16

[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE #5723

Are you sure you want to change the base?

[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE #5723

Conversation

rosenrodt commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

rosenrodt commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

rosenrodt commented Jul 4, 2025

Uh oh!

rosenrodt commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

tensorrt-cicd commented Jul 4, 2025

Uh oh!

rosenrodt commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

rosenrodt commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

rosenrodt commented Jul 7, 2025

Uh oh!

Uh oh!

rosenrodt commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

tensorrt-cicd commented Jul 7, 2025

Uh oh!

rosenrodt commented Jul 7, 2025

Uh oh!

rosenrodt commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

rosenrodt commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

rosenrodt commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Uh oh!

rosenrodt commented Jul 3, 2025 •

edited

Loading