[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline by danielvegamyhre · Pull Request #3585 · pytorch/ao

danielvegamyhre · 2026-01-06T22:37:25Z

Stacked PRs:

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

This stack creates a set of differentiable MXFP8 expert parallel primitives for MoE training:

a2a_dispatch
permute
unpermute
a2a_combine

The key idea is instead of waiting to quantize to MXFP8 until directly before the grouped GEMMs, we can quantize earlier, before the a2a collectives, to speed up these exposed comms. This requires staying in MXFP8 through the token permutation, then feeding those permuted fp8 data + scales into the MXFP8 grouped GEMM, which produces bf16 outputs.

We do this in both the forward pass and backward pass.

Design

Each color corresponds to a separate autograd function:

Benchmarks

This benchmark measures the entire forward + backward pass of the full EP pipeline:

Forward:

a2a_dispatch
permute
MXFP8 grouped GEMM
unpermute
a2a_combine

Backward:

a2a_combine backward()
unpermute backward()
MXFP8 grouped GEMM backward()
permute backward()
a2a_dispatch backward()

For DeepSeekV3 shapes we see:

1.4x speedup over vanilla bf16 pipeline (with torch._grouped_mm)
1.22x speedup over existing torchao pipeline (with bf16 all2alls, mxfp8 grouped GEMMs)

Note this is WITHOUT the mxfp8 grouped GEMM improvements landing in pytorch core soon, so we expect even higher speedups then! We have lower speedups than usual right now due to improvments landing to bf16 grouped gemm but none to mxfp8 grouped gemm yet.

Versus vanilla bf16 pipeline:

Expert Parallelism Pipeline Benchmark Results
World Size: 2
========================================================================================================================
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+
|   tokens |   dim |   hidden_dim |   num_experts |   fwd_bf16_ms |   fwd_mxfp8_ms | fwd_speedup   |   bwd_bf16_ms |   bwd_mxfp8_ms | bwd_speedup   | total_speedup   |
+==========+=======+==============+===============+===============+================+===============+===============+================+===============+=================+
|   131072 |  8192 |         5120 |             8 |        17.537 |         11.7   | 1.50x         |        29.303 |         21.563 | 1.36x         | 1.41x           |
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+
|   131072 |  7168 |         2048 |             8 |         9.285 |          6.387 | 1.45x         |        13.718 |         10.004 | 1.37x         | 1.40x           |
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+

Versus existing torchao strategy (bf16 all2alls, quantize to mxfp8 directly before grouped gemms)

========================================================================================================================
Expert Parallelism Pipeline Benchmark Results
World Size: 2
========================================================================================================================
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+
|   tokens |   dim |   hidden_dim |   num_experts |   fwd_bf16_ms |   fwd_mxfp8_ms | fwd_speedup   |   bwd_bf16_ms |   bwd_mxfp8_ms | bwd_speedup   | total_speedup   |
+==========+=======+==============+===============+===============+================+===============+===============+================+===============+=================+
|   131072 |  8192 |         5120 |             8 |        13.472 |         11.947 | 1.13x         |        25.453 |         21.891 | 1.16x         | 1.15x           |
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+
|   131072 |  7168 |         2048 |             8 |         7.903 |          6.388 | 1.24x         |        12.278 |         10.021 | 1.23x         | 1.23x           |
+----------+-------+--------------+---------------+---------------+----------------+---------------+---------------+----------------+---------------+-----------------+

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

pytorch-bot · 2026-01-06T22:37:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3585

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

vkuzo

didn't read closely, stamp for prototype

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

e682d0c

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre added a commit that referenced this pull request Jan 6, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

e85cb48

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from 5388cac to e85cb48 Compare January 6, 2026 22:37

danielvegamyhre force-pushed the danielvegamyhre/stack/108 branch from edae1fa to 93f767e Compare January 6, 2026 22:37

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 6, 2026

danielvegamyhre marked this pull request as draft January 6, 2026 23:28

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 6, 2026 23:28

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from e85cb48 to 5f4b776 Compare January 6, 2026 23:28

danielvegamyhre added a commit that referenced this pull request Jan 6, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

5f4b776

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 6, 2026 23:28

danielvegamyhre marked this pull request as ready for review January 6, 2026 23:29

danielvegamyhre marked this pull request as draft January 6, 2026 23:43

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 6, 2026 23:43

danielvegamyhre added a commit that referenced this pull request Jan 6, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

ed7dae9

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from 5f4b776 to ed7dae9 Compare January 6, 2026 23:43

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 6, 2026 23:44

danielvegamyhre marked this pull request as ready for review January 6, 2026 23:44

danielvegamyhre marked this pull request as draft January 6, 2026 23:47

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 6, 2026 23:47

danielvegamyhre added a commit that referenced this pull request Jan 6, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

2862df3

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from ed7dae9 to 2862df3 Compare January 6, 2026 23:47

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 6, 2026 23:48

danielvegamyhre marked this pull request as ready for review January 6, 2026 23:48

danielvegamyhre added a commit that referenced this pull request Jan 7, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

b916621

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from 21f6964 to b916621 Compare January 7, 2026 00:29

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 7, 2026 00:29

danielvegamyhre marked this pull request as ready for review January 7, 2026 00:29

danielvegamyhre added mx moe topic: new feature Use this tag if this PR adds a new feature labels Jan 7, 2026

danielvegamyhre marked this pull request as draft January 7, 2026 18:27

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 7, 2026 18:27

danielvegamyhre added a commit that referenced this pull request Jan 7, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

bbaa0a4

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from b916621 to bbaa0a4 Compare January 7, 2026 18:27

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 7, 2026 18:28

danielvegamyhre marked this pull request as ready for review January 7, 2026 18:28

danielvegamyhre marked this pull request as draft January 7, 2026 18:59

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 7, 2026 18:59

danielvegamyhre added a commit that referenced this pull request Jan 7, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

a5acc85

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from bbaa0a4 to a5acc85 Compare January 7, 2026 18:59

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 7, 2026 19:00

danielvegamyhre marked this pull request as ready for review January 7, 2026 19:00

danielvegamyhre marked this pull request as draft January 7, 2026 21:57

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 7, 2026 21:57

danielvegamyhre added a commit that referenced this pull request Jan 7, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

cc59b9a

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre force-pushed the danielvegamyhre/stack/109 branch from a5acc85 to cc59b9a Compare January 7, 2026 21:57

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/108 January 7, 2026 21:57

danielvegamyhre marked this pull request as ready for review January 7, 2026 21:58

danielvegamyhre marked this pull request as draft January 7, 2026 22:07

danielvegamyhre changed the base branch from danielvegamyhre/stack/108 to main January 7, 2026 22:07

danielvegamyhre added a commit that referenced this pull request Jan 7, 2026

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline

d46c577

stack-info: PR: #3585, branch: danielvegamyhre/stack/109

danielvegamyhre mentioned this pull request Jan 9, 2026

[mxfp8 moe training] add torch.compile support #3606

Merged

vkuzo approved these changes Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline#3585

[mxfp8 moe training] add benchmark for e2e mxfp8 EP pipeline#3585
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/109

danielvegamyhre commented Jan 6, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

vkuzo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielvegamyhre commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!