[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

danielvegamyhre · 2025-09-11T00:28:09Z

Summary

This PR integrates all the grouped GEMMs and triton kernels for per group scale conversions landed recently (details below) into mxfp8 MoE training:

We landed 2d-2d support for mxfp8 grouped gemm in FBGEMM (Add 2d-2d support to MXFP8 Grouped GEMM FBGEMM#4816) then integrated into torch._scaled_grouped_mm (MXFP8 grouped GEMM support for torch._scaled_grouped_mm + submodule bump pytorch#162209)
[mxfp8 moe training] add per group blocked scale kernels #2886 and [mxfp8 moe training] add triton kernel for blocked swizzled 3d weight scales #2894 added triton kernels needed to convert each group's mxfp8 scales to blocked swizzled format while keeping all data on the device / avoiding a d2h sync, for the 2d-3d grouped gemm
[mxfp8 moe training] per group scale conversion to blocked format with groups along K dim (for 2d2d grouped gemm) #2956 is adding additional triton kernels needed for the per group scale conversion for the 2d-2d grouped gemm

Test plan

pytest test/prototype/moe_training/test_scaled_grouped_mm.py -k test_mxfp8_grouped_gemm_with_dq_fwd_bwd -s

Next steps

Make compatible with compile w/ custom ops
Microbenchmarks
torchtitan integration -> e2e benchmarks

pytorch-bot · 2025-09-11T00:28:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2977

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2025-09-11T16:58:53Z

torchao/prototype/moe_training/kernels/mxfp8_blocked_scales.py


    Returns:
-        - starting_row_after_padding: 1D integer tensor representing the starting row after padding each to blocked format.
+        - starting_col_after_padding: 1D integer tensor representing the starting row after padding each to blocked format.


nit row -> col

drisspg · 2025-09-11T16:59:27Z

torchao/prototype/moe_training/scaled_grouped_mm.py

            out_dtype=out_dtype,
        )
+
+        # Store what we need for backward before returning.


nit: not needed comment

…d gemm

…p blocked formatting kernels

danielvegamyhre marked this pull request as draft September 11, 2025 00:28

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe labels Sep 11, 2025

danielvegamyhre force-pushed the dvm-integrate branch 2 times, most recently from 7c8e965 to ecde9da Compare September 11, 2025 04:54

danielvegamyhre marked this pull request as ready for review September 11, 2025 04:55

danielvegamyhre requested review from drisspg and vkuzo September 11, 2025 04:55

drisspg reviewed Sep 11, 2025

View reviewed changes

drisspg approved these changes Sep 11, 2025

View reviewed changes

danielvegamyhre force-pushed the dvm-integrate branch from ecde9da to ad14ff3 Compare September 11, 2025 17:03

[mxfp8 moe training] blocked scale conversion for LHS of 2d-2d groupe…

66a5008

…d gemm

danielvegamyhre changed the base branch from contraction to main September 11, 2025 17:11

danielvegamyhre changed the base branch from main to contraction September 11, 2025 17:11

danielvegamyhre added 5 commits September 11, 2025 10:12

debug

d0426fa

col of blocks method

c09a6af

row of blocks within groups only

9c4ff83

row of blocks within groups only

0f6c576

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton per grou…

7d857dd

…p blocked formatting kernels

danielvegamyhre force-pushed the dvm-integrate branch from ad14ff3 to 7d857dd Compare September 11, 2025 17:13

danielvegamyhre changed the base branch from contraction to main September 11, 2025 17:13

danielvegamyhre merged commit 66384a9 into main Sep 11, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

Uh oh!

danielvegamyhre commented Sep 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading

Uh oh!

drisspg Sep 11, 2025

Uh oh!

drisspg Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

[mxfp8 moe training] integrate mxfp8 grouped gemm and triton kernels for scale conversion to blocked format #2977

Uh oh!

Conversation

danielvegamyhre commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Next steps

Uh oh!

pytorch-bot bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2977

Uh oh!

drisspg Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Sep 11, 2025 •

edited

Loading

pytorch-bot bot commented Sep 11, 2025 •

edited

Loading