[mxfp8 moe training] fallback cuda kernel for when input doesn't meet 2d TMA constraints by danielvegamyhre · Pull Request #3708 · pytorch/ao

danielvegamyhre · 2026-01-23T05:46:03Z

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet 2d TMA constraints

Context

In mx_block_rearrange_2d_M_groups_cuda does not support scales_tensor with M-group size = 44 #3636 the user reported a bug training DeepSeekV3 16b. TL;DR is the CUDA blocked layout kernel for groups along M doesn't support input tensors which don't meet 2d TMA constraint of stride being a multiple of 16 bytes.
Short term mitigation ([mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors #3656) was to assert this constraint is met, and throw an informative error prompting the user to use the slower but more flexible Triton kernel for this instead.
This PR lands a proper fix, which is a simple non-pipelined CUDA kernel that is:
- ~5.5x faster than Triton for DSV3 16b shapes, but handles arbitrary column width
- Dispatching to use faster pipelined kernel if we can, otherwise fallback to this simpler kernel.

Tests

pytest test/prototype/moe_training/test_kernels.py -v -s -k cuda_mx_block

Benchmarks

input_shape      chunks_per_tb    torch_time_us    triton_time_us    cuda_time_us  triton_speedup    cuda_speedup
-------------  ---------------  ---------------  ----------------  --------------  ----------------  --------------
(131072, 44)                 1          1105.65            123.94           19.46  8.92x             56.83x
(131072, 44)                 4          1101.41            109.57           19.46  10.05x            56.61x
(131072, 44)                 8          1042.5             207.97           19.46  5.01x             53.58x

E2E in Torchtitan on DSV3 16b I see an extra ~3% TPS speedup with single node dp2ep and the mxfp8_wgrad_with_hp + mxfp8 all2all/expert parallel enabled.

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

pytorch-bot · 2026-01-23T05:46:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3708

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 7 Pending

As of commit 11d5e00 with merge base 28306f0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

slayton58

Some merge/rebase cleanup to sort, then LGTM

slayton58 · 2026-01-23T16:40:31Z

torchao/csrc/cuda/mx_kernels/mxfp8_extension.cpp

    int chunks_per_tb,
    cudaStream_t stream);

+<<<<<<< Updated upstream


Remove merge detritus :)

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre added a commit that referenced this pull request Jan 23, 2026

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet…

46532e0

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre force-pushed the danielvegamyhre/stack/119 branch from 41f65d2 to 46532e0 Compare January 23, 2026 05:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 23, 2026

danielvegamyhre added mx moe topic: new feature Use this tag if this PR adds a new feature labels Jan 23, 2026

danielvegamyhre marked this pull request as draft January 23, 2026 05:53

danielvegamyhre added a commit that referenced this pull request Jan 23, 2026

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet…

43cebb1

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre force-pushed the danielvegamyhre/stack/119 branch from 46532e0 to 43cebb1 Compare January 23, 2026 05:53

danielvegamyhre marked this pull request as ready for review January 23, 2026 05:53

danielvegamyhre marked this pull request as draft January 23, 2026 06:08

danielvegamyhre added a commit that referenced this pull request Jan 23, 2026

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet…

7a25231

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre force-pushed the danielvegamyhre/stack/119 branch from 43cebb1 to 7a25231 Compare January 23, 2026 06:08

danielvegamyhre marked this pull request as ready for review January 23, 2026 06:08

danielvegamyhre marked this pull request as draft January 23, 2026 06:08

danielvegamyhre force-pushed the danielvegamyhre/stack/119 branch from 7a25231 to 6cad436 Compare January 23, 2026 06:08

danielvegamyhre added a commit that referenced this pull request Jan 23, 2026

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet…

6cad436

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre marked this pull request as ready for review January 23, 2026 06:08

slayton58 approved these changes Jan 23, 2026

View reviewed changes

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet…

11d5e00

… 2d TMA constraints stack-info: PR: #3708, branch: danielvegamyhre/stack/119

danielvegamyhre marked this pull request as draft January 23, 2026 16:45

danielvegamyhre force-pushed the danielvegamyhre/stack/119 branch from 6cad436 to 11d5e00 Compare January 23, 2026 16:45

danielvegamyhre marked this pull request as ready for review January 23, 2026 16:45

danielvegamyhre merged commit 0cbe590 into main Jan 23, 2026
19 checks passed

danielvegamyhre mentioned this pull request Feb 25, 2026

Expert group padding to multiple of 32 pytorch/torchtitan#2262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet 2d TMA constraints#3708

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet 2d TMA constraints#3708
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/119

danielvegamyhre commented Jan 23, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

slayton58 left a comment

Uh oh!

slayton58 Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielvegamyhre commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Tests

Benchmarks

Uh oh!

pytorch-bot bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3708

⏳ No Failures, 7 Pending

Uh oh!

slayton58 left a comment

Choose a reason for hiding this comment

Uh oh!

slayton58 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Jan 23, 2026 •

edited

Loading

pytorch-bot bot commented Jan 23, 2026 •

edited

Loading