[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors by danielvegamyhre · Pull Request #3656 · pytorch/ao

danielvegamyhre · 2026-01-17T18:22:51Z

Stacked PRs:

[mxfp8 moe training] auto-select chunk_width in cuda blocked layout kernel #3658
->[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors #3656

Summary

Fixes #3636

Context

Bug report shows the user has a scales tensor with 44 columns, which this kernel does not handle well for 2 reasons:

TMA 2d transfers require stride be divisible by 16, and for this row major scale tensor, stride would be 44, resulting in a non-informative CUDA error when creating the Tensormap.
- "globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40.'" - source
Even for tensors with stride % 16 == 0, the minimum CHUNK_WIDTH the kernel handles is 64 (this corresponds to a model dim of 2048, as 2048/32==64). For smaller models like DSV3 16b with intermediate dim = 1408 in the MoE layer, this kernel will fail.

Fix

Add validation with clear error messages enforcing scale_cols % 16 == 0, which will direct the user to use the more flexible, but slower, Triton kernel if they have some odd model dim that results in scale_cols not divisible by 16.
Support chunk_size 16, 32 to handle smaller model dims.

Tests

pytest test/prototype/moe_training/test_kernels.py -v -s -k cuda_mx_block

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

pytorch-bot · 2026-01-17T18:22:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3656

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 401405f with merge base a5f2693 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

drisspg

… scale tensors (#3656)

danielvegamyhre added a commit that referenced this pull request Jan 17, 2026

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier…

a065834

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from 3a50cd4 to a065834 Compare January 17, 2026 18:22

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2026

danielvegamyhre added mx topic: bug fix Use this tag for PRs that fix bugs topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) moe labels Jan 17, 2026

danielvegamyhre marked this pull request as draft January 17, 2026 18:32

danielvegamyhre added a commit that referenced this pull request Jan 17, 2026

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier…

b6881a0

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from a065834 to b6881a0 Compare January 17, 2026 18:32

danielvegamyhre marked this pull request as ready for review January 17, 2026 18:32

danielvegamyhre marked this pull request as draft January 17, 2026 18:35

danielvegamyhre added a commit that referenced this pull request Jan 17, 2026

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier…

871b5b1

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from b6881a0 to 871b5b1 Compare January 17, 2026 18:35

danielvegamyhre marked this pull request as ready for review January 17, 2026 18:35

danielvegamyhre marked this pull request as draft January 17, 2026 19:19

danielvegamyhre marked this pull request as ready for review January 17, 2026 19:19

This was referenced Jan 17, 2026

[mxfp8 moe training] auto-select chunk_width in cuda blocked layout kernel #3658

Merged

mx_block_rearrange_2d_M_groups_cuda does not support scales_tensor with M-group size = 44 #3636

Closed

danielvegamyhre marked this pull request as draft January 17, 2026 23:01

danielvegamyhre marked this pull request as ready for review January 17, 2026 23:01

danielvegamyhre marked this pull request as draft January 17, 2026 23:02

danielvegamyhre marked this pull request as ready for review January 17, 2026 23:02

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier…

401405f

… scale tensors stack-info: PR: #3656, branch: danielvegamyhre/stack/117

danielvegamyhre marked this pull request as draft January 17, 2026 23:08

danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from 871b5b1 to 401405f Compare January 17, 2026 23:08

danielvegamyhre marked this pull request as ready for review January 17, 2026 23:09

danielvegamyhre requested a review from drisspg January 18, 2026 17:31

drisspg approved these changes Jan 20, 2026

View reviewed changes

danielvegamyhre merged commit 80bae6b into main Jan 20, 2026
21 checks passed

jcaip pushed a commit that referenced this pull request Jan 22, 2026

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier…

3137235

… scale tensors (#3656)

danielvegamyhre mentioned this pull request Jan 23, 2026

[mxfp8 moe training] fallback cuda kernel for when input doesn't meet 2d TMA constraints #3708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors#3656

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors#3656
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/117

danielvegamyhre commented Jan 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 17, 2026 •

edited

Loading

Uh oh!

drisspg left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielvegamyhre commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Fix

Tests

Uh oh!

pytorch-bot bot commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3656

✅ No Failures

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Jan 17, 2026 •

edited

Loading

pytorch-bot bot commented Jan 17, 2026 •

edited

Loading