Skip to content

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors#3656

Merged
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/117
Jan 20, 2026
Merged

[mxfp8 moe training] cuda blocked layout kernel handling for skinnier scale tensors#3656
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/117

Conversation

@danielvegamyhre
Copy link
Copy Markdown
Contributor

@danielvegamyhre danielvegamyhre commented Jan 17, 2026

Stacked PRs:


Summary

Fixes #3636

Context

Bug report shows the user has a scales tensor with 44 columns, which this kernel does not handle well for 2 reasons:

  1. TMA 2d transfers require stride be divisible by 16, and for this row major scale tensor, stride would be 44, resulting in a non-informative CUDA error when creating the Tensormap.
    • "globalStrides array, which specifies tensor stride of each of the lower tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40.'" - source
  2. Even for tensors with stride % 16 == 0, the minimum CHUNK_WIDTH the kernel handles is 64 (this corresponds to a model dim of 2048, as 2048/32==64). For smaller models like DSV3 16b with intermediate dim = 1408 in the MoE layer, this kernel will fail.

Fix

  • Add validation with clear error messages enforcing scale_cols % 16 == 0, which will direct the user to use the more flexible, but slower, Triton kernel if they have some odd model dim that results in scale_cols not divisible by 16.
  • Support chunk_size 16, 32 to handle smaller model dims.

Tests

  • pytest test/prototype/moe_training/test_kernels.py -v -s -k cuda_mx_block

danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
… scale tensors

stack-info: PR: #3656, branch: danielvegamyhre/stack/117
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from 3a50cd4 to a065834 Compare January 17, 2026 18:22
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3656

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 401405f with merge base a5f2693 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2026
@danielvegamyhre danielvegamyhre added mx topic: bug fix Use this tag for PRs that fix bugs topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) moe labels Jan 17, 2026
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 18:32
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
… scale tensors

stack-info: PR: #3656, branch: danielvegamyhre/stack/117
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from a065834 to b6881a0 Compare January 17, 2026 18:32
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 18:32
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 18:35
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
… scale tensors

stack-info: PR: #3656, branch: danielvegamyhre/stack/117
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from b6881a0 to 871b5b1 Compare January 17, 2026 18:35
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 18:35
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 19:19
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 19:19
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:01
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:01
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:02
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:02
… scale tensors

stack-info: PR: #3656, branch: danielvegamyhre/stack/117
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:08
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/117 branch from 871b5b1 to 401405f Compare January 17, 2026 23:08
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:09
Copy link
Copy Markdown
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@danielvegamyhre danielvegamyhre merged commit 80bae6b into main Jan 20, 2026
21 checks passed
jcaip pushed a commit that referenced this pull request Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: bug fix Use this tag for PRs that fix bugs topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mx_block_rearrange_2d_M_groups_cuda does not support scales_tensor with M-group size = 44

2 participants