Skip to content

[mxfp8 moe training] auto-select chunk_width in cuda blocked layout kernel#3658

Merged
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/118
Jan 20, 2026
Merged

[mxfp8 moe training] auto-select chunk_width in cuda blocked layout kernel#3658
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/118

Conversation

@danielvegamyhre
Copy link
Copy Markdown
Contributor

@danielvegamyhre danielvegamyhre commented Jan 17, 2026

Stacked PRs:


[mxfp8 moe training] auto-select chunk_width in cuda blocked layout kernel

Summary

  • Make kernel less error prone for users and simplfy user-facing API by auto-selecting chunk_width used in cuda blocked layout for groups along M kernel.
  • This should be auto selected via heuristics at dispatch time, rather than manually specified by the user, similar to how in GEMM kernels the user doesn't specify block size, mma instruction width (wgmma n64m16k16 vs n64m128k16, etc)

Note: we should also add auto-selection for chunks_per_tb

Tests

  • pytest test/prototype/moe_training/test_kernels.py -v -s -k cuda_mx_block

Benchmarks

Same or slightly better for some shapes now than originally in #3546

input_shape      chunk_width    chunks_per_tb    torch_time_us    triton_time_us    cuda_time_us  triton_vs_torch    cuda_vs_torch      cuda_vs_triton
-------------  -------------  ---------------  ---------------  ----------------  --------------  -----------------  ---------------  ----------------
(32768, 160)              64                1           122.91             60.45           17.41  2.03x              7.06x                        3.47x
(32768, 160)              64                4           560.88             52.19           15.39  10.75x             36.44x                       3.39x
(32768, 160)              64                8           557.07             37.89           17.44  14.70x             31.94x                       2.17x
(131072, 160)             64                1          1867.25            128              31.78  14.59x             58.76x                       4.03x
(131072, 160)             64                4          1856.22            136.22           25.63  13.63x             72.42x                       5.31x
(131072, 160)             64                8          1853.86            279.55           23.55  6.63x              78.71x                      11.87x
(131072, 64)              64                1          1595.49            148.48           15.49  10.75x             103.01x                      9.59x
(131072, 64)              64                4          1528.14            103.42           15.39  14.78x             99.28x                       6.72x
(131072, 64)              64                8          1542.43            109.57           17.44  14.08x             88.44x                       6.28x
(131072, 224)             64                1          4884.77            234.4            39.97  20.84x             122.22x                      5.86x
(131072, 224)             64                4          4943.9             189.44           31.78  26.10x             155.59x                      5.96x
(131072, 224)             64                8          4923.94            211.97           29.73  23.23x             165.63x                      7.13x```

…ernel

stack-info: PR: #3658, branch: danielvegamyhre/stack/118
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
…ernel

stack-info: PR: #3658, branch: danielvegamyhre/stack/118
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3658

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 2 Pending

As of commit ca73d5d with merge base 80bae6b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/118 branch from 00e7bfa to 086bc3f Compare January 17, 2026 19:19
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2026
@danielvegamyhre danielvegamyhre added mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) moe labels Jan 17, 2026
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:01
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/117 to main January 17, 2026 23:01
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
…ernel

stack-info: PR: #3658, branch: danielvegamyhre/stack/118
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/118 branch from 086bc3f to f491d5a Compare January 17, 2026 23:01
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/117 January 17, 2026 23:01
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:01
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:02
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/117 to main January 17, 2026 23:02
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
…ernel

stack-info: PR: #3658, branch: danielvegamyhre/stack/118
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/118 branch from f491d5a to 1c9c708 Compare January 17, 2026 23:02
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/117 January 17, 2026 23:02
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:02
@danielvegamyhre danielvegamyhre marked this pull request as draft January 17, 2026 23:08
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/117 to main January 17, 2026 23:08
danielvegamyhre added a commit that referenced this pull request Jan 17, 2026
…ernel

stack-info: PR: #3658, branch: danielvegamyhre/stack/118
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/118 branch from 1c9c708 to e70c69d Compare January 17, 2026 23:08
@danielvegamyhre danielvegamyhre changed the base branch from main to danielvegamyhre/stack/117 January 17, 2026 23:09
@danielvegamyhre danielvegamyhre marked this pull request as ready for review January 17, 2026 23:09
Copy link
Copy Markdown
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/118 branch from e70c69d to ca73d5d Compare January 20, 2026 18:00
@danielvegamyhre danielvegamyhre changed the base branch from danielvegamyhre/stack/117 to main January 20, 2026 18:00
@danielvegamyhre danielvegamyhre merged commit 8fa0b5a into main Jan 20, 2026
16 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants