[mxfp8 moe training] default to triton kernel for dim0 cast by danielvegamyhre · Pull Request #3560 · pytorch/ao

danielvegamyhre · 2026-01-01T00:31:58Z

Stacked PRs:

[mxfp8 moe training] _to_mxfp8_then_scaled_grouped_mm wrapper that accepts keyword args #3561
->[mxfp8 moe training] default to triton kernel for dim0 cast #3560

[mxfp8 moe training] default to triton kernel for dim0 cast

torch.compile is still slow for RCEIL (see pytorch/pytorch#170635). Since we are migrating to RCEIL as the default for MXFP8 training, we should default to the triton dim0 cast kernel which has decent performance (~6k TB on 1000W B200).

(torch) dev@gpu-dev-8f27b069:~/ao$ PYTHONPATH=/home/dev/ao:$PYTHONPATH python benchmarks/mx_formats/cast_bench.py  --mode dim0_mxfp8_triton_rceil
W0102 04:28:08.388000 67699 site-packages/torch/_library/triton.py:222] _dequant_mxfp8_kernel not in collector.assignments
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.11.0.dev20260101+cu128
triton version: 3.6.0
mode: dim0_mxfp8_triton_rceil
time_us 136.25599443912506
mem_bw_gbps 5971.810483270397

pytorch-bot · 2026-01-01T00:32:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3560

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

B200 runners are down due to network issues

✅ No Failures

As of commit 3ac311d with merge base 8bb433e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3560, branch: danielvegamyhre/stack/93

[mxfp8 moe training] default to triton kernel for dim0 cast

3ac311d

stack-info: PR: #3560, branch: danielvegamyhre/stack/93

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 1, 2026

danielvegamyhre force-pushed the danielvegamyhre/stack/93 branch from b07510a to 3ac311d Compare January 1, 2026 00:32

danielvegamyhre mentioned this pull request Jan 1, 2026

[mxfp8 moe training] _to_mxfp8_then_scaled_grouped_mm wrapper that accepts keyword args #3561

Merged

jainapurva approved these changes Jan 2, 2026

View reviewed changes

danielvegamyhre merged commit 2319156 into main Jan 2, 2026
29 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mxfp8 moe training] default to triton kernel for dim0 cast#3560

[mxfp8 moe training] default to triton kernel for dim0 cast#3560
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/93

danielvegamyhre commented Jan 1, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielvegamyhre commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!