Skip to content

[mxfp8 moe training] default to triton kernel for dim0 cast#3560

Merged
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/93
Jan 2, 2026
Merged

[mxfp8 moe training] default to triton kernel for dim0 cast#3560
danielvegamyhre merged 1 commit intomainfrom
danielvegamyhre/stack/93

Conversation

@danielvegamyhre
Copy link
Copy Markdown
Contributor

@danielvegamyhre danielvegamyhre commented Jan 1, 2026

Stacked PRs:


[mxfp8 moe training] default to triton kernel for dim0 cast

torch.compile is still slow for RCEIL (see pytorch/pytorch#170635). Since we are migrating to RCEIL as the default for MXFP8 training, we should default to the triton dim0 cast kernel which has decent performance (~6k TB on 1000W B200).

(torch) dev@gpu-dev-8f27b069:~/ao$ PYTHONPATH=/home/dev/ao:$PYTHONPATH python benchmarks/mx_formats/cast_bench.py  --mode dim0_mxfp8_triton_rceil
W0102 04:28:08.388000 67699 site-packages/torch/_library/triton.py:222] _dequant_mxfp8_kernel not in collector.assignments
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.11.0.dev20260101+cu128
triton version: 3.6.0
mode: dim0_mxfp8_triton_rceil
time_us 136.25599443912506
mem_bw_gbps 5971.810483270397

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3560

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 3ac311d with merge base 8bb433e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #3560, branch: danielvegamyhre/stack/93
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 1, 2026
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/93 branch from b07510a to 3ac311d Compare January 1, 2026 00:32
@danielvegamyhre danielvegamyhre added mx topic: performance Use this tag if this PR improves the performance of a feature moe topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) and removed topic: performance Use this tag if this PR improves the performance of a feature labels Jan 2, 2026
@danielvegamyhre danielvegamyhre merged commit 2319156 into main Jan 2, 2026
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants