Next steps for mxfp8 MoE training

- [ ] mxfp8 all2all -> stay in mxfp8 through the token shuffle -> mxfp8 grouped gemm
  - [x] initial mxfp8 all2all impl (drop in replacement for all_to_all_single_autograd, sync required)
  - [ ] mxfp8 token shuffle (modified version of [this](https://github.com/pytorch/torchtitan/blob/ad9f188abe816de08f68a2aaf8b97bed81b83e64/torchtitan/models/moe/kernels.py#L140) Triton kernel which also permutes scales to be in the same order as their associated tokens)      
  - [ ] Extend [mxfp8 grouped gemm autograd func](https://github.com/pytorch/ao/blob/6adb8b882657e1708153863e30ba67d626ab74df/torchao/prototype/moe_training/scaled_grouped_mm.py#L288) to also accept pre-quantized inputs 
- [ ] Improve [3d expert weight mxfp8 quanitzation CUDA kernel](https://github.com/pytorch/ao/blob/6adb8b882657e1708153863e30ba67d626ab74df/torchao/csrc/cuda/mx_kernels/mxfp8_quantize.cuh#L1348) (currently at 65-70% peak memory bandwidth, should target 85%+ like the other mxfp8 quantization kernels)
- [ ] Investigate if we can write e8m0 scales directly to blocked format, instead of running [separate conversion kernels](https://github.com/pytorch/ao/blob/6adb8b882657e1708153863e30ba67d626ab74df/torchao/prototype/moe_training/scaled_grouped_mm.py#L333). 
- [ ] Improve mxfp8 grouped gemm performance for small K dim (dsv3/kimi shapes). Currently we see less speedup for small, skinny experts than larger experts like llama4 has. We need to improve this since dsv3/kimi base models are so popular now. 
- [ ] unify dense + moe mxfp8 training code bases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Next steps for mxfp8 MoE training #3379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Next steps for mxfp8 MoE training #3379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions