🤗 HuggingFace | 📑 Tech Report
- [2025.12.15]: Release the codes of DSMoE and JiTMoE.
- [2025.12.21]: Add the implementation of Kimi Delta Attention (KDA) to DSMoE.
We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface.
Download ImageNet dataset, and place it in your IMAGENET_PATH.
pip install torch==2.6.0 torchvision==0.21.0
pip install peft==0.17.1
pip install pexpect timm torchdiffeq tensorboard diffusers transformers
pip install tensorflow==2.15.0
pip install -e git+https://github.com/LTH14/torch-fidelity.git@master#egg=torch-fidelity
Please follow the installations of DiffMoE and JiT, respectively.
See details in DSMoE and JiTMoE respectively.
We follow the evaluation protocols provided by DiffMoE and JiT.
- Ours DSMoE v.s. DiffMoE on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):
| Model Name | # Act. Params | FID-50K↓ | Inception Score↑ |
|---|---|---|---|
| DiffMoE-S-E16 | 32M | 41.02 | 37.53 |
| DSMoE-S-E16 | 33M | 39.84 | 38.63 |
| DSMoE-S-E48 | 30M | 40.20 | 38.09 |
| DiffMoE-B-E16 | 130M | 20.83 | 70.26 |
| DSMoE-B-E16 | 132M | 20.33 | 71.42 |
| DSMoE-B-E48 | 118M | 19.46 | 72.69 |
| DiffMoE-L-E16 | 458M | 11.16 (14.41*) | 107.74 (88.19*) |
| DSMoE-L-E16 | 465M | 9.80 | 115.45 |
| DSMoE-L-E48 | 436M | 9.19 | 118.52 |
| DSMoE-3B-E16 | 965M | 7.52 | 135.29 |
- Ours DSMoE v.s. DiffMoE on 700K training steps with CFG = 1.5:
| Model Name | # Act. Params | FID-50K↓ | Inception Score↑ |
|---|---|---|---|
| DiffMoE-S-E16 | 32M | 15.47 | 94.04 |
| DSMoE-S-E16 | 33M | 14.53 | 97.55 |
| DSMoE-S-E48 | 30M | 14.81 | 96.51 |
| DiffMoE-B-E16 | 130M | 4.87 | 183.43 |
| DSMoE-B-E16 | 132M | 4.50 | 186.79 |
| DSMoE-B-E48 | 118M | 4.27 | 191.03 |
| DiffMoE-L-E16 | 458M | 2.84 | 256.57 |
| DSMoE-L-E16 | 465M | 2.59 | 272.55 |
| DSMoE-L-E48 | 436M | 2.55 | 278.35 |
| DSMoE-3B-E16 | 965M | 2.38 | 304.93 |
- Ours JiTMoE v.s. JiT on 200 training epochs with CFG interval (* refers to the reported results in the official paper):
| Model Name | # Act. Params | FID-50K↓ | Inception Score↑ |
|---|---|---|---|
| JiT-B/16 | 131M | 4.81 (4.37*) | 222.32 (-) |
| JiTMoE-B/16-E16 | 133M | 4.23 | 245.53 |
| JiT-L/16 | 459M | 3.19 (2.79*) | 309.72 (-) |
| JiTMoE-L/16-E16 | 465M | 3.10 | 311.34 |
A large portion of codes in this repo is based on DiffMoE, JiT, DeepSeekMoE
@article{liu2025efficient,
title={Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe},
author={Liu, Yahui and Yue, Yang and Zhang, Jingyuan and Sun, Chenxi and Zhou, Yang and Zeng, Wencong and Tang, Ruiming and Zhou, Guorui},
journal={arXiv preprint arXiv:2512.01252},
year={2025}
}