Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

1. 🔥 Updates

[2025.12.15]: Release the codes of DSMoE and JiTMoE.
[2025.12.21]: Add the implementation of Kimi Delta Attention (KDA) to DSMoE.

2. 📖 Introduction

We release the MoE Transformer that can be applied to both latent and pixel-space diffusion frameworks, employing DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. The models are already relased to Huggingface.

3. Preparation

3.1 Dataset

Download ImageNet dataset, and place it in your IMAGENET_PATH.

3.2 Installation

pip install torch==2.6.0 torchvision==0.21.0
pip install  peft==0.17.1
pip install pexpect timm torchdiffeq tensorboard diffusers transformers
pip install tensorflow==2.15.0
pip install -e git+https://github.com/LTH14/torch-fidelity.git@master#egg=torch-fidelity

Please follow the installations of DiffMoE and JiT, respectively.

3.3 Training

See details in DSMoE and JiTMoE respectively.

3.4 Evaluation

We follow the evaluation protocols provided by DiffMoE and JiT.

4. Main results

4.1 Latent diffusion framework

Ours DSMoE v.s. DiffMoE on 700K training steps with CFG = 1.0 (* refers to the reported results in the official paper):

Model Name	# Act. Params	FID-50K↓	Inception Score↑
DiffMoE-S-E16	32M	41.02	37.53
DSMoE-S-E16	33M	39.84	38.63
DSMoE-S-E48	30M	40.20	38.09
DiffMoE-B-E16	130M	20.83	70.26
DSMoE-B-E16	132M	20.33	71.42
DSMoE-B-E48	118M	19.46	72.69
DiffMoE-L-E16	458M	11.16 (14.41*)	107.74 (88.19*)
DSMoE-L-E16	465M	9.80	115.45
DSMoE-L-E48	436M	9.19	118.52
DSMoE-3B-E16	965M	7.52	135.29

Ours DSMoE v.s. DiffMoE on 700K training steps with CFG = 1.5:

Model Name	# Act. Params	FID-50K↓	Inception Score↑
DiffMoE-S-E16	32M	15.47	94.04
DSMoE-S-E16	33M	14.53	97.55
DSMoE-S-E48	30M	14.81	96.51
DiffMoE-B-E16	130M	4.87	183.43
DSMoE-B-E16	132M	4.50	186.79
DSMoE-B-E48	118M	4.27	191.03
DiffMoE-L-E16	458M	2.84	256.57
DSMoE-L-E16	465M	2.59	272.55
DSMoE-L-E48	436M	2.55	278.35
DSMoE-3B-E16	965M	2.38	304.93

4.2 Pixel-space diffusion framework

Ours JiTMoE v.s. JiT on 200 training epochs with CFG interval (* refers to the reported results in the official paper):

Model Name	# Act. Params	FID-50K↓	Inception Score↑
JiT-B/16	131M	4.81 (4.37*)	222.32 (-)
JiTMoE-B/16-E16	133M	4.23	245.53
JiT-L/16	459M	3.19 (2.79*)	309.72 (-)
JiTMoE-L/16-E16	465M	3.10	311.34

5. Acknowledgements

A large portion of codes in this repo is based on DiffMoE, JiT, DeepSeekMoE

6. 🌟 Citation

@article{liu2025efficient,
  title={Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe},
  author={Liu, Yahui and Yue, Yang and Zhang, Jingyuan and Sun, Chenxi and Zhou, Yang and Zeng, Wencong and Tang, Ruiming and Zhou, Guorui},
  journal={arXiv preprint arXiv:2512.01252},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

1. 🔥 Updates

2. 📖 Introduction

3. Preparation

3.1 Dataset

3.2 Installation

3.3 Training

3.4 Evaluation

4. Main results

4.1 Latent diffusion framework

4.2 Pixel-space diffusion framework

5. Acknowledgements

6. 🌟 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
DSMoE		DSMoE
JiTMoE		JiTMoE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

1. 🔥 Updates

2. 📖 Introduction

3. Preparation

3.1 Dataset

3.2 Installation

3.3 Training

3.4 Evaluation

4. Main results

4.1 Latent diffusion framework

4.2 Pixel-space diffusion framework

5. Acknowledgements

6. 🌟 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages