Yasaman Haghighi*, Bastien van Delft*, Mariam Hassan, Alexandre Alahi
École Polytechnique Fédérale de Lausanne (EPFL)
🧠 Abstract: We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and human motion generation. LayerSync consistently enhances the generation quality and the training efficiency. For example, it speeds up the training of flow-based transformer by over 8.75x on the ImageNet dataset and improves the generation quality by 23.6%.
- [Jan 2025] Added fine-tuning with LoRA support! Check out the Finetuning-LoRA folder.
First, clone and set up the repository:
git clone https://github.com/vita-epfl/LayerSync.git
cd LayerSyncWe provide an environment.yml file that can be used to create a Conda environment.
If you only want to run pre-trained model locally on CPU, you can remove the cudatoolkit and pytorch-cuda requirements from the file.
conda env create -f environment.yml
conda activate LayerSyncWe use the ImageNet-1K dataset. Please first register on Hugging Face, agree to the dataset’s terms and conditions, and then download:
We provide pretrained model checkpoints on Hugging Face:
Available checkpoints:
- Final checkpoint: Trained model after 800 epochs
- Representation evaluation checkpoint: Model checkpoint used for representation quality evaluation
To use a checkpoint, download it from the repository and load it during evaluation or inference.
To train with LayerSync, run:
torchrun --nnodes=1 --nproc_per_node=N train.py \
--model SiT-XL/2 \
--loss-type "layer_sync" \
--reg-weight 0.2 \
--encoder-depth 8 \
--gt-encoder-depth 16You can generate images (and the resulting .npz file can be used for ADM evaluation suite) using:
torchrun --nnodes=1 --nproc_per_node=N sample_ddp.py ODE \
--model SiT-XL/2 \
--num-fid-samples 50000To use LayerSync in your own project, follow this reference line and synchronize two layers:
- Release checkpoints
- Add audio scripts
- Add representation evaluation scripts
We greatly appreciate the tremendous effort behind the following fantastic projects:
This codebase is primarily built upon SiT and REPA repositories.
If you find our work or code useful, please cite:
@misc{haghighi2025layersyncselfaligningintermediatelayers,
title={LayerSync: Self-aligning Intermediate Layers},
author={Yasaman Haghighi and Bastien van Delft and Mariam Hassan and Alexandre Alahi},
year={2025},
eprint={2510.12581},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.12581},
}
