Contact: Wuyang Li
Email: wymanbest@outlook.com
- GPU-friendly training. Rank-32 LoRA post-training on Wan2.2-Animate reaches strong results with only thousands of iterations on 4 GPUs.
- Long-horizon animation. EverAnimate supports minute-scale human animation with controlled identity and motion consistency.
- Fully open source. Code, training/inference scripts, LoRA checkpoints, demo data, and ablation videos are released for reproducible research.
Note. EverAnimate builds on the long-video generation framework of SVI 2.0 Pro. Unlike the version described in our paper, which uses a rank of 128, we reproduce and release a lighter, more user-friendly LoRA (rank 32) version focused on long-horizon human animation. It comes with ready-to-run training and inference scripts and can be used on 80GB GPUs without DeepSpeed ZeRO-2 or ZeRO-3.
If you find EverAnimate useful for your research or applications, we would greatly appreciate a ⭐.
Important
ComfyUI workflow notice: The current community-deployed ComfyUI workflow has known issues and may cause severe background flickering between chunks. We are preparing an official version and will release it publicly once it is ready.
Known issues:
- The padding at
wan_video_animate_adapter.py#L625should correspond to four anchors, while the default Wan-Animate setting uses one anchor. - Our method is theoretically not compatible with background masks.
- 27 May 2026: Code released.
- 14 May 2026: Paper released.
git clone https://github.com/vita-epfl/EverAnimate.git
cd EverAnimate
conda create -n everanimate python=3.10 -y
conda activate everanimate
pip install --upgrade pip setuptools wheel packaging ninja
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install flash-attn --no-build-isolationDownload all required files with one command:
bash scripts/download_models.shThis downloads:
- Wan2.2-Animate diffusion, T5 encoder, VAE, and CLIP model files
- The
google/umt5-xxltokenizer required by the DiffSynth Wan pipeline - The Wav2Vec processor files used by the default training pipeline
- EverAnimate 480p LoRA checkpoints and the 720p beta checkpoint under
ckpts/everanimate-v1-lora32 - Demo assets from
data, including the minimal training sample, inference demo, and Stage-1/Stage-2 ablation videos
After downloading, the default scripts use the local ckpts/ folder for both base models and EverAnimate LoRA checkpoints. For offline runs, set:
export DIFFSYNTH_MODEL_BASE_PATH=$PWD/ckpts
export DIFFSYNTH_SKIP_DOWNLOAD=TrueExpected layout:
ckpts/
|-- Wan-AI/Wan2.2-Animate-14B/
| |-- diffusion_pytorch_model*.safetensors
| |-- models_t5_umt5-xxl-enc-bf16.pth
| |-- Wan2.1_VAE.pth
| `-- models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
|-- Wan-AI/Wan2.1-T2V-1.3B/
| `-- google/umt5-xxl/ # Tokenizer used by DiffSynth
|-- Wan-AI/Wan2.2-S2V-14B/
| `-- wav2vec2-large-xlsr-53-english/
|-- everanimate-v1-lora32/
| |-- stage1_480p.safetensors
| |-- stage2_480p.safetensors
| `-- stage3_720p_beta.safetensors # Beta, tested only at small scale
data/
|-- train/ # Minimal training sample
|-- test/ # Inference demo
`-- ablation/ # Stage-1 and Stage-2 ablation videos
The ablation videos are the two-stage outputs: data/ablation/stage1.mp4 is the Stage-1 result, and data/ablation/stage2.mp4 is the Stage-2 result.
EverAnimate follows the official DiffSynth-Studio model-loading convention:
Run the bundled test 480p demo:
bash test.shDuring inference, EverAnimate automatically saves the latest chunk latents so long videos can be resumed from the saved state. We use 4 overlap frames between two chunks, and use the last latent (without deocoding) of previous chunk to guide the current chunk generation.
Run a longer demo with 20 chunks:
NUM_CLIPS=20 OUTPUT_PATH=outputs/test/demo_000001_20chunks.mp4 bash test.shRun the 720p beta checkpoint:
The 720p checkpoint is a beta version and has only been tested at small scale.
LORA_PATH=ckpts/everanimate-v1-lora32/stage3_720p_beta.safetensors \
WIDTH=1280 \
HEIGHT=720 \
OUTPUT_PATH=outputs/test/demo_000001_720p_beta.mp4 \
bash test.shUse custom inputs:
INPUT_IMAGE=path/to/image.png \
POSE_VIDEO=path/to/pose.mp4 \
FACE_VIDEO=path/to/face.mp4 \
OUTPUT_PATH=outputs/custom.mp4 \
bash test.shThe repository includes a minimal toy training sample under data/train/. Training videos should be longer than 160 frames.
Stage 1 performs video extension with the last latent and memory, without anti-drifting. (data/ablation/stage1.mp4)
bash train_stage1.shStage 2 conducts restorative flow matching. (data/ablation/stage2.mp4)
To apply an SVI/Helios-style anti-drifting strategy, pass --enable_image_enhancement and --image_enhancement_prob 0.95 to explicitly augment the motion latents. While this approach can further improve stability, it may introduce cross-chunk flickering.
bash train_stage2.shUse custom training data:
DATASET_BASE_PATH=path/to/data \
DATASET_METADATA_PATH=path/to/metadata.csv \
OUTPUT_PATH=experiments/stage1_custom \
bash train_stage1.shFor Stage 2, the default Stage-1 LoRA is ckpts/everanimate-v1-lora32/stage1_480p.safetensors. To use another checkpoint:
LORA_CHECKPOINT=path/to/stage1.safetensors \
OUTPUT_PATH=experiments/stage2_custom \
bash train_stage2.sh720p beta fine-tuning:
train_stage3.shuses 1280x720-scale training throughMAX_PIXELS=921600. This path is still beta and has only been tested at small scale. Training without DeepSpeed requires more than 80GB of GPU memory.
bash train_stage3.shContinue fine-tuning from the 720p beta checkpoint:
LORA_CHECKPOINT=ckpts/everanimate-v1-lora32/stage3_720p_beta.safetensors \
OUTPUT_PATH=experiments/stage3_720p_continue \
bash train_stage3.shEverAnimate/
|-- diffsynth/ # Core model, pipeline, diffusion, and utility code
|-- scripts/ # Download, training, and inference utilities
|-- data/train/ # Minimal toy training sample
|-- data/test/ # Minimal inference demo sample
|-- ckpts/Wan-AI/ # Wan base model files used by DiffSynth
|-- ckpts/everanimate-v1-lora32/ # EverAnimate 480p LoRA and 720p beta checkpoints
|-- train_stage1.sh
|-- train_stage2.sh
|-- train_stage3.sh
`-- test.sh
Q: What input resolution is supported?
Due to compute constraints, the stable public release focuses on 480p LoRA checkpoints. We also provide a 720p beta checkpoint, but it has only been tested at small scale. A more thoroughly fine-tuned and evaluated 720p model is planned for a future update, and the current 480p model can also support 720p inference to some extent.
Q: How are anchor frames used?
EverAnimate uses four anchor frames as guidance. For the first chunk, we directly copy the provided reference as the anchor. Starting from the second chunk, we use the first frame plus three randomly selected frames as anchors. Therefore, video decoding starts from the fifth latent, and WanAnimateAdapter in diffsynth/models/wan_video_animate_adapter.py pads the anchor positions accordingly. Users can also provide anchor frames explicitly for their own workflows.
Q: What minor future improvements are planned?
In some samples, we observe a visible transition between the first and second chunks. We plan to improve this boundary behavior in future releases.
EverAnimate is an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: low-level quality drift, such as progressive degradation of static backgrounds, and high-level semantic drift, such as inconsistent character identity and view-dependent attributes. EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory. It consists of two complementary mechanisms: Persistent Latent Propagation, which maintains context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting, and Restorative Flow Matching, which introduces an implicit restoration objective during sampling through velocity adjustment to improve within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.
| Component | Purpose |
|---|---|
| Persistent Latent Propagation | Propagates identity and motion through latent memory across chunks. |
| Restorative Flow Matching | Corrects drifted latent trajectories with a bounded restorative velocity target. |
| Lightweight LoRA Adaptation | Enables efficient post-training on top of a video animation backbone. |
This work builds on the following projects:
- DiffSynth-Studio
- Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
- Stable Video Infinity: Infinite-Length Video Generation with Error Recycling
This work has also been inspired by SVI 2.0 Pro and LongCat Video Avatar.
@misc{li2026everanimate,
title = {EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration},
author = {Wuyang Li and Yang Gao and Mariam Hassan and Lan Feng and Wentao Pan and Po-Chien Luan and Alexandre Alahi},
year = {2026},
eprint = {2605.15042},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.15042}
}
@misc{li2025stablevideoinfinity,
title = {Stable Video Infinity: Infinite-Length Video Generation with Error Recycling},
author = {Wuyang Li and Wentao Pan and Po-Chien Luan and Yang Gao and Alexandre Alahi},
year = {2025},
eprint = {2510.09212},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2510.09212}
}
