We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.
LatentSync uses the Whisper to convert melspectrogram into audio embeddings, which are then integrated into the U-Net via cross-attention layers. The reference and masked frames are channel-wise concatenated with noised latents as the input of U-Net. In the training process, we use a one-step method to get estimated clean latents from predicted noises, which are then decoded to obtain the estimated clean frames. The TREPA, LPIPS and SyncNet losses are added in the pixel space.
| Original video | Lip-synced video |
demo1_video.mp4 |
demo1_output.mp4 |
demo2_video.mp4 |
demo2_output.mp4 |
- Inference code and checkpoints
- Data processing pipeline
- Training code
- Super-resolution integration
Install the required packages and download the checkpoints via:
source setup_env.sh
pip install gfpgan basicsr
git clone https://github.com/sczhou/CodeFormer
cd CodeFormer && python scripts/download_pretrained_models.py facelibFile structure after setup:
./checkpoints/
|-- latentsync_unet.pt
|-- latentsync_syncnet.pt
|-- whisper/
|-- auxiliary/
|-- CodeFormer/
python main.py \
--video_path input.mp4 \
--audio_path audio.wav \
--superres [GFPGAN/CodeFormer] \
--video_out_path output.mp4 \
--inference_steps 25 \
--guidance_scale 1.5New Parameters:
--superres: Apply face enhancement using GFPGAN or CodeFormer--mask_smooth: Blending smoothness (0-1, default=0.5)--sr_ratio: Super-resolution scale factor (1-4, default=2)
python gradio_app.py --enable_srgraph TD
A[Input Video] --> B(LatentSync Generation)
B --> C{Face Detection}
C -->|Detected| D[SR Processing]
C -->|Not Detected| E[Direct Output]
D --> F[GFPGAN/CodeFormer]
F --> G[Mask Blending]
G --> H[Final Output]
Evaluate results with:
./eval/eval_quality.sh \
--video_path output.mp4 \
--metric psnr ssim fidSupported metrics:
- PSNR (Peak Signal-to-Noise Ratio)
- SSIM (Structural Similarity)
- FID (Fréchet Inception Distance)
- GFPGAN - Practical face restoration
- CodeFormer - Robust face enhancement
- BasicSR - Super-resolution framework
- AnimateDiff - Base architecture
Apache License 2.0 - See LICENSE for details

