LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

📚 Abstract

We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.

🏰️ Framework

LatentSync uses the Whisper to convert melspectrogram into audio embeddings, which are then integrated into the U-Net via cross-attention layers. The reference and masked frames are channel-wise concatenated with noised latents as the input of U-Net. In the training process, we use a one-step method to get estimated clean latents from predicted noises, which are then decoded to obtain the estimated clean frames. The TREPA, LPIPS and SyncNet losses are added in the pixel space.

🎮 Demo

Original video	Lip-synced video
demo1_video.mp4	demo1_output.mp4
demo2_video.mp4	demo2_output.mp4

💑 Open-source Plan

Inference code and checkpoints
Data processing pipeline
Training code
Super-resolution integration

🔧 Setting up the Environment

Install the required packages and download the checkpoints via:

source setup_env.sh
pip install gfpgan basicsr
git clone https://github.com/sczhou/CodeFormer
cd CodeFormer && python scripts/download_pretrained_models.py facelib

File structure after setup:

./checkpoints/
|-- latentsync_unet.pt
|-- latentsync_syncnet.pt
|-- whisper/
|-- auxiliary/
|-- CodeFormer/

🚀 Enhanced Inference with Super-Resolution

Command Line Interface

python main.py \
    --video_path input.mp4 \
    --audio_path audio.wav \
    --superres [GFPGAN/CodeFormer] \
    --video_out_path output.mp4 \
    --inference_steps 25 \
    --guidance_scale 1.5

New Parameters:

--superres: Apply face enhancement using GFPGAN or CodeFormer
--mask_smooth: Blending smoothness (0-1, default=0.5)
--sr_ratio: Super-resolution scale factor (1-4, default=2)

Gradio Interface

python gradio_app.py --enable_sr

🛠️ Technical Enhancements

Super-Resolution Pipeline

graph TD
    A[Input Video] --> B(LatentSync Generation)
    B --> C{Face Detection}
    C -->|Detected| D[SR Processing]
    C -->|Not Detected| E[Direct Output]
    D --> F[GFPGAN/CodeFormer]
    F --> G[Mask Blending]
    G --> H[Final Output]

📊 Quality Evaluation

Evaluate results with:

./eval/eval_quality.sh \
    --video_path output.mp4 \
    --metric psnr ssim fid

Supported metrics:

PSNR (Peak Signal-to-Noise Ratio)
SSIM (Structural Similarity)
FID (Fréchet Inception Distance)

🙏 Acknowledgements

GFPGAN - Practical face restoration
CodeFormer - Robust face enhancement
BasicSR - Super-resolution framework
AnimateDiff - Base architecture

🐟 License

Apache License 2.0 - See LICENSE for details

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
configs		configs
eval		eval
latentsync		latentsync
preprocess		preprocess
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
data_processing_pipeline.sh		data_processing_pipeline.sh
gradio_app.py		gradio_app.py
inference.sh		inference.sh
predict.py		predict.py
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh
train_syncnet.sh		train_syncnet.sh
train_unet.sh		train_unet.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

📚 Abstract

🏰️ Framework

🎮 Demo

💑 Open-source Plan

🔧 Setting up the Environment

🚀 Enhanced Inference with Super-Resolution

Command Line Interface

Gradio Interface

🛠️ Technical Enhancements

Super-Resolution Pipeline

📊 Quality Evaluation

🙏 Acknowledgements

🐟 License

About

Uh oh!

Releases

Packages

Languages

License

reigen-11/LatentSync

Folders and files

Latest commit

History

Repository files navigation

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

📚 Abstract

🏰️ Framework

🎮 Demo

💑 Open-source Plan

🔧 Setting up the Environment

🚀 Enhanced Inference with Super-Resolution

Command Line Interface

Gradio Interface

🛠️ Technical Enhancements

Super-Resolution Pipeline

📊 Quality Evaluation

🙏 Acknowledgements

🐟 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages