Skip to content

reigen-11/LatentSync

 
 

Repository files navigation

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

arXiv arXiv Replicate

📚 Abstract

We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.

🏰️ Framework

LatentSync uses the Whisper to convert melspectrogram into audio embeddings, which are then integrated into the U-Net via cross-attention layers. The reference and masked frames are channel-wise concatenated with noised latents as the input of U-Net. In the training process, we use a one-step method to get estimated clean latents from predicted noises, which are then decoded to obtain the estimated clean frames. The TREPA, LPIPS and SyncNet losses are added in the pixel space.

🎮 Demo

Original video Lip-synced video
demo1_video.mp4
demo1_output.mp4
demo2_video.mp4
demo2_output.mp4

💑 Open-source Plan

  • Inference code and checkpoints
  • Data processing pipeline
  • Training code
  • Super-resolution integration

🔧 Setting up the Environment

Install the required packages and download the checkpoints via:

source setup_env.sh
pip install gfpgan basicsr
git clone https://github.com/sczhou/CodeFormer
cd CodeFormer && python scripts/download_pretrained_models.py facelib

File structure after setup:

./checkpoints/
|-- latentsync_unet.pt
|-- latentsync_syncnet.pt
|-- whisper/
|-- auxiliary/
|-- CodeFormer/

🚀 Enhanced Inference with Super-Resolution

Command Line Interface

python main.py \
    --video_path input.mp4 \
    --audio_path audio.wav \
    --superres [GFPGAN/CodeFormer] \
    --video_out_path output.mp4 \
    --inference_steps 25 \
    --guidance_scale 1.5

New Parameters:

  • --superres: Apply face enhancement using GFPGAN or CodeFormer
  • --mask_smooth: Blending smoothness (0-1, default=0.5)
  • --sr_ratio: Super-resolution scale factor (1-4, default=2)

Gradio Interface

python gradio_app.py --enable_sr

🛠️ Technical Enhancements

Super-Resolution Pipeline

graph TD
    A[Input Video] --> B(LatentSync Generation)
    B --> C{Face Detection}
    C -->|Detected| D[SR Processing]
    C -->|Not Detected| E[Direct Output]
    D --> F[GFPGAN/CodeFormer]
    F --> G[Mask Blending]
    G --> H[Final Output]
Loading

📊 Quality Evaluation

Evaluate results with:

./eval/eval_quality.sh \
    --video_path output.mp4 \
    --metric psnr ssim fid

Supported metrics:

  • PSNR (Peak Signal-to-Noise Ratio)
  • SSIM (Structural Similarity)
  • FID (Fréchet Inception Distance)

🙏 Acknowledgements

🐟 License

Apache License 2.0 - See LICENSE for details

About

Taming Stable Diffusion for Lip Sync!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%