[CVPR 2026] IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, and Jinwei Wang
Abstract: With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder the scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a fine-tuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.
freetalkdiff/
|-- configs/
| `-- inference.yaml # Main runtime configuration
|-- gfpgan/ # GFPGAN weight
| `-- weights/
| |-- detection_Resnet50_Final.pth
| `-- parsing_parsenet.pth
|-- modules/
| |-- face_enhancer/ # GFPGAN and RealESRGAN weight
| | |-- GFPGANv1.4.pth
| | `-- RealESRGAN_x4plus.pth
| |-- mediapipe/ # MediaPipe local assets
| | |-- canonical_face_model.obj
| | `-- selfie_multiclass_256x256.tflite
| |-- face_alignment.py # InsightFace alignment and recovery utilities
| |-- face_enhancer.py # GFPGAN/RealESRGAN restoration
| |-- masker.py # MediaPipe segmentation-based mouth/face masks
| |-- mediapipe_segmenter.py # Selfie segmentation wrapper
| |-- noise_sensor.py # The proposed module
| |-- structure_controller.py # The proposed module
| |-- structurist.py # The proposed module
|-- pipelines/
| |-- sd15_inpaint_ipadapter_faceid_pipeline.py # Diffusers backblone
| `-- freetalkdiff.py # Full FreeTalkDiff pipeline
|-- scripts/
| `-- inference.py # End-to-end demo script
Create an environment:
conda create -n attack_talker python=3.10 -y
conda activate attack_talkerInstall PyTorch for your CUDA version first. For example, choose the command recommended by the official PyTorch selector for your driver/runtime:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121Install the remaining runtime dependencies from the project requirements file:
pip install -r requirements.txtThe requirements.txt file is a minimal runtime dependency list for the main inference pipeline.
Notes:
- Match the PyTorch and CUDA versions to your own GPU environment.
- Match
onnxruntime-gputo your CUDA/cuDNN runtime as well. torchvision.io.read_videoandwrite_videorequire a working video backend such as FFmpeg.- InsightFace may download its
buffalo_lmodel package into the local InsightFace cache on first use.
FreeTalkDiff loads several models automatically from Hugging Face when the pipeline starts:
stable-diffusion-v1-5/stable-diffusion-inpaintingh94/IP-Adapterh94/IP-Adapter-FaceIDOedoSoldier/detail-tweaker-lora
The repo/config also expects these local assets:
gfpgan/weights/detection_Resnet50_Final.pthgfpgan/weights/parsing_parsenet.pthmodules/mediapipe/selfie_multiclass_256x256.tflitemodules/mediapipe/canonical_face_model.objmodules/face_enhancer/RealESRGAN_x4plus.pthmodules/face_enhancer/GFPGANv1.4.pth- InsightFace
buffalo_lmodel cache, downloaded by InsightFace if missing
You can download them from Google Driven.
Run inference with an identity video, a driven video, and an output path:
python scripts/inference.py \
--cfg_path configs/inference.yaml \
--id_video path/to/id_video.mp4 \
--driven_video path/to/driven_video.mp4 \
--output_video path/to/output.mp4Arguments:
--cfg_path: inference config path. Defaults toconfigs/inference.yaml.--id_video: identity and appearance source video.--driven_video: mouth motion driving video.--output_video: output video save path.
Important fields in configs/inference.yaml:
| Field | Default | Meaning |
|---|---|---|
freetalkdiff.setup.det_size |
256 |
Face detection/alignment size. |
freetalkdiff.setup.scale |
1.0 |
IP-Adapter scale. |
freetalkdiff.setup.torch_dtype |
float16 |
Torch dtype used for diffusion models. |
freetalkdiff.setup.variant |
fp16 |
Hugging Face model variant. |
freetalkdiff.setup.device |
cuda |
Runtime device. |
freetalkdiff.setup.seed |
42 |
Generator seed for repeatability. |
freetalkdiff.setup.proxy_port |
7890 |
Local proxy port and Hugging Face mirror switch. |
freetalkdiff.inference.guidance_scale |
7.5 |
Classifier-free guidance scale. |
freetalkdiff.inference.num_inference_steps |
50 |
Diffusion sampling steps. |
freetalkdiff.inference.strength |
1.0 |
Inpainting denoising strength. |
freetalkdiff.inference.clip_len |
5 |
Number of frames processed per diffusion clip. |
When proxy_port is not null, the code sets HF_ENDPOINT=https://hf-mirror.com and configures HTTP_PROXY/HTTPS_PROXY to http://127.0.0.1:{proxy_port}. Set proxy_port: null if you do not want this behavior.
Hugging Face downloads fail
- Check network access to Hugging Face or the configured mirror.
- If you do not use a local proxy, set
freetalkdiff.setup.proxy_port: null. - If you do use a proxy, make sure it is listening on the configured port.
CUDA or ONNX Runtime provider errors
- Install
onnxruntime-gputhat matches your CUDA runtime. - Confirm PyTorch sees your GPU with
python -c "import torch; print(torch.cuda.is_available())". - If you need CPU debugging, change
deviceand provider assumptions carefully; the current pipeline is written for CUDA/fp16 execution.
No face detected
- Use clear, frontal source and driven videos.
- Ensure both videos contain a detectable face in the first frames.
- The pipeline falls back to previous valid embeddings for some per-frame misses, but the first valid frame must contain a face.
Missing local assets
- Verify the MediaPipe files listed in the asset checklist.
- Verify GFPGAN and RealESRGAN weights before enabling
FaceEnhancer. - Confirm InsightFace can download or find
buffalo_l.
Video read/write errors
- Install FFmpeg and ensure it is visible to the Python environment.
- Try writing to another codec/container if
torchvision.io.write_videofails on your platform.
AnimateDiff extends SD 1.5 by introducing pluggable motion modules and motion LoRA, enabling coherent video generation. This design does not compromise the compatibility with IP-Adapter. Consequently, AnimateDiff can be paired with IP-Adapter to form a seemingly feasible fine-tuning-free backbone for talking face generation. However, the current IP-Adapter only provides a shared lip reference for the entire input video clip in Diffusers, rather than assigning distinct lip features to each frame. This limitation causes the AnimateDiff + IP-Adapter backbone to produce static lip shape across frames, preventing it from achieving fine-tuning-free talking face generation. Looking ahead, with the continued evolution of the AnimateDiff, IP-Adapter, and Diffusers communities, as well as the increasing modularity and openness of these frameworks, we believe this combined backbone holds great potential to advance fine-tuning-free talking face generation.
The project license is Apache-2.0.
Please also check the licenses and usage terms of the upstream models and assets used by this project.
This project builds on excellent open-source work from the Diffusers, IP-Adapter, InsightFace, MediaPipe, GFPGAN, and RealESRGAN communities.
