FreeTalkDiff

[CVPR 2026] IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, and Jinwei Wang

Abstract: With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder the scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a fine-tuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

Repository Layout

freetalkdiff/
|-- configs/
|   `-- inference.yaml                              # Main runtime configuration
|-- gfpgan/                                         # GFPGAN weight
|   `-- weights/
|       |-- detection_Resnet50_Final.pth
|       `-- parsing_parsenet.pth
|-- modules/             
|   |-- face_enhancer/                              # GFPGAN and RealESRGAN weight
|   |   |-- GFPGANv1.4.pth             
|   |   `-- RealESRGAN_x4plus.pth             
|   |-- mediapipe/                                  # MediaPipe local assets
|   |   |-- canonical_face_model.obj   
|   |   `-- selfie_multiclass_256x256.tflite   
|   |-- face_alignment.py                           # InsightFace alignment and recovery utilities
|   |-- face_enhancer.py                            # GFPGAN/RealESRGAN restoration
|   |-- masker.py                                   # MediaPipe segmentation-based mouth/face masks
|   |-- mediapipe_segmenter.py                      # Selfie segmentation wrapper
|   |-- noise_sensor.py                             # The proposed module
|   |-- structure_controller.py                     # The proposed module
|   |-- structurist.py                              # The proposed module
|-- pipelines/   
|   |-- sd15_inpaint_ipadapter_faceid_pipeline.py   # Diffusers backblone
|   `-- freetalkdiff.py                             # Full FreeTalkDiff pipeline
|-- scripts/             
|   `-- inference.py                                # End-to-end demo script

Environment Setup

Create an environment:

conda create -n attack_talker python=3.10 -y
conda activate attack_talker

Install PyTorch for your CUDA version first. For example, choose the command recommended by the official PyTorch selector for your driver/runtime:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

Install the remaining runtime dependencies from the project requirements file:

pip install -r requirements.txt

The requirements.txt file is a minimal runtime dependency list for the main inference pipeline.

Notes:

Match the PyTorch and CUDA versions to your own GPU environment.
Match onnxruntime-gpu to your CUDA/cuDNN runtime as well.
torchvision.io.read_video and write_video require a working video backend such as FFmpeg.
InsightFace may download its buffalo_l model package into the local InsightFace cache on first use.

Model And Asset Checklist

FreeTalkDiff loads several models automatically from Hugging Face when the pipeline starts:

stable-diffusion-v1-5/stable-diffusion-inpainting
h94/IP-Adapter
h94/IP-Adapter-FaceID
OedoSoldier/detail-tweaker-lora

The repo/config also expects these local assets:

gfpgan/weights/detection_Resnet50_Final.pth
gfpgan/weights/parsing_parsenet.pth
modules/mediapipe/selfie_multiclass_256x256.tflite
modules/mediapipe/canonical_face_model.obj
modules/face_enhancer/RealESRGAN_x4plus.pth
modules/face_enhancer/GFPGANv1.4.pth
InsightFace buffalo_l model cache, downloaded by InsightFace if missing

You can download them from Google Driven.

Quick Start

Run inference with an identity video, a driven video, and an output path:

python scripts/inference.py \
  --cfg_path configs/inference.yaml \
  --id_video path/to/id_video.mp4 \
  --driven_video path/to/driven_video.mp4 \
  --output_video path/to/output.mp4

Arguments:

--cfg_path: inference config path. Defaults to configs/inference.yaml.
--id_video: identity and appearance source video.
--driven_video: mouth motion driving video.
--output_video: output video save path.

Configuration

Important fields in configs/inference.yaml:

Field	Default	Meaning
`freetalkdiff.setup.det_size`	`256`	Face detection/alignment size.
`freetalkdiff.setup.scale`	`1.0`	IP-Adapter scale.
`freetalkdiff.setup.torch_dtype`	`float16`	Torch dtype used for diffusion models.
`freetalkdiff.setup.variant`	`fp16`	Hugging Face model variant.
`freetalkdiff.setup.device`	`cuda`	Runtime device.
`freetalkdiff.setup.seed`	`42`	Generator seed for repeatability.
`freetalkdiff.setup.proxy_port`	`7890`	Local proxy port and Hugging Face mirror switch.
`freetalkdiff.inference.guidance_scale`	`7.5`	Classifier-free guidance scale.
`freetalkdiff.inference.num_inference_steps`	`50`	Diffusion sampling steps.
`freetalkdiff.inference.strength`	`1.0`	Inpainting denoising strength.
`freetalkdiff.inference.clip_len`	`5`	Number of frames processed per diffusion clip.

When proxy_port is not null, the code sets HF_ENDPOINT=https://hf-mirror.com and configures HTTP_PROXY/HTTPS_PROXY to http://127.0.0.1:{proxy_port}. Set proxy_port: null if you do not want this behavior.

Troubleshooting

Hugging Face downloads fail

Check network access to Hugging Face or the configured mirror.
If you do not use a local proxy, set freetalkdiff.setup.proxy_port: null.
If you do use a proxy, make sure it is listening on the configured port.

CUDA or ONNX Runtime provider errors

Install onnxruntime-gpu that matches your CUDA runtime.
Confirm PyTorch sees your GPU with python -c "import torch; print(torch.cuda.is_available())".
If you need CPU debugging, change device and provider assumptions carefully; the current pipeline is written for CUDA/fp16 execution.

No face detected

Use clear, frontal source and driven videos.
Ensure both videos contain a detectable face in the first frames.
The pipeline falls back to previous valid embeddings for some per-frame misses, but the first valid frame must contain a face.

Missing local assets

Verify the MediaPipe files listed in the asset checklist.
Verify GFPGAN and RealESRGAN weights before enabling FaceEnhancer.
Confirm InsightFace can download or find buffalo_l.

Video read/write errors

Install FFmpeg and ensure it is visible to the Python environment.
Try writing to another codec/container if torchvision.io.write_video fails on your platform.

Discussion

AnimateDiff extends SD 1.5 by introducing pluggable motion modules and motion LoRA, enabling coherent video generation. This design does not compromise the compatibility with IP-Adapter. Consequently, AnimateDiff can be paired with IP-Adapter to form a seemingly feasible fine-tuning-free backbone for talking face generation. However, the current IP-Adapter only provides a shared lip reference for the entire input video clip in Diffusers, rather than assigning distinct lip features to each frame. This limitation causes the AnimateDiff + IP-Adapter backbone to produce static lip shape across frames, preventing it from achieving fine-tuning-free talking face generation. Looking ahead, with the continued evolution of the AnimateDiff, IP-Adapter, and Diffusers communities, as well as the increasing modularity and openness of these frameworks, we believe this combined backbone holds great potential to advance fine-tuning-free talking face generation.

License

The project license is Apache-2.0.

Please also check the licenses and usage terms of the upstream models and assets used by this project.

Acknowledgements

This project builds on excellent open-source work from the Diffusers, IP-Adapter, InsightFace, MediaPipe, GFPGAN, and RealESRGAN communities.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
modules		modules
pipelines		pipelines
resources		resources
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FreeTalkDiff

Repository Layout

Environment Setup

Model And Asset Checklist

Quick Start

Configuration

Troubleshooting

Discussion

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FreeTalkDiff

Repository Layout

Environment Setup

Model And Asset Checklist

Quick Start

Configuration

Troubleshooting

Discussion

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages