This project provides a ComfyUI wrapper of FLOAT for Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
For a more advanced and maintained version, check out: ComfyUI-FLOAT_Optimized
0506.4.mp4
If you like my projects and wish to see updates and new features, please consider supporting me. It helps a lot!
git clone https://github.com/yuvraj108c/ComfyUI-FLOAT.git
cd ./ComfyUI-FLOAT
pip install -r requirements.txt
- Load example workflow
- Upload driving image and audio, click queue
- Models autodownload to
/ComfyUI/models/float
- The models are organized as follows:
|-- float.pth # main model |-- wav2vec2-base-960h/ # audio encoder | |-- config.json | |-- model.safetensors | |-- preprocessor_config.json |-- wav2vec-english-speech-emotion-recognition/ # emotion encoder |-- config.json |-- preprocessor_config.json |-- pytorch_model.bin
ref_image
: Reference image with a face (must have batch size 1)ref_audio
: Reference audio (For long audios (e.g 3+ minutes), ensure that you have enough ram/vram)a_cfg_scale
: Audio classifier-free guidance scale (default:2)r_cfg_scale
: Reference classifier-free guidance scale (default:1)emotion
: none, angry, disgust, fear, happy, neutral, sad, surprise (default:none)e_cfg_scale
: Intensity of emotion (default:1). For more emotion intensive video, try large value from 5 to 10crop
: Enable only if the reference image does not have a centered facefps
: Frame rate of the output video (default:25)
@article{ki2024float,
title={FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait},
author={Ki, Taekyung and Min, Dongchan and Chae, Gyeongsu},
journal={arXiv preprint arXiv:2412.01064},
year={2024}
}
Thanks to simplepod.ai for providing GPU servers
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)