This is the code for SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion
.
Run the following command to install it as a pip module:
pip install .
If you are developing this repo or want to run the scripts, run instead:
pip install -e .[dev]
If there is an error related to pyrender, install additional packages as follows:
apt-get install libboost-dev libglfw3-dev libgles2-mesa-dev freeglut3-dev libosmesa6-dev libgl1-mesa-glx
data
: It contains data used for preprocessing and training.model
: It contains the weights of VAE, which is used for the evaluation.blender-addon
: It contains the blender addon that can visualize the blendshape coefficients.script
: It contains Python scripts for preprocessing, training, inference, and evaluation.static
: It contains the resources for the project page.
You can download the pretrained weights of SAiD from Hugging Face Repo.
python script/inference.py \
--weights_path "<SAiD_weights>.pth" \
--audio_path "<input_audio>.wav" \
--output_path "<output_coeffs>.csv" \
[--init_sample_path "<input_init_sample>.csv"] \ # Required for editing
[--mask_path "<input_mask>.csv"] # Required for editing
Due to the license issue of VOCASET, we cannot distribute BlendVOCA directly.
Instead, you can preprocess data/blendshape_residuals.pickle
after constructing BlendVOCA
directory as follows for the simple execution of the script.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
└─ templates
├─ ...
└─ FaceTalk_170915_00223_TA.ply
templates
: Download the template meshes from VOCASET.
python script/preprocess_blendvoca.py \
--blendshapes_out_dir "<output_blendshapes_dir>"
If you want to generate blendshapes by yourself, do the folowing instructions.
-
Unzip
data/ARKit_reference_blendshapes.zip
. -
Download the template meshes from VOCASET.
-
Crop template meshes using
data/FLAME_head_idx.txt
. You can crop more indices and then restore them after finishing the construction process. -
Use Deformation-Transfer-for-Triangle-Meshes to construct the blendshape meshes.
- Use
data/ARKit_landmarks.txt
anddata/FLAME_head_landmarks.txt
as marker vertices. - Find the correspondance map between neutral meshes, and use it to transfer the deformation of arbitrary meshes.
- Use
-
Create
blendshape_residuals.pickle
, which contains the blendshape residuals in the following Python dictionary format. Refer todata/blendshape_residuals.pickle
.{ 'FaceTalk_170731_00024_TA': { 'jawForward': <np.ndarray object with shape (V, 3)>, ... }, ... }
You can simply unzip data/blendshape_coeffcients.zip
.
If you want to generate coefficients by yourself, we recommend constructing the BlendVOCA
directory as follows for the simple execution of the script.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
├─ blendshapes_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ noseSneerRight.obj
├─ templates_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA.obj
└─ unposedcleaneddata
├─ ...
└─ FaceTalk_170915_00223_TA
├─ ...
└─ sentence40
blendshapes_head
: Place the constructed blendshape meshes (head).templates_head
: Place the template meshes (head).unposedcleaneddata
: Download the mesh sequences (unposed cleaned data) from VOCASET.
And then, run the following command:
python script/optimize_blendshape_coeffs.py \
--blendshapes_coeffs_out_dir "<output_coeffs_dir>"
After generating blendshape coefficients, create coeffs_std.csv
, which contains the standard deviation of each coefficients. Refer to data/coeffs_std.csv
.
jawForward,...
<std_jawForward>,...
We recommend constructing the BlendVOCA
directory as follows for the simple execution of scripts.
├─ audio-driven-speech-animation-with-diffusion
│ ├─ ...
│ └─ script
└─ BlendVOCA
├─ audio
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ sentence40.wav
├─ blendshape_coeffs
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ sentence40.csv
├─ blendshapes_head
│ ├─ ...
│ └─ FaceTalk_170915_00223_TA
│ ├─ ...
│ └─ noseSneerRight.obj
└─ templates_head
├─ ...
└─ FaceTalk_170915_00223_TA.obj
audio
: Download the audio from VOCASET.blendshape_coeffs
: Place the constructed blendshape coefficients.blendshapes_head
: Place the constructed blendshape meshes (head).templates_head
: Place the template meshes (head).
-
Train VAE
python script/train_vae.py \ --output_dir "<output_logs_dir>" \ [--coeffs_std_path "<coeffs_std>.txt"]
-
Train SAiD
python script/train.py \ --output_dir "<output_logs_dir>"
-
Generate SAiD outputs on the test speech data
python script/test_inference.py \ --weights_path "<SAiD_weights>.pth" \ --output_dir "<output_coeffs_dir>"
-
Remove
FaceTalk_170809_00138_TA/sentence32-xx.csv
files from the output directory. Ground-truth data does not contain the motion data ofFaceTalk_170809_00138_TA/sentence32
. -
Evaluate SAiD outputs: FD, WInD, and Multimodality.
python script/test_evaluate.py \ --coeffs_dir "<input_coeffs_dir>" \ [--vae_weights_path "<VAE_weights>.pth"] \ [--blendshape_residuals_path "<blendshape_residuals>.pickle"]
-
We have to generate the videos to compute the AV offset/confidence. To avoid the memory leak issue of the pyrender module, we use the shell script. After updating
COEFFS_DIR
andOUTPUT_DIR
, run the script:# Fix 1: COEFFS_DIR="<input_coeffs_dir>" # Fix 2: OUTPUT_DIR="<output_video_dir>" python script/test_render.sh
-
Use SyncNet to compute the AV offset/confidence.
If you use this code as part of any research, please cite the following paper.
@misc{park2023said,
title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
author={Inkyu Park and Jaewoong Cho},
year={2023},
eprint={2401.08655},
archivePrefix={arXiv},
primaryClass={cs.CV}
}