SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

This is the code for SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion.

Installation

Run the following command to install it as a pip module:

pip install .

If you are developing this repo or want to run the scripts, run instead:

pip install -e .[dev]

If there is an error related to pyrender, install additional packages as follows:

apt-get install libboost-dev libglfw3-dev libgles2-mesa-dev freeglut3-dev libosmesa6-dev libgl1-mesa-glx

Directories

data: It contains data used for preprocessing and training.
model: It contains the weights of VAE, which is used for the evaluation.
blender-addon: It contains the blender addon that can visualize the blendshape coefficients.
script: It contains Python scripts for preprocessing, training, inference, and evaluation.
static: It contains the resources for the project page.

Inference

You can download the pretrained weights of SAiD from Hugging Face Repo.

python script/inference.py \
        --weights_path "<SAiD_weights>.pth" \
        --audio_path "<input_audio>.wav" \
        --output_path "<output_coeffs>.csv" \
        [--init_sample_path "<input_init_sample>.csv"] \  # Required for editing
        [--mask_path "<input_mask>.csv"]  # Required for editing

BlendVOCA

Construct Blendshape Facial Model

Due to the license issue of VOCASET, we cannot distribute BlendVOCA directly. Instead, you can preprocess data/blendshape_residuals.pickle after constructing BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   └─ templates
      ├─ ...
      └─ FaceTalk_170915_00223_TA.ply

templates: Download the template meshes from VOCASET.

python script/preprocess_blendvoca.py \
        --blendshapes_out_dir "<output_blendshapes_dir>"

If you want to generate blendshapes by yourself, do the folowing instructions.

Unzip data/ARKit_reference_blendshapes.zip.
Download the template meshes from VOCASET.
Crop template meshes using data/FLAME_head_idx.txt. You can crop more indices and then restore them after finishing the construction process.
Use Deformation-Transfer-for-Triangle-Meshes to construct the blendshape meshes.
- Use data/ARKit_landmarks.txt and data/FLAME_head_landmarks.txt as marker vertices.
- Find the correspondance map between neutral meshes, and use it to transfer the deformation of arbitrary meshes.
Create blendshape_residuals.pickle, which contains the blendshape residuals in the following Python dictionary format. Refer to data/blendshape_residuals.pickle.
```
{
    'FaceTalk_170731_00024_TA': {
        'jawForward': <np.ndarray object with shape (V, 3)>,
        ...
    },
    ...
}
```

Generate Blendshape Coefficients

You can simply unzip data/blendshape_coeffcients.zip.

If you want to generate coefficients by yourself, we recommend constructing the BlendVOCA directory as follows for the simple execution of the script.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   ├─ templates_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA.obj
   └─ unposedcleaneddata
      ├─ ...
      └─ FaceTalk_170915_00223_TA
         ├─ ...
         └─ sentence40

blendshapes_head: Place the constructed blendshape meshes (head).
templates_head: Place the template meshes (head).
unposedcleaneddata: Download the mesh sequences (unposed cleaned data) from VOCASET.

And then, run the following command:

python script/optimize_blendshape_coeffs.py \
        --blendshapes_coeffs_out_dir "<output_coeffs_dir>"

After generating blendshape coefficients, create coeffs_std.csv, which contains the standard deviation of each coefficients. Refer to data/coeffs_std.csv.

jawForward,...
<std_jawForward>,...

Training / Evaluation on BlendVOCA

Dataset Directory Setting

We recommend constructing the BlendVOCA directory as follows for the simple execution of scripts.

├─ audio-driven-speech-animation-with-diffusion
│  ├─ ...
│  └─ script
└─ BlendVOCA
   ├─ audio
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.wav
   ├─ blendshape_coeffs
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ sentence40.csv
   ├─ blendshapes_head
   │  ├─ ...
   │  └─ FaceTalk_170915_00223_TA
   │     ├─ ...
   │     └─ noseSneerRight.obj
   └─ templates_head
      ├─ ...
      └─ FaceTalk_170915_00223_TA.obj

audio: Download the audio from VOCASET.
blendshape_coeffs: Place the constructed blendshape coefficients.
blendshapes_head: Place the constructed blendshape meshes (head).
templates_head: Place the template meshes (head).

Training VAE, SAiD

Train VAE

python script/train_vae.py \
        --output_dir "<output_logs_dir>" \
        [--coeffs_std_path "<coeffs_std>.txt"]

Train SAiD

python script/train.py \
        --output_dir "<output_logs_dir>"

Evaluation

Generate SAiD outputs on the test speech data

python script/test_inference.py \
        --weights_path "<SAiD_weights>.pth" \
        --output_dir "<output_coeffs_dir>"

Remove FaceTalk_170809_00138_TA/sentence32-xx.csv files from the output directory. Ground-truth data does not contain the motion data of FaceTalk_170809_00138_TA/sentence32.

Evaluate SAiD outputs: FD, WInD, and Multimodality.

python script/test_evaluate.py \
        --coeffs_dir "<input_coeffs_dir>" \
        [--vae_weights_path "<VAE_weights>.pth"] \
        [--blendshape_residuals_path "<blendshape_residuals>.pickle"]

We have to generate the videos to compute the AV offset/confidence. To avoid the memory leak issue of the pyrender module, we use the shell script. After updating COEFFS_DIR and OUTPUT_DIR, run the script:
```
# Fix 1: COEFFS_DIR="<input_coeffs_dir>"
# Fix 2: OUTPUT_DIR="<output_video_dir>"
python script/test_render.sh
```
Use SyncNet to compute the AV offset/confidence.

Reference

If you use this code as part of any research, please cite the following paper.

@misc{park2023said,
      title={SAiD: Speech-driven Blendshape Facial Animation with Diffusion},
      author={Inkyu Park and Jaewoong Cho},
      year={2023},
      eprint={2401.08655},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
blender-addon		blender-addon
data		data
model		model
said		said
script		script
static		static
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.nojekyll		.nojekyll
LICENSE		LICENSE
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

Installation

Directories

Inference

BlendVOCA

Construct Blendshape Facial Model

Generate Blendshape Coefficients

Training / Evaluation on BlendVOCA

Dataset Directory Setting

Training VAE, SAiD

Evaluation

Reference

About

Releases

Packages

Languages

License

yunik1004/SAiD

Folders and files

Latest commit

History

Repository files navigation

SAiD: Blendshape-based Audio-Driven Speech Animation with Diffusion

Installation

Directories

Inference

BlendVOCA

Construct Blendshape Facial Model

Generate Blendshape Coefficients

Training / Evaluation on BlendVOCA

Dataset Directory Setting

Training VAE, SAiD

Evaluation

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages