Skip to content

zf223669/Persona_Gestor

Repository files navigation

Persona_Gestor

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference
(Accpected by IEEE Transactions on Visualization and Computer Graphics (TVCG))

Abstract

Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability.

To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches.

Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology.

image Each pose depicted is personalized gestures generated solely relying on raw speech audio. Persona-Gestor offers a versatile solution, bypassing complex multimodal processing and thereby enhancing user-friendliness.

For clarity, our contributions are summarized as follows:

  1. We pioneering introduce the fuzzy feature inference strategy that enables driving a wider range of personalized gesture synthesis from speech audio alone, removing the need for style labels or extra inputs. This fuzzy feature extractor improves the usability and the generalization capabilities of the system. To the best of our knowledge, it is the first approach that uses fuzzy features to generate co-speech personalized gestures.
  2. We combined AdaLN transformer architecture within the diffusion model to enhance the Modeling of the gesture-speech interplay. We demonstrate that this architecture can generate gestures that achieve an optimal balance of natural and speech synchronization.
  3. Extensive subjective and objective evaluations reveal our model superior outperform to the current state-of-the-art approaches. These results show the remarkable capability of our method in generating credible, speech-appropriateness, and personalized gestures.

Environment

  • Ubuntu 18.04
  • Python 3.11
  • CUDA 12
  • PyTorch 2.2.2
  • Pytorch-lightning 2.2.3
    Note:
  • 1.We recommend using Anaconda to manage your Python environment.
  • 2.We recommend using GPU (such as A100, V100) that more memory is available for faster training. For inferecing, we recommend using RTX 3090 or higher.

Installations

We provide a requirements.txt for installation. code:

git clone https://github.com/zf223669/Persona_Gestor.git
cd Persona_Gestor    # enter the project directory
conda create -n PG python==3.11
conda activate PG
pip install -r requirements.txt

Download the WavLM pre-train model from:

WavLM-Base+: https://drive.google.com/file/d/1-zlAj2SyVJVsbhifwpTlAfrgc9qu-HDb/view
WavLM-Large: https://drive.google.com/file/d/12-cB34qCTvByWT-QtOcZaqwwO21FLSqU/view
When downloaded, place the pre-train model (WavLM-Base+.pt and WavLM-Large.pt) in the "src/utils/wavlm/pretrain-models/" folder.

Note: You also could head to the WavLM original Source code repository for more details: https://github.com/microsoft/unilm/tree/master/wavlm
More WavLM models could be found in the following link: https://huggingface.co/models?other=wavlm&sort=trending&search=WavLM-Base

Datasets download and preprocessing

1. Downloaded datasets

We use three datasets for training and evaluation: Trinity, ZEGGS, and BEAT. You can download and preprocess each dataset separately.

data
├── BEAT
│   └── Sources
│      ├── audio
│      └── bvh
├── BEAT_processed_20s (When pre-processing is finished, the processed data is stored here)
│   └─ feat_20fps_20s_WithExp_waveform_WithStd (Do not modify the folder name)
├── Trinity 
│   └── Sources
│      ├── audio
│      └── bvh
├── Trinity_processed_20s (When pre-processing is finished, the processed data is stored here)
│   └── feat_20fps_20s_WithExp_waveform_WithStd (Do not modify the folder name)
├── ZEGGS
│   └── Sources
│      ├── audio
│      └── bvh
├── ZEGGS_processed_20s (When pre-processing is finished, the processed data is stored here)
│   └── feat_20fps_20s_WithExp_waveform_WithStd (Do not modify the folder name)

Note: Due to some issues with the ZEGGS and BEAT datasets, we have provided processed data for your convenience. You can download the processed data from the following links:

The Fixed and Processed Dataset could be downloaded from:
Trinity: https://pan.baidu.com/s/1EFxbOsxs4bylBfl9AyvhwA?pwd=1234 Extraction code: 1234
ZEGGS:https://pan.baidu.com/s/1IVTY6K6JaG_yK-6-hvvOOg?pwd=1234 Extraction code: 1234
BEAT: https://pan.baidu.com/s/1a9a74MMkhYDWZqDzSoNiBA?pwd=1234 Extraction code: 1234
and place the extracted files in the corresponding folders.

Note:Also, all processed data and the pre-train model are provided in google drive : https://drive.google.com/drive/folders/1qgdBb2LgpW4hhmtOb3I9o_qysoQ4qVCP?usp=drive_link

2. Preprocessing

We use the pre-processing method proposed by https://github.com/simonalexanderson/StyleGestures, but change the audio feature extractor to raw audio information. (Each dataset is trained separately and you can train it individually according to your needs.)

2.1 Preprocessing Trinity

First, find the preprocessing configs in the /configs_diffmotion/data/Trinity/prepare_trinity_dataset.yaml.
Make sure and set the parameters (data_dir and processed_dir) at the top 2 lines, as follows:
  data_dir: the path to the data folder
  processed_dir: the path to the processed data folder. like this:

#Run in terminal:
    data_dir: ./data/Trinity/Sources 
    processed_dir: ./data/Trinity/processed_20s 
#or in IDE:
  data_dir: ../../../../data/Trinity/Sources 
  processed_dir: ../../../../data/Trinity/processed_20s 

Second, run the preprocessing Trinity dataset:

cd src/data/Data_Processing/Trinity_full_Spine/
python prepare_trinity_datasets.py
# The whole preprocessing time is about 20 minutes, you can use the pre-processed data provided by us.

2.2 Preprocessing ZEGGS

First, find the preprocessing configs in the /configs_diffmotion/data/EGGS/prepare_EGGS_dataset.yaml.
Make sure and set the parameters (data_dir and processed_dir) at the top 2 lines, as follows:
  data_dir: the path to the data folder
  processed_dir: the path to the processed data folder. like this:

#Run in terminal:
    data_dir: ./data/ZEGGS/Sources 
    processed_dir: ./data/ZEGGS/processed_20s 
#or in IDE:
  data_dir: ../../../../data/ZEGGS/Sources 
  processed_dir: ../../../../data/ZEGGS/processed_20s 

Second, run the preprocessing ZEGGS dataset:

cd src/data/Data_Processing/ZEGGS/
python prepare_EGGS_datasets.py
# The whole preprocessing time is about 2 hour, you can use the pre-processed data provided by us.

2.3 Preprocessing BEAT

First, find the preprocessing configs in the /configs_diffmotion/data/BEAT/prepare_BEAT_dataset.yaml.
Make sure and set the parameters(data_dir and processed_dir) at the top 2 lines, as follows:
  data_dir: the path to the data folder
  processed_dir: the path to the processed data folder. like this:

#Run in terminal:
    data_dir: ./data/BEAT/Sources 
    processed_dir: ./data/BEAT/processed_20s 
#or in IDE:
  data_dir: ../../../../data/BEAT/Sources 
  processed_dir: ../../../../data/BEAT/processed_20s

Second, run the preprocessing BEAT dataset:

cd src/data/Data_Processing/BEAT/
python prepare_BEAT_datasets.py
# The whole preprocessing time is about 12 hours, you can use the pre-processed data provided by us.

3. Inference

3.1 Download the pre-trained model

We provide the trained model for inference. You can download the model from the following link:

Note:Also, all processed data and the pre-train model are provided in google drive : https://drive.google.com/drive/folders/1qgdBb2LgpW4hhmtOb3I9o_qysoQ4qVCP?usp=drive_link

Then, place the downloaded model in the "src/models/Trinity/" folder, "src/models/ZEGGS/" folder, and "src/models/BEAT/" folder, respectively.

3.2 Set the config file

We provide the config files for each dataset. You can set the parameters according to your needs. All the config files are in the /configs_diffmotion/experiment
xxx_WavLM_based.yaml files are for WavLM_Base+ model and xxx_WavLM_large.yaml files are for WavLM_Large model.

  • Trinity: Trinity_WavLM_based_Generate.yaml or Trinity_WavLM_large_generate.yaml
  • ZEGGS: ZEGGS_WavLM_based_Generate.yaml or ZEGGS_WavLM_large_generate.yaml
  • BEAT: BEAT_WavLM_large_generate.yaml

Open the parameter file and set the parameters according to your needs, as follows:

ckpt_path: the path to the downloaded pre-trained model
paths:
  data_dir: ./data/xxx/processed_20s  # replace xxx with the dataset name
data:
  dataset: ${paths.data_dir}/feat_20fps_20s_WithExp_waveform_WithStd

3.3 Run the inference

We provide the inference code for each dataset. You can run the inference code according to your needs.

# enter the root directory of the project
python src/diffmotion/diffmotion_trainer/train_diffmotion.py -m experiment=Trinity_WavLM_[based/large]_Generate.yaml

The inference results will be saved in the "logs/log_xxx/xxx/multi\DATA_Time" folder. (DATA_Time is the date and time when the inference is started.)

4. Training

We provide the training code for each dataset. You can train the model according to your needs.

# Set the dataset path and processed data path in the config file, like in 3.
# Enter the root directory of the project
# Then run
python src/diffmotion/diffmotion_trainer/train_diffmotion.py -m experiment=xxx_Training_WavLM_[based/large].yaml

5. Inference by using in-the-wild speech audio

We provide in-the-wild speech audios from TED Video or you can prepare by yourself (need mono channel).
Note:We recommend you to use the BEAT pre-trained model for inference.
All the audio files are in the "data/TED/audio" folder.

  1. Set the data_dir and processed_dir in the config file to specify the audio directory, like in Step 2.

  2. Copy the input_scaler.sav file from BEAT processed data folder to the "data/TED/audio/" folder.

  3. run:

python src/data/BEAT_Only_Audio/prepare_BEAT_datasets.py
  1. Then, copy the output_scaler.sav and data_pipe_20fps.save files from BEAT processed data folder to the processed data folder.

  2. run for generating the gestures:

python src/diffmotion/diffmotion_trainer/train_diffmotion.py -m experiment=In_the_wild_audio_WavLM_large_generate.yaml

References

@ARTICLE{zhang2024PersonalizedGesture,
  author={Zhang, Fan and Wang, Zhaohan and Lyu, Xin and Zhao, Siyuan and Li, Mengjian and Geng, Weidong and Ji, Naye and Du, Hui and Gao, Fuxing and Wu, Hao and Li, Shunman},
  journal={IEEE Transactions on Visualization and Computer Graphics}, 
  title={Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference}, 
  year={2024},
  volume={},
  number={},
  pages={1-16},
  keywords={Feature extraction;Transformers;Art;Synchronization;Semantics;Fuzzy logic;Adaptation models;Speech-driven;Gesture synthesis;Fuzzy inference;AdaLN;Diffusion;Transformer;DiTs},
  doi={10.1109/TVCG.2024.3393236}}
  

Our previous work (The Diffusion-based and LSTM models for gesture generation) and project are listed: https://github.com/zf223669/DiffmotionGG-beta

@inproceedings{zhang2023diffmotion,
  title={DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model},
  author={Zhang, Fan and Ji, Naye and Gao, Fuxing and Li, Yongping},
  booktitle={MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9--12, 2023, Proceedings, Part I},
  pages={231--242},
  year={2023},
  organization={Springer}
}

  @Article{Zhang2024DiTGesture,
    AUTHOR = {Zhang, Fan and Wang, Zhaohan and Lyu, Xin and Ji, Naye and Zhao, Siyuan and Gao, Fuxing},
    TITLE = {DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation},
    JOURNAL = {Electronics},
    VOLUME = {13},
    YEAR = {2024},
    NUMBER = {9},
    ARTICLE-NUMBER = {1702},
    URL = {https://www.mdpi.com/2079-9292/13/9/1702},
    ISSN = {2079-9292},
    DOI = {10.3390/electronics13091702}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published