Skip to content

yuangan/EAT_code

Repository files navigation

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation (EAT )

Yuan Gan · Zongxin Yang · Xihang Yue · Lingyun Sun · Yi Yang

arXiv Project Page google colab logo License GitHub Stars

EAT

News:

  • 10/03/2024 Released all evaluation codes used in our paper, please refer to here for more details.
  • 26/12/2023 Released the A2KP Training code. Thank you for your attention and patience~ 🎉
  • 05/12/2023 Released the LRW test code.
  • 27/10/2023 Released the Emotional Adaptation Training code. Thank you for your patience~ 🎉
  • 17/10/2023 Released the evaluation code for the MEAD test results. For more information, please refer to evaluation_eat.
  • 21/09/2023 Released the preprocessing code. Now, EAT can generate emotional talking-head videos with any portrait and driven video.
  • 07/09/2023 Released the pre-trained weight and inference code.

Environment

If you wish to run only our demo, we recommend trying it out in Colab. Please note that our preprocessing and training code should be executed locally, and requires the following environmental configuration:

conda/mamba env create -f environment.yml

Note: We recommend using mamba to install dependencies, which is faster than conda.

Checkpoints && Demo dependencies

In the EAT_code folder, Use gdown or download and unzip the ckpt, demo and Utils to the specific folder.

gdown --id 1KK15n2fOdfLECWN5wvX54mVyDt18IZCo && unzip -q ckpt.zip -d ckpt
gdown --id 1MeFGC7ig-vgpDLdhh2vpTIiElrhzZmgT && unzip -q demo.zip -d demo
gdown --id 1HGVzckXh-vYGZEUUKMntY1muIbkbnRcd && unzip -q Utils.zip -d Utils

Demo

Execute the code within our eat environment using the command:

conda activate eat

Then, run the demo with:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/W015_neu_1_002 --emo hap

  • Parameters:
    • root_wav: Choose from ['obama', 'M003_neu_1_001', 'W015_neu_1_002', 'W009_sad_3_003', 'M030_ang_3_004']. Preprocessed wavs are located in ./demo/video_processed/. The 'obama' wav is approximately 5 minutes long, while the others are much shorter.
    • emo: Choose from ['ang', 'con', 'dis', 'fea', 'hap', 'neu', 'sad', 'sur']

Note 1: Place your own images in ./demo/imgs/ and run the above command to generate talking-head videos with aligned new portraits. If you prefer not to align your portrait, simply place your cropped image (256x256) in ./demo/imgs_cropped. Due to the background used in the MEAD training set, results tend to be better with a similar background.

Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.

Note 3: The audio used in our work should be sampled at 16,000 Hz and the corresponding video should have a frame rate of 25 fps.

Test MEAD

To reproduce the results of MEAD as reported in our paper, follow these steps:

First, Download the additional MEAD test data from mead_data and unzip it into the mead_data directory:

gdown --id 1_6OfvP1B5zApXq7AIQm68PZu1kNyMwUY && unzip -q mead_data.zip -d mead_data

Then, Execute the test using the following command:

CUDA_VISIBLE_DEVICES=0 python test_mead.py [--part 0/1/2/3] [--mode 0]

  • Parameters:
    • part: Choose from [0, 1, 2, 3]. These represent the four test parts in the MEAD test data.
    • mode: Choose from [0, 1]. Where 0 tests only 100 samples in total, and 1 tests all samples (985 in total).

You can use our evaluation_eat code to evaluate.

Test LRW

To reproduce the results of LRW as reported in our paper, you need to download and extract the LRW test dataset from here. Due to the limitations of the license, we cannot provide any video data. (The name of the test files can be found here for validation.) After downloading LRW, You will need to preprocess them using our preprocessing code. Then, move and rename the output files as follows:

'imgs, latents, deepfeature32, poseimg, video_fps25/.wavs' --> 'lrw/lrw_images, lrw/lrw_latent, lrw/lrw_df32, lrw/poseimg, lrw/lrw_wavs/.wav'

Change the dataset path in test_lrw_posedeep_normalize_neutral.py.

Then, execute the following command:

CUDA_VISIBLE_DEVICES=0 python test_lrw_posedeep_normalize_neutral.py --name deepprompt_eam3d_all_final_313 --part [0/1/2/3] --mode 0

or run them concurrently:

bash test_lrw_posedeep_normalize_neutral.sh

The results will be saved in './result_lrw/'.

Preprocessing

If you want to test with your own driven video that includes audio, place your video (which should have audio) in the preprocess/video. Then execute the preprocessing code:

cd preprocess
python preprocess_video.py

The video will be processed and saved in the demo/video_processed. To test it, run:

CUDA_VISIBLE_DEVICES=0 python demo.py --root_wav ./demo/video_processed/[fill in your video name] --emo [fill in emotion name]

The videos should contain only one person. We will crop the input video according to the estimated landmark of the first frame. Refer to these video for more details.

Note 1: The preprocessing code has been verified to work correctly with TensorFlow version 1.15.0, which can be installed on Python 3.7. Refer to this issue for more information.

Note 2: Extract the bbox for training with preprocess/extract_bbox.py.

A2KP Training

Data&Ckpt Preparation:

Execution:

  • Run the following command to start training A2KP transformer with latent and pca loss in 4 GPUs:

    python pretrain_a2kp.py --config config/pretrain_a2kp_s1.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/pretrain_new_274.pth.tar

  • Note: Stop training when the loss converges. We trained for 8 epochs here. Our training log is at: ./output/qvt_2 30_10_22_14.59.29/log.txt. Copy and rename the output ckpt to folder ./ckpt, for example: ckpt/qvt_2_1030_281.pth.tar

  • Run the following command to start training A2KP transformer with all loss in 4 GPUs:

    python pretrain_a2kp_img.py --config config/pretrain_a2kp_img_s2.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/qvt_2_1030_281.pth.tar

  • Note: Stop training when the loss converges. We trained for 24 epochs here. Our training log is at: ./output/qvt_img_pca_sync_4 01_11_22_15.47.54/log.txt

Emotional Adaptation Training

Data&Ckpt Preparation:

  • The processed MEAD data used in our paper can be downloaded from Yandex or Baidu. After downloading, concatenate, unzip the files, and update the paths in deepprompt_eam3d_st_tanh_304_3090_all.yaml and frames_dataset_transformer25.py.
  • We have updated environment.yaml to adapt to the training environment. You can install the required packages using pip or mamba, or reinstall the eat environment.
  • We have also updated ckpt.zip, which contains the pre-trained checkpoints that can be used directly for the second phase of training.

Execution:

  • Run the following command to start training in 4 GPUs:

    python -u prompt_st_dp_eam3d.py --config ./config/deepprompt_eam3d_st_tanh_304_3090_all.yaml --device_ids 0,1,2,3 --checkpoint ./ckpt/qvt_img_pca_sync_4_01_11_304.pth.tar

  • Note 1: The batch_size in the config should be consistent with the number of GPUs. To compute the sync loss, we train consecutive syncnet_T frames (which is 5 in our paper) in a batch. Each GPU is assigned a batch during training, consuming around 17GB of VRAM.

  • Note 2: Our checkpoints are saved every half an hour. The results in the paper were obtained using 4 Nvidia 3090 GPUs, training for about 5-6 hours. Please refer to output/deepprompt_eam3d_st_tanh_304_3090_all\ 03_11_22_15.40.38/log.txt for the training logs at that time. The convergence speed of the training loss should be similar to what is shown there.

Evaluation:

  • The checkpoints and logs are saved at ./output/deepprompt_eam3d_st_tanh_304_3090_all [timestamp].

  • Change the data root in test_posedeep_deepprompt_eam3d.py and dirname in test_posedeep_deepprompt_eam3d.sh, then run the following command for batch testing:

    bash test_posedeep_deepprompt_eam3d.sh

  • The results from sample testing (100 samples) are stored in ./result. You can use our evaluation_eat code to evaluate.

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at ganyuan@zju.edu.cn and yangyics@zju.edu.cn

Citation

If you find this code helpful for your research, please cite:

@InProceedings{Gan_2023_ICCV,
    author    = {Gan, Yuan and Yang, Zongxin and Yue, Xihang and Sun, Lingyun and Yang, Yi},
    title     = {Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {22634-22645}
}

Acknowledge

We acknowledge these works for their public code and selfless help: EAMM, OSFV (unofficial), AVCT, PC-AVS, Vid2Vid, AD-NeRF and so on.

About

Official code for ICCV 2023 paper: "Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published