This repository contains the source code of the paper Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion, accepted in Interspeech 2023. An overview of the method and the results can be found here.
- Emo-StarGAN: An emotion-preserving deep semi-supervised voice conversion-based speaker anonymisation method is proposed.
- Emotion supervision techiniques are proposed: (a) Direct: using emotion classifier (b) Indirect: using losses leveraging acoustic features and deep features which represent the emotional content of the source and converted samples.
- The indirect techniques can also be used in the absence of emotion labels.
- Experiments demonstrate its generalizability on the following benchmark datasets, across different accents, genders, emotions and cross-corpus conversions:
Samples can be found here.
The demo can be found at Demo/EmoStarGAN Demo.ipynb.
- Python >= 3.9
- Install the python dependencies mentioned in the requirements.txt
- Before starting the training, please specify the number of target speakers in
num_speaker_domains
and other details such as training and validation data in the config file. - Download VCTK and ESD datasets. For VCTK dataset preprocessing is needed, which can be carried out using Preprocess/getdata.py. The dataset paths need to be adjusted in train
train_list.txt
and validationval_list.txt
lists present in Data/. - Download and copy the emotion embeddings weights to the folder Utils/emotion_encoder
- Download and copy the vocoder weights to the folder Utils/Vocoder
python train.py --config_path ./Configs/speaker_domain_config.yml
The Emo-StarGAN model weight can be downloaded from here.
When the speaker index in train_list.txt
or val_list.txt
is greater than the number of speakers ( the hyperparameter num_speaker_domains
mentioned in speaker_domain_config.yml
), the following error is encountered:
[train]: 0%| | 0/66 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Also note that the speaker index starts with 0 (not with 1!) in the training and validation lists.
- Our repository is heavily based on this great repo yl4579/StarGANv2-VC
- kan-bayashi/ParallelWaveGAN
- keums/melodyExtraction_JDC