This repo contains the training code for Phoneme-level ASR for Voice Conversion (VC) and TTS (Text-Mel Alignment) used in StarGANv2-VC and StyleTTS.
- Python >= 3.7
- Clone this repository:
git clone https://github.com/yl4579/AuxiliaryASR.git
cd AuxiliaryASR
- Install python requirements:
pip install SoundFile torchaudio torch jiwer pyyaml click matplotlib g2p_en librosa
- Prepare your own dataset and put the
train_list.txt
andval_list.txt
in theData
folder (see Training section for more details).
python train.py --config_path ./Configs/config.yml
Please specify the training and validation data in config.yml
file. The data list format needs to be filename.wav|label|speaker_number
, see train_list.txt as an example (a subset for LJSpeech). Note that speaker_number
can just be 0
for ASR, but it is useful to set a meaningful number for TTS training (if you need to use this repo for StyleTTS).
Checkpoints and Tensorboard logs will be saved at log_dir
. To speed up training, you may want to make batch_size
as large as your GPU RAM can take. However, please note that batch_size = 64
will take around 10G GPU RAM.
This repo is set up for English with the g2p_en package, but you can train it with other languages. If you would like to train for datasets in different languages, you will need to modify the meldataset.py file (L86-93) with your own phonemizer. You also need to change the vocabulary file (word_index_dict.txt) and change n_token
in config.yml
to reflect the number of tokens. A recommended phonemizer for other languages is phonemizer.
The author would like to thank @tosaka-m for his great repository and valuable discussions.