This is an implementation for train hifigan part of XTTSv2 model using Coqui/TTS.
In this repo, I'm using the Ljspeech dataset for experimentation, but you can easily swap out different datasets as long as they adhere to the same format as the Ljspeech dataset. Make sure your dataset includes both audio and transcripts.
Another note is that currently, this repository now supports training the HiFi-GAN decoder and the speaker encoder part. If you find this useful, please give me a star. Thank you!
- Download Ljspeech dataset:
bash download_ljspeech.sh
- Download XTTSv2 checkpoint
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/coqui/XTTS-v2
Note that you can use the XTTSv2 checkpoint that you fine-tuned with your own data on the GPT part.
- Install Coqui/TTS from source
git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .[all,dev,notebooks] # Select the relevant extras
- Install requirements
pip install -r requirements.txt
Instead of using melspectrogram like conventional Hifigan, XTTSv2 utilizes GPT latents to convert into waveform format.
Run the script to generate gpt latents to "Ljspeech_latents" folder. You can custom output folder in generate_latents.py.
python generate_latents.py
After generating GPT latents, we will use them to train the model.
python train.py
Since there's no pre-trained discriminator available, and the only option is to load weights from the generator, the generated sounds might be a bit noisy in the early epochs. Please be patient and wait for the later epochs; the results will improve.
For multiple GPU systems the follow will allow multiple gpu training:
CUDA_VISIBLE_DEVICES="0,1" python -m trainer.distribute --script train.py
You can generate audio with new hifigan decoder
python test.py
Use
tensorboard --logdir [output_path]