Torch implementation of Whisper-guided DDPM based Voice Conversion
- DiffWave: A Versatile Diffusion Model for Audio Synthesis, Zhifeng Kong et al., 2020. [arXiv:2009.09761]
- Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data, Sungwon Kim et al., 2022. [arXiv:2205.15370]
- Variational Diffusion Models, Kingma et al., 2021. [arXiv:2107.00630]
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, Radford et al., 2022. [openai:whisper]
Tested in python 3.7.9 conda environment.
Download LibriTTS dataset from openslr
To train model, run train.py
python train.py \
--data-dir /datasets/LibriTTS/train-clean-360
To start to train from previous checkpoint, --load-epoch is available.
python train.py \
--data-dir /datasets/LibriTTS/train-clean-360 \
--load-epoch 20 \
--config ./ckpt/t1.json
Checkpoint will be written on TrainConfig.ckpt, tensorboard summary on TrainConfig.log.
tensorboard --logdir ./log
To inference model, run inference.py
[TODO] Pretrained checkpoints are relased on releases.
To use pretrained model, download files and unzip it. Followings are sample script.
from wgvc import WhisperGuidedVC
ckpt = torch.load('t1_200.ckpt', map_location='cpu')
wgvc = WhisperGuidedVC.load(ckpt)
wgvc.eval()