PyTorch implementation of GAN-based text-to-speech synthesis and voice conversion (VC)
Switch branches/tags
Nothing to show
Clone or download
r9y9 Merge pull request #24 from yamachu/fix/affect-global-fs
Use global fs to save wavfile correctly
Latest commit e661d18 Dec 11, 2018


Build Status PyPI DOI

PyTorch implementation of Generative adversarial Networks (GAN) based text-to-speech (TTS) and voice conversion (VC).

  1. Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari. "Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks." IEEE/ACM Transactions on Audio, Speech, and Language Processing (2017).
  2. Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li, " Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework", arXiv:1707.01670, Jul 2017.

Generated audio samples

Audio samples are available in the Jupyter notebooks at the link below:

Notes on hyper parameters

  • adversarial_streams, which represents streams (mgc, lf0, vuv, bap) to be used to compute adversarial loss, is a very speech quality sensitive parameter. Computing adversarial loss on mgc features (except for first few dimensions) seems to be working good.
  • If mask_nth_mgc_for_adv_loss > 0, first mask_nth_mgc_for_adv_loss dimension for mgc will be ignored for computing adversarial loss. As described in saito2017asja, I confirmed that using 0-th (and 1-th) mgc for computing adversarial loss affects speech quality. From my experience, mask_nth_mgc_for_adv_loss = 1 for mgc order 25, mask_nth_mgc_for_adv_loss = 2 for mgc order 59 are working to me.
  • F0 extracted by WORLD will be spline interpolated. Set f0_interpolation_kind to "slinear" if you want frist-order spline interpolation, which is same as Merlin's default.
  • Set use_harvest to True if you want to use Harvest F0 estimation algorithm. If False, Dio and StoneMask are used to estimate/refine F0.
  • If you see cuda runtime error (2) : out of memory, try smaller batch size.

Notes on [2]

Though I haven't got improvements over Saito's approach [1] yet, but the GAN-based models described in [2] should be achieved by the following configurations:

  • Set generator_add_noise to True. This will enable generator to use Gaussian noise as input. Linguistic features are concatenated with the noise vector.
  • Set discriminator_linguistic_condition to True. The discriminator uses linguistic features as condition.



Please install PyTorch, TensorFlow and SRU (if needed) first. Once you have those, then

git clone --recursive & cd gantts
pip install -e ".[train]"

should install all other dependencies.

Repository structure

  • gantts/: Network definitions, utilities for working on sequence-loss optimization.
  • Acoustic feature extraction script for voice conversion.
  • Linguistic/duration/acoustic feature extraction script for TTS.
  • GAN-based training script. This is written to be generic so that can be used for training voice conversion models as well as text-to-speech models (duration/acoustic).
  • Adversarial training wrapper script for
  • Hyper parameters for VC and TTS experiments.
  • Evaluation script for VC.
  • Evaluation script for TTS.

Feature extraction scripts are written for CMU ARCTIC dataset, but can be easily adapted for other datasets.

Run demos

Voice conversion (en) is a clb to clt voice conversion demo script. Before running the script, please download wav files for clb and slt from CMU ARCTIC and check that you have all data in a directory as follows:

> tree ~/data/cmu_arctic/ -d -L 1
├── cmu_us_awb_arctic
├── cmu_us_bdl_arctic
├── cmu_us_clb_arctic
├── cmu_us_jmk_arctic
├── cmu_us_ksp_arctic
├── cmu_us_rms_arctic
└── cmu_us_slt_arctic

Once you have downloaded datasets, then:

./ ${experimental_id} ${your_cmu_arctic_data_root}


 ./ vc_gan_test ~/data/cmu_arctic/

Model checkpoints will be saved at ./checkpoints/${experimental_id} and audio samples are saved at ./generated/${experimental_id}.

Text-to-speech synthesis (en) is a self-contained TTS demo script. The usage is:

./ ${experimental_id}

This will download slt_arctic_full_data used in Merlin's demo, perform feature extraction, train models and synthesize audio samples for eval/test set. ${experimenta_id} can be arbitrary string, for example,

./ tts_test

Model checkpoints will be saved at ./checkpoints/${experimental_id} and audio samples are saved at ./generated/${experimental_id}.

Hyper paramters


Monitoring training progress

tensorboard --logdir=log



The repository doesn't try to reproduce same results reported in their papers because 1) data is not publically available and 2). hyper parameters are highly depends on data. Instead, I tried same ideas on different data with different hyper parameters.