Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-speaker TTS with ESPnet mel-spectrograms #209

Open
migi-gon opened this issue Nov 20, 2020 · 0 comments
Open

Multi-speaker TTS with ESPnet mel-spectrograms #209

migi-gon opened this issue Nov 20, 2020 · 0 comments

Comments

@migi-gon
Copy link

Hello!

I have been following the system described in this paper by Y. Jia, et al Link. So far, I am done training the synthesizer module using ESPnet-Tacotron 2 multi-speaker tts scripts provided here: Link. I finished the training and resulted to intelligible speech, albeit robotic, using Griffin-Lim.

Now, in order to improve the synthesized outputs, I decided to train a wavenet vocoder using the synthesized mel-spectrograms (produced mel-specs of the train set) as described in the paper. I trained the model for 1000k steps and checked the output which resulted to garbled speech. I then extended the training (without changing the hparams) to 1600k steps but still no improvements. Sample synthesized audio files (and the hparams file) can be found here: Link.

Any help or insights on how I could continue would be very much appreciated. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant