😋 Text To Speech (TTS)

Check the CHANGELOG file to have a global overview of the latest modifications ! 😋

Project structure

├── custom_architectures
│   ├── tacotron2_arch.py
│   └── waveglow_arch.py
├── custom_layers
├── custom_train_objects
│   ├── losses
│   │   └── tacotron_loss.py    : custom Tacotron2 loss
├── datasets
├── example_outputs         : some pre-computed audios to show you an example
├── hparams
├── loggers
├── models
│   ├── siamese             : the `AudioSiamese` is used as encoder for the SV2TTS model
│   ├── tts
│   │   ├── sv2tts_tacotron2.py : SV2TTS main class
│   │   ├── tacotron2.py        : Tacotron2 main class
│   │   ├── vocoder.py          : main functions for complete inference
│   │   └── waveglow.py         : WaveGlow main class (both pytorch and tensorflow)
├── pretrained_models
├── unitest
├── utils
├── example_fine_tuning.ipynb
├── example_sv2tts.ipynb
├── example_tacotron2.ipynb
├── example_waveglow.ipynb
└── text_to_speech.ipynb

Check the main project for more information about the unextended modules / structure / main classes.

* Check my Siamese Networks project for more information about the models/siamese module

Available features

Text-To-Speech (module models.tts) :

Feature	Fuction / class	Description
Text-To-Speech	`tts`	perform TTS on text you want with the model you want
stream	`tts_stream`	perform TTS on text you enter
TTS logger	`loggers.TTSLogger`	converts `logging` logs to voice and play it

You can check the text_to_speech notebook for a concrete demonstration

Available models

Model architectures

Available architectures :

Synthesizer :
- Tacotron2 with extensions for multi-speaker (by ID or SV2TTS)
- SV2TTS extension of the Tacotron2 architecture for multi-speaker based on speaker's embeddings*
Vocoder :
- Waveglow

* Some speaker's embeddings are created with the Siamese Networks approach, which differs from the original paper. Check the Siamese Networks project for more information on this architecture. More recent models use the GE2E-loss based encoders (like in the original paper), with a CNN architecture (instead of the 3-layers LSTM), as it is faster to train.

My SV2TTS models are fine-tuned from pretrained Tacotron2 models, by using the partial transfer learning procedure (see below for details), which speeds up a lot the training.

Model weights

Name	Language	Dataset	Synthesizer	Vocoder	Speaker Encoder	Trainer	Weights
pretrained_tacotron2	`en`	LJSpeech	`Tacotron2`	`WaveGlow`	/	NVIDIA	Google Drive
tacotron2_siwis	`fr`	SIWIS	`Tacotron2`	`WaveGlow`	/	me	Google Drive
sv2tts_tacotron2_256	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_siwis	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_tacotron2_256_v2	`fr`	SIWIS, VoxForge, CommonVoice	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive
sv2tts_siwis_v2	`fr`	SIWIS	`SV2TTSTacotron2`	`WaveGlow`	Google Drive	me	Google Drive

You can download the tensorflow version of WaveGlow at this link

Models must be unzipped in the pretrained_models/ directory !

Important Note : the NVIDIA models available on torch hub requires a compatible GPU with the correct configuration for pytorch. It is the reason why I have released pre-converted models (both Tacotron2 and WaveGlow) in tensorflow if you do not want to configure pytorch ! 😄

The sv2tts_siwis is a fine-tuned version of sv2tts_tacotron2_256 on the SIWIS (single-speaker) dataset. Fine-tuning a multi-speaker on a single-speaker dataset tends to improve the stability, and to produce a voice with more intonation, compared to simply training the single-speaker model.

Usage and demonstration

Demonstration

A Google Colab demo is available at this link !

You can also find some audio generated in example_outputs/, or directly in the Colab notebook ;)

Installation and usage

Clone this repository : git clone https://github.com/yui-mhcp/text_to_speech.git
Go to the root of this repository : cd text_to_speech
Install requirements : pip install -r requirements.txt
Open text_to_speech notebook and follow the instruction !

You also have to install ffmpeg for audio loading / saving.

TO-DO list :

Multi-speaker Text-To-Speech

There exists 2 main ways to enable multi-speaker in the Tacotron2 architecture :

Use a speaker-id, embed it with an Embedding layer and concat / add it to the Encoder output
Use a Speaker Encoder (SE) to embed audio from speakers and concat / add this embedding to the encoder output

I have not tested the 1st idea but it is available in my implementation.

Automatic voice cloning with the `SV2TTS` architecture

Note : in the next paragraphs, encoder refers to the Tacotron Encoder part while SE refers to a speaker encoder model (detailed below)

The basic intuition

The Speaker Encoder Text-To-Speech comes from the From Speaker Verification To Text-To-Speech (SV2TTS) paper which shows how to use a Speaker Verification model to embed audio and use them as input for a Tacotron2 model

The idea is the following :

Train a model to identify speakers based on their audio : the speaker verification model. This model basically takes as input an audio sample (5-10 sec) from a speaker, embed it and compare it to baseline embeddings to decide whether the speakers are the same or not
It uses the speaker encoder model to produce embeddings of the speaker to clone
It makes a classical text encoding with the Tacotron Encoder part
It concatenates the speaker embedding (1D vector) to each frame of the encoder output*
It makes a classical forward pass with the Tacotron Decoder part

The idea is that the Decoder will learn to use the speaker embedding to copy its prosody / intonation / ... to read the text with the voice of this speaker : it works quite well !

* The embedding is a 1D vector while the encoder output is a matrix with shape (text_length, encoder_embedding_dim). The idea is to concatenate the embedding to each frame by repeating it text_length times

Problems and solutions

There are some issues with the above approach :

A perfect generalization to new speakers is really hard as it requires datasets with many speakers (more than 1k), which is really rare in Text-To-Speech datasets
The audio should be good quality, to avoid creating noise in the output voices
The Speaker Encoder must be good enough to well separate speakers
The Speaker Encoder must be able to embed speakers in a relevant way, such that the Tacotron model can extract useful information on the speaker's prosody

For the 1st issue, there is no real solution, except combining different datasets, as done in the example notebooks, with the CommonVoice, VoxForge and SIWIS datasets

Another solution is to train a low quality model (i.e. trained with many noisy data), and fine-tune it with a small amount of good quality data on a particular speaker. The big advantage of this approach is that you can train a new model really fast with less than 20min of annotated audio from the speaker (which is impossible with a classical single-speaker model training).

For the second point, pay attention to have good quality audio : my experiments have shown that with the original datasets (which contains quite poor quality data), the model never converges

However there exists a solution : preprocessing ! The utils/audio module contains many powerful preprocessing functions for noise reduction (using the noisereduce library) and audio silence trimming (which is really important for the model)

For the 2 last points, read the next section on speaker encoder

The Speaker Encoder (SE)

The SE part should be able to differentiate speakers, and embed (encode in a 1-D vector) them in a meaningful way (i.e. to be able to differenciate them).

The model used in the paper is a 3-layer LSTM model with a normalization layer and trained with the GE2E loss. The problem is that training this model is really slow and took 2 weeks on 4 GPU's in the CorentinJ master thesis (cf his github)

This idea was not possible for me (because I do not have 4 GPU's 😄 ), so I have tested something else : use the AudioSiamese model ! The objective of this model is to create speakers' embeddings, and try to minimize distance between embeddings from a same speaker, which is equivalent to the GE2E training objective !

Experiments have shown 2 interesting results :

An AudioSiamese trained on raw audio is quite good for speaker verification but embeds in a non-meaningful way for Tacotron so the result were quite poor
An AudioSiamese trained on mel-spectrogram (same parameters as the Tacotron mel function) is as good for speaker verification but seems to extract more meaningful information !

The big advantage is that in less than 1 training night you can have your Speaker Encoder and use it which is crazy : 1 night on single GPU instead of 2 weeks on 4 GPU's !

Furthermore in a visual comparison of embeddings made by the 3-layer LSTM encoder and my Siamese Network encoder, they seem quite similar

The partial Transfer Learning procedure

In order to avoid training a SV2TTS model from scratch which would be completely impossible on a single GPU, I created a partial transfer learning code

The idea is quite simple : make transfer learning between models that have the same number of layers but different shapes*. This allowed me to use my single-speaker pretrained model as base for the SV2TTS model ! Experiments showed that it works pretty well : the model has to learn new neurons specific to voice cloning but can reuse its pretrained-neurons for speaking, quite funny !

Some ideas that showed some benefits (especially for single-speaker fine-tuning) :

After some epochs (2-5) we can put the Postnet part as non-trainable : this part basically improves mel-quality but is not speaker-specific so no need to train it too much
After some epochs (5-10) you can put the Tacotron Encoder part non trainable (only if your pretrained model was for the same language) : text-encoding is not speaker-specific so no need to train it too much

The idea behind these tricks is that the only speaker-specific part is the DecoderCell so we can make other parts non-trainable to force it to learn this specific part

* Note that I also implemented it when models do not have the same number of layers

Contacts and licence

You can contact me at yui-mhcp@tutanota.com or on discord at yui#0732

The objective of these projects is to facilitate the development and deployment of useful application using Deep Learning for solving real-world problems and helping people. For this purpose, all the code is under the Affero GPL (AGPL) v3 licence

All my projects are "free software", meaning that you can use, modify, deploy and distribute them on a free basis, in compliance with the Licence. They are not in the public domain and are copyrighted, there exist some conditions on the distribution but their objective is to make sure that everyone is able to use and share any modified version of these projects.

Furthermore, if you want to use any project in a closed-source project, or in a commercial project, you will need to obtain another Licence. Please contact me for more information.

For my protection, it is important to note that all projects are available on an "As Is" basis, without any warranties or conditions of any kind, either explicit or implied. However, do not hesitate to report issues on the repository's project or make a Pull Request to solve it 😄

If you use this project in your work, please add this citation to give it more visibility ! 😋

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}

Notes and references

The code for this project is a mixture of multiple GitHub projects to have a fully modulable Tacotron-2 implementation

[1] NVIDIA's repository (tacotron2 / waveglow) : this was my first implementation where I copied their architecture in order to reuse their pretrained model in a tensorflow 2.x implementation.
[2] The TFTTS project : my 1st model was quite slow and had many Out Of Memory (OOM) errors so I improved the implementation by using the TacotronDecoder from this github which allows the swap_memory argument by using dynamic_decode
[3] Tensorflow Addons : as I had some trouble to use the library due to version issues, I copied just the dynamic_decode() with BaseDecoder class to use it in the TacotronDecoder implementation
[4] CorentinJ's Real-Time Voice cloning project : this repository is an implementation of the SV2TTS architecture. I do not copy any of its code as I already had my own implementation (which is slightly different for this repo) but it inspired me to add the SV2TTS feature to my class.

Papers :

[5] Tacotron 2 : the original Tacotron2 paper
[6] Waveglow : the WaveGlow model
[7] Transfer learning from Speaker Verification to Text-To-Speech) : original paper for SV2TTS idea
[8] Generalized End-to-End loss for Speaker Verification : the GE2E Loss paper

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
custom_architectures		custom_architectures
custom_layers		custom_layers
custom_train_objects		custom_train_objects
datasets		datasets
docker		docker
example_outputs		example_outputs
hparams		hparams
loggers		loggers
models		models
unitests		unitests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example_fine_tuning.ipynb		example_fine_tuning.ipynb
example_sv2tts.ipynb		example_sv2tts.ipynb
example_tacotron2.ipynb		example_tacotron2.ipynb
example_waveglow.ipynb		example_waveglow.ipynb
requirements.txt		requirements.txt
text_to_speech.ipynb		text_to_speech.ipynb
train_tacotron.py		train_tacotron.py

License

yui-mhcp/text_to_speech

Folders and files

Latest commit

History

Repository files navigation