Check the CHANGELOG file to have a global overview of the latest modifications ! π
βββ custom_architectures
βΒ Β βββ tacotron2_arch.py
βΒ Β βββ waveglow_arch.py
βββ custom_layers
βββ custom_train_objects
βΒ Β βββ losses
βΒ Β βΒ Β βββ tacotron_loss.py : custom Tacotron2 loss
βββ datasets
βββ example_outputs : some pre-computed audios to show you an example
βββ hparams
βββ loggers
βββ models
βΒ Β βββ siamese : the `AudioSiamese` is used as encoder for the SV2TTS model
βΒ Β βββ tts
βΒ Β βΒ Β βββ sv2tts_tacotron2.py : SV2TTS main class
βΒ Β βΒ Β βββ tacotron2.py : Tacotron2 main class
βΒ Β βΒ Β βββ vocoder.py : main functions for complete inference
βΒ Β βΒ Β βββ waveglow.py : WaveGlow main class (both pytorch and tensorflow)
βββ pretrained_models
βββ unitest
βββ utils
βββ example_fine_tuning.ipynb
βββ example_sv2tts.ipynb
βββ example_tacotron2.ipynb
βββ example_waveglow.ipynb
βββ text_to_speech.ipynb
Check the main project for more information about the unextended modules / structure / main classes.
* Check my Siamese Networks project for more information about the models/siamese
module
- Text-To-Speech (module
models.tts
) :
Feature | Fuction / class | Description |
---|---|---|
Text-To-Speech | tts |
perform TTS on text you want with the model you want |
stream | tts_stream |
perform TTS on text you enter |
TTS logger | loggers.TTSLogger |
converts logging logs to voice and play it |
You can check the text_to_speech
notebook for a concrete demonstration
Available architectures :
Synthesizer
:Vocoder
:
* Some speaker's embeddings are created with the Siamese Networks approach, which differs from the original paper. Check the Siamese Networks project for more information on this architecture. More recent models use the GE2E
-loss based encoders (like in the original paper), with a CNN architecture (instead of the 3-layers LSTM), as it is faster to train.
My SV2TTS models are fine-tuned from pretrained Tacotron2 models, by using the partial transfer learning procedure (see below for details), which speeds up a lot the training.
Name | Language | Dataset | Synthesizer | Vocoder | Speaker Encoder | Trainer | Weights |
---|---|---|---|---|---|---|---|
pretrained_tacotron2 | en |
LJSpeech | Tacotron2 |
WaveGlow |
/ | NVIDIA | Google Drive |
tacotron2_siwis | fr |
SIWIS | Tacotron2 |
WaveGlow |
/ | me | Google Drive |
sv2tts_tacotron2_256 | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_siwis | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_tacotron2_256_v2 | fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
sv2tts_siwis_v2 | fr |
SIWIS | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
You can download the tensorflow
version of WaveGlow
at this link
Models must be unzipped in the pretrained_models/
directory !
Important Note : the NVIDIA
models available on torch hub
requires a compatible GPU with the correct configuration for pytorch
. It is the reason why I have released pre-converted models (both Tacotron2
and WaveGlow
) in tensorflow
if you do not want to configure pytorch
! π
The sv2tts_siwis
is a fine-tuned version of sv2tts_tacotron2_256
on the SIWIS
(single-speaker) dataset. Fine-tuning a multi-speaker on a single-speaker dataset tends to improve the stability, and to produce a voice with more intonation, compared to simply training the single-speaker model.
A Google Colab demo is available at this link !
You can also find some audio generated in example_outputs/
, or directly in the Colab notebook ;)
- Clone this repository :
git clone https://github.com/yui-mhcp/text_to_speech.git
- Go to the root of this repository :
cd text_to_speech
- Install requirements :
pip install -r requirements.txt
- Open
text_to_speech
notebook and follow the instruction !
You also have to install ffmpeg
for audio loading / saving.
- Make the TO-DO list
- Comment the code
- Add pretrained weights for French
- Make a
Google Colab
demonstration - Implement WaveGlow in
tensorflow 2.x
- Add
batch_size
support forvocoder inference
- Add pretrained
SV2TTS
weights - Add a
similarity loss
to test a new training procedure for single-speaker fine-tuning - Add document parsing to perform
TTS
on document (in progress) - Add new languages support
- Add new TTS architectures / models
- Train a
SV2TTS
model based on an encoder trained with theGE2E
loss - Experimental add support for long text inference
- Add support for streaming inference
There exists 2 main ways to enable multi-speaker
in the Tacotron2
architecture :
- Use a
speaker-id
, embed it with anEmbedding
layer and concat / add it to theEncoder
output - Use a
Speaker Encoder (SE)
to embed audio from speakers and concat / add this embedding to theencoder output
I have not tested the 1st idea but it is available in my implementation.
Note : in the next paragraphs, encoder
refers to the Tacotron Encoder
part while SE
refers to a speaker encoder
model (detailed below)
The Speaker Encoder Text-To-Speech
comes from the From Speaker Verification To Text-To-Speech (SV2TTS) paper which shows how to use a Speaker Verification
model to embed audio and use them as input for a Tacotron2
model
The idea is the following :
- Train a model to identify speakers based on their audio : the
speaker verification
model. This model basically takes as input an audio sample (5-10 sec) from a speaker, embed it and compare it to baseline embeddings to decide whether the speakers are the same or not - It uses the
speaker encoder
model to produce embeddings of the speaker to clone - It makes a classical
text encoding
with theTacotron Encoder
part - It concatenates the
speaker embedding
(1D vector) to each frame of theencoder output
* - It makes a classical forward pass with the
Tacotron Decoder
part
The idea is that the Decoder
will learn to use the speaker embedding
to copy its prosody / intonation / ... to read the text with the voice of this speaker : it works quite well !
* The embedding
is a 1D vector while the encoder output
is a matrix with shape (text_length, encoder_embedding_dim)
. The idea is to concatenate the embedding
to each frame by repeating it text_length
times
There are some issues with the above approach :
- A perfect generalization to new speakers is really hard as it requires datasets with many speakers (more than 1k), which is really rare in
Text-To-Speech
datasets - The audio should be good quality, to avoid creating noise in the output voices
- The
Speaker Encoder
must be good enough to well separate speakers - The
Speaker Encoder
must be able to embed speakers in a relevant way, such that theTacotron
model can extract useful information on the speaker's prosody
For the 1st issue, there is no real solution, except combining different datasets, as done in the example notebooks, with the CommonVoice
, VoxForge
and SIWIS
datasets
Another solution is to train a low quality model (i.e. trained with many noisy data), and fine-tune it with a small amount of good quality data on a particular speaker. The big advantage of this approach is that you can train a new model really fast with less than 20min of annotated audio from the speaker (which is impossible with a classical single-speaker model training).
For the second point, pay attention to have good quality audio : my experiments have shown that with the original datasets (which contains quite poor quality data), the model never converges
However there exists a solution : preprocessing ! The utils/audio
module contains many powerful preprocessing functions for noise reduction
(using the noisereduce library) and audio silence trimming
(which is really important for the model)
For the 2 last points, read the next section on speaker encoder
The SE part should be able to differentiate speakers, and embed (encode in a 1-D vector) them in a meaningful way (i.e. to be able to differenciate them).
The model used in the paper is a 3-layer LSTM
model with a normalization layer and trained with the GE2E loss. The problem is that training this model is really slow and took 2 weeks on 4 GPU's in the CorentinJ master thesis (cf his github)
This idea was not possible for me (because I do not have 4 GPU's π ), so I have tested something else : use the AudioSiamese model ! The objective of this model is to create speakers' embeddings, and try to minimize distance between embeddings from a same speaker, which is equivalent to the GE2E
training objective !
Experiments have shown 2 interesting results :
- An
AudioSiamese
trained on raw audio is quite good forspeaker verification
but embeds in a non-meaningful way forTacotron
so the result were quite poor - An
AudioSiamese
trained on mel-spectrogram (same parameters as theTacotron mel function
) is as good forspeaker verification
but seems to extract more meaningful information !
The big advantage is that in less than 1 training night you can have your Speaker Encoder
and use it which is crazy : 1 night on single GPU instead of 2 weeks on 4 GPU's !
Furthermore in a visual comparison of embeddings made by the 3-layer LSTM
encoder and my Siamese Network
encoder, they seem quite similar
In order to avoid training a SV2TTS model from scratch which would be completely impossible on a single GPU, I created a partial transfer learning
code
The idea is quite simple : make transfer learning between models that have the same number of layers but different shapes*. This allowed me to use my single-speaker pretrained model as base for the SV2TTS model ! Experiments showed that it works pretty well : the model has to learn new neurons specific to voice cloning but can reuse its pretrained-neurons for speaking, quite funny !
Some ideas that showed some benefits (especially for single-speaker fine-tuning) :
- After some epochs (2-5) we can put the
Postnet
part as non-trainable : this part basically improves mel-quality but is not speaker-specific so no need to train it too much - After some epochs (5-10) you can put the
Tacotron Encoder
part non trainable (only if your pretrained model was for the same language) : text-encoding is not speaker-specific so no need to train it too much
The idea behind these tricks is that the only speaker-specific part is the DecoderCell
so we can make other parts non-trainable to force it to learn this specific part
* Note that I also implemented it when models do not have the same number of layers
You can contact me at yui-mhcp@tutanota.com or on discord at yui#0732
The objective of these projects is to facilitate the development and deployment of useful application using Deep Learning for solving real-world problems and helping people. For this purpose, all the code is under the Affero GPL (AGPL) v3 licence
All my projects are "free software", meaning that you can use, modify, deploy and distribute them on a free basis, in compliance with the Licence. They are not in the public domain and are copyrighted, there exist some conditions on the distribution but their objective is to make sure that everyone is able to use and share any modified version of these projects.
Furthermore, if you want to use any project in a closed-source project, or in a commercial project, you will need to obtain another Licence. Please contact me for more information.
For my protection, it is important to note that all projects are available on an "As Is" basis, without any warranties or conditions of any kind, either explicit or implied. However, do not hesitate to report issues on the repository's project or make a Pull Request to solve it π
If you use this project in your work, please add this citation to give it more visibility ! π
@misc{yui-mhcp
author = {yui},
title = {A Deep Learning projects centralization},
year = {2021},
publisher = {GitHub},
howpublished = {\url{https://github.com/yui-mhcp}}
}
The code for this project is a mixture of multiple GitHub projects to have a fully modulable Tacotron-2
implementation
- [1] NVIDIA's repository (tacotron2 / waveglow) : this was my first implementation where I copied their architecture in order to reuse their pretrained model in a
tensorflow 2.x
implementation. - [2] The TFTTS project : my 1st model was quite slow and had many
Out Of Memory (OOM)
errors so I improved the implementation by using theTacotronDecoder
from this github which allows theswap_memory
argument by usingdynamic_decode
- [3] Tensorflow Addons : as I had some trouble to use the library due to version issues, I copied just the
dynamic_decode()
withBaseDecoder
class to use it in theTacotronDecoder
implementation - [4] CorentinJ's Real-Time Voice cloning project : this repository is an implementation of the
SV2TTS
architecture. I do not copy any of its code as I already had my own implementation (which is slightly different for this repo) but it inspired me to add theSV2TTS
feature to my class.
Papers :
- [5] Tacotron 2 : the original Tacotron2 paper
- [6] Waveglow : the WaveGlow model
- [7] Transfer learning from Speaker Verification to Text-To-Speech) : original paper for SV2TTS idea
- [8] Generalized End-to-End loss for Speaker Verification : the GE2E Loss paper