# Text-To-Speech inference

This notebook illustrates how to use the `tts` function to automatically convert text to natural speech ! The function leverages the `Tacotron2` (with `SV2TTS` variant) as `synthesizer` (`text to mel-spectrogram`), and `WaveGlow` as `vocoder` (`mel-spectrogram to audio`) to produce speech :)

Note that all models are loaded as `singleton`, meaning that only the 1st call will be slow (due to model loading + compilation), and the subsequent calls will be **much faster**, up to 10 times real-time on `RTX3090Ti` ! (i.e., it takes 1 second to generate a 10 seconds audio)

All the audio outputs are available in the `example_outputs` directory, so that you can listen to them without loading the models :)

PS : the last French model demo will not be shared, but it is a good example of `SV2TTS` fine-tuning on single-speakers with only limited amount of good quality data :) Nonetheless, the foundation model (i.e. multi-speaker version), and the training code will be shared so that you can replicate this on your own data !\*

\* The training code has not been tested yet with this updated code, and will be shared in a future update. The previously shared notebook may still work without anny warranty.

## Steps to reproduce

1. Download model weights (see `README.md` for the links)
2. Unzip weights in `pretrained_models/` directory
3. Execute cells !

Note : to associate a model to a language, update the `_pretrained` global variable at the end of the `models/tts/__init__.py` file. The `key` is the language and the `value` is the model's name.

### TTS on text

This function generates audios based on the provided (list of) text(s). 

By default, the model does not re-generate sentences if they have already been generated. This behavior can modified by passing the `overwrite = True` argument, to force regeneration. Note that for `SV2TTS`-based models, `overwrite` is `True` by default, as those models are designed to have multiple intonations based on the input embeddings. 

In this example, the model is loaded with `lang = 'en'`, which loads the `pretrained_tacotron2` model. In the `models/tts/__init__.py` file, it is possible to modify the association by changing the global `_pretrained` variable. 

In [None]:
from models.tts import tts
from loggers import set_level

text = """
Hello world ! I hope you will enjoy this funny API for Text-To-Speech ! 
"""

set_level('info')

_ = tts(
    text, lang = 'en', directory = 'example_outputs/en', display = True, overwrite = True
)

In [None]:
from models.tts import tts

text = [
    "Bonjour tout le monde ! J'espère que vous allez aimer cette démonstration de voix en français !"
]

_ = tts(
    text, lang = 'fr', directory = 'example_outputs/fr', display = True,  overwrite = True
)

In regular inference, `Tacotron-2` (and `SV2TTS`) models struggle to predict long texts (e.g., longer than ~150 caracters), due to their attention mechanism. To mitigate this limitation, the `attn_mask_win_len` is introduced to dynamically move a sliding window to keep at most the given number of tokens visible by the attention. It is therefore important to correctly set this parameter when your text is too long, or use the `max_text_length` argument to split the text.
The 2nd solution works well, but gives results of lower quality as the model do not have access to the entire text, and will therefore not produce a *smooth and continuous* reading of the text. Nonetheless, the masking feature is still experimental, and may fail in some cases ;) Thanks to the randomness nature of the model, a simple workaround is to re-execute the generation (with `overwrite = True`) to have another result !

In [None]:
from models.tts import tts

text = """
Bonjour tout le monde ! J'espère que vous allez aimer cette démonstration de mon super modèle entrainé avec seulement 20 minutes 
d'audios ! Je ne partagerai pas ce modèle, mais je trouvais ça intéressant de montrer ce qu'il était possible de faire !
"""

_ = tts(
    text, model = 'sv2tts_fine_tuned', directory = 'example_outputs/fr', attn_mask_win_len = 150, display = True, overwrite = True
)

This cell displays the audio examples, so that you do not have to re-execute the above cells to check the results ! :)

In [None]:
import glob

from utils.audio import display_audio

for file in glob.glob('example_outputs/**/**/*.mp3'):
    print(file)
    _ = display_audio(file)