# Text to speech

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/text_to_speech.ipynb)

This tutorial demonstrates how to use the `synthesize_texts` function to convert pieces of text into audio files. 

## Quick start
We will start with some ```HuggingFace``` models. 

The very first example shows how to use ```facebook/mms-tts-eng``` which just requires as input the list of pieces of text that you want to synthetize.

In [None]:
%pip install senselab['audio']

In [None]:
# Model: facebook/mms-tts-eng (https://huggingface.co/facebook/mms-tts-eng)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

# Initialize the model
hf_model = HFModel(path_or_uri="facebook/mms-tts-eng", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

## More examples
Here is ```suno/bark-small``` (https://huggingface.co/suno/bark-small). Even in this case, the required input is the list of pieces of text to synthetize.

In [None]:
# Model: suno/bark-small (https://huggingface.co/suno/bark-small)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

# Initialize the model
hf_model = HFModel(path_or_uri="suno/bark-small", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

Let's proceed with ```microsoft/speecht5_tts``` (https://huggingface.co/microsoft/speecht5_tts). This model requires the list of pieces of text to synthetize plus the speaker embedding of the voice we want to clone (btw, speaker embeddings are some values describing the characteristics of someone's voice. If you want to learn more about extracting speaker embeddings with Senselab, please refer to the [dedicated documentation](https://sensein.group/senselab/senselab/audio/tasks/speaker_embeddings.html)). Details about ```microsoft/speecht5_tts```can be found in the model card. In our example, we use some speaker embeddings from the dataset called ```Matthijs/cmu-arctic-xvectors```.

In [None]:
import torch
from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

# Initialize the model
hf_model = HFModel(path_or_uri="microsoft/speecht5_tts", revision="main")
# Write the text to be synthesized
texts = ["Hello, world!"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model, forward_params={"speaker_embeddings": speaker_embedding})

# Play the synthesized audio
play_audio(audios[0])

## Even more examples
Let's now try the advanced ```Mars5-tts``` model.

```Mars5-tts``` requires two inputs:
1. A list of pieces of text you want to synthesize.
2. Target voices you want to clone, along with their respective transcripts.
Although transcripts are not strictly necessary for the model to function, our initial tests show that they significantly improve the model's quality. For this reason, we have made transcripts mandatory in our interface in ```senselab```.

### Setup
First, let's import the necessary libraries and the function we'll be using.

In [None]:
from senselab.audio.data_structures import Audio
from senselab.audio.tasks.plotting.plotting import play_audio
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, extract_segments, resample_audios
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import DeviceType, Language, TorchModel

### Specifying the TTS model, the language and the preferred device
Let's initialize the model we want to use (remember to specify both the ```path_or_uri``` and the ```revision``` for reproducibility purposes), the language of the text we want to synthetize, and the device we prefer. In this tutorial, we are going to use [```mars5```](https://github.com/Camb-ai/MARS5-TTS), which only works for English.

In [None]:
model = TorchModel(path_or_uri="Camb-ai/mars5-tts", revision="master")
language = Language(language_code="en")
device = DeviceType.CPU

### Loading Target Audio File
Now let's load and process the audio file that contains the voice we want to target as part of our text-to-speech process. We do segment just the first second of audio since that contains 1 speaker only. 

In [None]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav

audio = Audio(filepath="tutorial_audio_files/audio_48khz_mono_16bits.wav")
ground_truth = "This is Peter."
audio = extract_segments([(audio, [(0.0, 1.0)])])[0][0]

### Preprocessing
Let's preprocess the audio data to make it suitable with the TTS model characteristics that we can find in the model card in the HuggingFace Hub. In particular, for our example model we need the audio to be sampled at 24kHz. 

In [None]:
audio = downmix_audios_to_mono([audio])[0]
audio = resample_audios([audio], 24000)[0]

And here is how it sounds our target audio.

In [None]:
play_audio(audio)
print("Ground truth:", ground_truth)

### Synthesis
Let's finally synthetize the audio. 

Note: If you want to specify more params and customize the process, you can do it. For more details, see the [**dedicated documentation**](https://sensein.group/senselab/senselab/audio/tasks/text_to_speech.html).

In [None]:
res = synthesize_texts(texts=["Hello, world. It's nice to meet you."], 
                 targets=[(audio, ground_truth)],
                 model=model,
                 language=language
)

And here is the output audio of our tutorial.

In [None]:
play_audio(res[0])