# Text to speech

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/text_to_speech.ipynb)

This tutorial demonstrates how to use the `synthesize_texts` function to convert pieces of text into audio files efficiently.

## Setup
First, let's import the necessary libraries and the function we'll be using.

In [None]:
from senselab.utils.data_structures.model import TorchModel
from senselab.utils.data_structures.language import Language
from senselab.utils.data_structures.device import DeviceType
from senselab.audio.data_structures.audio import Audio
from senselab.audio.tasks.preprocessing.preprocessing import resample_audios, downmix_audios_to_mono, extract_segments
from senselab.audio.tasks.plotting.plotting import play_audio
from senselab.audio.tasks.text_to_speech import synthesize_texts

## Specifying the TTS model, the language and the preferred device
Let's initialize the model we want to use (remember to specify both the ```path_or_uri``` and the ```revision``` for reproducibility purposes), the language of the text we want to synthetize, and the device we prefer. In this tutorial, we are going to use [```mars5```](https://github.com/Camb-ai/MARS5-TTS), which only works for English.

In [None]:
model = TorchModel(path_or_uri="Camb-ai/mars5-tts", revision="master")
language = Language(language_code="en")
device = DeviceType.CPU

## Loading Target Audio File
Now let's load and process the audio file that contains the voice we want to target as part of our text-to-speech process. We do segment just the first second of audio since that contains 1 speaker only. 

In [None]:
audio = Audio.from_filepath("../src/tests/data_for_testing/audio_48khz_mono_16bits.wav")
ground_truth = "This is Peter."
audio = extract_segments([(audio, [(0.0, 1.0)])])[0][0]

## Preprocessing
Let's preprocess the audio data to make it suitable with the TTS model characteristics that we can find in the model card in the HuggingFace Hub. In particular, for our example model we need the audio to be sampled at 24kHz. 

In [None]:
audio = downmix_audios_to_mono([audio])[0]
audio = resample_audios([audio], 24000)[0]

And here is how it sounds our target audio.

In [None]:
play_audio(audio)
print("Ground truth:", ground_truth)

## Synthesis
Let's finally synthetize the audio. 

Note: If you want to specify more params and customize the process, you can do it. For more details, see the [**dedicated documentation**](https://sensein.group/senselab/senselab/audio/tasks/text_to_speech.html).

In [None]:
res = synthesize_texts(texts=["Hello, world. It's nice to meet you."], 
                 target=[[audio, ground_truth]],
                 model=model,
                 language=language
)

And here is the output audio of our tutorial.

In [None]:
play_audio(res[0])