# Text to speech

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/text_to_speech.ipynb)

This tutorial demonstrates how to use the `synthesize_texts` function to convert pieces of text into audio files. 

## Quick start
We will start with some ```HuggingFace``` models. 

The very first example shows how to use ```facebook/mms-tts-eng``` which just requires as input the list of pieces of text that you want to synthetize.

In [1]:
%pip install 'senselab[audio]'

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Model: facebook/mms-tts-eng (https://huggingface.co/facebook/mms-tts-eng)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

%matplotlib inline

# Initialize the model
hf_model = HFModel(path_or_uri="facebook/mms-tts-eng", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/145M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/413 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Device set to use cpu


## More examples
Here is ```suno/bark-small``` (https://huggingface.co/suno/bark-small). Even in this case, the required input is the list of pieces of text to synthetize.

In [3]:
# Model: suno/bark-small (https://huggingface.co/suno/bark-small)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

# Initialize the model
hf_model = HFModel(path_or_uri="suno/bark-small", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Let's proceed with ```microsoft/speecht5_tts``` (https://huggingface.co/microsoft/speecht5_tts). This model requires the list of pieces of text to synthetize plus the speaker embedding of the voice we want to clone (btw, speaker embeddings are some values describing the characteristics of someone's voice. If you want to learn more about extracting speaker embeddings with Senselab, please refer to the [dedicated documentation](https://sensein.group/senselab/senselab/audio/tasks/speaker_embeddings.html)). Details about ```microsoft/speecht5_tts```can be found in the model card. In our example, we use some speaker embeddings from the dataset called ```Matthijs/cmu-arctic-xvectors```.

In [4]:
import torch
from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

# Initialize the model
hf_model = HFModel(path_or_uri="microsoft/speecht5_tts", revision="main")
# Write the text to be synthesized
texts = ["Hello, world!"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model, forward_params={"speaker_embeddings": speaker_embedding})

# Play the synthesized audio
play_audio(audios[0])

README.md: 0.00B [00:00, ?B/s]

cmu-arctic-xvectors.py: 0.00B [00:00, ?B/s]

default/validation/0000.parquet:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7931 [00:00<?, ? examples/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/585M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

## Even more examples
Let's now try the advanced ```Mars5-tts``` model.

```Mars5-tts``` requires two inputs:
1. A list of pieces of text you want to synthesize.
2. Target voices you want to clone, along with their respective transcripts.
Although transcripts are not strictly necessary for the model to function, our initial tests show that they significantly improve the model's quality. For this reason, we have made transcripts mandatory in our interface in ```senselab```.

### Setup
First, let's import the necessary libraries and the function we'll be using.

In [5]:
from senselab.audio.data_structures import Audio
from senselab.audio.tasks.plotting.plotting import play_audio
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, extract_segments, resample_audios
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import DeviceType, Language, TorchModel

  available_backends = torchaudio.list_audio_backends()


### Specifying the TTS model, the language and the preferred device
Let's initialize the model we want to use (remember to specify both the ```path_or_uri``` and the ```revision``` for reproducibility purposes), the language of the text we want to synthetize, and the device we prefer. In this tutorial, we are going to use [```mars5```](https://github.com/Camb-ai/MARS5-TTS), which only works for English.

In [6]:
model = TorchModel(path_or_uri="Camb-ai/mars5-tts", revision="master")
language = Language(language_code="en")
device = DeviceType.CPU

model.safetensors:   0%|          | 0.00/50.6M [00:00<?, ?B/s]

### Loading Target Audio File
Now let's load and process the audio file that contains the voice we want to target as part of our text-to-speech process. We do segment just the first second of audio since that contains 1 speaker only. 

In [9]:
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav

import os

audio = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_mono_16bits.wav"))
ground_truth = "This is Peter."
audio = extract_segments([(audio, [(0.0, 1.0)])])[0][0]

--2025-09-15 18:59:38--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav [following]
--2025-09-15 18:59:38--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 472488 (461K) [audio/wav]
Saving to: ‘tutorial_audio_files/audio_48khz_mono_16bits.wav’


2025-09-15 18:59:38 (5.16 MB/s) - ‘tutorial_audio_files/audio_48kh

  info = torchaudio.info(self._file_path)
  return AudioMetaData(
  info = torchaudio.info(filepath)


### Preprocessing
Let's preprocess the audio data to make it suitable with the TTS model characteristics that we can find in the model card in the HuggingFace Hub. In particular, for our example model we need the audio to be sampled at 24kHz. 

In [10]:
audio = downmix_audios_to_mono([audio])[0]
audio = resample_audios([audio], 24000)[0]

And here is how it sounds our target audio.

In [11]:
play_audio(audio)
print("Ground truth:", ground_truth)

Ground truth: This is Peter.


### Synthesis
Let's finally synthetize the audio. 

Note: If you want to specify more params and customize the process, you can do it. For more details, see the [**dedicated documentation**](https://sensein.group/senselab/senselab/audio/tasks/text_to_speech.html).

In [12]:
res = synthesize_texts(texts=["Hello, world. It's nice to meet you."], 
                 targets=[(audio, ground_truth)],
                 model=model,
                 language=language
)

Downloading: "https://github.com/Camb-ai/mars5-tts/zipball/master" to /Users/fabiocat/.cache/torch/hub/master.zip


100%|██████████| 1.42G/1.42G [00:31<00:00, 47.9MB/s]
100%|██████████| 863M/863M [00:16<00:00, 55.4MB/s] 
  WeightNorm.apply(module, name, dim)


Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /Users/fabiocat/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th


100%|██████████| 88.9M/88.9M [00:02<00:00, 34.4MB/s]


config.yaml:   0%|          | 0.00/503 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/40.4M [00:00<?, ?B/s]

2025-09-15 19:00:53,942 - senselab - INFO - Time taken to initialize the Mars5-TTS model: 63.88 seconds
INFO:senselab:Time taken to initialize the Mars5-TTS model: 63.88 seconds


Note: using deep clone. Assuming input `c_phones` is concatenated prompt and output phones. Also assuming no padded indices in `c_codes`.
New x: torch.Size([1, 561, 8]) | new x_known: torch.Size([1, 561, 8]) . Base prompt: torch.Size([1, 75, 8]). New padding mask: torch.Size([1, 561]) | m shape: torch.Size([1, 561, 8])


2025-09-15 19:04:53,418 - senselab - INFO - Time taken for synthesizing audios: 239.47 seconds
INFO:senselab:Time taken for synthesizing audios: 239.47 seconds


And here is the output audio of our tutorial.

In [13]:
play_audio(res[0])

## Let's experiment with Coqui-tts models
Here is the plain TTS:

In [14]:
# Model: xtts_v2 (tts_models/multilingual/multi-dataset/xtts_v2)
# More models here: https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json

# Import the Coqui model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts

# Import language
from senselab.utils.data_structures import CoquiTTSModel, Language

# Initialize the model
coqui_model = CoquiTTSModel(path_or_uri="tts_models/multilingual/multi-dataset/xtts_v2", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=coqui_model, language=Language(language_code="en"))

# Play the synthesized audio
play_audio(audios[0])

100%|██████████| 1.87G/1.87G [00:41<00:00, 44.8MiB/s]
4.37kiB [00:00, 24.7kiB/s]
361kiB [00:00, 2.16MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 111iB/s]
100%|██████████| 7.75M/7.75M [00:13<00:00, 22.4MiB/s]

And here you find an example of TTS with target voice:

In [15]:
# Model: xtts_v2 (tts_models/multilingual/multi-dataset/xtts_v2)
# More models here: https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json

# Download the audio file for the tutorial
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav

# Import the Coqui model
# Import the audio data structure
from senselab.audio.data_structures import Audio

# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the audio preprocessing functions
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, extract_segments, resample_audios

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts

# Import language
from senselab.utils.data_structures import CoquiTTSModel, Language

# Initialize the model
coqui_model = CoquiTTSModel(path_or_uri="tts_models/multilingual/multi-dataset/xtts_v2", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
audio = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_mono_16bits.wav"))
ground_truth = "This is Peter."
audio = extract_segments([(audio, [(0.0, 1.0)])])[0][0]
audio = downmix_audios_to_mono([audio])[0]
audio = resample_audios([audio], 24000)[0]

audios = synthesize_texts(texts=texts, targets=[audio], model=coqui_model, language=Language(language_code="en"))

# Play the synthesized audio
play_audio(audios[0])

--2025-09-15 19:05:59--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav [following]
--2025-09-15 19:06:00--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 472488 (461K) [audio/wav]
Saving to: ‘tutorial_audio_files/audio_48khz_mono_16bits.wav’


2025-09-15 19:06:00 (6.50 MB/s) - ‘tutorial_audio_files/audio_48kh

  info = torchaudio.info(self._file_path)
  return AudioMetaData(
  info = torchaudio.info(filepath)
