# Text to speech

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/audio/text_to_speech.ipynb)

This tutorial demonstrates how to use the `synthesize_texts` function to convert pieces of text into audio files. 

## Quick start
We will start with some ```HuggingFace``` models. 

The very first example shows how to use ```facebook/mms-tts-eng``` which just requires as input the list of pieces of text that you want to synthetize.

In [1]:
%pip install senselab

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Model: facebook/mms-tts-eng (https://huggingface.co/facebook/mms-tts-eng)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

%matplotlib inline

# Initialize the model
hf_model = HFModel(path_or_uri="facebook/mms-tts-eng", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/145M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/413 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/47.0 [00:00<?, ?B/s]

Device set to use cpu


## More examples
Here is ```suno/bark-small``` (https://huggingface.co/suno/bark-small). Even in this case, the required input is the list of pieces of text to synthetize.

In [3]:
# Model: suno/bark-small (https://huggingface.co/suno/bark-small)

# Import the Hugging Face model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts
from senselab.utils.data_structures import HFModel

# Initialize the model
hf_model = HFModel(path_or_uri="suno/bark-small", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model)

# Play the synthesized audio
play_audio(audios[0])

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cpu
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:10000 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Let's proceed with ```microsoft/speecht5_tts``` (https://huggingface.co/microsoft/speecht5_tts). This model requires the list of pieces of text to synthetize plus the speaker embedding of the voice we want to clone (btw, speaker embeddings are some values describing the characteristics of someone's voice. If you want to learn more about extracting speaker embeddings with Senselab, please refer to the [dedicated documentation](https://sensein.group/senselab/senselab/audio/tasks/speaker_embeddings.html)). Details about ```microsoft/speecht5_tts```can be found in the model card. In our example, we use some speaker embeddings from the dataset called ```Matthijs/cmu-arctic-xvectors```.

In [4]:
import torch
from datasets import load_dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

# Initialize the model
hf_model = HFModel(path_or_uri="microsoft/speecht5_tts", revision="main")
# Write the text to be synthesized
texts = ["Hello, world!"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=hf_model, forward_params={"speaker_embeddings": speaker_embedding})

# Play the synthesized audio
play_audio(audios[0])

README.md: 0.00B [00:00, ?B/s]

cmu-arctic-xvectors.py: 0.00B [00:00, ?B/s]

default/validation/0000.parquet:   0%|          | 0.00/21.3M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7931 [00:00<?, ? examples/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/585M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

## Let's experiment with Coqui-tts models
Here is the plain TTS:

In [14]:
# Model: xtts_v2 (tts_models/multilingual/multi-dataset/xtts_v2)
# More models here: https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json

# Import the Coqui model
# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts

# Import language
from senselab.utils.data_structures import CoquiTTSModel, Language

# Initialize the model
coqui_model = CoquiTTSModel(path_or_uri="tts_models/multilingual/multi-dataset/xtts_v2", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
# Call the text-to-speech function
audios = synthesize_texts(texts=texts, model=coqui_model, language=Language(language_code="en"))

# Play the synthesized audio
play_audio(audios[0])

100%|██████████| 1.87G/1.87G [00:41<00:00, 44.8MiB/s]
4.37kiB [00:00, 24.7kiB/s]
361kiB [00:00, 2.16MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 111iB/s]
100%|██████████| 7.75M/7.75M [00:13<00:00, 22.4MiB/s]

And here you find an example of TTS with target voice:

In [None]:
# Model: xtts_v2 (tts_models/multilingual/multi-dataset/xtts_v2)
# More models here: https://github.com/idiap/coqui-ai-TTS/blob/dev/TTS/.models.json

# Download the audio file for the tutorial
!mkdir -p tutorial_audio_files
!wget -O tutorial_audio_files/audio_48khz_mono_16bits.wav https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav

# Import the Coqui model
# Import the audio data structure
import os

from senselab.audio.data_structures import Audio

# Import the audio player
from senselab.audio.tasks.plotting.plotting import play_audio

# Import the audio preprocessing functions
from senselab.audio.tasks.preprocessing import downmix_audios_to_mono, extract_segments, resample_audios

# Import the text-to-speech function
from senselab.audio.tasks.text_to_speech import synthesize_texts

# Import language
from senselab.utils.data_structures import CoquiTTSModel, Language

# Initialize the model
coqui_model = CoquiTTSModel(path_or_uri="tts_models/multilingual/multi-dataset/xtts_v2", revision="main")
# Write the text to be synthesized
texts = ["Hello world"]
audio = Audio(filepath=os.path.abspath("tutorial_audio_files/audio_48khz_mono_16bits.wav"))
ground_truth = "This is Peter."
audio = extract_segments([(audio, [(0.0, 1.0)])])[0][0]
audio = downmix_audios_to_mono([audio])[0]
audio = resample_audios([audio], 24000)[0]

audios = synthesize_texts(texts=texts, targets=[audio], model=coqui_model, language=Language(language_code="en"))

# Play the synthesized audio
play_audio(audios[0])

--2025-09-15 19:05:59--  https://github.com/sensein/senselab/raw/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav [following]
--2025-09-15 19:06:00--  https://raw.githubusercontent.com/sensein/senselab/main/src/tests/data_for_testing/audio_48khz_mono_16bits.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 472488 (461K) [audio/wav]
Saving to: ‘tutorial_audio_files/audio_48khz_mono_16bits.wav’


2025-09-15 19:06:00 (6.50 MB/s) - ‘tutorial_audio_files/audio_48kh

  info = torchaudio.info(self._file_path)
  return AudioMetaData(
  info = torchaudio.info(filepath)
