<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/mms-tts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## MMS TTS in 🤗 Transformers

The [Massively Multilingual Speech (MMS)](https://ai.meta.com/blog/multilingual-model-speech-recognition/) project from Meta AI is an endeavour to expand text-to-speech (TTS), speech-to-text (STT),
and language identification (LID) technology from around 100 langauges to over 1,100, more than 10 times as many as before. The diagram below highlights the vast coverage the MMS TTS, STT and LID models achieve globally:

<p align="center">
  <img src="https://scontent-lhr8-2.xx.fbcdn.net/v/t39.8562-6/346646031_1369649176946105_149369221346745594_n.png?_nc_cat=104&ccb=1-7&_nc_sid=6825c5&_nc_ohc=3w0ZyKVY4eYAX8OK5lg&_nc_ht=scontent-lhr8-2.xx&oh=00_AfCwouWD9Bdcj5R7-FX16UUA3wMzPCcl78aPKKTC0Cy0sg&oe=64F6D356" width="600"/>
</p>

**Image source:** [MMS Blog Post](https://ai.meta.com/blog/multilingual-model-speech-recognition/)

For the first time, many low-resource languages have an effective model checkpoint that can be used for TTS, unlocking applications
that were previously reserved for high-resource languages. In this Colab, we'll go through a few examples of how to use the MMS TTS model in 🤗 Transformers to generate
speech in over 1,100 languages. For details on using the STT or LID models, see the corresponding [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms).

## Set-up

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to `T4 GPU`. Next, click the `Connect T4` button in the top-right hand pane of the Colab, which will assign us a GPU and connect us directly.

We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [1]:
!nvidia-smi

Mon Sep  4 10:25:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

The STT and LID models from MMS were added to 🤗 Transformers in the release 4.31. The TTS checkpoints are now available on the main
branch of 🤗 Transformers, allowing users to run inference with just three lines of code.

We'll install 🤗 Transformers from the main branch, as well as the 🤗 Accelerate package:

In [2]:
!pip install --quiet --upgrade git+https://github.com/huggingface/transformers.git accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Roman Alphabet

MMS-TTS uses the same model architecture as VITS. MMS trains a separate
model checkpoint for each of the 1100+ languages in the project. All available checkpoints can be found on the Hugging
Face Hub: [facebook/mms-tts](https://huggingface.co/models?sort=trending&search=facebook%2Fmms-tts), and the inference
documentation under [MMS-TTS](https://huggingface.co/docs/transformers/main/en/model_doc/mms#speech-synthesis-tts).

For languages with a Roman alphabet, such as English, Spanish or French, the tokenizer can be used directly to pre-process the text inputs.
Let's go through an example of using the MMS-TTS Spanish checkpoint [facebook/mms-tts-spa](https://huggingface.co/facebook/mms-tts-spa) for
speech synthesis.

First, we'll load the tokenizer and model checkpoint using [`.from_pretrained`](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained):

In [3]:
from transformers import VitsTokenizer, VitsModel

tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-spa")
model = VitsModel.from_pretrained("facebook/mms-tts-spa")



We can move the model to the GPU (if available) and set it to evaluation mode to disable dropout:

In [4]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device).eval();

We can then pre-process our text input by converting it from a string to token ids:

In [5]:
inputs = tokenizer("Hola - Hugging Face está al teléfono", return_tensors="pt")

The token ids can be passed to the model to generate the output waveform. Since VITS uses a flow-based model that is non-deterministic, it is good practice to set a seed to ensure reproducibility of
the outputs:

In [6]:
from transformers import set_seed

set_seed(555)

with torch.no_grad():
   outputs = model(**inputs.to(device))

The generated waveform can be accessed through the `.waveform` attribute of the model outputs:

In [7]:
waveform = outputs.waveform[0]
waveform = waveform.cpu().float().numpy()

The resulting waveform can be saved as a `.wav` file:

In [8]:
import scipy

scipy.io.wavfile.write("synthesized_speech.wav", rate=model.config.sampling_rate, data=waveform)

Or displayed in a Jupyter Notebook / Google Colab:

In [9]:
from IPython.display import Audio

Audio(waveform, rate=model.config.sampling_rate)

Nice! The synthesised speech sounds as expected 🇪🇸

## Non-Roman Alphabet

For certain languages with non-Roman alphabets, such as Mandarin, Hindi, or Korean, the [`uroman`](https://github.com/isi-nlp/uroman)
perl package is required to pre-process the text inputs to the Roman alphabet. We can check whether we require the `uroman` package
for a given language by inspecting the `is_uroman` attribute of the pre-trained `tokenizer`.

Let's take the example of using the Korean MMS-TTS checkpoint [facebook/mms-tts-kor](https://huggingface.co/facebook/mms-tts-kor).
We'll first load the tokenizer from pre-trained, and then check the corresponding attribute `is_uroman`:

In [10]:
from transformers import VitsTokenizer

tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor")
print(f"Requires uroman: {tokenizer.is_uroman}")

Requires uroman: True


Alright! We've confirmed that we require the uroman package to use this langauge. That means we should apply the uroman package to our text
inputs **prior** to passing them to the `VitsTokenizer`, since currently the tokenizer does not support performing the pre-processing
itself.

To enable `git clone` on Colab, we first have to set the ipython encoding to UTF-8 using a small hack given below:

In [11]:
import locale

locale.getpreferredencoding = lambda: "UTF-8"

We can then `git clone` the uroman repository to our device:

In [12]:
!git clone https://github.com/isi-nlp/uroman.git

fatal: destination path 'uroman' already exists and is not an empty directory.


We can then define a python function to call the uroman package on our text input. This function takes as input two arguments:
1. `input_string`: input text that we want to uromanize
2. `uroman_path`: path to the uroman package that we git cloned

It then calls the uroman package on our text inputs, and returns the uromaized result:

In [13]:
import os
import subprocess


def uromanize(input_string, uroman_path):
    script_path = os.path.join(uroman_path, "bin", "uroman.pl")

    command = ["perl", script_path]

    process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    # Execute the perl command
    stdout, stderr = process.communicate(input=input_string.encode())

    if process.returncode != 0:
        raise ValueError(f"Error {process.returncode}: {stderr.decode()}")

    # Return the output as a string and skip the new-line character at the end
    return stdout.decode()[:-1]

Cool! Let's see how we can use this function to pre-process our text inputs. First, we'll load the tokenizer and model and place it on the correct device:

In [14]:
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor")
model = VitsModel.from_pretrained("facebook/mms-tts-kor")

model.to(device).eval();

We can now uromanize our text inputs and convert them to token ids:

In [15]:
text = "이봐 무슨 일이야"
uromaized_text = uromanize(text, uroman_path="./uroman")

inputs = tokenizer(uromaized_text, return_tensors="pt")

Running the forward pass is then the same as before:

In [16]:
set_seed(555)

with torch.no_grad():
   outputs = model(**inputs.to(device))

Here's the generated waveform:

In [17]:
waveform = outputs.waveform[0]
waveform = waveform.cpu().numpy()

Audio(waveform, rate=model.config.sampling_rate)

Great! We have a working pipeline for TTS in Korean, or more generally any non-Roman language that requires the uroman package.

## Pipeline Usage

The MMS-TTS checkpoints are also compatible with the [text-to-audio pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextToAudioPipeline) class. This pipeline provides a high-level helper function to run inference in just three lines of code.

To use it, we first instantiate an instance of the pipeline with a given model id. In this example, we'll load the checkpoint for [Bislama](https://en.wikipedia.org/wiki/Bislama) (`bis`), an English-based creole language and one of the official languages of [Vanuatu](https://en.wikipedia.org/wiki/Vanuatu):

In [18]:
from transformers import pipeline

pipe = pipeline("text-to-audio", model="facebook/mms-tts-bis", device=device)

Running inference is as easy as passing the input text to the pipeline:

In [19]:
outputs = pipe("Hey, it's Hugging Face on the phone")

Audio(outputs["audio"], rate=outputs["sampling_rate"])