<a href="https://colab.research.google.com/github/ylacombe/explanatory_notebooks/blob/main/seamless_m4t_hugging_face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SeamlessM4T 🤗

## Goal of this notebook

This notebook will teach you how to use how to easily use [SeamlessM4T](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), a foundational multimodal model for speech translation using 🤗 Transformers.

## Resources

1. [SeamlessM4T docs in 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel)
2. [Demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless_m4t)
3. Model cards: [medium](https://huggingface.co/facebook/hf-seamless-m4t-medium) and [large](https://huggingface.co/facebook/hf-seamless-m4t-large).
4. [Original repository](https://github.com/facebookresearch/seamless_communication)

## Presentation of the model

SeamlessM4T was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team from Meta AI. It is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)

SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.

Here's how the generation process works:

- Input text or speech is processed through its specific encoder.
- A decoder creates text tokens in the desired language.
- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
- These unit tokens are then passed through the final vocoder to produce the actual speech.

## Prepare the Environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click Runtime -> Change runtime type, then change Hardware accelerator from None to GPU. We can verify that we’ve been assigned a GPU and view its specifications through the nvidia-smi command:

In [None]:
!nvidia-smi

Wed Oct 25 10:04:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we install the 🤗 Transformers package from the main branch and the sentencepiece package:

In [None]:
!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece

## Preprocessing

### Load the model

The pre-trained SeamlessM4T medium and large checkpoints can be loaded from the pre-trained weights (medium and large) on the 🤗 Hugging Face Hub. You can change the repo-id with the checkpoint size that you wish to use.

We'll default to the medium checkpoint, for faster inference. But you can use the large checkpoint by using `"facebook/hf-seamless-m4t-large"` instead of `"facebook/hf-seamless-m4t-medium"`.


In [None]:
from transformers import SeamlessM4TModel

model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")



Downloading (…)lve/main/config.json:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/4.84G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/5.35k [00:00<?, ?B/s]

Place the model to an accelerator device if available.

In [None]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

### Load the Processor

Before everything, load `SeamlessM4TProcessor` in order to be able to pre-process the inputs. The Transformers package has a wide range of processors, so we'll use the `AutoProcessor` class than can recognize which processor to load from the repository id.

The processor role here is two-sides:
1. It is used to prepare inputs. It tokenizes the input text, i.e. to cut it into small pieces that the model can understand, and transforms the audio into a format more suitable for the model.
2. It is used to process the model results. Here, it is used to "detokenize" the output, i.e. to perform the opposite operation to that described above.

In [None]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")


Downloading (…)rocessor_config.json:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0k [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/4.33k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.29k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

### Preparing audio
Here is how to use the processor to process audio. Here, we'll use an audio taken from an Arabic speech corpus.

**Note that you don't need to specify the source language, it will be automatically understood by the model!**


In [None]:
# let's load an audio sample from an Hindi speech corpus
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "hi_in", split="train", streaming=True)
audio_sample = next(iter(dataset))["audio"]

print(f"Sampling rate: {audio_sample['sampling_rate']}")

Downloading builder script:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

Sampling rate: 16000


**Note:** The https://en.wikipedia.org/wiki/Sampling_(signal_processing) of the input audio must 16 kHz. If your sampling rate is higher, you'll need an additional step to prepare the input audio.

**Here is how to do it:**
```python
# we need an additional library
# you can install it with:
# !pip install torchaudio
import torchaudio, torch

# you need to convert the audio from a numpy array to a torch tensor first
audio = torch.tensor(audio_sample["array"])

# now downsample the audio
audio = torchaudio.functional.resample(audio, orig_freq=audio_sample['sampling_rate'], new_freq=model.config.sampling_rate)
```

Now, use the processor:

In [None]:
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


### Preparing text

It is much easier to prepare text, you just have to give it to the processor, alongs with the language of the text. Here the text is in English so we'll set `src_lang="eng"`.

In [None]:
# now, process some English test as well
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt").to(device)

## Model usage

Now, we got everything ready to actually use the model !

### Generate translated speech

`SeamlessM4TModel` can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:

In [None]:
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

**With the exact same code but different inputs, I’ve translated English text and Hindi speech to Russian speech samples.**

Now, let's listen to the generated audios!

#### From text

In [None]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

#### From audio

In [None]:
Audio(audio_array_from_audio, rate=sample_rate)

You can also save audio as .wav files using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

```python
import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text) # audio_array_from_audio
```


### Generate translated text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to `SeamlessM4TModel.generate`.

This time, let's translate the Hindi audio to English (I personnaly don't speak Hindi 🤗) and the English text to French.

In [None]:
# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

# from text
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

Translation from audio: Politicians said that they had found enough clarity in the Afghan constitution to define the decisive step in an unambiguous way.
Translation from text: Bonjour, mon chien est mignon


## Intermediary conclusion

Now you know how to use [SeamlessM4T using 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel)!

Let's wrap it up:
1. SeamlessM4T can **translate text/speech to text/speech.**
2. It **supports numerous languages** and is a great step towards reducing language barriers in AI.
3. It is **fast and efficient**.
4. The rest of this notebook will share some **tips** on how to best use the model! (**Spoiler:** You can also do batching inference!)
5. You can try SeamlessM4T in this [demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless_m4t)

**Don't hesitate to share how you think this model should be used!**


## Tips


### 1. Use dedicated models

`SeamlessM4TModel` is a Transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

```python
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
```

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.

```python
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
```

Feel free to try out `SeamlessM4TForSpeechToText` and `SeamlessM4TForTextToSpeech` as well.

#### 2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages!


In [None]:
# let's test with let say spkr_id=5 and tgt_lang="eng"
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", spkr_id=5)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 3. Change the generation strategy

You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.


In [None]:
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", spkr_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 4. Generate speech and text at the same time

Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text !

In [None]:
output = model.generate(**audio_inputs, return_intermediate_token_ids=True, tgt_lang="eng", spkr_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)

audio_array_from_audio = output[0].cpu().numpy().squeeze()
text_tokens = output[2]
translated_text_from_text = processor.decode(text_tokens.tolist()[0], skip_special_tokens=True)
print(f"TRANSLATION: {translated_text_from_text}")

Audio(audio_array_from_audio, rate=sample_rate)

TRANSLATION: Politicians said that they had found enough clarity in the Afghan constitution to unilaterally determine the decisive step.


### 5. Use batching for increased throughput

Batching with SeamlessM4T is only supported in 🤗 Transformers. Here is an example with two French sentences translated to English!

In [None]:
text_inputs = processor(text = ["J'aime HF de tout mon coeur.", "La vie est belle."], src_lang="fra", return_tensors="pt").to(device)

audio_array_from_text = model.generate(**text_inputs, tgt_lang="eng", spkr_id=7, num_beams=5, speech_do_sample=True, speech_temperature=0.6)

When batching, you can get the length of each generated waveform by accessing `audio_array_from_text[1]`.

In [None]:
# first sentence
length = audio_array_from_text[1][0]
audio = audio_array_from_text[0][0]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)

In [None]:
# second sentence
length = audio_array_from_text[1][1]
audio = audio_array_from_text[0][1]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)