# SeamlessM4T 🤗

## Goal of this notebook

This notebook will teach you how to use how to easily use [SeamlessM4T](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel), a foundational multimodal model for speech translation using 🤗 Transformers.

## Resources

1. [SeamlessM4T docs in 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel)
2. [Demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless_m4t)
3. Model cards: [medium](https://huggingface.co/facebook/hf-seamless-m4t-medium) and [large](https://huggingface.co/facebook/hf-seamless-m4t-large).
4. [Original repository](https://github.com/facebookresearch/seamless_communication)

## Presentation of the model

SeamlessM4T was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team from Meta AI. It is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)

SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.

Here's how the generation process works:

- Input text or speech is processed through its specific encoder.
- A decoder creates text tokens in the desired language.
- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
- These unit tokens are then passed through the final vocoder to produce the actual speech.

## Prepare the Environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click Runtime -> Change runtime type, then change Hardware accelerator from None to GPU. We can verify that we’ve been assigned a GPU and view its specifications through the nvidia-smi command:

In [1]:
!nvidia-smi

Tue Dec  5 19:43:25 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.07             Driver Version: 537.34       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4070        On  | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P8               1W / 200W |     14MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Next, we install the 🤗 Transformers package from the main branch and the sentencepiece package:

In [None]:
!pip install --quiet git+https://github.com/huggingface/transformers sentencepiece

## Preprocessing

### Load the model

The pre-trained SeamlessM4T medium and large checkpoints can be loaded from the pre-trained weights (medium and large) on the 🤗 Hugging Face Hub. You can change the repo-id with the checkpoint size that you wish to use.

We'll default to the medium checkpoint, for faster inference. But you can use the large checkpoint by using `"facebook/hf-seamless-m4t-large"` instead of `"facebook/hf-seamless-m4t-medium"`.


In [2]:
from transformers import SeamlessM4TModel

model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")



config.json:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/5.35k [00:00<?, ?B/s]

Place the model to an accelerator device if available.

In [3]:
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = model.to(device)

### Load the Processor

Before everything, load `SeamlessM4TProcessor` in order to be able to pre-process the inputs. The Transformers package has a wide range of processors, so we'll use the `AutoProcessor` class than can recognize which processor to load from the repository id.

The processor role here is two-sides:
1. It is used to prepare inputs. It tokenizes the input text, i.e. to cut it into small pieces that the model can understand, and transforms the audio into a format more suitable for the model.
2. It is used to process the model results. Here, it is used to "detokenize" the output, i.e. to perform the opposite operation to that described above.

In [4]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")


preprocessor_config.json:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.33k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.29k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

### Preparing audio
Here is how to use the processor to process audio. Here, we'll use an audio taken from an Arabic speech corpus.

**Note that you don't need to specify the source language, it will be automatically understood by the model!**


In [6]:
# let's load an audio sample from an Hindi speech corpus
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "cmn_hans_cn", split="train")#, streaming=True)


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.77G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/525M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/491k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [46]:
audio_sample = next(iter(dataset))["audio"]

print(f"Sampling rate: {audio_sample['sampling_rate']}")

Sampling rate: 16000


In [47]:
audio_sample

{'path': 'train/10009480476880474721.wav',
 'array': array([ 0.        ,  0.        ,  0.        , ...,  0.00084221,
         0.00065321, -0.00083894]),
 'sampling_rate': 16000}

In [45]:
from IPython.display import Audio

Audio(audio_sample['array'], rate=16000)

**Note:** The https://en.wikipedia.org/wiki/Sampling_(signal_processing) of the input audio must 16 kHz. If your sampling rate is higher, you'll need an additional step to prepare the input audio.

**Here is how to do it:**
```python
# we need an additional library
# you can install it with:
# !pip install torchaudio
import torchaudio, torch

# you need to convert the audio from a numpy array to a torch tensor first
audio = torch.tensor(audio_sample["array"])

# now downsample the audio
audio = torchaudio.functional.resample(audio, orig_freq=audio_sample['sampling_rate'], new_freq=model.config.sampling_rate)
```

Now, use the processor:

In [10]:
audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt").to(device)

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [11]:
audio_inputs

{'input_features': tensor([[[-4.3248, -4.2316, -4.1590,  ..., -4.3039, -4.3007, -4.2950],
         [-4.3248, -4.2316, -4.1590,  ..., -4.3039, -4.3007, -4.2950],
         [-4.3248, -4.2316, -4.1590,  ..., -4.3039, -4.3007, -4.2950],
         ...,
         [-0.5762, -0.6554, -0.9157,  ..., -0.1630, -0.2886, -0.1657],
         [-0.4287, -0.5118, -0.5541,  ..., -0.3005, -0.3652, -0.2637],
         [-0.4532, -0.4167, -0.4836,  ..., -0.4294, -0.4666, -0.3861]]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1

### Preparing text

It is much easier to prepare text, you just have to give it to the processor, alongs with the language of the text. Here the text is in English so we'll set `src_lang="eng"`.

In [12]:
# now, process some English test as well
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt").to(device)

In [13]:
text_inputs

{'input_ids': tensor([[256047,  94124, 248079,   1537,   6658,    248,  95740,      3]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

## Model usage

Now, we got everything ready to actually use the model !

### Generate translated speech

`SeamlessM4TModel` can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:

In [17]:
audio_array_from_text = model.generate(**text_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng")[0].cpu().numpy().squeeze()

**With the exact same code but different inputs, I’ve translated English text and Hindi speech to Russian speech samples.**

Now, let's listen to the generated audios!

#### From text

In [18]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

#### From audio

In [19]:
Audio(audio_array_from_audio, rate=sample_rate)

You can also save audio as .wav files using a third-party library, e.g. scipy (note here that we also need to remove the channel dimension from our audio tensor):

```python
import scipy

scipy.io.wavfile.write("seamless_m4t_out.wav", rate=sample_rate, data=audio_array_from_text) # audio_array_from_audio
```


### Generate translated text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to `SeamlessM4TModel.generate`.

This time, let's translate the Hindi audio to English (I personnaly don't speak Hindi 🤗) and the English text to French.

In [20]:
# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="eng", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from audio: {translated_text_from_audio}")

# from text
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print(f"Translation from text: {translated_text_from_text}")

Translation from audio: Despite these accusations, Ma Ying-jeou argued in an interview that China's relationship with the mainland was closer and easier to overcome.
Translation from text: Bonjour, mon chien est mignon


## Intermediary conclusion

Now you know how to use [SeamlessM4T using 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel)!

Let's wrap it up:
1. SeamlessM4T can **translate text/speech to text/speech.**
2. It **supports numerous languages** and is a great step towards reducing language barriers in AI.
3. It is **fast and efficient**.
4. The rest of this notebook will share some **tips** on how to best use the model! (**Spoiler:** You can also do batching inference!)
5. You can try SeamlessM4T in this [demo on 🤗 Spaces](https://huggingface.co/spaces/facebook/seamless_m4t)

**Don't hesitate to share how you think this model should be used!**


## Tips


### 1. Use dedicated models

`SeamlessM4TModel` is a Transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

```python
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
```

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.

```python
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
```

Feel free to try out `SeamlessM4TForSpeechToText` and `SeamlessM4TForTextToSpeech` as well.

#### 2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages!


In [21]:
# let's test with let say spkr_id=5 and tgt_lang="eng"
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", spkr_id=5)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 3. Change the generation strategy

You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.


In [22]:
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="eng", spkr_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)[0].cpu().numpy().squeeze()

Audio(audio_array_from_audio, rate=sample_rate)


#### 4. Generate speech and text at the same time

Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text !

In [23]:
output = model.generate(**audio_inputs, return_intermediate_token_ids=True, tgt_lang="eng", spkr_id=7, text_num_beams=4, speech_do_sample=True, speech_temperature=0.6)

audio_array_from_audio = output[0].cpu().numpy().squeeze()
text_tokens = output[2]
translated_text_from_text = processor.decode(text_tokens.tolist()[0], skip_special_tokens=True)
print(f"TRANSLATION: {translated_text_from_text}")

Audio(audio_array_from_audio, rate=sample_rate)

TRANSLATION: Despite these accusations, Ma Ying-jeou argued in a speech that China's relationship with the mainland was closer and easier to overcome.


### 5. Use batching for increased throughput

Batching with SeamlessM4T is only supported in 🤗 Transformers. Here is an example with two French sentences translated to English!

In [40]:
text_inputs = processor(text = ["中国国家主席习近平的专机降落旧金山.", "中国的农历新年."], src_lang="zh", return_tensors="pt").to(device)

audio_array_from_text = model.generate(**text_inputs, tgt_lang="eng", spkr_id=5, num_beams=2, speech_do_sample=True, speech_temperature=0.8)

When batching, you can get the length of each generated waveform by accessing `audio_array_from_text[1]`.

In [41]:
# first sentence
length = audio_array_from_text[1][0]
audio = audio_array_from_text[0][0]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)

In [42]:
# second sentence
length = audio_array_from_text[1][1]
audio = audio_array_from_text[0][1]
Audio(audio[:length].cpu().numpy().squeeze(), rate=sample_rate)

# SeamlessM4Tv2

https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2

In [14]:
from transformers import AutoProcessor, SeamlessM4TModel
 
processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
#speech from text
text_inputs = processor(text = "Hello, special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained", src_lang="eng", return_tensors="pt")
audio_array_from_text = model.generate(**text_inputs, tgt_lang="cmn")[0].cpu().numpy().squeeze() #cmn


In [16]:
from IPython.display import Audio

sample_rate = model.config.sampling_rate
Audio(audio_array_from_text, rate=sample_rate)

In [17]:
sample_rate

16000

In [18]:
#text 2 text
output_tokens = model.generate(**text_inputs, tgt_lang="cmn", generate_speech=False)
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
translated_text_from_text

'你好,在词汇中已经添加了特殊的标志,确保相关的词嵌入是精细调整或训练的'

In [21]:
import torchaudio
# from audio
audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
Audio(audio, rate=sample_rate)

In [29]:
device="cuda:0"

In [30]:
audio_inputs = processor(audios=audio, return_tensors="pt")
audio_inputs.to(device)
model.to(device)
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="cmn", spkr_id=5, num_beams=2, speech_do_sample=True, speech_temperature=0.6)[0].cpu().numpy().squeeze()

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [31]:
Audio(audio_array_from_audio, rate=sample_rate)

In [32]:
# from audio
output_tokens = model.generate(**audio_inputs, tgt_lang="cmn", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
translated_text_from_audio


'我们美国人民为了形成更完美的联盟建立正义确保家庭平静,为共同的防御提供'

In [33]:
from transformers import SeamlessM4TForTextToText
model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")

In [34]:
model.to(device)

SeamlessM4TForTextToText(
  (shared): Embedding(256206, 1024, padding_idx=0)
  (text_encoder): SeamlessM4TEncoder(
    (embed_tokens): Embedding(256206, 1024, padding_idx=0)
    (embed_positions): SeamlessM4TSinusoidalPositionalEmbedding()
    (layers): ModuleList(
      (0-11): 12 x SeamlessM4TEncoderLayer(
        (self_attn): SeamlessM4TAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (ffn): SeamlessM4TFeedForwardNetwork(
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
    

In [35]:
text_inputs = processor(text = "Hello, special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained", src_lang="eng", return_tensors="pt")
text_inputs.to(device)

{'input_ids': tensor([[256047,  94124, 248079,  23257,   1776,    585,   2986,   8857,  97238,
            108,    349, 177291,    276,   4379, 248079,   7038,  22802,    349,
         164490,   3422,  25079,   2314,   2895,   2442,  21510, 248105,  72838,
             76,    618,  57477,     76,      3]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [39]:
output_tokens = model.generate(**text_inputs, tgt_lang="cmn")
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
translated_text_from_text

''

In [7]:
from ipywebrtc import AudioRecorder, CameraStream #https://ipywebrtc.readthedocs.io/en/latest/AudioRecorder.html
import torchaudio
from IPython.display import Audio 
#https://medium.com/@harrycblum/record-audio-in-a-jupyter-notebook-da08a88278bb

In [8]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

In [10]:
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav file.wav -y -hide_banner -loglevel panic
sig, sr = torchaudio.load("file.wav")
print(sig.shape)
Audio(data=sig, rate=sr)