# Translation Transformer

In this notebook, we use a small transformer (Helsinki-NLP/opus-mt-fr-en) to translate from French to English.

<a target="_blank" href="https://colab.research.google.com/github/simonguest/CS-394/blob/main/src/01/notebooks/translation-transformer.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://github.com/simonguest/CS-394/raw/refs/heads/main/src/01/notebooks/translation-transformer.ipynb">
  <img src="https://img.shields.io/badge/Download_.ipynb-blue" alt="Download .ipynb"/>
</a>

## Load model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-fr-en"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

## Tokenize

In [None]:
french_text = "Bonjour, comment allez-vous?"
input_ids = tokenizer.encode(french_text, return_tensors="pt")
print(input_ids[0])
print("Tokens:", tokenizer.convert_ids_to_tokens(input_ids[0]))

tensor([8703,    2, 1027, 5682,   21,  682,   54,    0])
Tokens: ['▁Bonjour', ',', '▁comment', '▁allez', '-', 'vous', '?', '</s>']


In [None]:
# @title Demonstrate contextual vectors using the encoder

# French: "Bonjour , comment allez  - vous  ?"
#          ↓       ↓    ↓      ↓    ↓  ↓    ↓
# Encoder: [v1]   [v2] [v3]  [v4] [v5][v6][v7]  ← 7 vectors, each 512-dim
#          └─────────────────────────────────┘

encoder = model.get_encoder()
encoder_output = encoder(input_ids)
print("Encoder output shape:", encoder_output.last_hidden_state.shape)
print("Encoder output:", encoder_output)

Encoder output shape: torch.Size([1, 8, 512])
Encoder output: BaseModelOutput(last_hidden_state=tensor([[[-0.3943,  0.4660,  0.0190,  ..., -0.5069,  0.2120, -0.3190],
         [ 0.0957,  0.0780,  0.1918,  ..., -0.0854,  0.2138,  0.1528],
         [-0.6160,  0.0295,  0.1918,  ..., -0.3886,  0.0770,  0.2311],
         ...,
         [-0.1839, -0.3798,  0.1832,  ..., -0.0041, -0.3633, -0.5455],
         [ 0.0153,  0.0264,  0.1122,  ...,  0.1966, -0.3027, -0.3659],
         [-0.0484,  0.0147,  0.0078,  ..., -0.1359, -0.0295, -0.0799]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


## Run through tokenizer

In [None]:
output_ids = model.generate(input_ids)
print(output_ids)


tensor([[59513, 10537,     2,   541,    52,    55,    54,     0]])


## Decode back to tokens to complete the translation

In [None]:
english_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Translation:", english_text)

Translation: Hello, how are you?
