# EN→FR Translation with Hugging Face MarianMT Pretrained Model

This notebook demonstrates **English→French translation** using
the Hugging Face pretrained `Helsinki-NLP/opus-mt-en-fr` model.

> ✅ **Advantages:**
> - Translates instantly without any training
> - Pretrained on large‑scale parallel corpora
> - Runs in both CPU and GPU Colab environments; initial download/load may take tens of seconds–minutes, and CPU inference is slower (seconds–tens of seconds) while GPU inference is much faster (sub-second–few seconds).

## Learning Goals
- Load and run a pretrained model with `transformers`
- Tokenize and detokenize text for translation
- Translate single or multiple sentences

In [1]:
!pip install transformers sentencepiece --quiet

from transformers import MarianMTModel, MarianTokenizer

# 1. Load a pretrained EN→FR model (MarianMT family)
model_name = "Helsinki-NLP/opus-mt-en-fr"
hf_tokenizer = MarianTokenizer.from_pretrained(model_name)
hf_model = MarianMTModel.from_pretrained(model_name)

# 2. Translation function
def hf_translate_en_to_fr(texts):
    # Accept either a single string or a list of strings
    if isinstance(texts, str):
        texts = [texts]  # Convert single sentence to a list for consistent processing
    # Tokenize the input texts and return PyTorch tensors; enable padding and truncation
    inputs = hf_tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    # Generate translations using the model
    translated = hf_model.generate(**inputs)
    # Decode token ids back to strings, removing special tokens
    return [hf_tokenizer.decode(t, skip_special_tokens=True) for t in translated]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [2]:
sentence = "I ate a sandwich."
result = hf_translate_en_to_fr(sentence)
print(f"EN: {sentence}")
print(f"FR: {result[0]}")

EN: I ate a sandwich.
FR: J'ai mangé un sandwich.


In [3]:
sentences = [
    "She is reading a book.",
    "We are going to the park tomorrow.",
    "This AI model can translate English to French instantly."
]
results = hf_translate_en_to_fr(sentences)

for en, fr in zip(sentences, results):
    print(f"EN: {en}")
    print(f"FR: {fr}")
    print()

EN: She is reading a book.
FR: Elle lit un livre.

EN: We are going to the park tomorrow.
FR: Nous allons au parc demain.

EN: This AI model can translate English to French instantly.
FR: Ce modèle d'IA peut traduire l'anglais en français instantanément.

