[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1yhCfYC9Fxisoq_wXusada5Jz7zTkuLm5)

# Notebook #2: machine translation

## Description:

In this notebook, we will show you how to translate text from one language to another in a few lines of code. We will use the `transformers` package from [Hugging Face](https://huggingface.co/) and load translation models trained by the [Language Technology Research Group at the University of Helsinki (Helsinki-NLP)](https://blogs.helsinki.fi/language-technology/).

### Import modules

In [5]:
import transformers
from transformers import MarianMTModel, MarianTokenizer
from transformers.hf_api import HfApi

In [6]:
transformers.__version__

'4.1.1'

For this notebook, we are using the `3.4.0` version of the `transformers` package.

### Translation

The code below allows you to make use of the 1008 machine translation models covering 140 languages that Helsinki-NLP contributed to the Hugging Face model hub.

The `Translator` class takes two languages as input: the source language `src_language` which is the language of the source input and the target language `tgt_language`. 

It has two important methods:
- `load_model` this method allows to load the model you want to use
- `translate` this method uses the `loaded_model` to translate the `src_text` you gave as input.

In [7]:
class Translator:
    
    model_list = HfApi().model_list()
    valid_model_names = [x.modelId for x in model_list if x.modelId.startswith("Helsinki-NLP")]
    
    def __init__(self, src_language, tgt_language):
        self.src_language = src_language
        self.tgt_language = tgt_language
        self.model_name = f'Helsinki-NLP/opus-mt-{src_language}-{tgt_language}'
        if self.model_name not in self.valid_model_names:
            raise KeyError(f'{self.model_name} is not a valid model name.')

    def load_model(self):
        """Load translation model and tokenizer."""
        tokenizer = MarianTokenizer.from_pretrained(self.model_name)
        model = MarianMTModel.from_pretrained(self.model_name)
        return tokenizer, model
    
    def translate(self, src_text, loaded_model):
        """Use loaded model and tokenizer to translate the source text."""
        tokenizer, model = loaded_model
        translated = model.generate(**tokenizer.prepare_seq2seq_batch([src_text], return_tensors="pt"))
        return [tokenizer.decode(t, skip_special_tokens=True) for t in translated][0] 
        

We first instantiate the `Translator` class using as example English as the source language and French as the target language. 

Languages are defined by their ISO 639-1 codes (you can find a list of these codes for each language [here](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)). Not all language combinations are available: please check the [model list](https://huggingface.co/Helsinki-NLP) from `Helsinki-NLP` to see if the combination you're interested in is available. The model names are defined as follows: `Helsinki-NLP/opus-mt-<ISO_CODE_SRC_LANG>-<ISO_CODE_TGT_LANG>`. For instance, to check whether translation from English (ISO 639_1 = `en`) to French ( ISO 639_1 = `fr`), look for `Helsinki-NLP/opus-mt-en-fr`. 

In [8]:
%%time
translator = Translator(src_language='es', tgt_language='en')

CPU times: user 35 µs, sys: 0 ns, total: 35 µs
Wall time: 39.1 µs


After instantiating the `Translator` class, we load the translation model.

In [9]:
model = translator.load_model()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=825924.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=801636.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1590040.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1189.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087523.0, style=ProgressStyle(descri…




Some weights of MarianMTModel were not initialized from the model checkpoint at Helsinki-NLP/opus-mt-es-en and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Finally, we use the loaded model to translate the source text (`src_text` argument) in string format:

In [11]:
%%time
translator.translate(
    src_text='Hola! Quiero hacer una prueba para ver si este traductor funciona bien.',
    loaded_model=model
    )

CPU times: user 3.17 s, sys: 18.8 ms, total: 3.19 s
Wall time: 801 ms


'Hi! I want to do a test to see if this translator works well.'

## References:

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, & Alexander M. Rush (2020). Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). Association for Computational Linguistics.

Jörg Tiedemann, & Santhosh Thottingal (2020). OPUS-MT — Building open translation services for the World. In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT).