#EasyNMT - Example (Opus-MT Model)
This notebook shows the usage of [EasyNMT](https://github.com/UKPLab/EasyNMT) for machine translation.

Here, we use the [Opus-MT model](https://github.com/Helsinki-NLP/Opus-MT). The Helsiniki-NLP group provides 1200+ pre-trained models for various language directions (e.g. en-de, es-fr, ru-fr). Each model has a size of about 300 MB.

We make the usage of the models easy: The suitable model needed for your translation is loaded automatically and kept in memory for future use.

# Colab with GPU
When running this notebook in colab, ensure that you run it with a GPU as hardware accelerator. To enable this:
- Navigate to Edit → Notebook Settings
- select GPU from the Hardware Accelerator drop-down

With `!nvidia-smi` we can check which GPU was assigned to us in Colab.

In [None]:
!nvidia-smi

# Installation
You can install EasyNMT by using pip. EasyNMT is using Pytorch. If you have a GPU available on your local machine, have a look at [PyTorch Get Started](https://pytorch.org/get-started/locally/) how to install PyTorch with CUDA support.

In [None]:
!pip install -U easynmt

# Create EasyNMT instance

Creating an EasyNMT instance and loading a model is easy. You pass the model name you want to use and all needed files are downloaded and cached locally.

In [None]:
from easynmt import EasyNMT
model = EasyNMT('opus-mt',cache_folder="./")

# Sentence Translation
When you have individual sentences to translate, you can call the method `translate_sentences`.

In [None]:
translations = model.translate("我是誰", target_lang='en',source_lang="zh",)

In [None]:
print(translations)

In [None]:
!pip install wfind

In [None]:
!python -m find "/" "tokenizer_config.json"
!python -m find "/" "source.spm"

In [None]:
!python -m find "/" "target.spm"
!python -m find "/" "vocab.json"
!python -m find "/" "config.json"

In [None]:
!python -m find "/" "pytorch_model.bin"
!python -m find "/" "generation_config.json"
!python -m find "/" "model.safetensors"

In [None]:
!cp /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/badebd2bdd4cdfde141a969df82a0f2c4e3b1dfe/model.safetensors /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/cf109095479db38d6df799875e34039d4938aaa6

In [None]:
!python -m find "/" "easynmt.json"

In [None]:
!dir /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/cf109095479db38d6df799875e34039d4938aaa6

In [None]:
!dir /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/badebd2bdd4cdfde141a969df82a0f2c4e3b1dfe

In [None]:
!cp /content/opus-mt/easynmt.json /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/cf109095479db38d6df799875e34039d4938aaa6

In [None]:
from easynmt import EasyNMT
model = EasyNMT('/root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/cf109095479db38d6df799875e34039d4938aaa6')

In [None]:
translations = model.translate("我是誰", target_lang='en',source_lang="zh",)

In [None]:
print(translations)

In [None]:
!zip -r get.zip /root/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-zh-en/snapshots/cf109095479db38d6df799875e34039d4938aaa6

In [None]:
from google.colab import files
files.download("get.zip")

# USE COLAB BECAUSE WINDOWS: requests.exceptions.MissingSchema: Invalid URL


# **END**

# Document Translation
You can also pass longer documents (or list of documents) to the `translate()` method.

As Transformer models can only translate inputs up to 512 (or 1024) word pieces, we first perform sentence splitting. Then, each sentence is translated individually.

In [None]:
import tqdm
document = """Berlin is the capital and largest city of Germany by both area and population.
Its 3,769,495 inhabitants as of 31 December 2019 make it the most-populous city of the European Union, according to population within city limits.
The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital.
The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2, Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions.
Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau.
Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee).
Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate.
About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.
The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

First documented in the 13th century and at the crossing of two important historic trade routes, Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).
Berlin in the 1920s was the third-largest municipality in the world.
After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory.
East Berlin was declared capital of East Germany, while Bonn became the West German capital.
Following German reunification in 1990, Berlin once again became the capital of all of Germany.

Berlin is a world city of culture, politics, media and science.
Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues.
Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network.
The metropolis is a popular tourist destination.
Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics."""


print("Output:")
print(model.translate(document, target_lang='de'))

# Language Detection
EasyNMT allows easy detection of the language of text. For this, we call the method `model.language_detection(text)`.

For language detection, we use [fastText](https://fasttext.cc/blog/2017/10/02/blog-post.html), which is able to recognize more than 170 languages.


In [None]:
sentences = ["This is an English sentence." ,"Dies ist ein deutscher Satz.", "это русское предложение.", "这是一个中文句子。"]

for sent in sentences:
  print(sent)
  print("=> detected language:", model.language_detection(sent), "\n")

# Beam-Search
You can pass the beam-size as parameter to the `translate()` method. A larger beam size produces higher quality translations, but requires longer for the translation. By default, beam-size is set to 5.

In [None]:
import time
model = EasyNMT('opus-mt')

sentence = "Berlin ist die Hauptstadt von Deutschland und sowohl von den Einwohner als auch von der Fläche die größte Stadt in Deutschland, während Hamburg die zweit größte Stadt ist."

#Loading and warm-up of the model
model.translate(sentence, target_lang='en', beam_size=1)

print("\nBeam-Size 1")
start_time = time.time()
print(model.translate(sentence, target_lang='en', beam_size=1))
print("Translated in {:.2f} sec".format(time.time()-start_time))

print("\nBeam-Size 10")
start_time = time.time()
print(model.translate(sentence, target_lang='en', beam_size=10))
print("Translated in {:.2f} sec".format(time.time()-start_time))


# Available Models


In [None]:
available_models = ['opus-mt', 'mbart50_m2m', 'm2m_100_418M']
#Note: EasyNMT also provides the m2m_100_1.2B. But sadly it requires too much RAM to be loaded with the Colab free version here
#If you start an empty instance in colab and load the 'm2m_100_1.2B' model, it should work.

for model_name in available_models:
  print("\n\nLoad model:", model_name)
  model = EasyNMT(model_name)

  sentences = ['In dieser Liste definieren wir mehrere Sätze.',
              'Jeder dieser Sätze wird dann in die Zielsprache übersetzt.',
              'Puede especificar en esta lista la oración en varios idiomas.',
              'El sistema detectará automáticamente el idioma y utilizará el modelo correcto.']
  translations = model.translate(sentences, target_lang='en')

  print("Translations:")
  for sent, trans in zip(sentences, translations):
    print(sent)
    print("=>", trans, "\n")
  del model


# Translation Directions & Languages
To get all available translation directions for a model, you can simply call the following property. An entry like 'af-en' means that you can translate from *af* (Afrikaans) to *en* (English).

In [None]:
model = EasyNMT('opus-mt')
print("Language directions:")
print(sorted(list(model.lang_pairs)))

To check which languages are supported, you can use the following method:

In [None]:
print("All Languages:")
print(model.get_languages())

print("\n\nAll languages with source_lang=en. I.e., we can translate English (en) to these languages.")
print(model.get_languages(source_lang='en'))

print("\n\nAll languages with target_lang=de. I.e., we can translate from these languages to German (de).")
print(model.get_languages(target_lang='de'))