# Machine Translation Tutorial

Machine provides a general framework for machine translation engines. It currently provides implementations for statistical MT (SMT) and neural MT (NMT). All MT engines implement the same interfaces, which provides a high level of extensibility for calling applications.


In [None]:
%pip install sil-machine[thot]
%pip install transformers==4.34.0 datasets sacremoses accelerate==0.26.1 gcsfs==2023.3.0

In [None]:
!git clone https://github.com/sillsdev/machine.py.git
%cd machine.py/samples

## Statistical Machine Translation

Machine provides a phrase-based statistical machine translation engine that is based on the [Thot](https://github.com/sillsdev/thot) library. The SMT engine implemented in Thot is unique, because it supports incremental training and interactive machine translation (IMT). Let's start by training an SMT model. MT models implement the `TranslationModel` interface. SMT models are trained using a parallel text corpus, so the first step is to create a `ParallelTextCorpus`.


In [6]:
from machine.corpora import TextFileTextCorpus

source_corpus = TextFileTextCorpus("data/sp.txt")
target_corpus = TextFileTextCorpus("data/en.txt")
parallel_corpus = source_corpus.align_rows(target_corpus)

Trainers are responsible for training MT models. A trainer can be created either using the constructor or using the `create_trainer` method on the `TranslationModel` interface. Creating a trainer by constructor is useful if you are training a new model. The `create_trainer` method is useful when you are retraining an existing model. In this example, we are going to construct the trainer directly. Word alignment is at the core of SMT. In this example, we are going to use HMM for word alignment.


In [7]:
import os
import shutil
from machine.tokenization import LatinWordTokenizer
from machine.translation.thot import ThotSmtModelTrainer, ThotWordAlignmentModelType

tokenizer = LatinWordTokenizer()
os.makedirs("out/sp-en-smt", exist_ok=True)
shutil.copy("data/smt.cfg", "out/sp-en-smt/smt.cfg")
with ThotSmtModelTrainer(
    ThotWordAlignmentModelType.HMM,
    parallel_corpus,
    "out/sp-en-smt/smt.cfg",
    source_tokenizer=tokenizer,
    target_tokenizer=tokenizer,
    lowercase_source=True,
    lowercase_target=True,
) as trainer:
    print("Training model...", end="")
    trainer.train()
    print(" done.")
    print("Saving model...", end="")
    trainer.save()
    print(" done.")

Training model... done.
Saving model... done.


In order to fully translate a sentence, we need to perform pre-processing steps on the source sentence and post-processing steps on the target translation. Here are the steps to fully translate a sentence:

1. Tokenize the source sentence.
2. Lowercase the source tokens.
3. Translate the sentence.
4. Truecase the target tokens.
5. Detokenize the target tokens into a sentence.

Truecasing is the process of properly capitalizing a lowercased sentence. Luckily, Machine provides a statistical truecaser that can learn the capitalization rules for a language. The next step is train the truecaser model.


In [3]:
from machine.translation import UnigramTruecaserTrainer

with UnigramTruecaserTrainer("out/sp-en-smt/en.truecase.txt", target_corpus, tokenizer=tokenizer) as trainer:
    trainer.train()
    trainer.save()

Now that we have a trained SMT model and a trained truecasing model, we are ready to translate sentences. First, We need to load the SMT model. The model can be used to translate sentences using the `translate` method. A `TranslationResult` instance is returned when a text segment is translated. In addition to the translated segment, `TranslationResult` contains lots of interesting information about the translated sentence, such as the word confidences, alignment, phrases, and source/target tokens.

In [8]:
from machine.translation import UnigramTruecaser
from machine.translation.thot import ThotSmtModel
from machine.tokenization import LatinWordDetokenizer

truecaser = UnigramTruecaser("out/sp-en-smt/en.truecase.txt")
detokenizer = LatinWordDetokenizer()

with ThotSmtModel(
    ThotWordAlignmentModelType.HMM,
    "out/sp-en-smt/smt.cfg",
    source_tokenizer=tokenizer,
    target_tokenizer=tokenizer,
    target_detokenizer=detokenizer,
    truecaser=truecaser,
    lowercase_source=True,
    lowercase_target=True,
) as model:
    result = model.translate("Desearía reservar una habitación hasta mañana.")
    print("Translation:", result.translation)
    print("Source tokens:", result.source_tokens)
    print("Target tokens:", result.target_tokens)
    print("Alignment:", result.alignment)
    print("Confidences:", result.confidences)

Translation: I would like to book a room until tomorrow.
Source tokens: ['Desearía', 'reservar', 'una', 'habitación', 'hasta', 'mañana', '.']
Target tokens: ['I', 'would', 'like', 'to', 'book', 'a', 'room', 'until', 'tomorrow', '.']
Alignment: 0-1 0-2 1-3 1-4 2-5 3-6 4-7 5-8 6-9
Confidences: [0.1833474940416596, 0.3568307371510516, 0.3556863860951534, 0.2894564705698258, 0.726984900023586, 0.8915912178040876, 0.878754356224247, 0.8849444691927844, 0.8458962922106739, 0.8975745812873857]


## Interactive Machine Translation

`ThotSmtModel` also supports interactive machine translation. Under this paradigm, the engine assists a human translator by providing translations suggestions based on what the user has translated so far. This paradigm can be coupled with incremental training to provide a model that is constantly learning from translator input. Models and engines must implement the `InteractiveTranslationModel` and `InteractiveTranslationEngine` interfaces to support IMT. The IMT paradigm is implemented in the `InteractiveTranslator` class. The `approve` method on `InteractiveTranslator` performs incremental training using the current prefix. Suggestions are generated from translations using a class that implements the `TranslationSuggester` interface.

In [5]:
from machine.translation import PhraseTranslationSuggester, InteractiveTranslatorFactory

suggester = PhraseTranslationSuggester()

def get_current_suggestion(translator):
    suggestion = next(iter(suggester.get_suggestions_from_translator(1, translator)), None)
    suggestion_text = "" if suggestion is None else detokenizer.detokenize(suggestion.target_words)
    if len(translator.prefix) == 0:
        suggestion_text = suggestion_text.capitalize()
    prefix_text = translator.prefix.strip()
    if len(prefix_text) > 0:
        prefix_text = prefix_text + " "
    return f"{prefix_text}[{suggestion_text}]"


with ThotSmtModel(
    ThotWordAlignmentModelType.HMM,
    "out/sp-en-smt/smt.cfg",
    source_tokenizer=tokenizer,
    target_tokenizer=tokenizer,
    target_detokenizer=detokenizer,
    truecaser=truecaser,
    lowercase_source=True,
    lowercase_target=True,
) as model:
    factory = InteractiveTranslatorFactory(model, target_tokenizer=tokenizer, target_detokenizer=detokenizer)

    source_sentence = "Hablé con recepción."
    print("Source:", source_sentence)
    translator = factory.create(source_sentence)

    suggestion = get_current_suggestion(translator)
    print("Suggestion:", suggestion)

    translator.append_to_prefix("I spoke ")
    suggestion = get_current_suggestion(translator)
    print("Suggestion:", suggestion)

    translator.append_to_prefix("with reception.")
    suggestion = get_current_suggestion(translator)
    print("Suggestion:", suggestion)
    translator.approve(aligned_only=False)
    print()

    source_sentence = "Hablé hasta cinco en punto."
    print("Source:", source_sentence)
    translator = factory.create(source_sentence)

    suggestion = get_current_suggestion(translator)
    print("Suggestion:", suggestion)

    translator.append_to_prefix("I spoke until five o'clock.")
    suggestion = get_current_suggestion(translator)
    print("Suggestion:", suggestion)

Source: Hablé con recepción.
Suggestion: [With reception]
Suggestion: I spoke [with reception]
Suggestion: I spoke with reception. []

Source: Hablé hasta cinco en punto.
Suggestion: [I spoke until five o'clock]
Suggestion: I spoke until five o'clock. []


## Neural Machine Translation

Machine also supports neural machine translation through the use of the Huggingface [Transformers](https://huggingface.co/docs/transformers/en/index) library. The Huggingface NMT engine implements the same interfaces that the SMT engine does, so you can train and inference the engine using the same API.

Let's start by fine tuning an NMT model using `HuggingFaceNmtModelTrainer`. One thing to note is that Huggingface models typically have an associated tokenizer. The trainer will handle tokenization for us, so we don't have to tokenize the corpus. We will need to specify the base model and the training arguments. For this example, we will be fine tuning an M2M100 model. We will also need to specify the source and target languages according to the model.

In [9]:
# ignore transformers warnings
import warnings
from transformers.utils import logging as transformers_logging

warnings.simplefilter(action='ignore', category=FutureWarning)
transformers_logging.set_verbosity_error()

In [4]:
from transformers import Seq2SeqTrainingArguments
from machine.translation.huggingface import HuggingFaceNmtModelTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="out/sp-en-nmt", overwrite_output_dir=True, num_train_epochs=1, report_to=[], disable_tqdm=False
)

with HuggingFaceNmtModelTrainer(
    "facebook/m2m100_418M",
    training_args,
    parallel_corpus,
    src_lang="es",
    tgt_lang="en",
    add_unk_src_tokens=False,
    add_unk_tgt_tokens=False,
) as trainer:
    trainer.train()
    trainer.save()

Using custom data configuration default-d72fb4ece0e4f60a
Found cached dataset generator (C:/Users/damie/.cache/huggingface/datasets/generator/default-d72fb4ece0e4f60a/0.0.0)


Running tokenizer on train dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/125 [00:00<?, ?it/s]

{'train_runtime': 435.3667, 'train_samples_per_second': 2.297, 'train_steps_per_second': 0.287, 'train_loss': 0.27040533447265624, 'epoch': 1.0}
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     0.2704
  train_runtime            = 0:07:15.36
  train_samples            =       1000
  train_samples_per_second =      2.297
  train_steps_per_second   =      0.287


Now that the model is trained, let's try using it to translate some sentences. We need to use `HuggingFaceNmtEngine` to load the model and perform inferencing.

In [10]:
from machine.translation.huggingface import HuggingFaceNmtEngine

with HuggingFaceNmtEngine("out/sp-en-nmt", src_lang="es", tgt_lang="en") as engine:
    result = engine.translate("Desearía reservar una habitación hasta mañana.")
    print("Translation:", result.translation)
    print("Source tokens:", result.source_tokens)
    print("Target tokens:", result.target_tokens)
    print("Alignment:", result.alignment)
    print("Confidences:", result.confidences)

Translation: I would like to book a room until tomorrow.
Source tokens: ['▁D', 'ese', 'aría', '▁res', 'ervar', '▁una', '▁hab', 'itación', '▁hasta', '▁mañana', '.']
Target tokens: ['▁I', '▁would', '▁like', '▁to', '▁book', '▁a', '▁room', '▁until', '▁tom', 'orrow', '.']
Alignment: 1-2 2-0 2-1 3-4 4-3 5-5 6-6 8-7 9-8 9-9 10-10
Confidences: [0.9995167207904968, 0.9988614185814005, 0.9995524502931971, 0.9861009574421602, 0.9987220427038153, 0.998968593209302, 0.9944791909715244, 0.9989702587912649, 0.9749540518542505, 0.9996603689253716, 0.9930446924545876]


You can also perform inferencing on a pretrained Huggingface model without fine tuning. Let's translate the same sentence using NLLB-200.

In [11]:
with HuggingFaceNmtEngine("facebook/nllb-200-distilled-600M", src_lang="spa_Latn", tgt_lang="eng_Latn") as engine:
    result = engine.translate("Desearía reservar una habitación hasta mañana.")
    print("Translation:", result.translation)
    print("Source tokens:", result.source_tokens)
    print("Target tokens:", result.target_tokens)
    print("Alignment:", result.alignment)
    print("Confidences:", result.confidences)

Translation: I'd like to reserve a room for tomorrow.
Source tokens: ['▁Dese', 'aría', '▁reser', 'var', '▁una', '▁habitación', '▁hasta', '▁mañana', '.']
Target tokens: ['▁I', "'", 'd', '▁like', '▁to', '▁reserve', '▁a', '▁room', '▁for', '▁tomorrow', '.']
Alignment: 0-1 0-3 1-0 1-2 2-5 5-6 5-7 6-8 7-9 8-4 8-10
Confidences: [0.766540320750896, 0.5910241514763206, 0.8868627789322919, 0.8544048979056736, 0.8613305047447863, 0.45655845183164, 0.8814725030368357, 0.8585703155792751, 0.3142652857171965, 0.8780149028315941, 0.8617016651426532]
