# Machine Translation Tutorial

Machine provides a general framework for machine translation engines. It currently provides implementations for rule-based MT, statistical MT (SMT), and neural MT (NMT). All MT engines implement the same interfaces, which provides a high level of extensibility for calling applications.


In [1]:
#r "nuget:SIL.Scripture,12.0.1"
#r "nuget:Thot"
#r "nuget:Nito.AsyncEx"
#r "../src/SIL.Machine/bin/Debug/netstandard2.0/SIL.Machine.dll"
#r "../src/SIL.Machine.Morphology.HermitCrab/bin/Debug/netstandard2.0/SIL.Machine.Morphology.HermitCrab.dll"
#r "../src/SIL.Machine.Translation.Thot/bin/Debug/netstandard2.0/SIL.Machine.Translation.Thot.dll"

void Write(string text)
{
    Console.Write(text);
}

void WriteLine(string text = "")
{
    Console.Write(text + "\n");
}

## Statistical Machine Translation

Machine provides a phrase-based statistical machine translation engine that is based on the [Thot](https://github.com/sillsdev/thot) library. The SMT engine implemented in Thot is unique, because it supports incremental training and interactive machine translation (IMT). Let's start by training an SMT model. MT models implement the `ITranslationModel` interface. SMT models are trained using a parallel text corpus, so the first step is to create a `ParallelTextCorpus`.


In [2]:
using SIL.Machine.Corpora;
using SIL.Machine.Tokenization;

var sourceCorpus = new TextFileTextCorpus("data/sp.txt");
var targetCorpus = new TextFileTextCorpus("data/en.txt");
var parallelCorpus = sourceCorpus.AlignRows(targetCorpus);

Trainers are responsible for training MT models. A trainer can be created either using the constructor or using the `CreateTrainer` method on the `ITranslationModel` interface. Creating a trainer by constructor is useful if you are training a new model. The `CreateTrainer` method is useful when you are retraining an existing model. In this example, we are going to construct the trainer directly. Word alignment is at the core of SMT. In this example, we are going to use HMM for word alignment.


In [3]:
using System.IO;
using SIL.Machine.Translation.Thot;
using SIL.Machine.Utils;

var tokenizer = new LatinWordTokenizer();
Directory.CreateDirectory("out/sp-en");
File.Copy("data/smt.cfg", "out/sp-en/smt.cfg", overwrite: true);
{
    using var trainer = new ThotSmtModelTrainer(ThotWordAlignmentModelType.Hmm, parallelCorpus, "out/sp-en/smt.cfg")
    {
        SourceTokenizer = tokenizer,
        TargetTokenizer = tokenizer,
        LowercaseSource = true,
        LowercaseTarget = true
    };

    Write("Training model...");
    await trainer.TrainAsync();
    WriteLine($" done.");
    Write("Saving model...");
    await trainer.SaveAsync();
    WriteLine($" done.");
}

Training model... done.
Saving model... done.


In order to fully translate a sentence, we need to perform pre-processing steps on the source sentence and post-processing steps on the target translation. Here are the steps to fully translate a sentence:

1. Tokenize the source sentence.
2. Lowercase the source tokens.
3. Translate the sentence.
4. Truecase the target tokens.
5. Detokenize the target tokens into a sentence.

Truecasing is the process of properly capitalizing a lowercased sentence. Luckily, Machine provides a statistical truecaser that can learn the capitalization rules for a language. The next step is train the truecaser model.


In [4]:
using SIL.Machine.Translation;

{
    var truecaser = new UnigramTruecaser("out/sp-en/en.truecase.txt");
    using var trainer = truecaser.CreateTrainer(targetCorpus);
    await trainer.TrainAsync();
    await trainer.SaveAsync();
}

Now that we have a trained SMT model and a trained truecasing model, we are ready to translate sentences. First, We need to load the SMT model. The model can be used to translate sentences using the `TranslateAsync` method.


In [5]:
var truecaser = new UnigramTruecaser("out/sp-en/en.truecase.txt");
var detokenizer = new LatinWordDetokenizer();

{   
    using var model = new ThotSmtModel(ThotWordAlignmentModelType.Hmm, "out/sp-en/smt.cfg")
    {
        SourceTokenizer = tokenizer,
        TargetTokenizer = tokenizer,
        TargetDetokenizer = detokenizer,
        Truecaser = truecaser,
        LowercaseSource = true,
        LowercaseTarget = true
    };

    var result = await model.TranslateAsync("Desearía reservar una habitación hasta mañana.");
    WriteLine(result.Translation);
}

I would like to book a room until tomorrow.


`ThotSmtModel` also supports interactive machine translation. Under this paradigm, the engine assists a human translator by providing translations suggestions based on what the user has translated so far. This paradigm can be coupled with incremental training to provide a model that is constantly learning from translator input. Models and engines must implement the `IInteractiveTranslationModel` and `IInteractiveTranslationEngine` interfaces to support IMT. The IMT paradigm is implemented in the `InteractiveTranslator` class. The `ApproveAsync` method on `InteractiveTranslator` performs incremental training using the current prefix. Suggestions are generated from translations using a class that implements the `ITranslationSuggester` interface.


In [9]:
var suggester = new PhraseTranslationSuggester();
string GetCurrentSuggestion(InteractiveTranslator translator)
{
    var suggestion = suggester.GetSuggestions(1, translator).FirstOrDefault();
    var suggestionText = suggestion is null ? "" : detokenizer.Detokenize(suggestion.TargetWords);
    if (translator.Prefix.Length == 0)
        suggestionText = suggestionText.Capitalize();
    var prefixText = translator.Prefix.Trim();
    if (prefixText.Length > 0)
        prefixText = prefixText + " ";
    return $"{prefixText}[{suggestionText}]";
}

{
    using var model = new ThotSmtModel(ThotWordAlignmentModelType.Hmm, "out/sp-en/smt.cfg")
    {
        SourceTokenizer = tokenizer,
        TargetTokenizer = tokenizer,
        TargetDetokenizer = detokenizer,
        Truecaser = truecaser,
        LowercaseSource = true,
        LowercaseTarget = true
    };
    var factory = new InteractiveTranslatorFactory(model)
    {
        TargetTokenizer = tokenizer,
        TargetDetokenizer = detokenizer
    };

    var sourceSentence = "Hablé con recepción.";
    WriteLine($"Source: {sourceSentence}");
    var translator = await factory.CreateAsync(sourceSentence);

    var suggestion = GetCurrentSuggestion(translator);
    WriteLine($"Suggestion: {suggestion}");

    translator.AppendToPrefix("I spoke ");
    suggestion = GetCurrentSuggestion(translator);
    WriteLine($"Suggestion: {suggestion}");

    translator.AppendToPrefix("with reception.");
    suggestion = GetCurrentSuggestion(translator);
    WriteLine($"Suggestion: {suggestion}");
    await translator.ApproveAsync(alignedOnly: false);
    WriteLine();

    sourceSentence = "Hablé hasta cinco en punto.";
    WriteLine($"Source: {sourceSentence}");
    translator = await factory.CreateAsync(sourceSentence);

    suggestion = GetCurrentSuggestion(translator);
    WriteLine($"Suggestion: {suggestion}");

    translator.AppendToPrefix("I spoke until five o'clock.");
    suggestion = GetCurrentSuggestion(translator);
    WriteLine($"Suggestion: {suggestion}");
}

Source: Hablé con recepción.
Suggestion: [With reception]
Suggestion: I spoke [with reception]
Suggestion: I spoke with reception. []

Source: Hablé hasta cinco en punto.
Suggestion: [I spoke until five o'clock]
Suggestion: I spoke until five o'clock. []


## Rule-based Machine Translation

Machine provides an implementation of a simple, transfer-based MT engine. Transfer-based MT consists of three steps:

1. Analysis: source words are segmented into morphemes.
2. Transfer: source morphemes are converted to the equivalent target morphemes.
3. Synthesis: the target morphemes are combined into target words.

The `TransferEngine` class implements this process. HermitCrab, a rule-based morphological parser, can be used to perform the analysis and synthesis steps. HermitCrab parser implementation is provided in the [SIL.Machine.Morphology.HermitCrab](https://www.nuget.org/packages/SIL.Machine.Morphology.HermitCrab/) package. In this example, the transfer is performed using simple gloss matching.


In [7]:
using SIL.Machine.Morphology.HermitCrab;

var hcTraceManager = new TraceManager();

Language srcLang = XmlLanguageLoader.Load("data/sp-hc.xml");
var srcMorpher = new Morpher(hcTraceManager, srcLang);

Language trgLang = XmlLanguageLoader.Load("data/en-hc.xml");
var trgMorpher = new Morpher(hcTraceManager, trgLang);

var transferer = new SimpleTransferer(new GlossMorphemeMapper(trgMorpher));

{
    using var transferEngine = new TransferEngine(srcMorpher, transferer, trgMorpher)
    {
        SourceTokenizer = tokenizer,
        TargetDetokenizer = detokenizer,
        Truecaser = truecaser,
        LowercaseSource = true
    };

    var result = await transferEngine.TranslateAsync("Dios creó el mundo.");
    WriteLine(result.Translation.Capitalize());
}

God created the world.


## Hybrid Machine Translation

Machine includes a hybrid machine translation approach that allows you to merge the translation results from a rule-based engine and data-driven engine. The translation from the data-drive engine is the base translation. If there are any words/phrases in the base translation that have a low score, then they will be replaced by the translations from the rule-based engine. This hybrid approach is implemented in the `HybridTranslationEngine` class.


In [8]:
{
    using var smtModel = new ThotSmtModel(ThotWordAlignmentModelType.Hmm, "out/sp-en/smt.cfg")
    {
        SourceTokenizer = tokenizer,
        TargetTokenizer = tokenizer,
        TargetDetokenizer = detokenizer,
        Truecaser = truecaser,
        LowercaseSource = true,
        LowercaseTarget = true
    };

    using var transferEngine = new TransferEngine(srcMorpher, transferer, trgMorpher)
    {
        SourceTokenizer = tokenizer,
        TargetDetokenizer = detokenizer,
        Truecaser = truecaser,
        LowercaseSource = true
    };

    using var hybridEngine = new HybridTranslationEngine(smtModel, transferEngine)
    {
        TargetDetokenizer = detokenizer
    };

    var sourceSentence = "Por favor, haga dos cuentas.";
    var result = await smtModel.TranslateAsync(sourceSentence);
    WriteLine($"SMT: {result.Translation.Capitalize()}");

    result = await transferEngine.TranslateAsync(sourceSentence);
    WriteLine($"Transfer: {result.Translation.Capitalize()}");

    result = await hybridEngine.TranslateAsync(sourceSentence);
    WriteLine($"Hybrid: {result.Translation.Capitalize()}");
}

SMT: Please make out two cuentas.
Transfer: Por favor, haga dos bills.
Hybrid: Please make out two bills.
