# Word Alignment Tutorial

In this notebook, we will demonstrate how to use machine to train statistical word alignment models and then use them to predict alignments between sentences. Machine uses the [Thot](https://github.com/sillsdev/thot) library to implement word alignment models. The classes can be enabled by installing the `sil-machine` package with the `thot` optional dependency. Machine has implementations of all common statistical models, including the famous IBM models (1-4), HMM, and FastAlign.

## Setting up a parallel text corpus

The first step in word alignment is setting up a parallel text corpus. Word alignment models are unsupervised, so they only require a parallel text corpus to train. Manually created alignments are not necessary. So let's create a parallel corpus from the source and target monolingual corpora.

In [1]:
from machine.corpora import ParatextTextCorpus
from machine.tokenization import LatinWordTokenizer

source_corpus = ParatextTextCorpus("data/VBL-PT")
target_corpus = ParatextTextCorpus("data/WEB-PT")
parallel_corpus = source_corpus.align_rows(target_corpus).tokenize(LatinWordTokenizer())

## Simple word alignment

The easiest way to align a parallel corpus is to use the `word_align_corpus` function. The function will train the model and align the corpus. The alignment will be stored in the `aligned_word_pairs` property as a collection of `AlignedWordPair` instances. By default, the `word_align_corpus` function uses FastAlign. 

In [2]:
from machine.translation import word_align_corpus
from machine.corpora import AlignedWordPair

aligned_corpus = word_align_corpus(parallel_corpus.lowercase())
for row in aligned_corpus.take(5):
    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", AlignedWordPair.to_string(row.aligned_word_pairs, include_scores=False))

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Alignment: 0-0 2-1 4-4 5-34 6-6 7-36 8-7 9-8 10-8 12-5 13-12 14-13 15-9 15-10 15-14 15-15 15-16 16-11 18-22 19-20 20-17 21-18 22-19 26-24 27-23 27-25 28-26 29-26 30-27 31-28 32-29 32-30 33-31 35-28 37-32
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Alignment: 0-2 1-3 2-8 3-4 3-5 4-7 5-7 6-6 7-7 

You can specify a different word alignment model by specifying the `aligner` parameter. Let's align the same corpus using IBM-1.

In [3]:
aligned_corpus = word_align_corpus(parallel_corpus.lowercase(), aligner="ibm1")
for row in aligned_corpus.take(5):
    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", AlignedWordPair.to_string(row.aligned_word_pairs, include_scores=False))

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Alignment: 1-25 2-25 3-32 4-4 4-33 5-34 6-35 7-36 8-0 12-5 13-6 15-10 16-11 19-20 20-17 21-18 22-19 25-27 35-9
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Alignment: 0-17 1-3 2-36 3-4 4-0 6-38 7-0 8-1 9-0 10-0 11-13 12-18 14-8 18-9 22-23 23-6 26-28 27-2 28-29 28-30 33-4 34-5
Source: los que hem

## Training models

Now, let's get into more advanced scenarios.

In this tutorial, we are going to start by training an IBM-1 model. There are two possible ways to train a model. First, we will demonstrate training a model from a class that inherits from `WordAlignmentModel`. All alignment models implement the `WordAlignmentModel` abstract base class. This makes it easier to swap out different models in your code. We use the `create_trainer` method to create a trainer object that is used to train the model. If we do not specify a file path when creating the model object, then the model will only exist in memory. When we call the `save` method, the model instance is updated with the trained model parameters, but the model is not written to disk. We need to preprocess the corpus before training. First, we need to tokenize the corpus. We will also lowercase the data, since that generally gives better results.

In [4]:
from machine.translation.thot import ThotIbm1WordAlignmentModel
from machine.translation import SymmetrizationHeuristic

model = ThotIbm1WordAlignmentModel()
trainer = model.create_trainer(parallel_corpus.lowercase())
trainer.train(lambda status: print(f"Training IBM-1 model: {status.percent_completed:.2%}"))
trainer.save()

Training IBM-1 model: 0.00%
Training IBM-1 model: 16.67%
Training IBM-1 model: 33.33%
Training IBM-1 model: 50.00%
Training IBM-1 model: 66.67%
Training IBM-1 model: 83.33%
Training IBM-1 model: 100.00%


The other option for training a model is to construct a trainer object directly. This method is useful for when you are only interested in training the model and saving it to disk for later use. We need to specify where the model will be saved after it is trained and we call the `save` method.

In [5]:
import os
from machine.translation.thot import ThotWordAlignmentModelTrainer, ThotWordAlignmentModelType

os.makedirs("out/VBL-WEB-IBM1", exist_ok=True)
trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.lowercase(), "out/VBL-WEB-IBM1/src_trg"
)

trainer.train(lambda status: print(f"Training IBM-1 model: {status.percent_completed:.2%}"))
trainer.save()
print("IBM-1 model saved")

Training IBM-1 model: 0.00%
Training IBM-1 model: 16.67%
Training IBM-1 model: 33.33%
Training IBM-1 model: 50.00%
Training IBM-1 model: 66.67%
Training IBM-1 model: 83.33%
Training IBM-1 model: 100.00%
IBM-1 model saved


## Aligning parallel sentences

Now that we have a trained alignment model, we can find the best alignment for a parallel sentence. We call `get_best_alignment` method to find the best alignment. The results are returned as a `WordAlignmentMatrix` object.

In [6]:
model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
for row in parallel_corpus.lowercase().take(5):
    alignment = model.get_best_alignment(row.source_segment, row.target_segment)

    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", str(alignment))

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Alignment: 2-2 2-20 2-25 2-29 2-30 3-32 4-4 4-33 5-34 6-35 7-36 12-3 12-5 13-6 13-12 13-21 13-26 13-31 15-0 15-7 15-10 15-13 15-16 15-22 16-1 16-8 16-11 16-14 16-23 20-17 21-18 22-19 22-28 25-27 35-9 35-15 35-24
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Alignment: 0-17 0-35 1-3 1-20 1-24 2-36

Word alignment models also provide the `get_best_alignment_batch` method which aligns a batch of parallel sentences. This can be faster than aligning one parallel sentence at a time.

In [7]:
segment_batch = list(parallel_corpus.lowercase().take(5).to_tuples())
alignments = model.get_best_alignment_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Alignment:", str(alignment))

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Alignment: 2-2 2-20 2-25 2-29 2-30 3-32 4-4 4-33 5-34 6-35 7-36 12-3 12-5 13-6 13-12 13-21 13-26 13-31 15-0 15-7 15-10 15-13 15-16 15-22 16-1 16-8 16-11 16-14 16-23 20-17 21-18 22-19 22-28 25-27 35-9 35-15 35-24
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Alignment: 0-17 0-35 1-3 1-20 1-24 2-36

## Getting model probabilities

A statistical word alignment model consists of one or more conditional probability distributions that are estimated from the training data. For example, most models estimate a word translation probability distribution that can be queried to obtain the probability that a source word is a translation of a target word. Each model class has methods to obtain these probabilities. Let's try getting some translation probabilities from the IBM-1 model that we trained by calling the `get_translation_probability` method. In order to get the probability that a word does not translate to anything, you can pass `None` instead of the word string.

In [8]:
model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
prob = model.get_translation_probability("es", "is")
print(f"es -> is: {prob:.4f}")
prob = model.get_translation_probability(None, "that")
print(f"NULL -> that: {prob:.4f}")

es -> is: 0.2720
NULL -> that: 0.0516


It can also be useful to obtain a score for an entire alignment. The `get_avg_translation_score` method can be used to compute the average translation probability for an alignment.

In [9]:
segment_batch = list(parallel_corpus.lowercase().take(5).to_tuples())
alignments = model.get_best_alignment_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Score:", round(model.get_avg_translation_score(source_segment, target_segment, alignment), 4))

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Score: 0.1396
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Score: 0.1628
Source: los que hemos visto y oído eso mismo les contamos , para que también puedan participar de esta amistad junto a nosotros . esta amistad con el padre y su hijo jesucristo .
Target: that which we have seen and heard we

## Symmetrized alignment models

Most statistical word alignment models are directional and asymmetric. This means that it can only model one-to-one and one-to-many alignments in one direction. They are not capable of modeling many-to-many alignments, which can occur in some language pairs. One way to get around this limitation is to train models in both directions (source-to-target and target-to-source), and then merge the resulting alignments from the two models into a single alignment. This is called symmetrization and is a common practice when using statistical word alignment models. In addition, researchers have found that symmetrized alignments are better quality.

Machine provides a special word alignment model class to support symmetrization called `ThotSymmetrizedWordAlignmentModel`. Let's demonstrate how to use this class. First, we will train the symmetrized model using the `SymmetrizedWordAlignmentModelTrainer` class.

In [10]:
from machine.translation import SymmetrizedWordAlignmentModelTrainer

src_trg_trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.lowercase(), "out/VBL-WEB-IBM1/src_trg"
)
trg_src_trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.invert().lowercase(), "out/VBL-WEB-IBM1/trg_src"
)
symmetrized_trainer = SymmetrizedWordAlignmentModelTrainer(src_trg_trainer, trg_src_trainer)
symmetrized_trainer.train(lambda status: print(f"{status.message}: {status.percent_completed:.2%}"))
symmetrized_trainer.save()
print("Symmetrized IBM-1 model saved")

Training direct alignment model: 0.00%
Training direct alignment model: 0.00%
Training direct alignment model: 8.33%
Training direct alignment model: 16.67%
Training direct alignment model: 25.00%
Training direct alignment model: 33.33%
Training direct alignment model: 41.67%
Training direct alignment model: 50.00%
Training inverse alignment model: 50.00%
Training inverse alignment model: 50.00%
Training inverse alignment model: 58.33%
Training inverse alignment model: 66.67%
Training inverse alignment model: 75.00%
Training inverse alignment model: 83.33%
Training inverse alignment model: 91.67%
Training inverse alignment model: 100.00%
Symmetrized IBM-1 model saved


The model can also be trained using the `create_trainer` method on `ThotSymmetrizedWordAlignmentModel`. Now that we've trained the symmetrized model, let's obtain some alignments. Machine supports many different symmetrization heuristics. The symmetrization heuristic to use when merging alignments can be specified using the `heuristic` property. In this case, we will use the `GROW_DIAG_FINAL_AND` heuristic.

In [11]:
from machine.translation import SymmetrizationHeuristic
from machine.translation.thot import ThotSymmetrizedWordAlignmentModel

src_trg_model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
trg_src_model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/trg_src")
symmetrized_model = ThotSymmetrizedWordAlignmentModel(src_trg_model, trg_src_model)
symmetrized_model.heuristic = SymmetrizationHeuristic.GROW_DIAG_FINAL_AND

segment_batch = list(parallel_corpus.lowercase().take(5).to_tuples())
alignments = symmetrized_model.get_best_alignment_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Alignment:", alignment)

Source: esta carta trata sobre la palabra de vida que existía desde el principio , que hemos escuchado , que hemos visto con nuestros propios ojos y le hemos contemplado , y que hemos tocado con nuestras manos .
Target: that which was from the beginning , that which we have heard , that which we have seen with our eyes , that which we saw , and our hands touched , concerning the word of life
Alignment: 1-25 2-25 3-32 4-4 4-33 5-34 6-35 7-36 8-0 12-5 13-6 15-10 16-11 19-20 20-17 21-18 22-19 25-27 35-9
Source: esta vida nos fue revelada . la vimos y damos testimonio de ella . estamos hablándoles de aquél que es la vida eterna , que estaba con el padre , y que nos fue revelado .
Target: ( and the life was revealed , and we have seen , and testify , and declare to you the life , the eternal life , which was with the father , and was revealed to us ) ;
Alignment: 0-17 1-3 2-36 3-4 4-0 6-38 7-0 8-1 9-0 10-0 11-13 12-18 14-8 18-9 22-23 23-6 26-28 27-2 28-29 28-30 33-4 34-5
Source: los que hem