# Word Alignment Tutorial

In this notebook, we will demonstrate how to use machine to train statistical word alignment models and then use them to predict alignments between sentences. Machine uses the [Thot](https://github.com/sillsdev/thot) library to implement word alignment models. The classes can be enabled by installing the `sil-machine` package with the `thot` optional dependency. Machine has implementations of all common statistical models, including the famous IBM models (1-4), HMM, and FastAlign.

In order to get this to work with custom data, the following tags were edited in the Settings.xml files for each translation: Naming, FileNameBookNameForm, FileNamePostPart, BooksPresent. It is unclear to me which ones were absolutely necessary, but the rest can be left as is. It is also unclear if the biblical data needs to be in .SFM format, or if other formats are also acceptable.

## Setting up a parallel text corpus

The first step in word alignment is setting up a parallel text corpus. Word alignment models are unsupervised, so they only require a parallel text corpus to train. Manually created alignments are not necessary. So let's create a parallel corpus from the source and target monolingual corpora.

In [1]:
!pip install sil-machine[thot]




[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: C:\Users\natha\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [5]:
from machine.corpora import ParatextTextCorpus
from machine.tokenization import LatinWordTokenizer

source_corpus = ParatextTextCorpus("data/VBL-PT")
target_corpus = ParatextTextCorpus("data/WEB-PT")
parallel_corpus = source_corpus.align_rows(target_corpus).tokenize(LatinWordTokenizer())

## Simple word alignment

The easiest way to align a parallel corpus is to use the `word_align_corpus` function. The function will train the model and align the corpus. The alignment will be stored in the `aligned_word_pairs` property as a collection of `AlignedWordPair` instances. By default, the `word_align_corpus` function uses FastAlign. 

In [6]:
from machine.translation import word_align_corpus
from machine.corpora import AlignedWordPair

aligned_corpus = word_align_corpus(parallel_corpus.lowercase())
for row in aligned_corpus.take(5):
    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", AlignedWordPair.to_string(row.aligned_word_pairs, include_scores=False))

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Alignment: 0-0 1-1 2-2 3-3 4-4 5-4 5-5 6-6 7-7 8-8 9-9 10-10
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Alignment: 0-0 1-1 2-2 3-3 4-4 5-5 6-5 6-6 7-7 8-8 10-9 11-11 12-12 13-13 14-15 15-15 15-16 16-17 17-19 18-19 18-20 19-21 20-22 22-24 23-25 24-26 25-27 26-28 27-29 28-30 29-31
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Alignment: 0-0 0-2 1-1 2-3 2-4 3-4 3-5 4-6 4-7 5-8 6-10 7-9 7-11 8-12 9-14 10-13 10-15
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Targe

You can specify a different word alignment model by specifying the `aligner` parameter. Let's align the same corpus using IBM-1.

In [7]:
aligned_corpus = word_align_corpus(parallel_corpus.lowercase(), aligner="ibm1")
for row in aligned_corpus.take(5):
    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", AlignedWordPair.to_string(row.aligned_word_pairs, include_scores=False))

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Alignment: 0-2 1-2 2-2 3-3 4-4 5-0 6-0 7-7 8-1 8-8 9-9 10-10
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Alignment: 0-0 0-4 1-3 2-2 3-3 5-5 7-7 10-9 14-11 16-17 17-1 19-14 20-22 21-21 27-30 28-30
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Alignment: 0-5 1-1 2-2 2-4 3-2 3-3 4-7 5-8 5-14 6-15 7-0 8-6
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Target: and god saw that the light was good , and he separated the light from the darkness .
Alignment: 0-1 1-2 1-3 

## Training models

Now, let's get into more advanced scenarios.

In this tutorial, we are going to start by training an IBM-1 model. There are two possible ways to train a model. First, we will demonstrate training a model from a class that inherits from `WordAlignmentModel`. All alignment models implement the `WordAlignmentModel` abstract base class. This makes it easier to swap out different models in your code. We use the `create_trainer` method to create a trainer object that is used to train the model. If we do not specify a file path when creating the model object, then the model will only exist in memory. When we call the `save` method, the model instance is updated with the trained model parameters, but the model is not written to disk. We need to preprocess the corpus before training. First, we need to tokenize the corpus. We will also lowercase the data, since that generally gives better results.

In [8]:
from machine.translation.thot import ThotIbm1WordAlignmentModel
from machine.translation import SymmetrizationHeuristic

model = ThotIbm1WordAlignmentModel()
trainer = model.create_trainer(parallel_corpus.lowercase())
trainer.train(lambda status: print(f"Training IBM-1 model: {status.percent_completed:.2%}"))
trainer.save()

Training IBM-1 model: 0.00%
Training IBM-1 model: 16.67%
Training IBM-1 model: 33.33%
Training IBM-1 model: 50.00%
Training IBM-1 model: 66.67%
Training IBM-1 model: 83.33%
Training IBM-1 model: 100.00%


The other option for training a model is to construct a trainer object directly. This method is useful for when you are only interested in training the model and saving it to disk for later use. We need to specify where the model will be saved after it is trained and we call the `save` method.

In [9]:
import os
from machine.translation.thot import ThotWordAlignmentModelTrainer, ThotWordAlignmentModelType

os.makedirs("out/VBL-WEB-IBM1", exist_ok=True)
trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.lowercase(), "out/VBL-WEB-IBM1/src_trg"
)

trainer.train(lambda status: print(f"Training IBM-1 model: {status.percent_completed:.2%}"))
trainer.save()
print("IBM-1 model saved")

Training IBM-1 model: 0.00%
Training IBM-1 model: 16.67%
Training IBM-1 model: 33.33%
Training IBM-1 model: 50.00%
Training IBM-1 model: 66.67%
Training IBM-1 model: 83.33%
Training IBM-1 model: 100.00%
IBM-1 model saved


## Aligning parallel sentences

Now that we have a trained alignment model, we can find the best alignment for a parallel sentence. We call `align` method to find the best alignment. The results are returned as a `WordAlignmentMatrix` object.

In [10]:
model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
for row in parallel_corpus.lowercase().take(5):
    alignment = model.align(row.source_segment, row.target_segment)

    print("Source:", row.source_text)
    print("Target:", row.target_text)
    print("Alignment:", str(alignment))

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Alignment: 2-2 2-6 3-3 4-4 6-0 7-7 8-1 8-5 8-8 9-9 10-10
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Alignment: 0-0 0-4 0-6 0-13 0-16 0-20 0-24 0-27 2-2 3-3 3-10 3-23 5-5 5-8 5-18 7-7 10-9 14-11 14-25 16-17 16-31 17-1 17-12 17-15 17-19 17-26 17-29 20-22 21-14 21-21 21-28 28-30
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Alignment: 0-5 0-10 1-1 2-2 2-4 3-3 3-9 4-7 5-8 5-14 6-15 7-0 7-11 7-13 8-6 8-12
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Target: and go

Word alignment models also provide the `align_batch` method which aligns a batch of parallel sentences. This can be faster than aligning one parallel sentence at a time.

In [11]:
segment_batch = list(parallel_corpus.lowercase().take(5))
alignments = model.align_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Alignment:", str(alignment))

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Alignment: 2-2 2-6 3-3 4-4 6-0 7-7 8-1 8-5 8-8 9-9 10-10
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Alignment: 0-0 0-4 0-6 0-13 0-16 0-20 0-24 0-27 2-2 3-3 3-10 3-23 5-5 5-8 5-18 7-7 10-9 14-11 14-25 16-17 16-31 17-1 17-12 17-15 17-19 17-26 17-29 20-22 21-14 21-21 21-28 28-30
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Alignment: 0-5 0-10 1-1 2-2 2-4 3-3 3-9 4-7 5-8 5-14 6-15 7-0 7-11 7-13 8-6 8-12
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Target: and go

## Getting model probabilities

A statistical word alignment model consists of one or more conditional probability distributions that are estimated from the training data. For example, most models estimate a word translation probability distribution that can be queried to obtain the probability that a source word is a translation of a target word. Each model class has methods to obtain these probabilities. Let's try getting some translation probabilities from the IBM-1 model that we trained by calling the `get_translation_probability` method. In order to get the probability that a word does not translate to anything, you can pass `None` instead of the word string.

In [12]:
model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
prob = model.get_translation_probability("es", "is")
print(f"es -> is: {prob:.4f}")
prob = model.get_translation_probability(None, "that")
print(f"NULL -> that: {prob:.4f}")

es -> is: 0.0056
NULL -> that: 0.0071


It can also be useful to obtain a score for an entire alignment. The `get_avg_translation_score` method can be used to compute the average translation probability for an alignment.

In [13]:
segment_batch = list(parallel_corpus.lowercase().take(5))
alignments = model.align_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Score:", round(model.get_avg_translation_score(source_segment, target_segment, alignment), 4))

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Score: 0.1898
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Score: 0.1209
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Score: 0.1965
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Target: and god saw that the light was good , and he separated the light from the darkness .
Score: 0.158
Source: ʼelohim llamó a la luz día y a la oscuridad llamó noche . y fue la tarde y fue la mañana : día primero .
Target: god called the light “ day , ” and the darkness he called “ 

## Symmetrized alignment models

Most statistical word alignment models are directional and asymmetric. This means that it can only model one-to-one and one-to-many alignments in one direction. They are not capable of modeling many-to-many alignments, which can occur in some language pairs. One way to get around this limitation is to train models in both directions (source-to-target and target-to-source), and then merge the resulting alignments from the two models into a single alignment. This is called symmetrization and is a common practice when using statistical word alignment models. In addition, researchers have found that symmetrized alignments are better quality.

Machine provides a special word alignment model class to support symmetrization called `ThotSymmetrizedWordAlignmentModel`. Let's demonstrate how to use this class. First, we will train the symmetrized model using the `SymmetrizedWordAlignmentModelTrainer` class.

In [14]:
from machine.translation import SymmetrizedWordAlignmentModelTrainer

src_trg_trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.lowercase(), "out/VBL-WEB-IBM1/src_trg"
)
trg_src_trainer = ThotWordAlignmentModelTrainer(
    ThotWordAlignmentModelType.IBM1, parallel_corpus.invert().lowercase(), "out/VBL-WEB-IBM1/trg_src"
)
symmetrized_trainer = SymmetrizedWordAlignmentModelTrainer(src_trg_trainer, trg_src_trainer)
symmetrized_trainer.train(lambda status: print(f"{status.message}: {status.percent_completed:.2%}"))
symmetrized_trainer.save()
print("Symmetrized IBM-1 model saved")

Training direct alignment model: 0.00%
Training direct alignment model: 0.00%
Training direct alignment model: 8.33%
Training direct alignment model: 16.67%
Training direct alignment model: 25.00%
Training direct alignment model: 33.33%
Training direct alignment model: 41.67%
Training direct alignment model: 50.00%
Training inverse alignment model: 50.00%
Training inverse alignment model: 50.00%
Training inverse alignment model: 58.33%
Training inverse alignment model: 66.67%
Training inverse alignment model: 75.00%
Training inverse alignment model: 83.33%
Training inverse alignment model: 91.67%
Training inverse alignment model: 100.00%
Symmetrized IBM-1 model saved


The model can also be trained using the `create_trainer` method on `ThotSymmetrizedWordAlignmentModel`. Now that we've trained the symmetrized model, let's obtain some alignments. Machine supports many different symmetrization heuristics. The symmetrization heuristic to use when merging alignments can be specified using the `heuristic` property. In this case, we will use the `GROW_DIAG_FINAL_AND` heuristic.

In [15]:
from machine.translation import SymmetrizationHeuristic
from machine.translation.thot import ThotSymmetrizedWordAlignmentModel

src_trg_model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/src_trg")
trg_src_model = ThotIbm1WordAlignmentModel("out/VBL-WEB-IBM1/trg_src")
symmetrized_model = ThotSymmetrizedWordAlignmentModel(src_trg_model, trg_src_model)
symmetrized_model.heuristic = SymmetrizationHeuristic.GROW_DIAG_FINAL_AND

segment_batch = list(parallel_corpus.lowercase().take(5))
alignments = symmetrized_model.align_batch(segment_batch)

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    print("Source:", " ".join(source_segment))
    print("Target:", " ".join(target_segment))
    print("Alignment:", alignment)

Source: en un principio ʼelohim creó los cielos y la tierra .
Target: in the beginning god created the heavens and the earth .
Alignment: 0-2 1-2 2-2 3-3 4-4 5-0 6-0 7-7 8-1 8-8 9-9 10-10
Source: pero la tierra estaba desolada y vacía , y había oscuridad sobre la superficie del abismo . el espíritu de ʼelohim se movía sobre la superficie de las aguas .
Target: now the earth was formless and void , and darkness was over the surface of the deep . and the spirit of god was hovering over the surface of the waters .
Alignment: 0-0 0-4 1-3 2-2 3-3 5-5 7-7 10-9 14-11 16-17 17-1 19-14 20-22 21-21 27-30 28-30
Source: entonces ʼelohim dijo : haya luz . y hubo luz .
Target: and god said , “ let there be light , ” and there was light .
Alignment: 0-5 1-1 2-2 2-4 3-2 3-3 4-7 5-8 5-14 6-15 7-0 8-6
Source: ʼelohim vio que la luz era buena e hizo separación entre la luz y la oscuridad .
Target: and god saw that the light was good , and he separated the light from the darkness .
Alignment: 0-1 1-2 1-3 

In [25]:
import numpy as np

In [30]:
for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    word_pairs = []
    
    print(alignment)
    
    source = 0
    
    for i in alignment:
        key1, key2 = str(i).split('-')
        value1 = source_segment[int(key1)]
        value2 = target_segment[int(key2)]
        word_pairs.append(f'{source}-{value2}')
        source += 1
        
    print(' '.join(word_pairs))

0-2 1-2 2-2 3-3 4-4 5-0 6-0 7-7 8-1 8-8 9-9 10-10


ValueError: not enough values to unpack (expected 2, got 1)

In [32]:
print(alignment[0])

[False False  True False False False False False False False False]


In [42]:
import numpy as np
import pandas as pd

for (source_segment, target_segment), alignment in zip(segment_batch, alignments):
    word_pairs = []
    spa = []
    eng = []
    
    source = 0
    
    for i in alignment:
        true_indices = np.where(i)[0]  # Find the indices where the value is True
        if len(true_indices) > 0:
            key1 = source_segment[source]
            key2 = str(true_indices[0])  # Get the first index where the value is True
            value2 = target_segment[int(key2)]
            
            word_pairs.append(f'{key1}-{value2}')
            spa.append(key1)
            eng.append(value2)
            
        source += 1
        
    table = pd.DataFrame(
    {'English': eng,
     'Spanish': spa,
    })
    display(table)

Unnamed: 0,English,Spanish
0,beginning,en
1,beginning,un
2,beginning,principio
3,god,ʼelohim
4,created,creó
5,in,los
6,in,cielos
7,and,y
8,the,la
9,earth,tierra


Unnamed: 0,English,Spanish
0,now,pero
1,was,la
2,earth,tierra
3,was,estaba
4,and,y
5,",",","
6,darkness,oscuridad
7,over,del
8,.,.
9,the,el


Unnamed: 0,English,Spanish
0,let,entonces
1,god,ʼelohim
2,said,dijo
3,said,:
4,be,haya
5,light,luz
6,.,.
7,and,y
8,there,hubo


Unnamed: 0,English,Spanish
0,god,ʼelohim
1,saw,vio
2,that,que
3,the,la
4,light,luz
5,he,era
6,separated,buena
7,separated,e
8,separated,hizo
9,separated,separación


Unnamed: 0,English,Spanish
0,god,ʼelohim
1,called,llamó
2,called,a
3,the,la
4,light,luz
5,day,día
6,and,y
7,darkness,oscuridad
8,night,noche
9,.,.
