# Word alignments

Aligning sentences and words are central tasks in statistical machine transla- tion (SMT). In this assignment, you get to implement a word aligner. Given pairs of aligned sentences in two languages, source and target, the goal is to align source words to their target translations. The resulting alignments might contain unaligned or multiply aligned words, i.e., the word alignments are generally m:n, which makes the task challenging.

This assignment is almost identical to the first assignment of a well-known online course on SMT, which is available here: http://mt-class.org/jhu/hw1.html. The task is to train and evaluate IBM Model 1 for word ali- gnments. The EM algorithm for training Model 1 is explained very well in the tutorial by Adam Lopez, which is linked from mt-class and our course website.

**Important**: For some reason, the Lopez tutorial systematically flips the role of English and Foreign strings, compared to the lecture slides. The text of the mt-class assignment is consistent with this, but the presentation on the lecture slides is consistent with the original publication of IBM Model 1 and a lot of other material, such as Koehn’s textbook on SMT. **Please implement the assignment as stated on the mt-class site**, with word translation probabilities P (f | e) and the assumption that each Foreign word is aligned to at most one English word.

The key points are as follows:

1. Clone the repository from https://github.com/xutaima/jhu-mt-hw, using Git. Observe that the repository contains some Python code and a dataset of 100,000 English-French sentence pairs ('hansards.e' and 'hansards.f') in the folder hw2. The first 37 sentence pairs are manually aligned, and these manual alignments are encoded in the file 'hansards.a'.

In [1]:
! git clone https://github.com/xutaima/jhu-mt-hw

Cloning into 'jhu-mt-hw'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 55 (delta 0), reused 1 (delta 0), pack-reused 47[K
Unpacking objects: 100% (55/55), done.


## Baseline

2. Get to know the code of the aligner. It provides a very simple baseline system; test it through the provided command-line interface. Determine the Alignment Error Rate (AER) of the baseline system, using the score-alignments script, and submit the result. Observe that it is terrible; word alignments are not an easy problem.

Summary of the base alignment algorithm from 'align' file:
0. Input: parallel sentences; threshold for Dice' coefficient; number of sentences to use.
1. Create list of tuples of format (list_of_split_f_words, list_of_split_e_words). Splitting is done by space.
2. Count all unique (f, e) tuples of words in the parallel sentences. Please note that only F_E tuples are considered, no E_F. Also, we count all unique f and e words.
3. Iterate by (f, e) counter and calculate [Dice coefficient](https://en.wikipedia.org/wiki/Sørensen–Dice_coefficient):

$$DSC =  \frac{2  |X \cap Y|}{|X| + |Y|}$$
4. Dice's coefficient for each pair will be in range [0, 1].
Hence, for pairs with value upper than the provided threshold (0.5 by default) think that they are aligned.

In [1]:
! cd ./jhu-mt-hw/hw2/ && python align -n 1000 | python score-alignments -n 1

Training with Dice's coefficient.......................................
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 | *                                                                                   | chacun
 |    ?                                                                                | en
 |       ?                                                                             | lui
 |       ?                                                                             | -
 |       ?                                                                             | même
 |         (*)                           ( )                                           | est
 |             *                                                                       | très
 |               (*)         ( )            ( )   ( )            ( )         ( )( )    | complexe
 |                              

In [2]:
! cd ./jhu-mt-hw/hw2/ && python align -n 10000 | python score-alignments -n 0

Training with Dice's coefficient.......................................................................................................................................................................................................................................................................................................
Precision = 0.239020
Recall = 0.594675
AER = 0.681997


In [3]:
! cd ./jhu-mt-hw/hw2/ && python align -t 0.4 -n 1000 | python score-alignments -n 0

Training with Dice's coefficient.......................................
Precision = 0.220453
Recall = 0.721893
AER = 0.709249


## IBM Model 1

3. Your task is to improve over the baseline by implementing an aligner based on IBM Model 1. Your program should learn the parameters P (f | e) of Model 1 from the given data, and then use them to compute optimal alignments. Submit the AER for your implementation. Feel free to use NLTK if you find it helpful, but note that anything in the nltk.align and nltk.translate packages is disallowed in this assignment.

In [3]:
import os
from typing import List, Optional

In [4]:
%%capture
if not os.path.isfile('em-align'):
    ! wget https://raw.githubusercontent.com/tsimafeip/LCT-master-course/main/Computational_Linguistics/HW5_word_alignment/em-align 

In [5]:
# runs ~1min
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -n 1000 -b 0 | python score-alignments -n 1

100% 100/100 [00:50<00:00,  2.00it/s]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 | *             ( )                                                                   | chacun
 |    ?                                                 ( )                            | en
 |       ?                   ( )                                                       | lui
 |       ?                                                       ( )                   | -
 |       ?       ( )                                                                   | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                                                   ( )          

In [6]:
# runs ~2min
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -n 1000 -b 1 | python score-alignments -n 1

100% 100/100 [00:49<00:00,  2.02it/s]
100% 100/100 [00:49<00:00,  2.02it/s]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 | *                                                                                   | chacun
 |    ?                                                                                | en
 |       ?                                                                             | lui
 |       ?                                                                             | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                          

In [None]:
# runs ~15min
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -n 10000 -b 0 | python score-alignments -n 1

100% 100/100 [15:22<00:00,  9.23s/it]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 | *                                                                         ( )       | chacun
 |    ?                                                 ( )                            | en
 |       ?                                                                   ( )       | lui
 |       ?                                                    ( )                      | -
 |       ?                                  ( )                                        | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                              ( )                               

In [None]:
# runs ~30min
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -n 10000 -b 1 | python score-alignments -n 1

100% 100/100 [15:04<00:00,  9.04s/it]
100% 100/100 [15:07<00:00,  9.08s/it]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 | *                                                                                   | chacun
 |    ?                                                                                | en
 |       ?                                                                             | lui
 |       ?                                                                             | -
 |       ?                                  ( )                                        | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                          

In [4]:
# big setup:
# combined model with 10 EM iterations
# training on full available data (100k sentences)

# runs ~ 13 min and beats off-the-shelf aligner
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -i 10 -b 1 -s 0 | python score-alignments -n 1

100% 10/10 [12:40<00:00, 76.07s/it]
100% 10/10 [12:28<00:00, 74.84s/it]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ?                                                                                | en
 |       ?                                                                             | lui
 |       ?          ( )                                                                | -
 |       ?                                  ( )                                        | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                              

Conclusions:

1) Increasing number of train sentences improves AER drastically, but training time becomes much longer.

2) One-side 'IBM Model I' always chooses single English word for the foreign one. Combined version can leave some original words without alignments at all.

3) Baseline model can choose from 0 to many English words for the foreign one, so it explains high recall and low precision.

4) Adding null word to my code will require changes in check-alignments script. Hence, I prefer to leave out nulls altogether following a proposal from Lopez tutorial. 

## Off-the-shelf aligner

In addition to what is required in the JHU assignment, experiment also with an off-the-shelf aligner of your choice. GIZA++ used to be the standard, but is now hard to compile. Depending on your preference of programming language, you might try [MGIZA](https://github.com/moses-smt/mgiza), [fast align](https://github.com/clab/fast_align), or [the Berkeley aligner](https://github.com/mhajiloo/berkeleyaligner). Compare your IBM Model 1 to the implementation of Model 1 in the off-the-shelf aligner (if available) and another Model i > 1 of your choice.

I have decided to use fast_align. This is the implemenation of the IBM Model II with some improvements ([a model that prefers to align words close to the diagonal](http://aclweb.org/anthology/N/N13/N13-1073.pdf)).

This tool requires data in one file with '|||' as separator, so we have to convert existing files:

In [None]:
path_to_data = ['.', 'jhu-mt-hw', 'hw2', 'data']
f_filename = 'hansards.f'
e_filename = 'hansards.e'

In [None]:
def prepare_data_for_fastalign(path_to_data: List[str],
                               source_filename: str, target_filename: str,
                               res_filename: Optional[str] = None,
                               sentences_num: int = -1) -> str:
    """
    Converts data to fast_align format.

    Parameters
    ----------
    path_to_data : List[int]
        Path to folder with training data.
    source_filename : str
        Filename of source file.
    target_filename : str
        Filename of target file.
    res_filename : Optional[str]
        Optional filename of the resulting file. 
        If not set, then 'fast_align_data.txt' default value is used.
    sentences_num : int
        Optional number of training sentences.
        By default has value -1, so all available data will be used.

    Returns
    -------
    str
        Name of resulting file in fast_align format.
    """
    if res_filename is None:
        res_filename = 'fast_align_data.txt'
    
    fast_align_separator = '|||'
    path_to_target_data = os.path.join(*path_to_data, target_filename)
    path_to_source_data = os.path.join(*path_to_data, source_filename)

    with open(path_to_target_data, 'r') as target_file, \
         open(path_to_source_data, 'r') as source_file, \
         open(res_filename, 'w') as res_file:
        for i, (source_sent, target_sent) in enumerate(zip(source_file, target_file)):
            if sentences_num != -1 and i == sentences_num:
                break

            source_sent = source_sent.strip()
            target_sent = target_sent.strip()
            res_file.write(f'{source_sent} {fast_align_separator} {target_sent}\n')
    
    return res_filename

In [None]:
# source and target sides were chosen with respect to output format of fast_align
# and requirements of the task ('each Foreign word is aligned to at most one English word').
res_filename = prepare_data_for_fastalign(path_to_data=path_to_data, 
                                          source_filename=f_filename,
                                          target_filename=e_filename, 
                                          sentences_num=1000)

In [None]:
# install fast_align with dependencies and build it.
%%capture
! sudo apt-get install libgoogle-perftools-dev libsparsehash-dev
! git clone https://github.com/clab/fast_align
! cd fast_align && mkdir build && cd build && cmake .. && make

In [None]:
# runs fast_align based on data produced above
# set of flag was chosen based on examples provided in fast_align README
# please note, that we add here -r flag, which means 'reverse'
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/fast_align -i ../../$res_filename -d -o -v -r 2>/dev/null | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ? ( )                                                                            | en
 |       ?                                                                             | lui
 |       ?                                                                             | -
 |       ?          ( )                                                                | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                  ( )                                                                | et
 |          

In [None]:
# 'forward' path:
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/fast_align -i ../../$res_filename -d -o -v 2>/dev/null | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ? ( )                                                                            | en
 |       ?                                                                             | lui
 |       ?          ( )                                                                | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                              ( )                                  | complexe
 |                                                                                     | et
 |          

In [None]:
# try combining two models
! fast_align/build/fast_align -i $res_filename -d -o -v -r 2>/dev/null > reverse.align 
! fast_align/build/fast_align -i $res_filename -d -o -v 2>/dev/null > forward.align
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/atools -i ../../forward.align -j ../../reverse.align -c grow-diag-final-and | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ? ( )                                                                            | en
 |       ?                                                                             | lui
 |       ?                                                                             | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                  ( )                                                                | et
 |          

I have decided to train it on full available data:

In [None]:
# source and target sides were chosen with respect to output format of fast_align
# and requirements of the task ('each Foreign word is aligned to at most one English word').
res_filename_full = prepare_data_for_fastalign(path_to_data=path_to_data, 
                                               source_filename=f_filename,
                                               target_filename=e_filename, 
                                               sentences_num=-1)

In [None]:
%%time
# return fast_align on full available data
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/fast_align -i ../../$res_filename_full -d -o -v -r 2>/dev/null | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ? ( )                                                                            | en
 |       ? ( )                                                                         | lui
 |       ? ( )                                                                         | -
 |       ?                                  ( )                                        | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                              ( )                                                    | et
 |          

In [None]:
%%time
# rerun 'forward' path on full data:
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/fast_align -i ../../$res_filename_full -d -o -v 2>/dev/null | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |   (?)( )                                                                            | en
 |       ?                                                                             | lui
 |       ?          ( )                                                                | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                              ( )                                  | complexe
 |                                                                                     | et
 |          

In [None]:
%%time
# return combining two models
! fast_align/build/fast_align -i $res_filename_full -d -o -v -r 2>/dev/null > reverse.align 
! fast_align/build/fast_align -i $res_filename_full -d -o -v 2>/dev/null > forward.align
! cd ./jhu-mt-hw/hw2/ && ../../fast_align/build/atools -i ../../forward.align -j ../../reverse.align -c grow-diag-final-and | python score-alignments -n 1

  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |   (?)( )                                                                            | en
 |       ? ( )                                                                         | lui
 |       ? ( )                                                                         | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                                                                   | complexe
 |                                                                                     | et
 |          

In [None]:
# it is an attempt to compare my aligner with off-the-shelf aligner on full data
# using almost the SAME time of run to eliminate effect of longer running
! cd ./jhu-mt-hw/hw2/ && python ../../em-align -i 2 -b 1 -s 0 | python score-alignments -n 1

100%|█████████████████████████████████████████████| 2/2 [01:14<00:00, 37.09s/it]
100%|█████████████████████████████████████████████| 2/2 [01:14<00:00, 37.17s/it]
  Alignment 0  KEY: ( ) = guessed, * = sure, ? = possible
  ------------------------------------------------------------------------------------
 |(*)                                                                                  | chacun
 |    ?                                                                                | en
 |       ?                                                                             | lui
 |       ?          ( )                                                                | -
 |       ?                                                                             | même
 |         (*)                                                                         | est
 |            (*)                                                                      | très
 |               (*)                    

Conclusion: my implementation of 'IBM model I' can deliver comparably good AER (0.24 with 10 iterations of EM, 0.27 with only 2 iterations) on full available data. Whereas off-the-shelf fast_align even on first 1000 of sentences gives low error rate (0.29 vs. 0.5 in my implementation), and on full data it gives AER=~0.2.

## Extra

Implement an aligner that improves over your implementation of IBM Model 1. Some ideas are suggested in the JHU homework assignment:

- Implement [a model that prefers to align words close to the diagonal](http://aclweb.org/anthology/N/N13/N13-1073.pdf).
- Implement [an HMM alignment model](https://aclanthology.org/C96-2141.pdf).
- Implement [a morphologically-aware alignment model](https://aclanthology.org/N13-1140.pdf).
- [Use maximum a posteriori inference under a Bayesian prior](https://aclanthology.org/P11-2032.pdf).
- Train a French-English model and an English-French model and [combine their predictions](https://aclanthology.org/N06-1014.pdf).
- Train [a supervised discriminative alignment model](https://aclanthology.org/P06-1009.pdf) on the annotated development set.
- Train [an unsupervised discriminative alignment model](https://aclanthology.org/P11-1042.pdf).

I have decided to implement combination of French-English model and English-French model. However, I have done the simpliest version of it by choosing alignments, where both models agree and omitting other cases. [Alignment by Agreement](https://aclanthology.org/N06-1014.pdf) proposed better approach. Possibly, I will implement it later.

## Submissions

Submit your code and document all your evaluation results. Submit at least one alignment visualization from your system in comparison to the baseline system and the off-the-shelf aligner so we can discuss it in class.