# Homework 4: Word Alignment

## Group: Wisefish 

* Wenhao Zhang, wenhaoz 
* Graeme Milne, gmilne 
* Mitchell McCormack, mmccorma
* Jonathan Lo, jcl60

## Development process

As with previous assignments, we arranged a group meeting in which we would all read the assignment outline and came with questions and points to discuss. We planned a few hours long meeting in order to begin implementing the algorithm. 

The one session ended with a nearly complete implementation of the baseline algorithm. However, there were some remaining issues we found while debugging the baseline at large values of N that still required work. 


## Imports and Command line setup

In [None]:
import optparse, sys, os, logging
from collections import defaultdict
from itertools import islice
from decimal import Decimal

An additional command line parameter `-m` was added to specify which model to use for training. `-m m1` will train using only model 1 and `-m m2` will train using model 2. `m2` is the default setting for the command line parameter.

In [None]:
optparser = optparse.OptionParser()
optparser.add_option("-d", "--datadir", dest="datadir", default="data", help="data directory (default=data)")
optparser.add_option("-p", "--prefix", dest="fileprefix", default="hansards", help="prefix of parallel data files (default=hansards)")
optparser.add_option("-e", "--english", dest="english", default="en", help="suffix of English (target language) filename (default=en)")
optparser.add_option("-f", "--french", dest="french", default="fr", help="suffix of French (source language) filename (default=fr)")
optparser.add_option("-l", "--logfile", dest="logfile", default=None, help="filename for logging output")
optparser.add_option("-t", "--threshold", dest="threshold", default=0.5, type="float", help="threshold for alignment (default=0.5)")
optparser.add_option("-n", "--num_sentences", dest="num_sents", default=sys.maxsize, type="int", help="Number of sentences to use for training and alignment")
optparser.add_option("-m", "--model", dest="model", default = "m2", help="m1 for model 1 and m2 for model 2")
(opts, _) = optparser.parse_args()

f_data = "%s.%s" % (os.path.join(opts.datadir, opts.fileprefix), opts.french)
e_data = "%s.%s" % (os.path.join(opts.datadir, opts.fileprefix), opts.english)


if opts.logfile:
    logging.basicConfig(filename=opts.logfile, filemode='w', level=logging.INFO)


## Baseline Model 1 Implementation

### Initialization

The running probability of t(f<sub>i</sub>, e<sub>j</sub>) is initialized as a dictionary of dictionaries and all sets of dictionaries are defaultdicts. 

In [None]:
bitext, f_vocab, e_vocab = loadData(f_data, e_data, opts.num_sents)
size_f_vocab = len(f_vocab)
t = defaultdict(lambda: defaultdict(lambda: 1/size_f_vocab))

def loadData(f_data, e_data, num_sents):
  bitext = [[sentence.strip().split() for sentence in pair] for pair in islice(zip(open(f_data), open(e_data)), num_sents)]
  f_vocab = set()
  e_vocab = set()

  for s in islice(open(f_data), num_sents):
      words = s.strip().split()
      for w in words:
          f_vocab.add(w)

  for s in islice(open(e_data), num_sents):
      words = s.strip().split()
      for w in words:
          e_vocab.add(w)

  return bitext, f_vocab, e_vocab

### Lookup Functions

In our initial implementation which used tuples of `(f_i, e_i)`, we encountered memory errors from `t` (as well as `fe_count` in each training stage) growing too large to be held in memory. Even with in a VM with more memory than in most computers, we could not get a full training run complete with `n=100000`.

The root cause of this is due to the behaviour of Python's `defaultdict`. When an access is made to an unknown key, the `defaultdict` will set the default value (from evaluating the parameter from creation) and return it to whatever made the request. This resulted in a huge amount of key value pairs: keys from every single query, values which most of the time were just the default value.

What we implemented was a lookup helper for `t` (and `a` for model 2). Using `if ... in ...` to check for key existance in a `defaultdict` does NOT set the default value into it, so we can save a lot of memory by simply checking for the existance of the key, otherwise returning the default value as a simple value.

We also found out that using tuples as keys also had a significant impact on memory usage (and runtime to a small extent). So we switched to using a dict of dicts, with `f_i` values as the first key and `e_i` values as the second key.
The structure of dicts in a dict halved the size of required memory for `t`, and so we were able to achieve a memory usage level which allowed us to run this model.

In [None]:
def lookup(f_i, e_j):
    if f_i in t:
        if e_j in t[f_i]:
            return t[f_i][e_j]
    return 1 / size_f_vocab

### Training a lexical word alignment model

Training the Model is done by iteratively improving the conditional probability for the translation. In order to do this, we go over the entire corpus' sentences. We compute the expected counts of all possible alignments and the total number of times we see a particular alignment with a foreign word. After computing the expected counts over all the sentences, we update our probabilities and continue this process until convergence. In our case the there is no convergence calculation at the end of each iteration, instead we settled on using a more simplistic and practical training of 5 iterations.


In [None]:
for i in range(5):
    e_count = defaultdict(int)
    fe_count = defaultdict(lambda: defaultdict(int))
    for (n, (f, e)) in enumerate(bitext):
        e = ["NULL"] + e
        for (i, f_i) in enumerate(f):
            Z = 0
            for (j, e_j) in enumerate(e):
                Z += lookup(f_i, e_j)
            for (j, e_j) in enumerate(e):
                c = lookup(f_i, e_j)/Z
                fe_count[f_i][e_j] += c
                e_count[e_j] += c
    for f_i,e_is in fe_count.items():
        for e_i, value in e_is.items():
            t[f_i][e_i] = value/ e_count[e_i]
return t

###  Decoding the best alignment

The final step of the process is the actual alignment of our data given the model we have just trained. This is done by iterating through to find the arg max alignment for any given translation pair. Or, as written in the assignment: 

<b> 
\begin{align}
\hat{\textbf{a}} = \arg\max_{\textbf{a}} \Pr(\textbf{a} \mid \textbf{e}, \textbf{f})
\end{align}
</b>

In [None]:
for (n, (f, e)) in enumerate(bitext):
    current_alignment = ""
    for (i, f_i) in enumerate(f): 
        bestp = 0
        bestj = 0
        for (j, e_j) in enumerate(e):
            if t[(f_i,e_j)] > bestp:
                bestp = t[(f_i,e_j)]
                bestj = j
        sys.stdout.write("%i-%i " % (i,bestj))
    sys.stdout.write("\n")

### Create an alignment from the German-English Europarl dataset (Model 1)

In [None]:
! python3 align.py -p europarl -f de -n 100000 -m m1 > output.a 
! head -1000 output.a > europarl_align.a

### Create and score an alignment on the French-English Hansard dataset (Model 1)

In [None]:
! python3 align.py -n 100000 -m m1| python3 score-alignments.py

## Improvements with a Model 2 Implementation 

### Initialization 

The initialization for Model 2 involves running Model 1 to train the translation probabilities. This means we first uniformly initialize the translation probability as before but instead of using it directly in Model 2, we use Model 1 to train it first. Model 2 adds in the reverse distortion probability to the model. That is a(a<sub>i</sub>| i, I, J). We initialize the dictionary for a uniformly. The lookup function returns the value 1/(m+1) as the initial value for each value of a. This uniform distribution assumes all positions for a translation is equally likely for all m+1 positions. It is m+1 because it includes the Null alignment.

The alignment probablilty for model 2 is now: P(__f , a__ | __e__) = $\prod_{i=1}^{I}$ t(f<sub>i</sub>, e<sub>a<sub>j</sub></sub>) $\times$ a(a<sub>i</sub>| i, I, J) 


In [None]:
bitext, f_vocab, e_vocab = loadData(f_data, e_data, opts.num_sents)
size_f_vocab = len(f_vocab)

model_one__trained_t = train_model_one(bitext, t, t_lookup)
a = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 1/opts.num_sents))))

### Lookup Functions

The look up functions provide coverage for any case where the value of `t(f_i, e_j)` or `a(f_i,e_j)` is 0. Since these values are assigned to Z and Z is a denominator in the algorithm, they cannot be 0. 

As mentioned above for Model 1, the lookup functions also prevents the default dictionaries from becoming to bloated in memory by returning the base values of `1 / size_f_vocab` (for `t`) and `1/(m+1)` (for `a`) meaning we do not need to store the base values in the dictionaries.

In [None]:
def t_lookup(f_i, e_j):
    if f_i in t:
        if e_j in t[f_i]:
            return t[f_i][e_j]
    return 1 / size_f_vocab

def a_lookup(i,j,l,m):
    if i in a:
        if j in a[i]:
            if l in a[i][j]:
                if m in a[i][j][l]:
                    return a[i][j][l][m]
    return 1/(m+1)


### Training the Model


Model 2 improves upon Model 1 alignment by taking into account distortions. Instead of using the forward distortion, we use the reverse distortion a(a<sub>i</sub>| i, I, J). This means we have to modify the normalization vector, Z. Z is modified to the product of `t[f_i][e_j]` and `a[i][j][l][m]` where `l` and `m` are the lengths of the given French and English sentences. The algorithm for Model 2 is otherwise the same as Model 1. It is a EM algorithm. We first collect the counts and then update t and a.

In [None]:
    for iteration in range(5):
        e_count = defaultdict(int)
        fe_count = defaultdict(lambda: defaultdict(int))
        ijlm_count = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(int))))
        ilm_count = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        for (n, (f, e)) in enumerate(bitext):
            f_null = ["PLACE HOLDER"] + f
            e_null = ["NULL"] + e 
            l = len(f)
            m = len(e)
            for i in range(1, l + 1):
                Z = 0
                f_i = f_null[i]
                for j in range(0, m + 1):
                    e_j = e_null[j]
                    Z += t_lookup(f_i, e_j)*a_lookup(i,j,len(f),len(e))
                for j in range(0, m + 1):
                    e_j = e_null[j]
                    c = (t_lookup(f_i, e_j)*a_lookup(i,j,len(f),len(e)))/Z
                    fe_count[f_i][e_j] += c
                    e_count[e_j] += c
                    ijlm_count[i][j][len(f)][len(e)] += c
                    ilm_count[i][len(f)][len(e)] += c
        
        for f_i,e_is in fe_count.items():
            for e_j, value in e_is.items():
                t[f_i][e_j] = value/ e_count[e_j]
        
        for i, js in ijlm_count.items():
            for j,ls in js.items():
                for l, ms in ls.items():
                    for m, value in ms.items():
                        a[i][j][l][m] = value/ilm_count[i][l][m]
    return t,a

###  Decoding the best alignment

In [None]:
    for (n, (f, e)) in enumerate(bitext):
        current_sentence = ""
        l = len(f)
        m = len(e)
        for (i, f_i) in enumerate(f): 
            bestp = t_lookup(f_i, "NULL") * a_lookup(i + 1, 0, l, m)
            bestj = -1
            for (j, e_j) in enumerate(e):
                prob = t_lookup(f_i, e_j) * a_lookup(i + 1,j + 1,l,m)
                if prob > bestp:
                    bestp = prob
                    bestj = j
            if bestj != -1:
                current_sentence += "%i-%i " % (i ,bestj)
        alignment.append(current_sentence)

### Create an alignment from the German-English Europarl dataset (Model 2) 

In [None]:
! python3 align.py -p europarl -f de -n 100000 -m m2 > output.a 
! head -1000 output.a > europarl_align.a

### Create and score an alignment on the French-English Hansard dataset (Model 2)

In [None]:
! python3 align.py -n 100000 -m m2 | python3 score-alignments.py