# Homework 4 Aligner

## In this notebook, I listed several attempts I have made and corresponding result. Eventually, our best AER of fr-en alignment is 0.1926. 
All the attempts are around IBM Model 1. We attempted to achieve the best AER with IBM Model 1. Adding null words and LLR initialization were also furthur explored. They worked well with the baseline. However, they didn't work when we tried to combine all the improvements. As a result, we didn't include them in this notebook. If you really want to look at the performance of those two types of improvement, please refer to the notebooks under backup folder.
- [Load the package and dataset](#part-0)
- [Baseline implementation: AER = 0.3417](#part-1)
- [Add n smoothing: AER = 0.3124](#part-2)
- [Use posterior probabilities + add n smoothing: AER = 0.2494 while threshold = 0.2494](#part-3)
- [Intersection of alignments from two directions + add n smoothing: AER = 0.2030](#part-4)
- [Alignment by agreement + add n smoothing + posterier probability: AER = 0.1926](#part-5)

## Load the package and dataset <a id='part-0'>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import optparse, sys, os, logging
from collections import defaultdict
from itertools import islice
import time
#from align_yabing import *

In [3]:
opts_datadir, opts_fileprefix = "data", "hansards"
opts_french, opts_english = "fr", "en"
# opts_datadir, opts_fileprefix = "data", "europarl"
# opts_french, opts_english = "de", "en"

In [4]:
opts_num_sents = 100000

In [5]:
f_data = "%s.%s" % (os.path.join(opts_datadir, opts_fileprefix), opts_french)
e_data = "%s.%s" % (os.path.join(opts_datadir, opts_fileprefix), opts_english)

In [6]:
bitext = [[sentence.strip().split() for sentence in pair] \
          for pair in islice(zip(open(f_data,encoding="utf8"), open(e_data,encoding="utf8")), opts_num_sents)]

## Baseline (AER = 0.3417)<a id='part-1'>

At the very beginning, I implemented the baseline according to the assignment description provided by the professor. It achieved 0.3417 of AER.

In [10]:
sys.stderr.write("Training with Expectation Maximization...\n")

Training with Expectation Maximization...


In [11]:
%%time
# f is the French word set
# e is the English word set
# f_count is the word count dictionary for French word set
# N is the number of sentences
f = set()
e = set()
f_count = defaultdict(int)
for pair in bitext:
    f = f.union(set(pair[0]))
    e = e.union(set(pair[1]))
    for f_i in set(pair[0]):
        f_count[f_i] += 1
N = len(bitext)

* $k = 0$<br>
* Initialize $t_0$ **## Easy choice: initialize uniformly ##**<br>
* repeat <br>
    * $k$ += 1 <br>
    * Initialize all counts to zero <br>
    * for each $(\textbf{f}, \textbf{e})$ in ${\cal D}$ <br>
        * for each $f_i$ in $\textbf{f}$ <br>
            * $Z$ = 0 **## Z commonly denotes a normalization term ##** <br>
            * for each $e_j$ in $\textbf{e}$ <br>
                * $Z$ += $t_{k-1}(f_i \mid e_j)$ <br>
            * for each $e_j$ in $\textbf{e}$ <br>
                * `c` = $ t_{k-1}(f_i \mid e_j) / Z $ <br>
                * count($f_i$, $e_j$) += `c` <br>
                * count($e_j$) += `c` <br>
    * for each ($f$, $e$) in count <br>
        * Set new parameters: $t_k(f \mid e)$ =  count($f,e$) / count($e$) <br>
* until k = 5

In [12]:
k = 0
# initialize theta uniformly
num_f = len(f_count)
theta = defaultdict(lambda: 1./num_f)
while k < 5:
    k += 1
    tic = time.time()
    sys.stderr.write(f"Iteration {k}.................................\n")
    e_count = defaultdict(int)
    fe_count = defaultdict(int)
    for n in range(N):
        for f_i in bitext[n][0]:
            Z = 0
            for e_j in bitext[n][1]:
                Z += theta[(f_i, e_j)]
            for e_j in bitext[n][1]:
                c = theta[(f_i, e_j)] / Z
                fe_count[(f_i, e_j)] += c
                e_count[e_j] += c
    for (f_i, e_j) in fe_count.keys():
        theta[(f_i, e_j)] = fe_count[(f_i, e_j)] / e_count[e_j]
    toc = time.time()
    sys.stderr.write(f"Iteration {k} finished. Time cost: {toc-tic}\n")

Iteration 1.................................
Iteration 2.................................
Iteration 3.................................
Iteration 4.................................
Iteration 5.................................


* for each $(\textbf{f}, \textbf{e})$ in ${\cal D}$
    * for each $f_i$ in $\textbf{f}$
        * `bestp` = 0
        * `bestj` = 0
        * for each $e_j$ in $\textbf{e}$
            * if $t(f_i \mid e_j)$ > `bestp`
                * `bestp` = $t(f_i \mid e_j)$
                * `bestj` = $j$
        * align $f_i$ to $e_{\texttt{bestj}}$

In [13]:
sys.stderr.write("Aligning...\n")

Aligning...


In [18]:
%%capture --no-stderr dice_a
for f, e in bitext:
    for i in range(len(f)):
        f_i = f[i]
        bestp = 0
        bestj = 0
        for j in range(len(e)):
            e_j = e[j]
            if theta[(f_i, e_j)] > bestp:
                bestp = theta[(f_i, e_j)]
                bestj = j
        sys.stdout.write(f"{i}-{bestj} ")
    sys.stdout.write("\n")

In [22]:
# dump the output to the local file dice.a
with open('dice.a','w',encoding="utf8") as fh:
    fh.write(str(dice_a))

In [1]:
#%run check-alignments.py -i dice.a

In [26]:
%run score-alignments.py -n 0 -i dice.a

Precision = 0.597603
Recall = 0.774889
AER = 0.341724


## Add n smoothing (AER = 0.3124)<a id='part-2'>

In this trial, I added n smoothing to the baseline. The AER was improved from 0.34 to 0.3124. The value of n is set to be 0.01. I didn't see much difference it can make.

In [8]:
%%time
# f is the French word set
# e is the English word set
# f_count is the word count dictionary for French word set
# N is the number of sentences
f = set()
e = set()
f_count = defaultdict(int)
for pair in bitext:
    f = f.union(set(pair[0]))
    e = e.union(set(pair[1]))
    for f_i in set(pair[0]):
        f_count[f_i] += 1
N = len(bitext)

Wall time: 2min 59s


In [9]:
# add n smoothing
smooth_n = 0.01
vocab_N = 100000

In [10]:
k = 0
# initialize theta uniformly
num_f = len(f_count)
theta = defaultdict(lambda: 1./num_f)
while k < 5:
    k += 1
    tic = time.time()
    sys.stderr.write(f"Iteration {k}.................................\n")
    e_count = defaultdict(int)
    fe_count = defaultdict(int)
    for n in range(N):
        for f_i in bitext[n][0]:
            Z = 0
            for e_j in bitext[n][1]:
                Z += theta[(f_i, e_j)]
            for e_j in bitext[n][1]:
                c = theta[(f_i, e_j)] / Z
                fe_count[(f_i, e_j)] += c
                e_count[e_j] += c
    for (f_i, e_j) in fe_count.keys():
        theta[(f_i, e_j)] = (fe_count[(f_i, e_j)] + smooth_n) / (e_count[e_j] + vocab_N * smooth_n)
    toc = time.time()
    sys.stderr.write(f"Iteration {k} finished. Time cost: {toc-tic}\n")

Iteration 1.................................
Iteration 1 finished. Time cost: 70.06756949424744
Iteration 2.................................
Iteration 2 finished. Time cost: 69.03484225273132
Iteration 3.................................
Iteration 3 finished. Time cost: 66.50286340713501
Iteration 4.................................
Iteration 4 finished. Time cost: 67.4163613319397
Iteration 5.................................
Iteration 5 finished. Time cost: 66.89071893692017


In [11]:
%%capture --no-stderr dice_a
for f, e in bitext:
    for i in range(len(f)):
        f_i = f[i]
        bestp = 0
        bestj = 0
        for j in range(len(e)):
            e_j = e[j]
            if theta[(f_i, e_j)] > bestp:
                bestp = theta[(f_i, e_j)]
                bestj = j
        sys.stdout.write(f"{i}-{bestj} ")
    sys.stdout.write("\n")

In [13]:
# dump the output to the local file dice.a
with open('dice.a','w',encoding='utf8') as fh:
    fh.write(str(dice_a))

In [14]:
%run score-alignments.py -n 0 -i dice.a

Precision = 0.623889
Recall = 0.810054
AER = 0.312399


## Use posterior probabilities + add n smoothing (AER = 0.2494 while setting threshold to 0.3)<a id='part-3'>

In this trial, I used posterior probabilities intead of argmax method. Depending on the value of the threshold, the precision and recall can be differed. There is always a tradeoff between precision and recall while adjusting the threshold. I indeed saw an improvement of AER.

In [7]:
%%time
# f is the French word set
# e is the English word set
# f_count is the word count dictionary for French word set
# N is the number of sentences
f = set()
e = set()
f_count = defaultdict(int)
for pair in bitext:
    f = f.union(set(pair[0]))
    e = e.union(set(pair[1]))
    for f_i in set(pair[0]):
        f_count[f_i] += 1
N = len(bitext)

Wall time: 2min 55s


In [8]:
# add n smoothing
smooth_n = 0.01
vocab_N = 100000

In [9]:
k = 0
# initialize theta uniformly
num_f = len(f_count)
theta = defaultdict(lambda: 1./num_f)
while k < 5:
    k += 1
    tic = time.time()
    sys.stderr.write(f"Iteration {k}.................................\n")
    e_count = defaultdict(int)
    fe_count = defaultdict(int)
    for n in range(N):
        for f_i in bitext[n][0]:
            Z = 0
            for e_j in bitext[n][1]:
                Z += theta[(f_i, e_j)]
            for e_j in bitext[n][1]:
                c = theta[(f_i, e_j)] / Z
                fe_count[(f_i, e_j)] += c
                e_count[e_j] += c
    for (f_i, e_j) in fe_count.keys():
        theta[(f_i, e_j)] = (fe_count[(f_i, e_j)] + smooth_n) / (e_count[e_j] + vocab_N * smooth_n)
    toc = time.time()
    sys.stderr.write(f"Iteration {k} finished. Time cost: {toc-tic}\n")

Iteration 1.................................
Iteration 1 finished. Time cost: 67.98513007164001
Iteration 2.................................
Iteration 2 finished. Time cost: 67.01407670974731
Iteration 3.................................
Iteration 3 finished. Time cost: 67.34206819534302
Iteration 4.................................
Iteration 4 finished. Time cost: 68.05870580673218
Iteration 5.................................
Iteration 5 finished. Time cost: 67.69240880012512


* for each $(\textbf{f},\textbf{e})$ in ${\cal D}$
    * for each $f_i$ in $\textbf{f}$
        * $Z = 0$
        * for each $e_j$ in $\textbf{e}$
            * $Z += t(f_i∣e_j)$
        * for each $e_j$ in $\textbf{e}$
            * $posterior$ = $t(f_i∣e_j)/ Z$
            * if $(posterior > \delta)$ keep alignment between $f_i$ and $e_j$

In [54]:
%%capture --no-stderr dice_a

# delta is the threshold we used to decide whether to keep the alignment pair
delta = 0.3

for f, e in bitext:
    for i, f_i in enumerate(f):
        Z = 0
        for j, e_j in enumerate(e):
            Z += theta[(f_i, e_j)]
        for j, e_j in enumerate(e):
            posterior = theta[(f_i, e_j)] / Z
            if posterior >= delta:
                sys.stdout.write(f"{i}-{j} ")
    sys.stdout.write("\n")

In [55]:
# dump the output to the local file dice.a
with open('dice.a','w',encoding='utf8') as fh:
    fh.write(str(dice_a))

In [56]:
%run score-alignments.py -n 0 -i dice.a

Precision = 0.740273
Recall = 0.765726
AER = 0.249424


## Intersection of alignments from two directions + add n smoothing (AER = 0.2030)<a id='part-4'>

In this trial, I tried to build two models from both translation directions and used the intersection of two alignment sets as final alignment. The AER was improved dramatically.

In [7]:
%%time
# f is the French word set
# e is the English word set
# f_count is the word count dictionary for French word set
# e_count is the word count dictionary for English word set
# N is the number of sentences
f = set()
e = set()
f_count = defaultdict(int)
e_count = defaultdict(int)
for pair in bitext:
    f = f.union(set(pair[0]))
    e = e.union(set(pair[1]))
    for f_i in set(pair[0]):
        f_count[f_i] += 1
    for e_j in set(pair[1]):
        e_count[e_j] += 1
N = len(bitext)

Wall time: 3min 10s


In [8]:
num_f = len(f_count)
num_e = len(e_count)

# add n smoothing
smooth_n = 0.01
vocab_N = 100000

In [10]:
def align(num, N, isReverse=False):
    '''
    num: size of target language word count dict
    N: number of sentences
    isReverse: if the translation direction is reversed
    '''
    k = 0
    # initialize theta uniformly
    theta = defaultdict(lambda: 1./num)
    while k < 5:
        k += 1
        tic = time.time()
        sys.stderr.write(f"Iteration {k}.................................\n")
        e_count = defaultdict(int)
        fe_count = defaultdict(int)
        if isReverse:
            for n in range(N):
                for f_i in bitext[n][1]:
                    Z = 0
                    for e_j in bitext[n][0]:
                        Z += theta[(f_i, e_j)]
                    for e_j in bitext[n][0]:
                        c = theta[(f_i, e_j)] / Z
                        fe_count[(f_i, e_j)] += c
                        e_count[e_j] += c
        else:
            for n in range(N):
                for f_i in bitext[n][0]:
                    Z = 0
                    for e_j in bitext[n][1]:
                        Z += theta[(f_i, e_j)]
                    for e_j in bitext[n][1]:
                        c = theta[(f_i, e_j)] / Z
                        fe_count[(f_i, e_j)] += c
                        e_count[e_j] += c
        for (f_i, e_j) in fe_count.keys():
            theta[(f_i, e_j)] = (fe_count[(f_i, e_j)] + smooth_n) / (e_count[e_j] + vocab_N * smooth_n)
        toc = time.time()
        sys.stderr.write(f"Iteration {k} finished. Time cost: {toc-tic}\n")
    return theta

In [11]:
theta_e2f = align(num_f, N, False)
theta_f2e = align(num_e, N, True)

Iteration 1.................................
Iteration 1 finished. Time cost: 59.55198121070862
Iteration 2.................................
Iteration 2 finished. Time cost: 60.37677502632141
Iteration 3.................................
Iteration 3 finished. Time cost: 59.10754203796387
Iteration 4.................................
Iteration 4 finished. Time cost: 59.759299993515015
Iteration 5.................................
Iteration 5 finished. Time cost: 59.11804461479187
Iteration 1.................................
Iteration 1 finished. Time cost: 63.21055626869202
Iteration 2.................................
Iteration 2 finished. Time cost: 63.88322186470032
Iteration 3.................................
Iteration 3 finished. Time cost: 59.511693477630615
Iteration 4.................................
Iteration 4 finished. Time cost: 57.98052620887756
Iteration 5.................................
Iteration 5 finished. Time cost: 57.46198773384094


In [12]:
%%capture --no-stderr dice_a
for f, e in bitext:
    set_e2f = set()
    set_f2e = set()
    for i in range(len(f)):
        f_i = f[i]
        bestp = 0
        bestj = 0
        for j in range(len(e)):
            e_j = e[j]
            if theta_e2f[(f_i, e_j)] > bestp:
                bestp = theta_e2f[(f_i, e_j)]
                bestj = j
        set_e2f.add((i, bestj))        
    for j in range(len(e)):
        e_j = e[j]
        bestp = 0
        besti = 0
        for i in range(len(f)):
            f_i = f[i]
            if theta_f2e[(e_j, f_i)] > bestp:
                bestp = theta_f2e[(e_j, f_i)]
                besti = i
        set_f2e.add((besti,j))
    set_combined = set_f2e.intersection(set_e2f)
    
    for pair in set_combined:
        sys.stdout.write(f"{pair[0]}-{pair[1]} ")
    sys.stdout.write("\n")

In [13]:
# dump the output to the local file dice.a
with open('dice.a','w',encoding='utf8') as fh:
    fh.write(str(dice_a))

In [14]:
%run score-alignments.py -n 0 -i dice.a

Precision = 0.858713
Recall = 0.730312
AER = 0.202974


## Alignment by agreement + add n smoothing + posterier probability (AER = 0.1926)<a id='part-5'>

In this trial, I combined everything I've tried, including adding n smoothing, applying posterior probabilities while aligning, and applying alignment by two independent models. I experimented several thresholds (see below). As you may have noted, the threshold is almost the square of the threshold used in previous trial. The reason is we are thresholding the product of two posterior probabilities. Eventually, the optimized one was determined to be 0.08. I also tried different n (in add n smoothing) (results not shown). That won't make much difference on AER. So I simply set it to 0.01 as specified in the paper.

    n = 0.01 
    Threshold | AER score | Precision | Recall
    0.04      | 0.215656  | 0.747634  | 0.841010
    0.06      | 0.197782  | 0.800442  | 0.804606
    0.07      | 0.194017  | 0.817271  | 0.791481
    0.08      | 0.192571  | 0.829273  | 0.780337
    0.09      | 0.194620  | 0.836773  | 0.767707
    0.10      | 0.196451  | 0.844369  | 0.756067

In [7]:
%%time
# f is the French word set
# e is the English word set
# f_count is the word count dictionary for French word set
# e_count is the word count dictionary for English word set
# N is the number of sentences
f = set()
e = set()
f_count = defaultdict(int)
e_count = defaultdict(int)
for pair in bitext:
    f = f.union(set(pair[0]))
    e = e.union(set(pair[1]))
    for f_i in set(pair[0]):
        f_count[f_i] += 1
    for e_j in set(pair[1]):
        e_count[e_j] += 1
N = len(bitext)

Wall time: 2min 52s


In [8]:
num_f = len(f_count)
num_e = len(e_count)

# add n smoothing
smooth_n = 0.01
vocab_N = 100000

In [9]:
def align(num, N, isReverse=False):
    '''
    num: size of target language word count dict
    N: number of sentences
    isReverse: if the translation direction is reversed
    '''
    k = 0
    # initialize theta uniformly
    theta = defaultdict(lambda: 1./num)
    while k < 5:
        k += 1
        tic = time.time()
        sys.stderr.write(f"Iteration {k}.................................\n")
        e_count = defaultdict(int)
        fe_count = defaultdict(int)
        if isReverse:
            for n in range(N):
                for f_i in bitext[n][1]:
                    Z = 0
                    for e_j in bitext[n][0]:
                        Z += theta[(f_i, e_j)]
                    for e_j in bitext[n][0]:
                        c = theta[(f_i, e_j)] / Z
                        fe_count[(f_i, e_j)] += c
                        e_count[e_j] += c
        else:
            for n in range(N):
                for f_i in bitext[n][0]:
                    Z = 0
                    for e_j in bitext[n][1]:
                        Z += theta[(f_i, e_j)]
                    for e_j in bitext[n][1]:
                        c = theta[(f_i, e_j)] / Z
                        fe_count[(f_i, e_j)] += c
                        e_count[e_j] += c
        for (f_i, e_j) in fe_count.keys():
            theta[(f_i, e_j)] = (fe_count[(f_i, e_j)] + smooth_n) / (e_count[e_j] + vocab_N * smooth_n)
        toc = time.time()
        sys.stderr.write(f"Iteration {k} finished. Time cost: {toc-tic}\n")
    return theta

In [10]:
theta_e2f = align(num_f, N, False)
theta_f2e = align(num_e, N, True)

Iteration 1.................................
Iteration 1 finished. Time cost: 55.44521379470825
Iteration 2.................................
Iteration 2 finished. Time cost: 56.67413544654846
Iteration 3.................................
Iteration 3 finished. Time cost: 66.29871845245361
Iteration 4.................................
Iteration 4 finished. Time cost: 67.42438054084778
Iteration 5.................................
Iteration 5 finished. Time cost: 68.79667353630066
Iteration 1.................................
Iteration 1 finished. Time cost: 71.30190944671631
Iteration 2.................................
Iteration 2 finished. Time cost: 70.01798844337463
Iteration 3.................................
Iteration 3 finished. Time cost: 67.70860290527344
Iteration 4.................................
Iteration 4 finished. Time cost: 68.6275646686554
Iteration 5.................................
Iteration 5 finished. Time cost: 62.62635397911072


In [11]:
%%capture --no-stderr dice_a

# delta is the threshold we used to decide whether to keep the alignment pair
delta = 0.08

for f, e in bitext:
    posterior_e2f = defaultdict(float)
    posterior_f2e = defaultdict(float)
    for i, f_i in enumerate(f):
        Z = 0
        for j, e_j in enumerate(e):
            Z += theta_e2f[(f_i, e_j)]
        for j, e_j in enumerate(e):
            posterior_e2f[(i,j)] = theta_e2f[(f_i, e_j)] / Z
    
    for j, e_j in enumerate(e):
        Z = 0
        for i, f_i in enumerate(f):
            Z += theta_f2e[(e_j, f_i)]
        for i, f_i in enumerate(f):
            posterior_f2e[(j,i)] = theta_f2e[(e_j, f_i)] / Z

    for pair in posterior_e2f.keys():
        posterior = posterior_e2f[pair] * posterior_f2e[(pair[1],pair[0])]
        if posterior > delta:
            sys.stdout.write(f"{pair[0]}-{pair[1]} ")
    sys.stdout.write("\n")

In [12]:
# dump the output to the local file dice.a
with open('dice.a','w',encoding='utf8') as fh:
    fh.write(str(dice_a))

In [13]:
%run score-alignments.py -n 0 -i dice.a

Precision = 0.829273
Recall = 0.780337
AER = 0.192571
