# Overview

In this homework we tried two models for alignment:

- IBM Model 1
- HMM model

And we apply bidirectional alignment training and decoding for both models. We found that by using the pretrained parameters in IBM model 1 for HMM model, we got relatively nice **AER -- 0.107** after 5 iterations.

# Baseline (IBM model 1)

For each English-French word pair, we have $t(f|e)$, which is initially $1/|f|$.

For each iteration, we:

* initial count() and count_pair() to 0
* for each parallel sentence pair $(f,e)$
 * for each French word $f_i$
 * $z = \sum_{e_j} t(f_i|e_j)$
  * for each English word $e_j$
   * $c = t(f_i|e_j) / z$
   * $count\_pair(f_i|e_j) += c$
   * $count(e_j) += c$
* for each word pair (f,e) in count_pair()
 * $t(f|e) = count\_pair(f|e) / count_e(e)$

Repeat the process until the difference between the new log likelihood and the previous one is smaller than a fixed value epsilon or until we have run a fixed number of iterations.

### Result

After 8 iterations:
* Precision = 0.599407
* Recall = 0.773403
* AER = 0.341046

# Improvements

## Bidirectional IBM model 1 (align using $Pr(f|e)$ and $Pr(e|f)$)

Align using $Pr(f|e)$ and also align using $Pr(e|f)$. Then decode the best alignment using each model independently. Then report the alignments that are the intersection of these two alignment sets.

### Result

After 100 iterations:
* Precision = 0.867216
* Recall = 0.695146
* AER = 0.220469

### Analysis

We can see that by intersecting the decoding results of two alignment directions, we got much higher precision but lower recall. This means we discarded many good results which do not appear in the intersections. There are ways to improve both precision and recall by intersecting during training (Liang et al.[2])

# HMM-based alignment model

## Baseline

HMM alignment model is a extension to IBM model 1 which models not only emittion probability $p \left( f _ { j } | e _ { a _ { j } } \right)$ but also transition probability $p \left( a _ { j } | a _ { j - 1 } , I \right)$ as follows,

$$
     Pr( f | e ) = \sum _ { a } \prod _ { j = 1 } ^ { J } \left[ p \left( a _ { j } | a _ { j - 1 } , I \right) \cdot p \left( f _ { j } | e _ { a _ { j } } \right) \right]
$$

Train can be done using the Baum-Welch[7] algorithm which makes use of the forward-backward algorithm[3]. The parameters are re-estmated as,

$$
\begin{align*}
p ( f | e ) &= \frac { c ( f , e ) } { \sum _ { f ^ { \prime } } c \left( f ^ { \prime } , e \right) } \\
p ( i | i ^ { \prime } , I ) &= \frac { c \left( i ^ { \prime } , i , I \right) } { \sum _ { i ^ { \prime } = 1 } ^ { I } c \left( i ^ { \prime } , i ^ { \prime \prime } , I \right) }
\end{align*}
$$

Viterbi decoding can be applied to decode. See[3] for more detail.

### Result

After 4 iterations:
* Precision = 0.731220
* Recall = 0.862803
* AER = 0.223748



## Extensions

### Smoothing

Following [4] we add smoothing to transition probability,

$$p\left( a _ { j } | a _ { j - 1 } , I \right) = \alpha \cdot \frac { 1 } { I } + ( 1 - \alpha ) \cdot p \left( a _ { j } | a _ { j - 1 } , I \right)
$$

$\alpha$ is set to 0.4 as in [4] in our experiments.

### Word-Dependent Transition Model

Baseline HMM model only models transition probability given the previous alignment. He[5] used a transition probability that is word dependent. The Word-Dependent HMM Model is therefore,

$$
Pr( \boldsymbol { f } | \boldsymbol { e } ) = \sum _ { a } \prod _ { j = 1 } ^ { J } \left[ p \left( a _ { j } | a _ { j - 1 } , e _ { a _ { j - 1 } } , I \right) \cdot p \left( f _ { j } | e _ { a _ { j } } \right) \right]
$$

Data sparsity is often a problem for word dependent transition probability. To estimate $p ( i | i ^ { \prime } , e , I )$, maximum a posteriori (MAP) is used,

$$
p _ { M A P } ( i | i ^ { \prime } , e , I ) = \frac { c \left( i - i ^ { \prime } ; e \right) + \tau \cdot p ( i | i ^ { \prime } , I ) } { \sum _ { i ^ { \prime \prime } = 1 } ^ { I } c \left( i ^ { \prime \prime } - i ^ { \prime } ; e \right) + \tau }
$$

Following [5], we set $\tau$ to 1000 in practice.

### Distortion in Buckets

When calculating distortion count $c \left( i - i ^ { \prime } , I \right)$ and $c \left( i - i ^ { \prime } ; e \right)$, the counts are put into buckets $c[d<=-7], c[d=-6], ..., c[d>=7]$ following the idea in [2] and settings in [5]. The counts for $d<=-7$ and $d>=7$ are evenly distributed when estimating parameters.

### Modelling Fertility

As in [6], we further extend our model to model fertility. 

$$
\begin{aligned} \tilde { p } \left( a _ { j } | a _ { j - 1 } , e _ { a _ { j - 1 } } , I \right) & = \delta \left( a _ { j } , a _ { j - 1 } \right) p ( stay | e _ { a _ { j - 1 } } ) \\ & + \left( 1 - \delta \left( a _ { j } , a _ { j - 1 } \right) \right) ( 1 - p(stay | e _ { a _ { j - 1 } } ) ) p \left( a _ { j } | a _ { j - 1 } , I \right) \end{aligned}
$$

where

$$
p ( stay | e _ { a _ { j - 1 } } ) = \lambda p _ { Z J } + ( 1 - \lambda ) p (stay | e _ { a _ { j - 1 } } )
$$

and $p _ { Z J } = Pr \left( a _ { j } = i | a _ { j - 1 } = i , I \right)$, $\delta \left( a _ { j } , a _ { j - 1 } \right)$ is the Kronecker delta function.

$\lambda$ is set to 0.1 as in [6]. 

### Bidirection

Similar to what we did on IBM model 1, we make HMM bidirectional.

## Combine bidirectional HMM model with bidirectional IBM model 1

We first train bidirectional IBM model 1 for 10 iteration and then use the parameters for bidirectional HMM model. Note that HMM model is a extension to IBM model 1. They both model the same emittion(translation) probability $p(f_j | e_i)$. Traing bidirectional HMM is very slow. it take around one and a half hour to train 1 iteration. We train it for 5 iteration. As we can see from the experiment results with or without loading pretrained IBM model 1, loading pretrained IBM model increases the result significantly.

### Result without pretrained IBM 1

After 5 iteration:
* Precision = 0.952381
* Recall = 0.750619
* AER = 0.150448

### Result with pretrained IBM 1

After 5 iteration:
* Precision = 0.954314
* Recall = 0.826895
* AER = 0.107305


# References

[1] "IBM Models". SMT Research Survey Wiki. 11 September 2015. Retrieved 20 Nov 2018.

[2]  P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In NAACL. 2006

[3] Word Alignment for Statistical Machine Translation Using Hidden Markov Models. Anahita Mansouri Bigvand. 2015

[4] Franz Josef Och and Hermann Ney. A comparison of alignment models for statistical machine translation. In *Proceedings of the 18th conference on Computational linguistics- Volume 2*, pages 1086–1090. Association for Computational Linguistics, 2000a.

[5] Xiaodong He. Using word dependent transition models in hmm based word alignment for statistical machine translation. In *Proceedings of the Second Workshop on Statistical Machine Translation*, pages 80–87. Association for Computational Linguistics, 2007.

[6] Kristina Toutanova, H Tolga Ilhan, and Christopher D Manning. Extensions to hmm- based statistical word alignment models. In *Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10*, pages 87–94. Association for Computational Linguistics, 2002.

[7] Leonard E Baum. An equality and associated maximization technique in statistical estima- tion for probabilistic functions of markov processes. *Inequalities*, 3:1–8, 1972.

In [1]:
import argparse, sys, os, logging
from itertools import islice
import pickle
from HMMmodel import BiHMMmodel, score_alignments

f_data = "data/hansards.fr"
e_data = "data/hansards.en"
a_data = "data/hansards.a"
with open(f_data) as f, open(e_data) as e, open(a_data) as a:
    f_data, e_data, a_data = f.readlines(),\
                             e.readlines(), \
                             a.readlines()

bitext = [[sentence.strip().split() for sentence in pair] for pair in 
    zip(f_data, e_data)]
rev_bitext = [[e_sentence, f_setence] for f_setence, e_sentence in bitext]
bihmmmodel = BiHMMmodel()
bihmmmodel.load_model('bihmm_iter5.m')
bihmmmodel.validate(bitext, rev_bitext, f_data, e_data, a_data)

Precision = 0.954314
Recall = 0.826895
AER = 0.107305
