# Machine Translation

How to automate translation between languages?

Notation
* Translate from: $S = source$ sometimes
    * Sometimes denoted $F=french$
* Translate to $T=target$
    * Sometiems denoted $E=english$

## Probabilistic Word Models
The chosen english sentence is the one which is most probable, given the foreign sentence, i.e.

$$\hat{e} = argmax_e \ P(e|f)$$

This can be re-written using Bayes Rule: $p(e|f) = \frac{p(f|e) p(e)}{p(f)}$

$$\hat{e} = argmax_e \ \frac{p(f|e) p(e)}{p(f)}$$


We can ignore $P(f)$ since this will be constant across all english sentences, and thus is not relevant to consider when designing the objective function. Thus the most probable english sentence becomes 
$$\hat{e}= argmax_e \ P(f|e) P(e)$$

* $P(f|e)$: Translation model
* $P(e)$: Language model

## IBM Model 1

Translation model $$p(f|e) = \sum_a P(f,a|e)$$
* French sentence $f = (f_1 ... f_{m})$
* English sentence $e = (e_1 ... f_{n})$
* Alignment: $a = (a_1 ... a_m)$

__Alignment__ 

Alignment function $a: j\rightarrow i$
* Map target word $w_j$ to source word $w_i$
* Keep track of alignments when translating
* Also words when one word in $S$ becomes multiple words in $T$
* Some words may also be dropped or added in translation

$$P(f,a | e) = P(f| e, a) \cdot P(a|e) \quad \text{[Chain rule]}$$ 

$$P(f,a | e) = \frac{\epsilon}{(n+1)^m} \prod_{j=1}^m t(f_j | e_{a(j)})$$
* $t(f_j, e_i) = P(f_j | e_i)$: Probability of trainslating $e_i$ as $f_j$
* $\epsilon$: Normalization constant
* $(n+1)^m$: Number of possible alignments

__Training with EM Algorithm__
* Input: sentence aligned corpus of $N$ sentences
    * Uniform $t(f_j | e_i)$ distr.

E-Step: Apply model on the data
$$E[count(f_j, e_i)] = \sum_{(f,e)} p(a | f,e)$$
$$P(a | f,e) = \frac{P(f, a, | e)}{\sum_a P(f,a | e)}$$

M-Step: Normalize probability
$$t(f_j | e_i) \frac{E[count(f_j, e_i)]}{\sum_j E[count(f_j, e_i)]}$$

__Evaluate model__

Perplexity score: How well the model fits the data
$$log_2 Perp = - \sum_s log_2 \ P(f|e)$$

Limitations of IBM models:
* Word alignments allow many-to-one
* ... But not one-to-many or many-to-many!

### Phrase Based Model

Unit of translation: Whole phrase
* Allows many-to-many modelling, carries more context, improves with more data
* State of the art before Deep Learning Models


### BLEU - Translation Evalutation
Evaluation of translation is a hard, subjective problem - very human.

Idea: Simple precision score ~ predicted n-grams frequency in references

$$P_n = \frac{\text{#correct n-grams}}{\text{#total n-grams}} $$

$$BLEU = \text{min} \ (1, \frac{\text{len(output}}{\text{len(reference)}}) \cdot (\prod_{n=1}^4 P_n)^{1/4}$$

#### Example
Reference text:
* `Israeli officials are responsible for airport security`

Model A:
* `Israeli officials reponsibility of airport safety`
* 1-gram: 3/6
* 2-gram: 1/5
* 3-gram: 0/4
* 4-gram: 0/3
* Brevity: 6/7

Model B:
* `airport security Israeli officials are responsible`
* 1-gram: 6/6
* 2-gram: 4/5
* 3-gram: 2/4
* 4-gram: 1/3
* Brevity: 6/7

$$BLEU_A = min(1, \frac{6}{7}) \cdot (\frac{3}{6} \frac{1}{5} \frac{0}{4} \frac{0}{3})^{1/4} = 0$$

$$BLEU_B = min(1, \frac{6}{7}) \cdot (\frac{6}{6} \frac{4}{5} \frac{2}{4} \frac{1}{3})^{1/4} = 0.518$$

So, Model B performs vastly better than A wrt. BLEU, even though the word odering is a very yoda-like.