# Draft

# Statistical Machine Translation Using IBM Translation Models

An ongoing project to translate English to French.

## Introduction

The IBM translation models are a family of statistical machine translation algorithms that date to the late 1980s. The models were created as part of IBM's "Candide" project, which aimed to automatically translate French to English. As an excercise in theory, I decided to try implementing the models myself. The goal of this implementation is efficiency: I want to build a learning algorithm that can be trained even on home computers.


The IBM models work by estimating the conditional probability that a given sentence in one language is the translation of a given  sentence in the other. Learning a distribution like this is difficult because training data is generally sparse. An algorithm is unlikely to see the same string translated two different ways, and unless one vectorizes sentences very cleverly, there is little relationship between the Euclidian distance of two string-vectors and their actual semantic similarity. 

Clever vectorization was not the strategy used by Candide team. Rather, they chose to make strict assumptions about the range of distributions the translation model would consider. With these restrictions in place, they then performed Maximum Likelihood Estimation to choose the likeliest such distribution.

A consequence of this approach is that, under the restrictions imposed by the Candide team, even the likeliest distribution tended to assign high conditional probabilities to ill-formed sentences. To resolve this, the Candide team adopted a noisy channel approach to translation. That is, rather than estimate directly the probabability 

$$\mathbb{P}(\mathbf{e}|\mathbf{f})$$

that an English string $\mathbf{e}$ is the translation of a French string $\mathbf{f}$, they chose rather to estimate the Bayesian equivalent

$$\frac{\mathbb{P}(\mathbf{e})\mathbb{P}(\mathbf{f}|\mathbf{e})}{\mathbb{P}(\mathbf{f})}.$$

One translates an English sentence $\mathbf{e}$ to the French sentence $\mathbf{f}$ that maximizes the product

$$\mathbb{P}(\mathbf{e})\mathbb{P}(\mathbf{f}|\mathbf{e}).$$

The Candide team playfully described their reasoning as follows:
>A string of English words, $\mathbf{e}$, can be translated into a string of French words in many different ways. Often, knowing the broader context in which $\mathbf{e}$ occurs may serve to winnow the field of acceptable French translations, but even so, many acceptable translations will remain; the choice among them is largely a matter of taste. In statistical translation, we take the view than every French string, $\mathbf{f}$, is a possible translation of $\mathbf{e}$. We assign to every pair of strings $(\mathbf{e}, \mathbf{f})$ a number $\mathbb{P}(\mathbf{e}|\mathbf{f})$, which we interpret as the probability that a translator, when presented with $\mathbf{e}$, will produce $\mathbf{f}$ as his translation. We further take the view that when a native speaker of French produces a string of French words, he has actually concieved of a string of English words, which he translated mentally. Given a French string $\mathbf{f}$, the job of our translation sustem is to find the string $\mathbf{e}$ that the native speaker had in mind then he produced $\mathbf{f}$. We minimize our chance of error by choosing that string $\hat{\mathbf{e}}$ for which $\mathbb{P}(\mathbf{e} | \mathbf{f})$ is greatest. (Brown et. al. 1993)

Practically, this approach shifts the emphasis on producing well-formed English strings to the distribution $\mathbb{P}(\mathbf{e})$, which is estimated by a "language model." A good language model assigns very low probabilites to ill-formed strings. The distribution $\mathbb{P}(
\mathbf{f} | \mathbf{e})$ is estimated by a "translation model", of which the IBM models are an example.

## IBM Model 1

My implementation of IBM's simplest translation model is contained in the notebook "IBM Model 1.ipynb". Model 1 assumes that the probability $\mathbb{P}(\mathbf{f} | \mathbf{e})$ is a conditional distribution of the form

$$\mathbb{P}(\mathbf{f} | \mathbf{e}) = \epsilon(m|l)\prod_{f \in \mathcal{F}} \left(\sum_{e \in \mathcal{E}} c(e, \mathbf{e})t(f|e)\right)^{c(f, \mathbf{f})},$$

where
- $\mathcal{E}$ is the set of all English words,
- $\mathcal{F}$ is the set of all French words, 
- the function $c(w, \mathbf{w})$ counts how many times a word w shows up in a string $\mathbf{w}$
- the conditional distribution $t(f|e)$ estimates the probability that a French word $f$ is the translation of an English word $e$,
- the conditional distribution $\epsilon(m|l)$ estimates the probability that an English word of length $l$ translates to a French word fo length $m$.

This model can be justified by the following assumptions. The probability that an English string $\mathbf{e}$ translates to a French string containing a French word $f$ is a conditional distibution of the form 

$$\mathbb{P}(f | \mathbf{e}) = \sum_{e \in \mathcal{E}} c(e, \mathbf{e})t(f|e).$$

We imagine that translation of $\mathbf{e}$ into $\mathbf{f}$ is a sequence of independent trials. First, a length $m$ is chosen according to the conditional distribution $\epsilon(m|l)$. Next, $m$ french words are independently drawn from the conditional distribution $\mathbb{P}(f | \mathbf{e}).$

Every distribution $\mathbb{P}(\mathbf{f}|\mathbf{e})$ of this form is uniquely determined by the distributions $\epsilon$ and $t$. Our choice of $\epsilon$ and $t$ by a sample $S$ we draw of $n$ translated strings $(\mathbf{e}, \mathbf{f})$. Given such a sample, we choose $\epsilon$ and $t$ to maximize the cross-entropy

$$\frac{1}{n} \sum_{(\mathbf{e}, \mathbf{f}) \in S}\ln(\mathbb{P}(\mathbf{f}|\mathbf{e})) = \frac{1}{n} \sum_{(\mathbf{e}, \mathbf{f}) \in S} \left( \ln(\epsilon(m|l)) + \sum_{f \in \mathcal{F}}c(f, \mathbf{f})\ln \left( \sum_{e \in \mathcal{E}} c(e, \mathbf{e})t(f|e) \right)\right) .$$

For those unfamiliar with cross-entropy, this may seem an odd choice of gain function. We justify it as follows. We can define a distribution $\hat{\mu}$ over english strings with the formula

$$\hat{\mu}(\mathbf{e}) = \frac{c(\mathbf{e}, S)}{n}$$,

where $c(\mathbf{e}, S)$ is the number of times the string $\mathbf{e}$ appears in the sample $S$. We will call $\hat{\mu}$ the empirical marginal distribution of our sample $S$. Similarly, we can define a conditional distribution $K$ from the set of all English strings to the set of all French strings by

$$ K(\mathbf{f} | \mathbf{e}) = \frac{c((\mathbf{e}, \mathbf{f}), S)}{c(\mathbf{e}, S)},$$ 

where $c((\mathbf{e}, \hat{\mathbf{f}}), S)$ is the number of times the pair $(\mathbf{e}, \hat{\mathbf{f}})$ appears in the sample $S$. We wll call $K$ the empirical kernel of $S$.

Given two distributions $\mu$, $\nu$ over some set, with $\mu$ absolutely continuous with respect to $\nu$, the Kullback-Liebler divergence $D(\mu || \nu)$ is a measure of distance between the two distributions. It is given by 
