# Neural IBM1


Conside IBM1's graphical model below, you will be replacing the standard parameterisation with tabular CPDs by deterministic parametric functions.

![ibm1](img/ibm1.png)

**Variables:**

* \\(F\\) is a Categorical random variable taking values from the French vocabulary
* \\(E\\) is a Categorical random variable taking values from the English vocabulary (extended by a NULL token)
* \\(A\\) is a Categorical random variable taking values in the set \\(0, \ldots, m\\), we call this variable *alignment* is it selects a mixture component (English type) in the English sentence that is used to generate a French word
* \\(\theta\\) is a set of deterministic parameter assignments
* throughout, we assume \\(m\\) (the length of the English sentence) to be random, but observed

We can now write the joint distribution in terms of the conditional probability distributions (CPDs) in this directed graphical model

<span style="color:red">***Complete this***</span>

\begin{align}
P_\theta(f_1^n, a_1^n|e_0^m) &= \prod_{j=1}^n P_\theta(f_j, a_j|e_0^m) \\
 &= \ldots
\end{align}

## Parameterisation

Throughout, we will assume \\(P(A|m)\\) to always distribute uniformly over \\(m+1\\) events.
In this project we will concentrate on the lexical distribution (but you can probably imagine how to extend the argument).

IBM1 is parameterised by tabular CPDs, that is, tables of independent (up to a normalisation) probabilities values, where we have one value for each condition-outcome pair.

**Tabular CPD:**

\begin{align}
P(F|E=e) &= \mathrm{Cat}(\theta_e) \\
 &\quad \text{where } 0 \le \theta_{f|e}\le 1 \\
 &\quad \text{ and } \sum_{f \in V_F} \theta_{f|e} = 1
\end{align}


* one parameter \\(\theta_{f|e}\\) per lexical event
* parameters are stored in a table

But nothing prevents us from using other parameterisations, for example, a feed-forward network would allow some parameters to be shared across events.

**Feed-forward neural network:**
* \\(P(F|E=e) = \mathrm{Cat}(f_\theta(e))\\) where 
    * \\(f_\theta(e) = \mathrm{softmax}(W_f r(e) + b_f)\\)
        * note that the softmax is necessary to make \\(f_\theta\\) produce valid parameters for the categorical distribution
        * \\(W_f \in \mathbb R^{V_F \times d}\\) and \\(b_f \in \mathbb R^{V_F}\\) 
        * \\(r(e) = W_r v(e)\\) with \\(W_r \in \mathbb R^{d \times V_F}\\) 
        * \\(v(e) \in \{0,1\}^{V_E} \\) is a one-hot encoding of \\(e\\), thus \\(\sum_i v(e)_i = 1\\) 
    * \\(\theta = \{W_f, b_f, W_r\}\\)

Other architectures are also possible, one can use different parameterisations that may use more or less parameters. For example, with a CNN one could make this function sensitive to characters in the words, something along these lines could also be done with RNNs. We will use FFNNs in this project.

## MLE

We can use maximum likelihood estimation (MLE) to choose the parameters of our deterministic function \\(f_\theta\\). We know at least one general (convex) optimisation algorithm, i.e. *gradient ascent*. This is a gradient-based procedure which chooses \\(\theta\\) so that the gradient of our objective with respect to \\(\theta\\) is zero. Even though gradient ascent is meant for convex functions, we often apply it to nonconvex problems. IBM1 would be convex with standard tabular CPDs, but FFNNs with 1 nonlinear hidden layer (or more) make it nonconvex.

Nowadays, we have tools that can perform automatic differentiation for us. Thus, for as long as our functions are differentiable, we can get gradients for them rather easily. This is convenient because we get to abstract away from most of the analytical work.

Still, some analytical work is on us when working with latent variable models. For example, we still need to be able to express the functional form of the likelihood.

Let us then express the log-likelihood (which is the objective we maximise in MLE) of a single sentence pair as a function of our free parameters (it generalises trivially to multiple sentence pairs).

<span style="color:red">***Complete this***</span>

\begin{align}
    \mathcal L(\theta|e_0^m, f_1^n) &= \log P_\theta(F_1^n=f_1^n|E_0^m = e_0^m) \\
    &= \ldots
\end{align}


Note that in fact our log-likelihood is a sum of independent terms \\(\mathcal L_j(\theta|e_0^m, f_j)\\), thus we can characterise the contribution of each French word in each sentence pair as

<span style="color:red">***Complete this***</span>

\begin{align}
\mathcal L_j(\theta|e_0^m, f_j) &= \log P_\theta(F=f_j|E_0^m = e_0^m) \\
 &= \ldots 
\end{align}



Neural network toolkits usually implement several flavours of gradient-based optimisation for us. But, they are mostly designed as *minimisation* (rather than *maximisation*) algorithms. Thus, we have to work with the idea of a *loss*.

To get a loss, we simply negate our objective. 

You will find a lot of material that mentions some *categorical cross-entropy loss*. 

\begin{align}
    l(\theta) &= -\sum_{(e_0^m, f_1^n)} p_\star(f_1^n|e_0^m) \log P_\theta(f_1^n|e_0^m) \\
    &\approx -\frac{1}{S} \log P_\theta(f_1^n|e_0^m)
\end{align}

But see that this is just the likelihood of our data assuming that the observations were independently sampled from the true data generating process \\(p_\star\\).

As discussed above, due to the assumptions in our graphical model, this loss factors over individual French positions.




<span style="color:red">***Complete this***</span>

\begin{align}
    l(\theta|\mathcal D) &= -\frac{1}{S} \sum_{(e_0^m, f_1^n) \in \mathcal D} \sum_{j}^n \mathcal L_j(\theta|e_0^m, f_j) \\
    &= \ldots
\end{align}

Here \\(\mathcal D\\) is our dataset of \\(S\\) sentence pairs.

### SGD

SGD is really quite simple, we sample a subset \\(\mathcal S\\) of the training data and compute a loss for that sample. We then use automatic differentiation to obtain a gradient \\(\nabla_\theta \mathcal l(\theta|\mathcal S)\\). This gradient is used to update our deterministic parameters \\(\theta\\).


\begin{align}
\theta^{(t+1)} &= \theta^{(t)} - \delta_t \nabla_{\theta^{(t)}} l(\theta^{(t)}|\mathcal S)
\end{align}

The key here is to have a learning rate schedule that complies with a [Robbins and Monro](https://www.jstor.org/stable/2236626) sequence (check [this](http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf) for practical notes).
.
Stochastic optimisers are very well studied. Neural network toolkits implement several *well defined* optimisers for you, so using a valid learning rate sequence should not represent much work.

### Batching (and notes on terminology)

In neural network terminology \\(f_1, \ldots, f_n\\) is a sample and \\(j\\) is a timestep. A collection of samples is a batch. We often work with collections of samples that are much smaller than the full dataset. 

A note on implementation: 
* most toolkits deal with static computational graphs, thus we typically have to batch sequences of fixed length (which may require some padding);
* padding essentially means that sometimes we will be performing useless computations associated with samples that are not really there, in which case we will be masking (setting to 0) their contributions to the loss as to avoid learning from them.

We are providing a tensorflow implementation of this basic neural extension to IBM1.
Your task will be to extend it further.

# Extensions

From here we will discuss a few extensions which you will experiment with.


## Neural IBM1 (with monolingual context)


Consider the following extension:

![ibm1](img/ibm1prev.png)


Now that we can use FFNNs to deterministically predict the parameters of our categorical distributions, we can afford conditioning on more events! This model for example generates a French word at position \\(j\\) by conditioning on two events:
1. the English word that occupies the position it aligns to
2. and the French word in the previous position

Let us start by writing the joint distribution, but let us show it for a single French word position (since we can generalise for a whole sentence trivially):

\begin{align}
P_\theta(f_j|e_0^m) &= \sum_{a_j=0}^m P(f_j, a_j|e_0^m, f_{j-1}) \\
 &= \sum_{a_j=0}^m P(a_j|m) P(f_j|e_{a_j}, f_{j-1}) \\
\end{align}

Using tabular CPDs, this would be difficult to model, there would be \\(v_F \times v_E\\) CPDs, each of which defined over \\(v_F\\) events. Thus, a lot of free parameters.

We are going to use FFNNs and make 

\begin{equation}
P(F|E=e, F_{\text{prev}}=f_{\text{prev}}) = \mathrm{Cat}(g_\theta(e, f_{\text{prev}}))
\end{equation}


<span style="color:red">***Complete this***</span>

Let's use both observations by concatenating their word embeddings.

1. embed e
2. embed the previous f
3. pass the concatenated embedding through affine transformation and nonlinearity
4. predict categorical parameters (i.e. affine transformation followed by softmax)


* \\(g_\theta(e, f) = \mathrm{softmax}(W_g r(e, f) + b_g)\\)
    * \\(W_f \in \mathbb R^{V_F \times d}\\) and \\(b_f \in \mathbb R^{V_F}\\) 
    * ...
* \\(\theta = \{W_g, b_g, \ldots\}\\)


<span style="color:red">***Complete this***</span>

Let's use both observations by summing nonlinear transformations of their word embeddings scaled by a gate value (a scalar between 0 and 1).

1. embed e 
2. embed the previous f
3. compute a gate value (a number between 0 and 1) from the embedding of the previous word
4. compute a nonlinear transformation of the embedding of e
5. compute a nonlinear transformation of the embedding of the previous f
6. combine both representations with a weighted sum, where the representation of the previous f gets weighted by the gate value, and the representation of e is weighted by 1 minus the gate value
5. from the resulting vector, predict the parameters of the Categorical (that is, affine transformation followed by softmax)


* \\(g_\theta(e, f) = \mathrm{softmax}(W_g r(e, f) + b_g)\\)
    * \\(W_f \in \mathbb R^{V_F \times d}\\) and \\(b_f \in \mathbb R^{V_F}\\) 
    * ...
* \\(\theta = \{W_g, b_g, \ldots\}\\)



<span style="color:red">***Complete this***</span>

Discuss the differences between the two parameterisations above and the role of a gate value.

# Neural IBM1 with collocations

Consider the following extension:

![ibm1](img/ibm1c.png)

where we have introduce a binary latent variable \\(c\\) which decides between English components and French components. That is, when \\(c=0\\) we generate a French word by *translating* an English word, when \\(c=1\\) we generate a French word by *inserting* it from monolingual (French) context.

**Note**:
* In comparison to the standard IBM1, French words are now themselves components, and they become available as we progress generating the French string from left-to-right.
* In comparison to the previous extension (IBM1 with monolingual context), we incorporate a different type of inductive bias as we give the model the power to explicitly choose between English and French components.
* Because we have an explicit latent treatment of this collocation variable, the model will reason with all of its possible assignments. That is, we will effectively marginalise over all options when computing the likelihood of observations (just like we did for alignment).


This is the marginal likelihood  (for a single French word position):

\begin{align}
P_\theta(f_j|e_0^m) &= \sum_{a_j=0}^m P(a_j|m) \left( \sum_{c_j=0}^1 P(c_j|f_{j-1})P(f_j|e_{a_j}, f_{j-1}) \right)\\
 &= \sum_{a_j=0}^m P(a_j|m) \left(P(C_j=0|F_{j-1}=f_{j-1})P(F=f_j|E=e_{a_j}) + P(C_j=1|F_{j-1}=f_{j-1})P(F=f_j|F_{j-1}=f_{j-1})\right)
\end{align}


Now you should be able to use simple FFNNs to parameterise the distributions
* \\(P(C|F_{\text{prev}})\\)
* \\(P(F|F_{\text{prev}})\\)
* \\(P(F|E)\\)


<span style="color:red">***Complete this***</span>

\begin{equation}
P(C|F_{\text{prev}}=f_{\text{prev}}) = \mathrm{Bernoulli}(\mathrm{sigmoid}(\ldots))
\end{equation}

* note that our FFNN will predict a single number which is the parameter of the Bernoulli, this parameter should be a single number between 0 and 1, that's why we use a sigmoid

* ...

\begin{equation}
P(F|F_{\text{prev}}=f_{\text{prev}}) = \textrm{Cat}(\ldots)
\end{equation}

* ...

\begin{equation}
P(F|E=e) = \textrm{Cat}(\ldots)
\end{equation}

* ...

Standard NN models would use a *deterministic* gate value in place of this random collocation indicator. 

For a deterministic we do not have any marginalisation to compute. It also has weaker inductive bias. However, at least in MT, there is a lot of empirical evidence that somehow *soft* decisions (such as this blend of translation and insertion given by a gate value) performs better than *hard* decisions (as the discrete decision of either inserting or translating). We will see a *stochastic* extension of the gate value pretty soon.

For now, can you comment on pros/cons of the stochastic view with discrete random variables:

<span style="color:red">***Complete this***</span>

1. Pros:
2. Cons:

# Bayesian IBM1 with collocations

Consider this last extension:

![ibm1](img/ibm1cz.png)

here we made the Bernoulli parameter \\(z\\) a random variable.

The big difference is that the CPD is a Bernoulli:

\begin{equation} 
P(C|F_{\text{prev}}=f_{\text{prev}}) = \mathrm{Bernoulli}(z_{f_{\text{prev}}})
\end{equation}

whose parameter is distributed by a [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) with fixed parameters \\(a\\) and \\(b\\), that is

\begin{equation}
Z_{f_{\text{prev}}} \sim \mathrm{Beta}(a,b)
\end{equation}

We now have a big problem: our likelihood is no longer tractable!

\begin{align}
P(f_j|e_0^m) &= \sum_{a_j=0}^m P(a_j|m) \int \mathrm{Beta}(z_{f_{j-1}}|a,b) \sum_{c_j=0}^1 P(c_j|z_{f_{j-1}}) P(f_j|e_{a_j}, c_j) \mathrm{d}z_{f_{j-1}}
\end{align}

it involves marginalising over all possible Bernoulli parameters.
Before, this wasn't the case because we had a FFNN deterministically predict a single value for that parameter. Now we want to reason with all possible values of that parameter assuming a Beta prior.

Because this is intractable, we will work with a *variational auto-encoder* (VAE).
As discussed in class, with a simpler variational approximation that approximates the posterior, we can get an unbiased Monte Carlo (MC) estimate of the likelihood and gradient. 
In class, we saw a VAE that employed a Gaussian random variable, for which we derived a change of variable (*reparameterisation trick*). 
That trick is quite different for Beta distributions.

<span style="color:red">***Complete this***</span>