# Neural IBM1


Conside IBM1's graphical model below, you will be replacing the standard parameterisation with tabular CPDs by deterministic parametric functions.

![ibm1](img/ibm1.png)

**Variables:**

* \\(F\\) is a Categorical random variable taking values from the French vocabulary \\(V_F\\)
* \\(E\\) is a Categorical random variable taking values from the English vocabulary \\(V_E\\) (extended by a NULL token)
* \\(A\\) is a Categorical random variable taking values in the set \\(0, \ldots, m\\), we call this variable *alignment* is it selects a mixture component (English type) in the English sentence that is used to generate a French word
* \\(\theta\\) is a set of deterministic parameter assignments
* throughout, we assume \\(m\\) (the length of the English sentence) to be random, but observed

We can now write the joint distribution in terms of the conditional probability distributions (CPDs) in this directed graphical model

<span style="color:red">***[1] Complete this***</span>

\begin{align}
P_\theta(f_1^n, a_1^n|e_0^m) &= \prod_{j=1}^n P_\theta(f_j, a_j|e_0^m) \\
 &= \ldots
\end{align}

## Parameterisation

Throughout, we will assume \\(P(A|m)\\) to always distribute uniformly over \\(m+1\\) events.
In this project we will concentrate on the lexical distribution (but you can probably imagine how to extend the argument).

IBM1 is parameterised by tabular CPDs, that is, tables of independent (up to a normalisation) probabilities values, where we have one value for each condition-outcome pair.

**Tabular CPD:**

\begin{align}
P(F|E=e) &= \mathrm{Cat}(\theta_e) \\
 &\quad \text{where } 0 \le \theta_{f|e}\le 1 \\
 &\quad \text{ and } \sum_{f \in V_F} \theta_{f|e} = 1
\end{align}


* one parameter \\(\theta_{f|e}\\) per lexical event
* parameters are stored in a table

But nothing prevents us from using other parameterisations, for example, a feed-forward network would allow some parameters to be shared across events.

**Feed-forward neural network:**

\begin{equation}
    P(F|E=e) = \mathrm{Cat}(t_\theta(e))
\end{equation}

where
* \\(t_\theta(e) = \mathrm{softmax}(W_t h_E(e) + b_t)\\)
    * note that the softmax is necessary to make \\(t_\theta\\) produce valid parameters for the categorical distribution
    * \\(W_t \in \mathbb R^{|V_F| \times d_h}\\) and \\(b_t \in \mathbb R^{|V_F|}\\) 
* \\(h_E(e)\\) is defined below with \\(W_{h_E} \in \mathbb R^{d_h \times d_e}\\) and \\(b_{h_E} \in \mathbb R^{d_h}\\)
\begin{equation}
h_E(e) = \underbrace{\tanh(\underbrace{W_{h_E} r_E(e) + b_{h_E}}_{\text{affine}})}_{\text{elementwise nonlinearity}}
\end{equation}
* \\(r_E(e) = W_{r_E} v_E(e)\\) is a word embedding of \\(e\\) with \\(W_{r_E} \in \mathbb R^{d_e \times |V_E|}\\) 
* \\(v_E(e) \in \{0,1\}^{v_E} \\) is a one-hot encoding of \\(e\\), thus \\(\sum_i v_E(e)_i = 1\\) 
* \\(\theta = \{W_t, b_t, W_{h_E}, b_{h_E}, W_{r_E}\}\\)



Other architectures are also possible, one can use different parameterisations that may use more or less parameters. For example, with a CNN one could make this function sensitive to characters in the words, something along these lines could also be done with RNNs. We will use FFNNs in this project.


**Remark on notation**

In answering the questions below, use a notation similar to the above one. Also follow the following convention:

* \\(v_F(f)\\) for one-hot encoding of \\(f\\)
* \\(\circ\\) for vector concatenation
* \\(r_F(f) = W_{r_F} v_F(f)\\) with \\(W_{r_F} \in \mathbb R^{d_f \times |V_F|}\\)for \\(f\\)'s word embedding

## MLE

We can use maximum likelihood estimation (MLE) to choose the parameters of our deterministic function \\(f_\theta\\). We know at least one general (convex) optimisation algorithm, i.e. *gradient ascent*. This is a gradient-based procedure which chooses \\(\theta\\) so that the gradient of our objective with respect to \\(\theta\\) is zero. Even though gradient ascent is meant for convex functions, we often apply it to nonconvex problems. IBM1 would be convex with standard tabular CPDs, but FFNNs with 1 nonlinear hidden layer (or more) make it nonconvex.

Nowadays, we have tools that can perform automatic differentiation for us. Thus, for as long as our functions are differentiable, we can get gradients for them rather easily. This is convenient because we get to abstract away from most of the analytical work.

Still, some analytical work is on us when working with latent variable models. For example, we still need to be able to express the functional form of the likelihood.

Let us then express the log-likelihood (which is the objective we maximise in MLE) of a single sentence pair as a function of our free parameters (it generalises trivially to multiple sentence pairs).

<span style="color:red">***[2.1] Complete this***</span>

\begin{align}
    \mathcal L(\theta|e_0^m, f_1^n) &= \log P_\theta(F_1^n=f_1^n|E_0^m = e_0^m) \\
    &= \ldots
\end{align}


Note that in fact our log-likelihood is a sum of independent terms \\(\mathcal L_j(\theta|e_0^m, f_j)\\), thus we can characterise the contribution of each French word in each sentence pair as

<span style="color:red">***[2.2] Complete this***</span>

\begin{align}
\mathcal L_j(\theta|e_0^m, f_j) &= \log P_\theta(F=f_j|E_0^m = e_0^m) \\
 &= \ldots 
\end{align}



Neural network toolkits usually implement several flavours of gradient-based optimisation for us. But, they are mostly designed as *minimisation* (rather than *maximisation*) algorithms. Thus, we have to work with the idea of a *loss*.

To get a loss, we simply negate our objective. 

You will find a lot of material that mentions some *categorical cross-entropy loss*. 

\begin{align}
    l(\theta) &= -\sum_{(e_0^m, f_1^n)} p_\star(f_1^n|e_0^m) \log P_\theta(f_1^n|e_0^m) \\
    &\approx -\frac{1}{S} \log P_\theta(f_1^n|e_0^m)
\end{align}

But see that this is just the likelihood of our data assuming that the observations were independently sampled from the true data generating process \\(p_\star\\).

As discussed above, due to the assumptions in our graphical model, this loss factors over individual French positions.




<span style="color:red">***[2.3] Complete this***</span>

\begin{align}
    l(\theta|\mathcal D) &= -\frac{1}{S} \sum_{(e_0^m, f_1^n) \in \mathcal D} \sum_{j}^n \mathcal L_j(\theta|e_0^m, f_j) \\
    &= \ldots
\end{align}

Here \\(\mathcal D\\) is our dataset of \\(S\\) sentence pairs.

### SGD

SGD is really quite simple, we sample a subset \\(\mathcal S\\) of the training data and compute a loss for that sample. We then use automatic differentiation to obtain a gradient \\(\nabla_\theta \mathcal l(\theta|\mathcal S)\\). This gradient is used to update our deterministic parameters \\(\theta\\).


\begin{align}
\theta^{(t+1)} &= \theta^{(t)} - \delta_t \nabla_{\theta^{(t)}} l(\theta^{(t)}|\mathcal S)
\end{align}

The key here is to have a learning rate schedule that complies with a [Robbins and Monro](https://www.jstor.org/stable/2236626) sequence (check [this](http://cilvr.cs.nyu.edu/diglib/lsml/bottou-sgd-tricks-2012.pdf) for practical notes).
Stochastic optimisers are very well studied. Neural network toolkits implement several *well defined* optimisers for you, so using a valid learning rate sequence should not represent much work.


<span style="color:red">***[3] Complete this***</span>

If \\(t\\) tracks the number of updates, and \\(\delta_t\\) is the learning rate for update \\(t\\), provide one series that complies with a Robbins and Monro sequence.

\begin{align}
    \delta_{t} &= \ldots
\end{align}

### Batching (and notes on terminology)

In neural network terminology \\(f_1, \ldots, f_n\\) is a sample and \\(j\\) is a timestep. A collection of samples is a batch. We often work with collections of samples that are much smaller than the full dataset. 

A note on implementation: 
* most toolkits deal with static computational graphs, thus we typically have to batch sequences of fixed length (which may require some padding);
* padding essentially means that sometimes we will be performing useless computations associated with samples that are not really there, in which case we will be masking (setting to 0) their contributions to the loss as to avoid learning from them.

We are providing a tensorflow implementation of this basic neural extension to IBM1.
Your task will be to extend it further.

# Extensions

From here we will discuss a few extensions which you will experiment with.


## Neural IBM1 (with additional French context)


Consider the following extension:

![ibm1](img/ibm1prev.png)


Now that we can use FFNNs to deterministically predict the parameters of our categorical distributions, we can afford conditioning on more events! This model for example generates a French word at position \\(j\\) by conditioning on two events:
1. the English word that occupies the position it aligns to
2. and the French word in the previous position

Let us start by writing the joint distribution, but let us show it for a single French word position (since we can generalise for a whole sentence trivially):

\begin{align}
P_\theta(f_j|e_0^m) &= \sum_{a_j=0}^m P(f_j, a_j|e_0^m, f_{j-1}) \\
 &= \sum_{a_j=0}^m P(a_j|m) P(f_j|e_{a_j}, f_{j-1}) \\
\end{align}

Using tabular CPDs, this would be difficult to model due to over-parameterisation.

<span style="color:red">***[4.1] Complete this***</span>

If \\(|V_F|\\) is the size of the French vocabulary, and \\(|V_E|\\) is the size of the English vocabulary (already counting NULL), then how many free parameters are necessary to model \\(P(F|E, F_{prev})\\) with tabular CPDs?

\begin{align}
\ldots
\end{align}

In this project, we are going to use FFNNs and make 

\begin{equation}
P(F|E=e, F_{prev}=f) = \mathrm{Cat}(t_\theta(e, f))
\end{equation}



### Concatenation

In a first variant, let's use both observations by **concatenating** their word embeddings.
Call \\(e\\) the English word we are aligning to, and call \\(f\\) the previous French word, then

1. embed \\(e\\) into \\(r_E(e)\\)
2. embed \\(f\\) into \\(r_F(f)\\)
3. concatenate the word embeddings: \\(r_E(e) \circ r_F(f)\\) 
3. pass the concatenated embedding through an affine transformation and an elementwise nonlinearity (e.g. tanh)
4. predict categorical parameters (i.e. affine transformation followed by softmax)


<span style="color:red">***[4.2] Specify r(e,f) according to the recipe above***</span>


* \\(t_\theta(e, f) = \mathrm{softmax}(W_t r(e, f) + b_t)\\)
    * \\(W_t \in \mathbb R^{V_F \times d}\\) and \\(b_t \in \mathbb R^{V_F}\\) 
    * ...
* \\(\theta = \{W_t, b_t, \ldots\}\\)


### Gate



In a second variant, let's use both words in the context by summing nonlinear transformations of their word embeddings scaled by a **gate value** (a scalar between 0 and 1). Call \\(e\\) the English word we are aligning to, and call \\(f\\) the previous French word, then

1. embed \\(e\\) into \\(r_E(e)\\)
2. embed \\(f\\) into \\(r_F(f)\\)
3. as a function of the embedding of the previous f, compute a gate value \\(0 \le s \le 1\\)
4. compute a nonlinear transformation of the embedding of e
5. compute a nonlinear transformation of the embedding of the previous f
6. combine both representations with a weighted sum, where the representation of the previous f gets weighted by the gate value, and the representation of e is weighted by 1 minus the gate value
5. from the resulting vector, predict the parameters of the Categorical (that is, affine transformation followed by softmax)

<span style="color:red">***[4.3] Specify r(e,f) according to the recipe above***</span>

* \\(t_\theta(e, f) = \mathrm{softmax}(W_t r(e, f) + b_t)\\)
    * \\(W_t \in \mathbb R^{V_F \times d}\\) and \\(b_t \in \mathbb R^{V_F}\\) 
    * ...
* \\(\theta = \{W_t, b_t, \ldots\}\\)



<span style="color:red">***[4.4] Complete this***</span>

Discuss the differences between the two parameterisations above and the role of a gate value.

# Neural IBM1 with collocations

Consider the following extension:

![ibm1](img/ibm1c.png)

where we have introduce a binary latent variable \\(c\\) which decides between English components and French components. That is, when \\(c=0\\) we generate a French word by *translating* an English word, when \\(c=1\\) we generate a French word by *inserting* it from monolingual (French) context.

**Note**:
* In comparison to the standard IBM1, French words are now themselves components, and they become available as we progress generating the French string from left-to-right.
* In comparison to the previous extension (IBM1 with monolingual context), we incorporate a different type of inductive bias as we give the model the power to explicitly choose between English and French components.
* Because we have an explicit latent treatment of this collocation variable, the model will reason with all of its possible assignments. That is, we will effectively marginalise over all options when computing the likelihood of observations (just like we did for alignment).


This is the marginal likelihood  (for a single French word position):

\begin{align}
P_\theta(f_j|e_0^m) &= \sum_{a_j=0}^m P(a_j|m) \left( \sum_{c_j=0}^1 P(c_j|f_{j-1})P(f_j|e_{a_j}, f_{j-1}) \right)\\
 &= \sum_{a_j=0}^m P(a_j|m) \left(P(C_j=0|F_{j-1}=f_{j-1})P(F_j=f_j|E=e_{a_j}) + P(C_j=1|F_{j-1}=f_{j-1})P(F_j=f_j|F_{j-1}=f_{j-1})\right) \\
 &= \sum_{a_j=0}^m P(a_j|m) \left((1 - s_j)\times P(f_j|e_{a_j}) + s_j \times P(f_j|f_{j-1})\right)
\end{align}

where \\(s_j = P(C_j=1|F_{j-1}=f_{j-1})\\).

Note that here we have 3 CPDs (ignoring the uniform alignment distribution).
1. \\(P(C|F_{\text{prev}})\\) is a distribution over component types (translation vs insertion)
2. \\(P(F|F_{\text{prev}})\\) is a distribution over French words inserted from context
3. \\(P(F|E)\\) is the usual lexical translation distribution

Now you should be able to use simple FFNNs to parameterise those distributions.


<span style="color:red">***[5.1] Complete this***</span>

\begin{equation}
P(C|F_{prev}=f) = \mathrm{Bernoulli}(s_\theta(f))
\end{equation}

* \\(s_\theta(f) = \mathrm{sigmoid}(\ldots) \\)
    * note that our FFNN will predict a single number which is the parameter of the [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution), this parameter should be a single number between 0 and 1, that's why we use a sigmoid
    * ...


\begin{equation}
P(F|F_{prev}=f) = \textrm{Cat}(i_\theta(f))
\end{equation}

* \\(i_\theta(f) = \mathrm{softmax}(\ldots) \\)
* ...

\begin{equation}
P(F|E=e) = \textrm{Cat}(t_\theta(e))
\end{equation}

* \\(t_\theta(e) = \mathrm{softmax}(\ldots) \\)
* ...

Parameters:
* \\(\theta = \{\ldots\}\\)

Standard NN models would use a *deterministic* gate value in place of this random collocation indicator. 

For a deterministic we do not have any marginalisation to compute. It also has weaker inductive bias. However, at least in MT, there is a lot of empirical evidence that somehow *soft* decisions (such as this blend of translation and insertion given by a gate value) performs better than *hard* decisions (as the discrete decision of either inserting or translating). We will see a *stochastic* extension of the gate value pretty soon.

For now, can you comment on pros/cons of the stochastic view with discrete random variables:

<span style="color:red">***[5.2] Complete this***</span>

1. Pros:
2. Cons:

## Neural IBM1 with collocations: latent gate

Consider this last extension:

![ibm1](img/ibm1s.png)

here we made the collocation variable \\(s\\) continuous. You can interpret \\(S\\) as a random variable over gate values, thus this model offers a stochastic treatment to the deterministic gates that you computed in (4.3).

The big difference is that, while \\(C\\) was Bernoulli-distributed, \\(S\\) is [Beta-distributed](https://en.wikipedia.org/wiki/Beta_distribution):

\begin{equation} 
P(S|F_{prev}=f) = \mathrm{Beta}(a_\theta(f), b_\theta(f))
\end{equation}

where we make the shape parameters deterministic functions of the previous French observation. Again, we could easily employ FFNNs for that.



We now have a big problem: our likelihood is no longer tractable!

\begin{align}
P(f_j|e_0^m) &= \sum_{a_j=0}^m P(a_j|m) \int P(s_j|f_{j-1}) P(f_j|e_{a_j}, f_{j-1}, s_j) \mathrm{d}s_j 
\end{align}

it involves marginalising over all possible latent gatent values.
Before, this wasn't the case because we had a FFNN deterministically predict a single value for the gate. Now we want to reason with all possible values.

Because this is intractable, we will work with a [*variational auto-encoder*](https://arxiv.org/abs/1312.6114) (VAE).
As discussed in [class](https://uva-slpl.github.io/nlp2/resources/slides/vae.pdf), with a simpler variational approximation to the posterior, we can get an unbiased Monte Carlo (MC) estimate of the likelihood and gradient. 

We want the posterior over a Beta-distributed variable, so we will use a Beta posterior approximation and make a mean-field assumption (that is, we will approximate the posterior locally to each French position):

\begin{align}
    q_\phi(s_j|f_{j-1}) &= \mathrm{Beta(a_\phi(f_{j-1}), b_\phi(f_{j-1}))}
\end{align}

In class, we saw a VAE that employed a Gaussian random variable, for which we derived a change of variable (*reparameterisation trick*). 
That trick is quite different for Beta distributions and requires a bit more maths (because Beta is not location-scale), thus here we use some simplifications inspired by [Nalisnick and Smyth, 2017 (section 3.2.1)](https://arxiv.org/pdf/1605.06197.pdf):

* first, we fix the first shape parameter to 1, effectively working with \\(q_\phi(s_j|f_{j-1}) = \mathrm{Beta(1, b_\phi(f_{j-1}))}\\)
* then, we sample the second shape parameter by sampling \\(\epsilon \sim \mathcal N(0, 1)\\) and then approximating the (intractable) CDF of the Gaussian using a [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function)

In sum, our reparemeterisation looks like the following

1. \\(\epsilon \sim \mathcal N(0, 1) \\)
2. \\(s = \mathrm{sigmoid}(\mu_\phi(f) + \sigma_\phi(f) \times \epsilon)\\)

where \\(\mu_\phi(f)\\) and \\(\sigma_\phi(f)\\) are deterministic functions of the previous French word. We employ FFNNs with parameters \\(\phi\\) to predict the mean and variance of the Gaussian approximation.


<span style="color:red">***[6.1] Complete this***</span>

Specify \\(\mu_\phi(f)\\) and \\(\sigma_\phi(f)\\) using one tanh hidden layer for each

* \\(\mu_\phi(f) = W_{\mu}h_\mu(f) + b_\mu\\)
* ...

and

* \\(\sigma_\phi(f) = \exp(W_{\sigma}h_\sigma(f) + b_\sigma) \\)
    * note that standard deviations are always positive, that's why exponentiate the affine transformation
* ...

and

* \\(\phi = \{W_{\mu}, b_{\mu}, W_{\sigma}, b_{\sigma}, \ldots \}\\)


<span style="color:red">***[6.2] Complete this***</span>

For a sampled \\(s\\), we can easily compute a distribution over French words \\(P(F|E=e, F_{prev}=f, S=s)\\) using a FFNN similar to the one in (4.3), but where instead of computing the gate value, we use \\(s\\) as the sampled gate value.

That is, \\(P(F|E=e, F_{prev}=f, S=s) = \mathrm{Cat}(t_\theta(e, f, s))\\) where

* \\(t_\theta(e, f, s) = \mathrm{softmax}(\ldots)\\)
* ...


### ELBO

Now, we will employ gradient-based optimisation to obtain a local optimum of the *evidence lower-bound* (ELBO), which for a single French position is

\begin{align}
\mathcal E_j(\theta, \phi|e_0^m, f_j) 
 &= \mathbb E_{q(S_j|f_{j-1})}[\log P(f_j|e_0^m, f_{j-1}, S_j)] 
    - \mathrm{KL}(q_\phi(S_j) || p_\theta(S_j)) \\
 &= \mathbb E_{q(S_j|f_{j-1})}[\log P(f_j|e_0^m, f_{j-1}, S_j)] 
    - \mathrm{KL}(\mathrm{Beta(1, b_\phi(f_{j-1}))} || \mathrm{Beta}(a_0, b_0))
\end{align}

where \\(\mathrm{Beta}(a_0, b_0)\\) is a Beta prior with shape parameters \\(a_0\\) and \\(b_0\\).

The KL divergence between two Beta distributions can be computed in [closed form](https://en.wikipedia.org/wiki/Beta_distribution)

\begin{align}
\mathrm{KL}(\mathrm{Beta}(a_1, b_1) || \mathrm{Beta}(a_0, b_0)) 
 &= \log \left(\frac{B(a_0, b_0)}{B(a_1, b_1)}\right) + (a_1 - a_0)\psi(a_1) + (b_1 - b_0)\psi(b_1) + (a_0 - a_1 + b_0 - b_1)\psi(a_1+b_1)
\end{align}

where

* \\(B(a,b)\\) is the [Beta function](https://en.wikipedia.org/wiki/Beta_function)
* \\(\psi(a)\\) is the [digamma function](https://en.wikipedia.org/wiki/Digamma_function)
* and \\(\log\\) is the natural logorathim




<span style="color:red">***[6.3] Complete this***</span>

Due to mean-field assumptions, the ELBO factors as independent contributions from French positions.

\begin{align}
\mathcal E(\theta, \phi|e_0^m, f_1^n) &= \ldots
\end{align}

<span style="color:red">***[6.4] Complete this***</span>

Neural network toolkits implement minimisation algorithms, thus what's the loss for a single French word now

\begin{align}
l_j(\theta, \phi|e_0^m, f_j) &= \ldots
\end{align}

and for a complete sentence?

\begin{align}
l(\theta, \phi|e_0^m, f_1^n) &= \ldots
\end{align}

and for the batch?

\begin{align}
l(\theta, \phi|\mathcal S) &= \ldots
\end{align}



