# Part 3: Sequence-to-sequence models

© Anatolii Stehnii, 2018

## Lecture 2: Neural attention

In [6]:
%%javascript
MathJax.Hub.Queue(
  ["resetEquationNumbers", MathJax.InputJax.TeX],
  ["PreProcess", MathJax.Hub],
  ["Reprocess", MathJax.Hub]
);

<IPython.core.display.Javascript object>

In [5]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()

### Problems with sentence vector

Theoretically, a sufficiently large encoder-decoder model should be able to perform the machine translation perfectly. However, to encode all words and their dependencies in the arbitrary-length sentences, the thought vector should have
enormous length. Such model would require massive computational resources to train and to use, thus this approach is ineffective.

This problem can be solved with attention technique ([Bahdanau et al., 2014](https://arxiv.org/abs/1409.0473)). Its basic idea is to replace a single vector representation of the input sentence with references to representations of different words in it.

![Attention](attention.png)
*Attention matrix for English to French translation. Via [WildML](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/)*

### Math

Matrix $\textbf{H}_\textbf{x} \in \mathbb{R}^{n\times m}$ ($n$ – number of encoding steps, $m$ – hidden vector dimensionality) is created from hidden vectors $\textbf{h}_{x}^{(t)}$ from each step of encoding process.

During decoding, each input vector $\textbf{w}_{y}^{(t)}$ is concatenated with a context vector $\boldsymbol{\phi}^{(t)}$:

\begin{equation}
\textbf{h}_y^{(t)} = rnn([\textbf{w}_y^{(t)}, \boldsymbol{\phi}^{(t)}], \textbf{h}_y^{(t-1)})
\end{equation}

Context vector $\phi^{(t)}$ is calculated as a weighted sum of all encoder representations:

\begin{equation}
\boldsymbol{\phi}^{(t)} = \textbf{H}_{\textbf{x}}^{T}\cdot\boldsymbol{\alpha}^{(t)}
\end{equation}

Weights for the attention vector $\alpha^{(t)}$ can be calculated with an arbitrary attention score function (for example, vector product) for all pairs of the decoder vector $h_y^{(t-1)}$ and the encoder vector $h_x^{(i)}, \forall i \in 1..n$.

In original work of Bahdanau et al. were used DNN with one hidden layer:

\begin{equation}
\begin{gathered}
\hat{\boldsymbol{\alpha}}_i^{(t)} = \textbf{W}_{attn1} \cdot [\textbf{h}_x^{(i)}, \textbf{h}_y^{(t-1)}] \\
\alpha_i^{(t)} = \textbf{W}_{attn2} \cdot tanh(\hat{\boldsymbol{\alpha}}_{i}^{(t)}) \\
\boldsymbol{\alpha}^{(t)} = softmax(\boldsymbol{\alpha}^{(t)})
\end{gathered}
\end{equation}

### Developing an idea

1. **Attention over history**
    
    Along with attention over encoder steps, you can use attention over previous decoder steps to recover possible lost information.
    
1. **Attention over context**

    In a previously mentioned [MQAN](https://einstein.ai/research/the-natural-language-decathlon) network along with source and target also used context - additional information, which can be used by network to find answer to question.
    
2. **Attention is all you need** - [Vaswani et al., 2017](http://papers.nips.cc/paper/7181-attention-is-all-you-need)
    
    Actually, when using attention over history, you can throw out all RNN and LSTM stuff and just model any sequence with attention. In paper above, this approach demonstrated better results for machine translation task with lower training time.
    
3. **Show, attend and tell** - [Xu et al., 2015](http://arxiv.org/abs/1502.03044)

    Here attention over images is used to generate image description:
    
    ![attention image](attention-image.png)
    
4. **Attention over memory** 
    
    Hidden state of a recurrent neural network is a basic example of internal memory. LSTM developed this context, allowing to store information about longer sequences. Soft attention allows network to interacts with even larger amounts of hierarchical data in [End-To-End memory networks](https://arxiv.org/abs/1503.08895)

### Pointer networks

A neural network operates with vector representations of words that are selected from a predefined vocabulary. This imposes the problem of unknown words that don't have a vector representation. This is especially important for the translation task where both an input and an output sequences could contain rare, special words or names.

However, names of people or locations should not be translated but copied to a target sequence. [Vinyals et al., 2015](http://papers.nips.cc/paper/5866-pointer-networks.pdf) proposed a solution of this problem with a **pointer network**. For each decoding step it calculates the probability of the next word to be copied from the input sequence. Calculation of this probability is described below step-by-step.

Let's denote $\textbf{h}_{y}^{(t)}$ – as a decoder output vector on decoding step $t$ and $\textbf{h}_{x}^{(i)}$ – as an encoder output vector on decoding step. First a hidden state of the pointer network is calculated:

\begin{equation}
\begin{gathered}
\textbf{h}_{x.pointer}^{(i)} = \textbf{W}_{x}^T \cdot \textbf{h}_x^{(i)}\\
\textbf{h}_{y.pointer}^{(t)} = \textbf{W}_{y}^T \cdot \textbf{h}_y^{(t)} \\
\textbf{h}_{pointer}^{(t, i)} = tanh(\textbf{h}_{x.pointer}^{(i)} + \textbf{h}_{y.pointer}^{(t)})\\
\end{gathered}
\end{equation}

This is performed for all $\textbf{h}_{x}^{(i)}, i \in 1..n$. Vectors $\textbf{h}_{pointer}^{(t, i)}$ are combined into matrix $\textbf{H}_{pointer}^{(t)}$. This matrix is translsted to vector of probabilities for each word to be copied from a source sequence into target:

\begin{equation}
\begin{gathered}
\textbf{h}^{(t)}_{copy} = \textbf{W}_{pointer}^T \cdot \textbf{H}^{(t)}_{pointer} \\
\textbf{p}_{copy}^{(t)} = softmax(\textbf{h}^{(t)}_{copy})\\
\end{gathered}
\end{equation}

The vector $P^{(t)}$ contains a probability for each input sequence token to be copied into the output sequence on step $t$. This way decoder generation should be performed in two steps. First, a probability of generating or copying is infered from hidden state $\textbf{h}_y^{(t)}$ using matrix $\textbf{W}_{g.or.c} \in \mathbb{R}^{m\times2}$:

\begin{equation}
\textbf{p}_{g.or.c}^{(t)} = softmax(\textbf{W}_{g.or.c}^T \cdot \textbf{h}_y^{(t)})
\end{equation}

$\textbf{p}_{g.or.c}^{(t)}$ is a vector of two elements – the first is a probability of a generating a word and second is a probability of a copying a word. This way model results in a conditional probaility of two actions:

\begin{equation}
p(y_t|y_{<t}, \textbf{x}) = p(y_t=w_y|y_{<t},\textbf{x}) \cdot p_t(y_t=generate|y_{<t},\textbf{x}) + p(y_t=w_x|y_{<t},\textbf{x}) \cdot p_t(y_t=copy|y_{<t},\textbf{x})
\end{equation}