# Part 3: Sequence-to-sequence models

© Anatolii Stehnii, 2018

## Lecture 1: Neural machine translation

In [4]:
%%javascript
MathJax.Hub.Queue(
  ["resetEquationNumbers", MathJax.InputJax.TeX],
  ["PreProcess", MathJax.Hub],
  ["Reprocess", MathJax.Hub]
);

<IPython.core.display.Javascript object>

In [6]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()

### Historical overview

![X-ray glasses](https://cdn-images-1.medium.com/max/2000/1*JE5UQE0Jo5f7RM-TCkMpSg.png)
*Via [Medium](https://medium.com/beluga-team/a-brief-and-untold-history-of-machine-translation-ea7dc1aa1f5)*

**1954** – First demonstration of machine translation in IBM, 49 sentences translated from Russian to English.

**1956** – Dartmouth Conference, term artificial intelligence used for machine translation task.

![History](https://cdn-images-1.medium.com/max/2000/1*d-iF6wcVYCWFDLkghpJvkw.png)
*Via [Medium](https://medium.freecodecamp.org/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5)*

**1966** – ALPAC report, criticised machine translation efforts and recommended the need for basic research in computational linguistics.

**1990-2015** – era of Statistical Machine Translation. Parallel bilingual corpora are used to build statistical model of $p(e|f)$ - probability that sentence $e$ in target language is a translation of sentence $f$ in source language.

More details about old times in [A History of Machine Translation From the Cold Ward to Deep Learning](https://medium.freecodecamp.org/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5)

**Main problems of SMT**:
1. Different word meaning in different context (polysemy)
1. Realigning words in target sentece
2. Transfer of syntactical structure
3. Rare words and named entities

**2014** – *[Cho et al](https://arxiv.org/abs/1406.1078)* "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" - first encoder-decoder paper.

**2016** - Google launched [GNMT for 9 languages](https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html). 8 levels encoder + 8 levels decoder + attention.

### NMT task

Formulation of machine translation problem:

Given a sentence in source language $\textbf{x}$, the task is to infer a sentence in target language $\textbf{y}$ which maximizes conditional probability $p(\textbf{y}|\textbf{x})$:

\begin{equation}
\textbf{y} = \underset{\hat{\textbf{y}}}{\mathrm{argmax}} p(\hat{\textbf{y}}|\textbf{x})
\end{equation}

In NMT we approximate $p(\textbf{y}|\textbf{x})$ using neural model with parameters $\theta$. To learn values of $\theta$, we use set of training examples, which consist of tuples $(\textbf{y}^{(x)}, \textbf{x})$. Model parameters are learned by maximizing conditional log-probabilities of the training set:

\begin{equation}
\theta = \underset{\theta}{\mathrm{argmax}} \sum_{\textbf{y}^{(x)}, \textbf{x}} log p(\textbf{y}^{(x)}| \textbf{x}; \theta)
\end{equation}

Since sentences consist of words, we can factorize this probability on separate word probabilities. Let's denote target sentece $\textbf{y}$ as ordered set of words $y_1, y_2, \ldots ,y_{t-1}, y_t$. Then probability $p(\textbf{y}|\textbf{x})$ can rewritten:

\begin{equation}
p(\textbf{y}|\textbf{x}) = \prod_{i=1}^{t} p(y_i|y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}, \textbf{x})
\end{equation}

### Neural model

We can approximate probability $p(y_i|y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}, \textbf{x})$ using neural model. All we need is to somehow represent information about source sentece  $\textbf{x}$ and previous words $y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}$ as vectors. This can be done with a *Sequence-to-sequence* model:

1. Source sentence $\textbf{x}$ transformed to a vector $\textbf{h}_{\textbf{x}}$ by RNN called encoder. 
2. Each word from target sentece ($y_i$) decoded by another RNN (decoder). Decoder hidden state $\textbf{h}_{y}^{(i-1)}$ represents information about previous words $y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}$.
3. Two ways to connect encoder and decoder: 
    * Thought vector – last hidden state of encoder $\textbf{h}_{x}^{(n)}$ used as initial hidden state for decoder $\textbf{h}_{y}^{(0)}$.
    * Attention – decoder RNN concats its input $\textbf{w}_{y}^{(i)}$ with attention vector $\phi_{y}^{(i)}$, which is a composition of encoder hidden states $\textbf{H}_{\textbf{x}}$. 
    
    *Attention will be described in details in the next lecture.*
    
![Seq2seq with thought vector](Seq2seq thought vector.png)
*Seq2seq with thought vector*

![Seq2seq with attention](Seq2seq-attention.png)
*Seq2seq with attention*

Basically, thought vector approach is not used anymore, since it is clear, that it is impossible to embed all information about a variable size sequence into a fixed size vector. Attention or its modifications are mostly used in all modern Seq2seq networks. 

### Infering result

To get distribution of possible words for a decoding step $i$, you can transform decoder output $\textbf{h}_{y}^{(i)}$ to probability distribution of possible words:

\begin{equation}
p(y_i|y_{<i}, \textbf{x}) = softmax(\textbf{W}_{out}^T \cdot \textbf{h}_{y}^{(i)})
\end{equation}

To infer whole sentence, using this distribution, you can choose different strategies:

1. **Greedy search** – select a word with the highest probability.
2. **Random search** – sample a random word from the distribution.
3. **Beam search** – grow a tree structure of possible options, prunning it to $n$ most possible options at each time step.

![Beam search](beam-search.png)
*Beam search example (via [Google AI Blog](https://ai.googleblog.com/2016/05/chat-smarter-with-allo.html))*

### Training

As mentioned before, a model is trained using log-likelihod loss, which is basically a sum of  logarithms of probabilities of each word in a train target example.

\begin{equation}
log p(\textbf{y}|\textbf{x}) = \sum_{i=1}^{t} log p(y_i|y_{<i}, \textbf{x})
\end{equation}

During infering, each decoder input is its previous output. To simplify training process, you can use train target example as decoder input. This is called **teacher forcing**. But this approach results in slightly different distibutions output during training and during inference. **Professor forcing** approach ([Goyal et al, 2016](http://papers.nips.cc/paper/6098-professor-forcing-a-new-algorithm-for-training-recurrent-networks)) proposes to use additional adversarial network, which is trained to distinguish distributions obtained during network training with **teacher forcing** and distributions obtained during infering without ground truth.

### Other applications of Seq2seq


1. **Text summarization**
    
    Seq2seq can be trained to obtain short summaries of news or abstracts of scientific papers. However, huge search spaces of both input and output requires modification of attention (for example, infra-temporal attention, paragraph vectors, etc).
    
    GitHub repository with overview of text summarization approaches: https://github.com/icoxfog417/awesome-text-summarization

2. **Question answering, natural database interfaces**

3. **Text-to-code, code-to-text translation **
    
    Use code and it's comments ([Barone and Sennrich 2017](https://arxiv.org/abs/1707.02275), [Yin and Neubig 2017](https://arxiv.org/abs/1704.01696)) to generate code from it's comments and vice versa. Non practical, but quite funny.

### Metrics

1. **BLEU** (bilingual evaluation understudy) – classical metric for MT evaluation. Can use multiple reference translations. Established as the most correlated with human evaluation.

2. **ROGUE** (Recall-Oriented Understudy for Gisting Evaluation) — for text summarization.
    1. **CROGUE** — incorporate word embeddings to advocate for semantically similar but rephrased summaries (Zaytsev et al, 2018). C stands for "continuous".

3. **METEOR** (Metric for Evaluation of Translation with Explicit ORdering) – another metric for translations.

### Seq2tree, Tree2tree

[Yin and Neubig 2017](https://arxiv.org/abs/1704.01696) addressed Seq2tree architecture in a task of natural language translation to code. Syntactically correct programming code can be represented as **abstract syntax tree**, therefore to generate correct program neural network can infer tree structures instead of sequences. To do this, tree generation is decomposed to sequence of tree-growing and terminal commands. Additionaly, recurrent network used as input concatenation of previous hidden state with a hidden state for root node.

[Stehnii 2017](http://er.ucu.edu.ua/handle/1/1191?locale-attribute=en) complemented this architecture with recursive encoder, which used as input natural description, parsed to dependency trees and semantic trees. Resulting Tree2tree network has not improved performance, but improved author self-estimate, at least.

[Chen et al 2017](https://arxiv.org/abs/1707.05436) also incorporated syntactic structures for machine translation with syntactic attention and bi-directional syntactic encoder. Their network outperformed current state-of-the art for Chinese-English translation.

### Knowledge transfer

#### Google multylanguage translation

![Google translation](google-multy.gif)
*Via [Google AI Blog](https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html)*

#### Multitask Question Answering Network - MQAN

One network to ~~rule them all~~ [solve 10 tasks](https://einstein.ai/research/the-natural-language-decathlon):

![MQAN](MQAN.gif)

1. **Question Answering**

2. **Machine Translation**

3. **Text Summarization**

4. **Natural Language Inference**: models receive two input sentences: a premise and a hypothesis. Models must then classify the inference relationship between the premise and hypothesis as one of entailment, neutrality, or contradiction.

5. **Sentiment Analysis**

6. **Relation Extraction**

7. **Goal-Oriented Dialogue**: based on user utterances and system actions, dialogue state trackers keep track of which predefined goals the user has for the dialogue system and which kinds of requests the user makes as the system and user interact turn-by-turn.

8. **Semantic Parsing (SQL query generation)**

9. **Pronoun Resolution**: "Joan made sure to thank Susan for the help she had [given/received]. Who had [given/received] help? Susan or Joan?"


