# Part 3: Sequence-to-sequence models

© Anatolii Stehnii, 2018

## Lecture 1: Neural machine translation

### Historical overview

![X-ray glasses](https://cdn-images-1.medium.com/max/2000/1*JE5UQE0Jo5f7RM-TCkMpSg.png)
*Via [Medium](https://medium.com/beluga-team/a-brief-and-untold-history-of-machine-translation-ea7dc1aa1f5)*

**1954** – First demonstration of machine translation in IBM, 49 sentences translated from Russian to English.

**1956** – Dartmouth Conference, term artificial intelligence used for machine translation task.

![History](https://cdn-images-1.medium.com/max/2000/1*d-iF6wcVYCWFDLkghpJvkw.png)
*Via [Medium](https://medium.freecodecamp.org/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5)*

**1966** – ALPAC report, criticised machine translation efforts and recommended the need for basic research in computational linguistics.

**1990-2015** – era of Statistical Machine Translation. Parallel bilingual corpora are used to build statistical model of $p(e|f)$ - probability that sentence $e$ in target language is a translation of sentence $f$ in source language.

More details about old times in [A History of Machine Translation From the Cold Ward to Deep Learning](https://medium.freecodecamp.org/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5)

**Main problems of SMT**:
1. Different word meaning in different context (polysemy)
1. Realigning words in target sentece
2. Transfer of syntactical structure
3. Rare words and named entities

**2014** – *[Cho et al](https://arxiv.org/abs/1406.1078)* "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" - first encoder-decoder paper.

**2016** - Google launched [GNMT for 9 languages](https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html). 8 levels encoder + 8 levels decoder + attention.

### NMT task

Formulation of machine translation problem:

Given a sentence in source language $\textbf{x}$, the task is to infer a sentence in target language $\textbf{y}$ which maximizes conditional probability $p(\textbf{y}|\textbf{x})$:

$$
\textbf{y} = \underset{\hat{\textbf{y}}}{\mathrm{argmax}} p(\hat{\textbf{y}}|\textbf{x})
$$

In NMT we approximate $p(\textbf{y}|\textbf{x})$ using neural model with parameters $\theta$. To learn values of $\theta$, we use set of training examples, which consist of tuples $(\textbf{y}^{(x)}, \textbf{x})$. Model parameters are learned by maximizing conditional log-probabilities of the training set:

$$
\theta = \underset{\theta}{\mathrm{argmax}} \sum_{\textbf{y}^{(x)}, \textbf{x}} log p(\textbf{y}^{(x)}| \textbf{x}; \theta)
$$

Since sentences consist of words, we can factorize this probability on separate word probabilities. Let's denote target sentece $\textbf{y}$ as ordered set of words $y_1, y_2, \ldots ,y_{t-1}, y_t$. Then probability $p(\textbf{y}|\textbf{x})$ can rewritten:

$$
p(\textbf{y}|\textbf{x}) = \prod_{i=1}^{t} p(y_i|y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}, \textbf{x})
$$

### Neural model

We can approximate probability $p(y_i|y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}, \textbf{x})$ using neural model. All we need is to somehow represent information about source sentece  $\textbf{x}$ and previous words $y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}$ as vectors. This can be done with a *Sequence-to-sequence* model:

1. Source sentence $\textbf{x}$ transformed to a vector $h_{\textbf{x}}$ by RNN called encoder. 
2. Each word from target sentece ($y_i$) decoded by another RNN (decoder). Decoder hidden state $h_{y_{i-1}}$ represents information about previous words $y_{i-1}, y_{i-2}, \ldots, y_{2}, y_{1}$.
3. Two ways to connect encoder and decoder: 
    * Thought vector – last hidden state of encoder $h_{x_n}$ used as initial hidden state for decoder $h_{y_0}$.
    * Attention – decoder RNN concats hidden state $h_{y_{i-1}}$ with attention vector $a_{y_{i}}$, which is a composition of encoder hidden states $h_{x_k}$. Attention will be described in details in the next lecture.
    
![Seq2seq with tought vector](Seq2seq thought vector.png)
*Seq2seq with tought vector*

![Seq2seq with attention](Seq2seq-attention.png)
*Seq2seq with attention*

### Result search
1. Greedy search
2. Random search
3. Beam search https://ai.googleblog.com/2016/05/chat-smarter-with-allo.html

### Metric

### Other applications of Seq2seq models

Text summarization
Question answering
Text-to-code, code-to-text translation 

### Augmented networks
seq2tree, tree2tree

### Knowledge transfer

How Google is using different encoders and decoders

#### Multitask Question Answering Network - MQAN

One network to ~~rule them all~~ solve 10 tasks:
1. **Question Answering**

2. **Machine Translation**

3. **Text Summarization**

4. **Natural Language Inference**: models receive two input sentences: a premise and a hypothesis. Models must then classify the inference relationship between the premise and hypothesis as one of entailment, neutrality, or contradiction.

5. **Sentiment Analysis**

6. **Relation Extraction**

7. **Goal-Oriented Dialogue**: based on user utterances and system actions, dialogue state trackers keep track of which predefined goals the user has for the dialogue system and which kinds of requests the user makes as the system and user interact turn-by-turn.

8. **Semantic Parsing (SQL query generation)**

9. **Pronoun Resolution**: "Joan made sure to thank Susan for the help she had [given/received]. Who had [given/received] help? Susan or Joan?"


https://einstein.ai/research/the-natural-language-decathlon

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()