# Seq2seq


Outline:  
1. Neural Machine Translation
1. RNN Seq2Seq Architecture
1. Teacher Forcing
1. Attention in RNN
1. Transformer

Readings:
1. http://jalammar.github.io/illustrated-transformer/
1. http://nlp.seas.harvard.edu/2018/04/03/attention.html
1. https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/
1. https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

## 1 Neural Machine Translation

Why it's hard?

Let 
$x$ - source lang  
$y$ - target lang

$$ y^{*} = \arg \max_{y} P(y|x) = \arg \max_y \prod_{t=1}^T P(y_t | y_{<t}, x)$$


### BLEU score

1. N-gram overlap between candidates and reference translations (clipped) 
1. Compute precision for n-grams of length 1 to 4
1. Add brevity penalty if too short
1. Compute over the whole corpus

<img src="images/bleu.jpg" style="height:500px">


### ROUGE score

Overlap between n-grams of hypothesis and reference sentences

$$ROUGE-N = \frac {number\_of\_common\_Ngrams} {total\_number\_of\_Ngrams\_in\_reference}$$

### METEOR

R - unigram recall  
P - unigram precision  

```
To take into account longer matches, METEOR computes a penalty for a given alignment as follows. First, all the unigrams in the system translation that are mapped to unigrams in the reference translation are grouped into the fewest possible number of chunks such that the unigrams in each chunk are in adjacent positions in the system translation, and are also mapped to unigrams that are in adjacent positions in the reference translation. Thus, the longer the n-grams, the fewer the chunks, and in the extreme case where the entire system translation string matches the reference translation there is only one chunk. In the other extreme, if there are no bigram or longer matches,
there are as many chunks as there are unigram matches. 
```

Penalty for alignment:
$$p = 0.5 (\frac c u)^3$$
where   
$c$ - number of chunks  
$u$ - number of matched n-grams  

$$METEOR = \frac {10 P R} {R + 9 P} (1-p) $$

## 1 RNN Seq2Seq Architecture

<img src="images/seq2seq.png" style="height:300px">


From source sequence generate target sequence.  


**Encoder**:   
$h_t = LSTM(h_{t-1}, x_t)$ - encoder hidden state     
$e_t = out_e(h_t)$ - output of encoder at time $t$     

**Decoder**:  
$s_t = LSTM(s_{t-1}, y_{t-1})$ - decoder hidden state  
$g_t = out_g(s_t)$ - output of decoder at time $t$  
$p_t = softmax(g_t)$ - probabilities of tokens at time $t$  
$y_t = argmax(p_t)$ - predicted token at time $t$  
$s_o = h_T$ - initial decoder hidden state is the last encoder hidden state.  


Seq2seq is a classification task. Because at every time step you have to choose what token to output.   
Loss function is very similar to autoregression case = is an average of cross-entropy loss on tokens. 
Since it is an average, it doesn't depend on the sequence length. 

$$ Loss(S_{pred}, S_{target}) = \frac 1 {|S_{target}|} \sum_{i=1}^{|S_{target}|} cross\_entropy(y_i, \hat y_i)$$
where  
$S_{pred}, S_{target}$ - predicted and target sequences.  
$y_i, \hat y_i$ - target token and predicted token for corresponding sequences.  
If predicted sequence is shorter than the target sequence, it should be padded to target sequence length.  
If predicted sequence is longer than the target sequence, it should be cutted to target sequence length.    
Because you train your model with mini-batches, you have to pad your target sequences to have common length.  
Padding value should not be counted as an error.  
All that is done for your by `torch.nn.CrossEntropyLoss(ignore_index = <your padding value>)`

## 3 Training/Inference Discrepancy. Teacher forcing 

There is a problem with training the decoder.  
Because at $y_{t}$ depends on $y_{t-1}, ..., y_0$, if at some timestamp a wrong token is predicted, the rest of the sequence will be wrong too.   
So teacher forcing method was introduced.  

<img src="images/teacher.png" style="height:400px">

At the training phase, at every timestep you give the decoder **true previous token $x_{t-1}$** to predict the current one $y_t$.  
At the inference phase, at every timestep you give the decoder **predicted previous token $y_{t-1}$** to predict the current one $y_t$.

Well, there is another problem with teacher forcing: the model is tought to do something different from it actually should do. Conditioning on the true previous token is a bit easier.  
Though, in pytorch you can mix both regimes: some batches train with teacher forcing and others - with vanilla BPTT.


### Problems with Vanilla Encoder-Decoder Architecture

1. Poor performance on long sentences
1. Bias towards shorter candidates
1. Fluent but inadequate output
1. No guarantee that all input words are translated

## 4 Attention

<img src="images/attention.jpg" style="height:500px">

Attention is a mechanism of conditioning of every output on a weighted sum of source inputs.

Introduce attention through new function $f$:  
$ \alpha_{t'} = f(g_{t-1}, e_{t'}) $ - weights of source tokens.    
$ \bar \alpha = softmax(\alpha) $ - normalize weights.  
$ c_t  = \sum_{t'=0}^T \bar \alpha_{t'} e_{t'}$ - weighted sum

**Encoder**:   
$h_t = LSTM(h_{t-1}, x_t)$ - encoder hidden state     
$e_t = out_e(h_t)$ - output of encoder at time $t$     

**Decoder**:  
$s_t = LSTM(s_{t-1}, [y_{t-1}, c_t])$ - decoder hidden state  
$g_t = out_g(s_t)$ - output of decoder at time $t$  
$p_t = softmax(g_t)$ - probabilities of tokens at time $t$  
$y_t = argmax(p_t)$ - predicted token at time $t$  
$s_o = h_T$ - initial decoder hidden state is the last encoder hidden state.  


Usual choices for attention functions:

$f(h, e) = h^T e$ - dot  
$f(h, e) = h^T W e$ - general    
$f(h, e) = v^T tanh(W [h, e])$  concat  


Basic attention mechanisms:  

<img src="images/attn2.png" style="height:500px">



Bonus: interpretable models - because every predicted token is conditioned on a weighted sum of input tokens, it means, that you can see wich tokens were most infuentional for the prediced one.    


<img src="images/viz.png" style="height:500px">

## Transformer

### Self Attention

<img src="images/selfatt.png" style="height:300px">

### Positional encoding

```
class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)
```

<img src="images/pos.png" style="height:300px">

### Layer Norm

BatchNorm:

$$\mu_j = \frac 1 N \sum_{i=1}^N x_{ij}$$
$$\sigma^2_j = \frac 1 N \sum_{i=1}^N (x_{ij} - \mu_j)^2$$
$$\hat x_{ij} = \frac {x_{ij} - \mu_j} {\sqrt{\sigma^2_j + \epsilon}}$$
$$\hat x_{ij} = \alpha \hat x_{ij}  + \beta$$


Problems with regular BatchNorm:
1. It puts a lower limit on the batch size
1. It makes batch normalization difficult to apply to recurrent connections in recurrent neural network
```
In a recurrent neural network, the recurrent activations of each time-step will have different statistics. This means that we have to fit a separate batch normalization layer for each time-step. This makes the model more complicated and – more importantly – it forces us to store the statistics for each time-step during training.
```

LayerNorm:

$$\mu_i = \frac 1 C \sum_{j=1}^C x_{ij}$$
$$\sigma^2_i = \frac 1 C \sum_{i=1}^C (x_{ij} - \mu_i)^2$$
$$\hat x_{ij} = \frac {x_{ij} - \mu_i} {\sqrt{\sigma^2_i + \epsilon}}$$
$$\hat x_{ij} = \alpha \hat x_{ij}  + \beta$$

<img src="images/layer.png" style="height:200px">

### General Architecture

<img src="images/tr.png" style="height:600px">


### Masking target

<img src="images/mask.png" style="height:200px">

### Problems with NMT

1. Number of parameters dominated by size of word embeddings => shared embeddings 
1. Unknown word rate depends on vocab size
1. UNK replacement (using alignments)