# Seq2seq


Outline:  
1. General Architecture
1. Teacher Forcing
1. Beam Search
1. Attention
1. Neural Machine Translation

## 1 General Architecture

<img src="images/seq2seq.png" style="height:300px">


From source sequence generate target sequence.  


**Encoder**:   
$h_t = LSTM(h_{t-1}, x_t)$ - encoder hidden state     
$e_t = out_e(h_t)$ - output of encoder at time $t$     

**Decoder**:  
$s_t = LSTM(s_{t-1}, y_{t-1})$ - decoder hidden state  
$g_t = out_g(s_t)$ - output of decoder at time $t$  
$p_t = softmax(g_t)$ - probabilities of tokens at time $t$  
$y_t = argmax(p_t)$ - predicted token at time $t$  
$s_o = h_T$ - initial decoder hidden state is the last encoder hidden state.  


Seq2seq is a classification task. Because at every time step you have to choose what token to output.   
Loss function is very similar to autoregression case = is an average of cross-entropy loss on tokens. 
Since it is an average, it doesn't depend on the sequence length. 

$$ Loss(S_{pred}, S_{target}) = \frac 1 {|S_{target}|} \sum_{i=1}^{|S_{target}|} cross\_entropy(y_i, \hat y_i)$$
where  
$S_{pred}, S_{target}$ - predicted and target sequences.  
$y_i, \hat y_i$ - target token and predicted token for corresponding sequences.  
If predicted sequence is shorter than the target sequence, it should be padded to target sequence length.  
If predicted sequence is longer than the target sequence, it should be cutted to target sequence length.    
Because you train your model with mini-batches, you have to pad your target sequences to have common length.  
Padding value should not be counted as an error.  
All that is done for your by `torch.nn.CrossEntropyLoss(ignore_index = <your padding value>)`

## 2 Teacher forcing 

There is a problem with training the decoder.  
Because at $y_{t}$ depends on $y_{t-1}, ..., y_0$, if at some timestamp a wrong token is predicted, the rest of the sequence will be wrong too.   
So teacher forcing method was introduced.  

<img src="images/teacher.png" style="height:400px">

At the training phase, at every timestep you give the decoder **true previous token $x_{t-1}$** to predict the current one $y_t$.  
At the inference phase, at every timestep you give the decoder **predicted previous token $y_{t-1}$** to predict the current one $y_t$.

Well, there is another problem with teacher forcing: the model is tought to do something different from it actually should do. Conditioning on the true previous token is a bit easier.  
Though, in pytorch you can mix both regimes: some batches train with teacher forcing and others - with vanilla BPTT.


## 3 Beam search 

<img src="images/beam.png" style="height:500px">

There is another problem with seq2seq models: prediction of discrete tokens at the inference phase.  
In general, you want to generate the most probable sequence given the input. And now you that by predicting the most probable token at each timestamp, which is actually wrong.  
Because such greedy search is not graunted to give you the optimal solution.  

To overcome this problem, beam search was introduced. 
Basically it says: at every timestamp lets evaluate probabilities of already generated sequences + appended new predicted token. And then keep track only of top-k most probable ones.  
This gives you better, yet still not optimal solution.  

1. at time t you have top-k already generated subsequences ${(y_0^(1), .., y_{t-1}^(1)), (y_0^(k), .., y_{t-1}^(k))}$
1. insert them into generator and get top-k most probable tokens for each subsequence (total $k^2$ variants)
1. evaluate probabilities of all new possible subsequences of length $t+1$. 
1. select top-k subsequences ${(y_0^(1), .., y_{t-1}^(1)), (y_0^(k), .., y_{t-1}^(k))}$
1. after max_len is reached or terminal symbol is generated, select most probable sequence as output.



## 4 Attention

<img src="images/attention.jpg" style="height:500px">

Another problem with vanilla seq2seq: for long sequences it's hard to put all information about sorce sequence into the intermidiate encoder state. 

Attention is a mechanism of conditioning of every output on a weighted sum of source inputs.

Introduce attention through new function $f$:  
$ \alpha_{t'} = f(g_{t-1}, e_{t'}) $ - weights of source tokens.    
$ \bar \alpha = softmax(\alpha) $ - normalize weights.  
$ c_t  = \sum_{t'=0}^T \bar \alpha_{t'} e_{t'}$ - weighted sum

**Encoder**:   
$h_t = LSTM(h_{t-1}, x_t)$ - encoder hidden state     
$e_t = out_e(h_t)$ - output of encoder at time $t$     

**Decoder**:  
$s_t = LSTM(s_{t-1}, [y_{t-1}, c_t])$ - decoder hidden state  
$g_t = out_g(s_t)$ - output of decoder at time $t$  
$p_t = softmax(g_t)$ - probabilities of tokens at time $t$  
$y_t = argmax(p_t)$ - predicted token at time $t$  
$s_o = h_T$ - initial decoder hidden state is the last encoder hidden state.  


Usual choices for attention functions:

$f(h, e) = h^T e$ - dot  
$f(h, e) = h^T W e$ - general    
$f(h, e) = v^T tanh(W [h, e])$  concat  


Basic attention mechanisms:  

<img src="images/attn2.png" style="height:500px">



Bonus: interpretable models - because every predicted token is conditioned on a weighted sum of input tokens, it means, that you can see wich tokens were most infuentional for the prediced one.    


<img src="images/viz.png" style="height:500px">

## 5 Neural Machine Translation

NMT usually evaluated with BLEU and ROUGE scores.  
For every case several references can be provided.  

### BLEU score

<img src="images/bleu.jpg" style="height:500px">


### ROUGE score

Overlap between n-grams of hypothesis and reference sentences

$$ROUGE-N = \frac {number\_of\_common\_Ngrams} {total\_number\_of\_Ngrams\_in\_reference}$$