# Intuition

Why did we even start with encoder and decoder?

- because we needed to compress source into a vector representation, and then expand it into a target sentence.

But why did we consider to fix the vector after encoding? why not use the source context at each output step?

- This paper introduces a new model that does exactly that. It uses the source context at each output step, instead of a fixed vector.

# 1 Introduction

> The models proposed recently for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

Instead of a fixed vector outputted from the source sequence, we can insert source context at each decoder output step.

Instead, for each output we generate, we soft search for source words that best align with the output. Then predicts the target word that has highest probability given the context vectors.
This gives more robust context and linking from any source to an output.

# 2 Background

Maximizing the conditional probability p(y|x), where y is the target sentence

# 3 Learning to Align and Translate

## 3.1 Decoder

For each output step, we calculate the probability of it using this:

$p\left(y_i \mid y_1, \ldots, y_{i-1}, \mathbf{x}\right)=g\left(y_{i-1}, s_i, c_i\right)$

Notice how c_i is a specific context vector

> The context vector \(c*i\) depends on a sequence of annotations \(\left(h_1, \cdots, h*{T_x}\right)\) to which an encoder maps the input sentence. Each annotation \(h_i\) contains information about the whole input sequence with a strong focus on the parts surrounding the \(i\)-th word of the input sequence. We explain in detail how the annotations are computed in Section 3.2.

$
c_i=\sum_{j=1}^{T_x} \alpha_{i j} h_j
$

The context is computed as the weighted sum of each annotation vector.

The weight is computed by

$
\begin{gathered}
\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_x} \exp \left(e_{i k}\right)} \\
e_{i j}=a\left(s_{i-1}, h_j\right)
\end{gathered}
$

i is the output position, j is the input position, so e_ij is the energy or alignment score between the two words. The higher the logit value, the higher the weight of the h_j in the input sentence

> the alignment is not considered to be a latent variable. Instead, the alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly.

The alignment score is also being trained, so the model can align better.

The weighted sum is an expected annotation, what we expect the output to be given all the weighted annotations. Aka the context vector

> sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

Fascinating that this time, we use s_j which is the hidden state of the RNN decoder, so we are passing information through the decoder to relieve the information bottleneck in the encoder.

## ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

How do we compute the annotations h_j?
They propose using a bidirectional RNN to encode the input sequence so the annotations can encode the preceding words as well as the following words. This is important for context since the meaning of a word can be dependent on the words before and after it.

The two annotations are concatenated to form the final annotation vector.


$
