## Attention

### What's wrong with seq2seq model?

The seq2seq model was born in the field of language modeling (Sutskever, et al. 2014). Broadly speaking, it aims to transform an input sequence (source) to a new one (target) and both sequences can be of arbitrary lengths. Examples of transformation tasks include machine translation between multiple languages in either text or audio, question-answer dialog generation, or even parsing sentences into grammar trees.

The seq2seq model normally has an encoder-decoder architecture, composed of:

- An **encoder** processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.

- A **decoder** is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.

![encoder-decoder-example.png](attachment:857cec9e-7496-478d-8326-0cfe49f74e2b.png)

![20200613115848985.png](attachment:941345bd-7f57-412d-9d20-31e8414f3595.png)

**这样做有一个缺点，翻译时过分依赖于这个将整个sentence压缩成固定输入的向量。输入的sentence有可能包含上百个单词，这么做不可避免会造成信息的丢失，翻译结果也无法准确了。**

Attention 机制的引入就是为了解决此问题，Attention机制使得机器翻译中利用原始的sentence信息，减少信息损失。在解码层生成每个时刻的y都会利用$x_1,x_2,...$,而不再仅仅利用最后时刻的hidden state.同时注意力机制还能使翻译器zoom in or out(使用局部或全局信息)。

注意力机制实现如下图：

![20190215104836321.png](attachment:e192e8f8-7a26-4db5-bcd7-6ee60980809a.png)

在基本的Encoder-Decoder模型中，注意力机制在Encoder和Decoder加入了上下文向量context vector，如上图所示，左边蓝色的代表Encoder，红色的代表Decoder。对于Decoder中每个要生成的y，都会生成一个上下文向量。这个上下文向量是由每个输入的words的信息加权求和得到的，其中权重向量就是注意力向量，它代表在此刻生成y时输入的单词的重要程度。最后将上下文向量和此刻的y的信息进行融合作为输出。

构建上下文向量过程也很简单，首先对于一个固定的target word，我们把这个target state跟所有的Encoder的state进行比较，这样对每个state得到了一个score;然后使用softmax对这些score进行归一化，这样就得到了基于target state的条件概率分布。最后，对source的state进行加权求和，得到上下文向量，将上下文向量与target state融合作为最终的输出。

![Screen Shot 2021-10-26 at 8.58.14 PM.png](attachment:fb3e7eb7-8a61-43f8-8918-ce820774c85d.png)

《Attention-Based Models for Speech Recognition》,2015

Attention被用来将输出序列中的每个音素与输入序列中的特定音频帧相关联。“基于混合Attention 机制的新型端对端可训练语音识别体系结构，其结合内容和位置信息帮助选择输入序列中的下一个位置用于解码。所提出的模型的一个理想特性就是它可以识别比训练集中句子更长的句子。”

In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

### Born for Translation

The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state, the secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

While the context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. Essentially the context vector consumes three pieces of information:

- encoder hidden states

- decoder hidden states

- alignment between source and target

Fig. The encoder-decoder model with additive attention mechanism in Bahdanau et al., 2015.    
![encoder-decoder-attention.png](attachment:e2f62b58-dd63-4b80-898c-d0a3c5a0c653.png)

### Definition

Now let’s define the attention mechanism introduced in NMT in a scientific way. Say, we have a source sequence $\mathbf{x}$ of length n and try to output a target sequence $\mathbf{y}$ of length m:

$$\mathbf{x} = [x_1,x_2,...,x_n]$$

$$\mathbf{y} = [y_1,y_2,...,y_m]$$

(Variables in bold indicate that they are vectors; same for everything else in this post.)

The encoder is a bidirectional RNN (or other recurrent network setting of your choice) with a forward hidden state $\overrightarrow {h_i}$ and a backward one $\overleftarrow {h_i}$. A simple concatenation of two represents the encoder state. The motivation is to include both the preceding and following words in the annotation of one word.

$$h_i=[\overrightarrow {h_i}^T;\overleftarrow {h_i}^T]^T,i=1,...,n$$

**Decoder**

定义每一个输出的预测单词$y_i$的条件概率分布为$p(y_i|\{y_1,...,y_{i-1}\},x)=g(y_{i-1},s_i,c_i)$,其中$g$是非线性函数，$y_{i-1}$是上一个单词，$s_i$是RNN当前的hidden state, $c_i$是当前的context vector. 那么$c_i$是怎么计算得来的呢？目前常用的两种方法BahdanauAttention和LuongAttention.

### Bahdanau Attention

https://arxiv.org/pdf/1409.0473.pdf

The first type of Attention commonly refered to as Additive(加法) Attention. 

将decoder的隐状态和encoder所有位置输出通过线性组合对齐，得到context向量，用于改善序列到序列的翻译模型。**本质：两层全连接网络，隐藏层激活函数tanh，输出层维度为1。**

The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

#### 1. Producing the Encoder Hidden States

#### 2. Calculating Alignment Scores

After obtaining all of our encoder outputs, we can start using the decoder to produce outputs. At each time step of the decoder, we have to calculate the alignment score of each encoder output with respect to the decoder input and hidden state at that time step. The alignment score is the essence of the Attention mechanism, as it quantifies(量化) the amount of "Attention" the decoder will place on each of the encoder outputs when producing the next output.

The alignment model assigns a score $\alpha_{t,i}$ to the pair of input at position i and output at position t, $(y_t,x_i)$, based on how well they match. The set of $\{ \alpha_{t,i}\}$ are weights defining how much of each source hidden state should be considered for each output. In Bahdanau’s paper, the alignment score $\alpha$ is parametrized by a feed-forward network with a single hidden layer and this network is jointly trained with other parts of the model. The score function is therefore in the following form, given that tanh is used as the non-linear activation function:

$$score(s_t,h_i)=v_a^Ttanh(W_a[s_t;h_i])$$

where both $v_a$ and $W_a$ are weight matrices to be learned in the alignment model.

![1611916415151-fea70838-6001-45ae-8dba-c4d274a22ea3.jpeg](attachment:4df9fd11-c6ba-40ca-9d54-e3f3bae4124e.jpeg)

![1611916961914-a6ff4155-9b8b-4172-9ec5-4d6b4e99411f.jpeg](attachment:ea69e27e-1683-4f95-ba48-b177354e0428.jpeg)&emsp; &emsp; &emsp; &emsp;![1611917143821-6956eb0d-6204-4742-9735-012479f9dce0.jpeg](attachment:90545a02-42a9-4be2-8f2f-7c8592bf9d17.jpeg)

*Note: As there is no previous hidden state or output for the first decoder step, the last encoder hidden state and a Start Of String (<SOS>) token can be used to replace these two, respectively.*