# Attention

## Introduction

RNN(or more precisely LSTM/GRU) have been found to be very effective in solving **complex sequence related problems** given a large amount of data. They have real time applications in speech recognition, NLP, time series forecasing, etc.

Sequence to sequence (seq2seq) models are a special class of RNN architectures typically used (but not restricted) to solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization, etc.

![IMG_1843.JPG](attachment:7efe3a91-7257-45a3-a9cc-d50461416ffa.JPG)

This blog is to give a detail explaination on how seq2seq models are built and to give an intuitive understanding of how they solve these complex tasks.

## seq2seq

NMT(natural machine translation) uses seq2seq model without attention. The Encoder-Decoder architecture with RNN became an effective approach for NMT. The key benefits of the approach are the ability to train a single end-to-end model directly on the source and target sentences and the ability to handle variable-length input and output sequences of text.

Below is an illustration of NMT with an RNN based encoder-decoder architecture.

![IMG_1838.PNG](attachment:f275cf2d-d23d-4b69-b700-27770a90accf.PNG)

Both encoder and decoder are typically LSTM models (or sometimes GRU models).

The seq2seq model normally has an encoder-decoder architecture, composed of:

### Encoder (eg: Encoder LSTM)

Encoder reads the input sequence and summarizes the information in something called as the internal state vectors (in case of LSTM these are called as the hidden state and cell state vectors). We discard the outputs of the encoder and only preserve the internal states.

- An **encoder** processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.

![IMG_60C6749C0376-1.jpeg](attachment:856ec86e-6d77-4f2b-bb47-d891ebec16e6.jpeg)

- $X_i$ => Input sequence at time step $i$
- $h_i$ and $c_i$ => LSTM maintains two states ('h' for hidden state and 'c' for cell state) at the each step. Combined together these are internal state of the LSTM at time step $i$ 
- $Y_i$ => Output sequence at time step $i$

Input sentence(English)=>"Rahul is a good boy"

Output sentence(Marathi)=>"राहुल चांगला मुलगा आहे"

#### Explanation for $X_i$

The above English sentence can be thought of as a sequence of 5 words. $X_1$='Rahul', $X_2$='is',$X_3$='a', $X_4$='good', $X_5$='boy'

![1.jpeg](attachment:45e68cb8-6cd0-4b9d-b5c7-d3fcff48fcc6.jpeg)

How to represent each $X_i$ as a vector? We will use the Embedding layer to map each word into a fixed length vector.

#### Explanation for $h_i$ and $c_i$

What is the role of the internal states ($h_i$ and $c_i$) at each time step?

In very simple terms, they remember what the LSTM has read(learned) till now. For example:

$h_3,c_3$=> These two vectors will remember that the network has read 'Rahul is a' till now. Basically its the summary of information till time step3 which is stored in the vectors $h_3$ and $c_3$ (thus called the states at time step3).

Similarly, we can thus say that $h_5, c_5$ will contain the summary of the entire input sentence, since this is where the sentence ends(at time step 5). These states coming out of the last time step are also called as the "Thought vectors" as they summarize the entire sequence in a vector form.

**Note**: The size of both of these vectors is equal to number of units (neurons) used in the LSTM cell.

#### Explanation for $Y_i$

$Y_i$ at each time step is the output(prediction) of the LSTM model. But what type of a vector is $Y_i$? More specifically in case of word level language models each $Y_i$ is actually a probability distribution over the entire vocabulary which is generated by using a softmax activation. Thus each $Y_i$ is a vector of size "vocab_size" representing a probability distribution.

Depending on the context of the problem they might sometimes be used or sometimes be discarded.

In our case we have nothing to output unless we have read the entire English sentence. Because we will start generating the output sequence once we have read the entired English sentence. Thus we will discard the $Y_i$ of the Encoder for our problem.

### Decoder (Decoder LSTM) -- Training Mode

Unlike the Encoder LSTM which has the same role to play in both the training phase as well as in the inference phase, the Decoder LSTM has a slightly different role to play in both of these phases. In this section we'll try to understand how to configure the Decoder during the training phase.

Output sequence=> "START_राहुल चांगला मुलगा आहे_END"

The decoder behaves a bit differently during the training and inference procedure. Decoder takes the hidden state of the last Encoder RNN cell as the initial state of its first RNN cell along with the \<start\> token as the initial input to produce an output sequence. We use Teacher Forcing for faster and efficient training of the decoder models that use the ground truth from a prior time step as input.  In this method, the right answer is given as the begining of training so that the model will train quickly and efficiently. During inference, the input to the decoder at each time step is the output from the previous time step.

![2.jpg](attachment:6d41280d-0be8-4f4f-80c5-8b1d40e0197c.jpg)

- A **decoder** is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

The most important point is that the initial state $(h_0, c_0)$ of the decoder are set to the final states of the encoder. This intuitively means that the decoder is trained to start generating the output sequence depending on the information encoded by the encoder. **Obviously the translated Marathi sentence must depend on the given English sentence.**

In the first time step we provide the START_ token so that the decoder starts generating the next token (the actual first word of Marathi sentence). **And after the last word in the Marathi sentence, we make the decoder learn to predict the \_END token. This will be used as the stopping condition during the inference procedure, basically it will denote the end of the translated sentence and we will stop the inference loop (more on the later).**

Finally the loss is calculated on the predicted outputs from each time step and the errors are back propagated through time in order to update the parameters of the network. Training the network over longer period with sufficiently large amount of data results in pretty good predictions as we'll see later

The entire training process (Encoder + Decoder) can be summarized in the below diagram:

![IMG_1848.JPG](attachment:eee8104b-c473-4936-b59f-65c342a01022.JPG)

Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.

### Decoder (Decoder LSTM) -- Inference Mode

Let's now try to understand the setup required for inference. As already stated the Encoder LSTM plays the same role of reading the input sequence (English sentence) and generating the thought vector $(h_k, c_k)$.

Howerver, the decoder now has to predict the entire output sequence (Marathi sentence) given these thought vectors. Let's try to visually understand by taking the same example.

Input sequence -> "Rahul is a good boy"

(Excepted) Output sequence -> "राहुल चांगला मुलगा आहे"

**Step1**: Encode the input sequence into the Thought Vectors:

![Screen Shot 2021-11-03 at 11.59.43 AM.png](attachment:9f66ee95-4e87-406f-af8c-ede279f8a9ec.png)

**Step2**:Start generating the output sequence in a loop, word by word:

At t=1

![Screen Shot 2021-11-03 at 11.49.50 AM.png](attachment:7f649e65-6ba5-41f1-afd3-87ae28876f07.png)

At t=3

![Screen Shot 2021-11-03 at 11.54.39 AM.png](attachment:bc169dd6-7c52-4cfa-9c19-4032f6cb5dc2.png)

At t=5

![Screen Shot 2021-11-03 at 11.56.58 AM.png](attachment:89a33f1a-d57b-4aea-875c-b8b58094ad66.png)

### Inference Algorithm:

- During inference, we generate one word at a time. Thus the Decoder LSTM is called in a loop, every time processing only one time step.    

- The initial states of the decoder are set to the final states of the encoder.    

- The initial input to the decoder is always the START_ token.    

- At each time step, we preserve the states of the decoder and set them as initial states for the next time step.

- At each time step, the predicted output is fed as input in the next time step.

- We break the loop when the decoder predicts the END_ token.

The entire inference procedure can be summarized in the below diagram:

![Screen Shot 2021-11-03 at 2.46.00 PM.png](attachment:71a05e20-2105-4121-9954-8cdc2688bcb7.png)

### What's wrong with seq2seq model?

The information from the encoder will become less and less relevant with every time step as the encoder's state is only passed to the first RNN cell of the decoder. **This makes it difficult for the model to deal with long sequences. So, the apparent flaw in this system is its inability to remember longer sequences.** The output sequence relies heavily on the context defined by the hidden state in the final output of the encoder, making it challenging for the model to deal with long sentences.

### Attention Mechanism

Attention 机制的引入就是为了解决此问题，Attention机制使得机器翻译中利用原始的sentence信息，减少信息损失。在解码层生成每个时刻的y都会利用$x_1,x_2,...$,而不再仅仅利用最后时刻的hidden state.同时注意力机制还能使翻译器zoom in or out(使用局部或全局信息)。

The below illustration shows how the prediction of words depends on two or more words in the sentence. In the follow gif, the links which are thin have a lower contribution to the prediction of a word while links that are thick have a higher contribution. We can observe that most of the predicted words in the target sequence depend on words after and before the correspinding word in the source sequence.

![IMG_1839.GIF](attachment:60e06450-40c9-4806-9341-ad289b0c14fb.GIF)

在基本的Encoder-Decoder模型中，注意力机制在Encoder和Decoder加入了上下文向量context vector，如上图所示，左边蓝色的代表Encoder，红色的代表Decoder。对于Decoder中每个要生成的y，都会生成一个上下文向量。这个上下文向量是由每个输入的words的信息加权求和得到的，其中权重向量就是注意力向量，它代表在此刻生成y时输入的单词的重要程度。最后将上下文向量和此刻的y的信息进行融合作为输出。

构建上下文向量过程也很简单，首先对于一个固定的target word，我们把这个target state跟所有的Encoder的state进行比较，这样对每个state得到了一个score;然后使用softmax对这些score进行归一化，这样就得到了基于target state的条件概率分布。最后，对source的state进行加权求和，得到上下文向量，将上下文向量与target state融合作为最终的输出。

![Screen Shot 2021-10-26 at 8.58.14 PM.png](attachment:fb3e7eb7-8a61-43f8-8918-ce820774c85d.png)

《Attention-Based Models for Speech Recognition》,2015

Attention被用来将输出序列中的每个音素与输入序列中的特定音频帧相关联。“基于混合Attention 机制的新型端对端可训练语音识别体系结构，其结合内容和位置信息帮助选择输入序列中的下一个位置用于解码。所提出的模型的一个理想特性就是它可以识别比训练集中句子更长的句子。”

In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

### Born for Translation

The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state, the secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

While the context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. Essentially the context vector consumes three pieces of information:

- encoder hidden states

- decoder hidden states

- alignment between source and target

### Definition

Now let’s define the attention mechanism introduced in NMT in a scientific way. Say, we have a source sequence $\mathbf{x}$ of length n and try to output a target sequence $\mathbf{y}$ of length m:

$$\mathbf{x} = [x_1,x_2,...,x_n]$$

$$\mathbf{y} = [y_1,y_2,...,y_m]$$

(Variables in bold indicate that they are vectors; same for everything else in this post.)

The encoder is a bidirectional RNN (or other recurrent network setting of your choice) with a forward hidden state $\overrightarrow {h_i}$ and a backward one $\overleftarrow {h_i}$. A simple concatenation of two represents the encoder state. The motivation is to include both the preceding and following words in the annotation of one word.

$$h_i=[\overrightarrow {h_i}^T;\overleftarrow {h_i}^T]^T,i=1,...,n$$

**Decoder**

定义每一个输出的预测单词$y_i$的条件概率分布为$p(y_i|\{y_1,...,y_{i-1}\},x)=g(y_{i-1},s_i,c_i)$,其中$g$是非线性函数，$y_{i-1}$是上一个单词，$s_i$是RNN当前的hidden state, $c_i$是当前的context vector. 那么$c_i$是怎么计算得来的呢？目前常用的两种方法BahdanauAttention和LuongAttention.

## Bahdanau Attention

https://arxiv.org/pdf/1409.0473.pdf

![IMG_1841.PNG](attachment:96b68b9d-4580-4490-9f0e-8bfee7049f5c.PNG)

Bahdanau Attention is also known as Additive attention as it performs a linear combination of encoder states and the decoder states. 

The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

### 1. Generating the Encoder Hidden States

We can use any variants of RNN such as LSTM or GRU to encode the input sequence. A hidden state will be produced by each cell for each input passed. Now, unlike the seq2seq model. we pass all the hidden state produced by all RNN units to the next steps.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class EncoderLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
        super(EncoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)
        
    def forward(self, inputs, hidden):
        # Embed input words
        embedded = self.embedding(inputs)
        # Pass the embedded word vectors into LSTM and return all outputs
        output, hidden = self.lstm(embedded, hidden)
        return output, hidden
    
    def init_hidden(self, batch_size=1):
        return (torch.zeros(self.n_layers, batch_size, self.hidden_size, device=device),
                torch.zeros(self.n_layers, batch_size, self.hidden_size, device=device))

### 2. Calculating Alignment Scores

After obtaining all of our encoder outputs, we can start **using the decoder to produce outputs**. At each time step of the decoder, we have to **calculate the alignment score of each encoder output with respect to the decoder input and hidden state at that time step**. The alignment score is the essence of the Attention mechanism, as it quantifies(量化) the amount of "Attention" the decoder will place on each of the encoder outputs when producing the next output.

Fig. The encoder-decoder model with additive attention mechanism in Bahdanau et al., 2015.

![encoder-decoder-attention.png](attachment:8eb3bd2e-e670-4686-81a1-b806cd174761.png)

The alignment model assigns a score $\alpha_{t,i}$ to the pair of input at position i and output at position t, $(y_t,x_i)$, based on how well they match. The set of $\{ \alpha_{t,i}\}$ are weights defining how much of each source hidden state should be considered for each output. In Bahdanau’s paper, the alignment score $\alpha$ is parametrized by a feed-forward network with a single hidden layer and this network is jointly trained with other parts of the model. The score function is therefore in the following form, given that tanh is used as the non-linear activation function:

$$score(s_t,h_i)=v_a^Ttanh(W_a[s_t;h_i])$$
$$score_{alignment}=W_{combined}\cdot tanh(W_{decoder} \cdot H_{decoder} + W_{encoder}\cdot H_{encoder})$$

where both $v_a$ and $W_a$ are weight matrices to be learned in the alignment model.

![1611916415151-fea70838-6001-45ae-8dba-c4d274a22ea3.jpeg](attachment:4df9fd11-c6ba-40ca-9d54-e3f3bae4124e.jpeg)

![1611916961914-a6ff4155-9b8b-4172-9ec5-4d6b4e99411f.jpeg](attachment:ea69e27e-1683-4f95-ba48-b177354e0428.jpeg)&emsp; &emsp; &emsp; &emsp;![1611917143821-6956eb0d-6204-4742-9735-012479f9dce0.jpeg](attachment:90545a02-42a9-4be2-8f2f-7c8592bf9d17.jpeg)

*Note: As there is no previous hidden state or output for the first decoder step, the last encoder hidden state and a Start Of String (<SOS>) token can be used to replace these two, respectively.*

In [4]:
class BahdanauDecoder(nn.Module):
    def __init__(self, hidden_size, output_size, n_layers=1, drop_prob=0.1):
        super(BahdanauDecoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.drop_prob = drop_prob
        
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        
        self.fc_hidden = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.fc_encoder = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.weight = nn.Parameter(torch.FloatTensor(1, hidden_size))
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.drop_prob)
        self.lstm = nn.LSTM(self.hidden_size * 2, self.hidden_size, batch_first=True)
        self.classifier = nn.Linear(self.hidden_size, self.output_size)
        
    def forward(self, inputs, hidden, encoder_outputs):
        encoder_outputs = encoder_outputs.squeeze()
        # Embed input words
        embedded = self.embedding(inputs).view(1, -1)
        embedded = self.dropout(embedded)
        
        # Calculating Alignment Score
        x = torch.tanh(self.fc_hidden(hidden[0]) + self.fc_encoder(encoder_outputs))
        alignment_score = x.bmm(self.weight.unsqueeze(2))
        
        # Softmaxing alignment scores to get Attention weights
        attn_weights = F.softmax(alignment_score.view(1. -1), dim=1)
        
        # Multiplying the Attention weights with encoder outputs to get the context vector
        context_vector = torch.bmm(attn_weights.unsqueeze(0),
                                   encoder_outputs.unsqueeze(0))
        
        # Concatenating context vector with embedded input word
        output = torch.cat((embedded, context_vector[0]), 1).unsqueeze(0)
        # Passing the concatenated vector as input to LSTM cell
        output, hidden = self.lstm(output, hidden)
        # Passing the LSTM output through a Linear layer acting as a classifier
        output = F.log_softmax(self.classifier(output[0]), dim=1)
        return output, hidden, attn_weights

### 3. Softmaxing the Alignment Scores

After generating the alignment scores vector in the previous step, we can then apply a softmax on this vector to obtain the attention weights. The softmax function will cause the values in the vector to sum up to 1 and each individual value will lie between 0 and 1, therefore representing the weightage each input holds at that time step.

![1611917843363-acec1119-3f9e-4c76-9df2-a38d9daff8de.jpeg](attachment:22bdf470-1c38-4f4e-924a-609d1bd6a808.jpeg)

$$a_{ti}=align(y_t,x_i)=\frac{exp(score(s_{t-1},h_i))}{\sum_{i'=1}^{n} exp(score(s_{t-1},h_{i'}))}$$

how well $y_t$ and $x_i$ are aligned.

### 4. Calculating the Context Vector

Now we can generate the context vector by doing an element-wise multiplication of the attention weights with the encoder outputs. 其中 $c_i$ 取决于 encoder map the input sequence的annotations $(h_1,...,h_{T_x})$, 由annotation vector 加权得到 $c_t=\sum_{i=1}^{n}\alpha_{t,i}h_i$ ; context vector for output $y_t$

![1611918081370-1150d25a-4885-402d-8828-a313364e69f9.jpeg](attachment:c03d7012-75cd-4355-a0a2-f2c309adc9f1.jpeg)

### 5. Decoding the Output

The context vector we produced will then be concatenated with the previous decoder output. It is then fed into the decoder RNN cell to produce a new hidden state and the process repeats itself from step 2. The final output for the time step is obtained by passing the new hidden state through a Linear layer, which acts as a classifier to give the probability scores of the next predicted word.

![1611918239564-93d8c2f1-cbdd-4a4c-9f9f-6d932f9cc539.jpeg](attachment:5ba399d2-e8b3-4379-8e5a-3088951fd840.jpeg)

### 6. Attention Plot

![IMG_1842.PNG](attachment:99fa71ec-2f0e-4752-9c94-1a92b0275161.PNG)

The yellow and green shades suggest higher attention weights to the corresponding words in the source sequence in the prediction of the word of the target sequence.

### 7. BLEU Score

BLEU is a metric for evaluating a generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

## Luong Attention

![1611918589189-576fceff-f513-4200-a0d5-6cb8ab1016d1.jpeg](attachment:6822988e-2431-40eb-b592-f32649a998e8.jpeg)

### 1. Producing the Encoder Hidden State

Same as Bahdanau Attention.

### 2. Decoder RNN

Unlike in Bahdanau Attention, the decoder in Luong Attention uses the RNN in the first step of the decoding process rather than the last. The RNN will take the hidden state produced in the previous time step and the word embedding of the final output from the previous time step to produce a new hidden state which will be used in the subsequent steps.

In [None]:
class LuongDecoder(nn.Module):
    def __init__(self, hidden_size, output_size, attention, n_layers=1, drop_prob=0.1):
        super(LuongDecoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.drop_prob = drop_prob
        
        # The Attention Mechanism is defined in a seperate class
        self.attention = attention
        
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.dropout = nn.Dropout(self.drop_prob)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        self.classifier = nn.Linear(self.hidden_size * 2, self.output_size)
        
    def forward(self, inputs, hidden, encoder_outputs):
        # Embed input words
        embedded = self.embedding(inputs).view(1, 1, -1)
        embedded = self.dropout(embedded)
        
        # Passing previous output word(embedded) and hidden state into LSTM Cell
        lstm_out, hidden = self.lstm(embedded, hidden) 

Luong Attention 基于 Bahdanau Attention，根据是否全部使用所有encoder outputs分为两种: global attention、local attention。

it is computationally less expensive than the global model or the soft attention; at the same time, unlike the hard attention, the local attention is differentiable almost everywhere, making it easier to implement and train.2 Besides, we also examine various alignment functions for our attention-based models.

Luong在encoder-decoder模型中使用的是两层单向循环神经网络且门控单元为LSTM，模型图如下。

![Screen Shot 2021-10-27 at 8.50.52 PM.png](attachment:4376855b-ff47-40ae-99e1-f20e8c02d018.png)

观察上图我们发现，encoder的最后时刻的hidden state $h_t$ 没有进行任何转换，直接送入到decoder中，所以encoder和decoder中的隐藏神经元数目是一样的，我们暂时设置为n,这也是Bahdanau和Luon提出Attention机制的区别之一。

