# Attention

## 1. Attention Motivation

Two different ways of approaching "many-to-one"

<img src='images/last_hidden_state.png' style="width:400px;height:230px;"/>

* We know that LSTM or GRU can learn "long-term" dependencies. However, If the sentence is too long, it would still forget words that appear earlier in time.  .
* By taking the last RNN state, we hope the RNN has both found the relevant feature and remember it all the way to the end, which is surreal.

<img src='images/hard_max_hidden_state.png' style="width:400px;height:300px;"/>

* Doing a max pool over RNN states is like doing a max pool over CNN features - it is essentially saying "pick the most important feature"
* Hard max takes the max and forgets everything else.

This leads us to a question: 

> Why not take all hidden states from encoder into account and pay attention to parts of those hidden states that are most relevant?

Actually research shows that:

> "One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye moovements and decision making."
>
> -- Recurrent Models of Visual Attention, 2014


This is what attention tries to do: 

> `Attention` produces a probability distribution over each hidden state (from encoder) for "how much to pay attention to" by using softmax.

## 2. Encoder-Decoder with Attention

The `Attention` mechanism is typically applied to decoder, the encoder does not change and it is composed of a embedding layer and a RNN layer. But instead of using one directional RNN, we can use bidirectioal RNN. 
* The output shape would be $T_x \times 2M $, where $T_x$ is the the length of sequence or total time steps and $M$ is the number of hidden states (Since we use bidirectional RNN, the output dimention should be $2M$) 


**regular encoder-decoder vs. encoder-decoder with attention**


<img src='images/encoder-decoder_comparison.png' style="width:650px;height:280px;"/>

In the regural encoder-decoder architecture, the decoder takes the final state from encoder. While in the encoder-decoder with attention architecture, the `attention` layer takes as input hidden states from all time steps of the RNN layer in the encoder and outputs a `context` vector which is essentially a `weighted sum` of those hidden states. 

> The key concept of `Attention` is to calculate an **attention weight vector** that is used to amplify the information of most relevant parts of the input sequence (internally represented by hidden states of the RNN cell of encoder) and in the same time, drown out irrelevant parts. 

We are using following formula to calculate the attention weight vector

<font size='4'> $$ score(S_{t}, \overline h_{t^{'}}) $$</font>

where 
* $S_{t}$ is the decoder hidden state at time step $t$ 
* $\overline h_{t^{'}} $ is a vector of all encoder hidden states from all time steps.

Then, we calculate `attention weight vector` through softmax:

<br/>
<font size='4'> $$ \alpha = \frac{
exp(score(S_{t}, \overline h_{t^{'}}))
}{
\displaystyle\sum_{t^{'} = 1}^{T_x} exp(score(S_{t}, h_{t^{'}}))
}$$ </font>

Before going to the detail of how the score function computes the attention weight vector, we first show you what the the score function does.

<img src='images/compute_attention_weights.png'>

## 3. Attention Mechanism

Following picture illustrates the encoder-decoder with attention architecture. 

<img src='images/encoder_decoder_attention_high_level.png' style="width:600px;height:640px;"/>

Now, the most mysterious part of the attention is how to compute the `attention weight vector`. Before diving into the details of computing `attention weight vector`, we first introduce types of attention. 

### 3.1 Types of Attention

There are two major types of attention: **additive attention** and **multiplicative attention**. Sometimes they are called `Bahdanau attention` and `Luong attention` respectively referring to the first authors of the papers which describe those attentions. You can check these two papers for details:
* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)


Basically, the score function can be calculated in three different ways (Of course, there are other variants).

<br/>
<font size='4'> $$ score(S_{t}, \overline h_{t^{'}})= \begin{cases}
    S^{T}_{t} \overline h_{t^{'}}   & \quad \text{dot } \\
    S^{T}_{t}W_a \overline h_{t^{'}}   & \quad \text{general } \\
    v^T_a tanh\left(W_a[S^{T}_{t};\overline h_{t^{'}}]\right)  & \quad \text{concat}
 \end{cases} $$ </font>

where
* $S_{t}$ is the decoder hidden state at time step $t$ 
* $\overline h_{t^{'}} $ is a vector of all encoder hidden states from all time steps.

### 3.2 Compute attention score

In this section, we will illustrate the three ways of computing attention score in intuitive way.

**Multiplicative Attention - dot product**

<img src='images/multiplicative_attention_dot_product.png' style="width:700px;height:550px;"/>

Essentially, this scoring method is using dot product between one encoder hidden state vector and on decoder hidden state vector to calculate the attention score for that encoder hidden state. This makes sense because dot product of two vectors in word-embedding space is a measure of similarity between them.

With the simplicity of this method, comes the drawback of assuming the encoder and decoder have the same embedding dimensions. The might work for text summarization for example, where the encoder ane decoder use the same language and the same embedding space. For machine translation, however, you might find that each language tends to have its own embedding space. 

This is the case where we might want to use the second scoring method - general multiplicative attention.

**Multiplicative Attention - general**

This method is a slight variation on the first method. It simply introduces a weight matrix between the multiplication of the decoder hidden state and the encoder hidden states. 

<img src='images/multiplicative_attention_general.png' style="width:750px;height:400px;"/>


**Additive Attention - concat**

The way to do the additive attention is using a small feed forward neural network. To better illustrate this concept, we will calculate attention score for one encoder hidden state (i.e., $h_1$), instead of calculating attention scores for encoder hidden states all at once.

<img src='images/additive_attention_concat.png' style="width:700px;height:520px;"/>

The concat scoring method is commonly done by concatenating one encoder hidden state vector and one decoder hidden state vector, and making that the input to a feed forward neural network. This network has a single hidden layer and outputs a score. The weights (i.e., $W_a$ and $V_a$) of this network are learned during the training process.

In the above example, we only calculate the attention score for one encoder hidden state. The attention scores for other encoder hidden states are calculated in exact the same way. This calculation is formalized as below:

<font size='4'>$$ v^T_a tanh\left(W_a[S^{T}_{t};\overline h_{t^{'}}]\right)  $$</font>

This concat method is very similar to the scoring method from the Bahdanau paper, which is shown below: 
    
<font size='4'>$$ v^T_a tanh\left(W_aS^{T}_{t-1} + U_a \overline h_{t^{'}}\right)  $$</font>

In the Bahdanau paper, there are two major differences:
1. It uses two weight matrices instead of one, each is applied to the respective vector.
2. It uses hidden state from the previous timestep at the decoder instead of hidden state from current timestep.
    * Actually, using hidden state from previous timestep and the one from current timestep are both legitimate choices. However, it would be better using hidden state from previous timestep to calculate context vector if you want to feed the context vector into the RNN cell of decoder. While using hidden state from current timestep might be a better choice if you want to feed the context vector into the dense layer. 