# Attention

## 1. Attention Motivation

Two different ways of approaching "many-to-one"

<img src='images/last_hidden_state.png' style="width:400px;height:230px;"/>

* We know that LSTM or GRU can learn "long-term" dependencies. However, If the sentence is too long, it would still forget words that appear earlier in time.  .
* By taking the last RNN state, we hope the RNN has both found the relevant feature and remember it all the way to the end, which is surreal.

<img src='images/hard_max_hidden_state.png' style="width:400px;height:300px;"/>

* Doing a max pool over RNN states is like doing a max pool over CNN features - it is essentially saying "pick the most important feature"
* Hard max takes the max and forgets everything else.

This leads us to a question: 

> Why not take all hidden states from encoder into account and pay attention to parts of those hidden states that are most relevant?

Actually research shows that:

> "One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye moovements and decision making."
>
> -- Recurrent Models of Visual Attention, 2014


This is what attention tries to do: 

> `Attention` produces a probability distribution over each hidden state (from encoder) for "how much to pay attention to" by using softmax.

## 2. Encoder-Decoder with Attention

The `Attention` mechanism is typically applied to decoder, the encoder does not change and it is composed of a embedding layer and a RNN layer. But instead of using one directional RNN, we can use bidirectioal RNN. 

<img src='images/bidirectional_rnn.png' style="width:560px;height:300px;"/>

* The output shape would be $T_x \times 2M $, where $T_x$ is the the length of sequence or total time steps and $M$ is the number of hidden states (Since we use bidirectional RNN, the output dimention should be $2M$) 

### 2.1 Regular encoder-decoder vs. Encoder-decoder with attention


<img src='images/encoder-decoder_comparison.png' style="width:650px;height:280px;"/>

In the regural encoder-decoder architecture, the decoder takes the final state from encoder. While in the encoder-decoder with attention architecture, the `attention` layer takes as input hidden states from all time steps of the RNN layer in the encoder and outputs a `context` vector which is essentially a `weighted sum` of those hidden states. 

> The key concept of `Attention` is to calculate an **attention weight vector** that is used to amplify the information of most relevant parts of the input sequence (internally represented by hidden states of the RNN cell of encoder) and in the same time, drown out irrelevant parts. 

## 3. Attention Mechanism

### 3.1 Attention Overview 

The core step of attention is to compute a `context vector` (also called `context`) at each time step $t$ of the decoder RNN cell. Such `context` is a compact representation of the information processed by encoder and it emphasizes more on parts of that information that are more relevant to hidden state being processed at time step $t$ of the decoder RNN cell.

Following picture illustrates in high-level view about how we compute the `context vector` at time step $t$, given all the hidden states of the encoder RNN cell (e.g., $[h_{1}, h_{2}, h_{3}]$) and the current hidden state of the decoder RNN cell (e.g., $S_{t}$)

<img src='images/compute_attention_weights.png'>

### 3.2  Mathematical Definition for Attention

As shown in above picture:
1. We first calcualte a `attention score` for each of encoder hidden state (e.g., $[h_{1}, h_{2}, h_{3}]$).
2. Then, we calculate softmax of those `attention scores` and get a `attention weight vector`.
3. Finally, we compute the `context vector` by calculating the dot-product value between the `attention weight vector` and the encode hidden states.

For each of the three steps, we have mathematical definition:

**Scoring function for calculating attention scores:**

$$ score(S_{t}, \overline h_{t^{'}}) \tag{1}$$  

where 
* $S_{t}$ is the decoder hidden state at time step $t$ 
* $\overline h_{t^{'}} $ is a vector of all encoder hidden states from all time steps.
* Since we have two sequences one for encoder and one for decoder, we use
    * $t$ to denote time step for decoder 
    * $t'$ to denote time step for encoder

**Softmax for calculating attention weight vector:**

<br/>
$$ \alpha_t = \frac{
exp(score(S_{t}, \overline h_{t^{'}}))
}{
\displaystyle\sum_{t^{'} = 1}^{T_x} exp(score(S_{t}, h_{t^{'}}))
}  \tag{2}$$ 

where $ \alpha_t = (\alpha_{t,1}, \alpha_{t,2}, \alpha_{t,3}, ... , \alpha_{t,T_x})$ and $T_x$ is the length of input sequence or the number of time steps of the encoder RNN cell.

**Dot-product for computing the context vector**

$$context_{t} = \sum_{t' = 1}^{T_x} \alpha_{t,t'} h_{t'}\tag{3}$$

Following picture illustrates the encoder-decoder with attention architecture in more detail: 

<img src='images/encoder_decoder_attention_high_level.png' style="width:600px;height:640px;"/>

Now, the most mysterious part of the attention is how to compute the `attention weight vector`. Before diving into the details of computing `attention weight vector`, we first introduce types of attention. 

### 3.3 Types of Attention

There are two major types of attention: **additive attention** and **multiplicative attention**. Sometimes they are called `Bahdanau attention` and `Luong attention` respectively referring to the first authors of the papers which describe those attentions. You can check these two papers for details:
* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)


Basically, the score function can be calculated in three different ways (Of course, there are other variants).

$$ score(S_{t}, \overline h_{t^{'}})= \begin{cases}
    S^{T}_{t} \overline h_{t^{'}}   & \quad \text{dot } \\
    S^{T}_{t}W_a \overline h_{t^{'}}   & \quad \text{general } \\
    v^T_a tanh\left(W_a[S^{T}_{t};\overline h_{t^{'}}]\right)  & \quad \text{concat}
 \end{cases} $$ 

where
* $S_{t}$ is the decoder hidden state at time step $t$ 
* $\overline h_{t^{'}} $ is a vector of all encoder hidden states from all time steps.

### 3.4 Attention Score Computation

In this section, we will illustrate the three ways of computing attention score in intuitive way.

**Multiplicative Attention - dot product**

<img src='images/multiplicative_attention_dot_product.png' style="width:700px;height:550px;"/>

Essentially, this scoring method is using dot product between one encoder hidden state vector and on decoder hidden state vector to calculate the attention score for that encoder hidden state. This makes sense because dot product of two vectors in word-embedding space is a measure of similarity between them.

With the simplicity of this method, comes the drawback of assuming the encoder and decoder have the same embedding dimensions. The might work for text summarization for example, where the encoder ane decoder use the same language and the same embedding space. For machine translation, however, you might find that each language tends to have its own embedding space. 

This is the case where we might want to use the second scoring method - general multiplicative attention.

**Multiplicative Attention - general**

This method is a slight variation on the first method. It simply introduces a weight matrix between the multiplication of the decoder hidden state and the encoder hidden states. 

<img src='images/multiplicative_attention_general.png' style="width:750px;height:330px;"/>


**Additive Attention - concat**

The way to do the additive attention is using a small feed forward neural network. To better illustrate this concept, we will calculate attention score for one encoder hidden state (i.e., $h_1$), instead of calculating attention scores for encoder hidden states all at once.

<img src='images/additive_attention_concat.png' style="width:700px;height:520px;"/>

The concat scoring method is commonly done by concatenating one encoder hidden state vector and one decoder hidden state vector, and making that the input to a feed forward neural network. This network has a single hidden layer and outputs a score. The weights (i.e., $W_a$ and $V_a$) of this network are learned during the training process.

In the above example, we only calculate the attention score for one encoder hidden state. The attention scores for other encoder hidden states (e.g., $h_2$, $h_3$) are calculated in exact the same way. The calculation of attention scores for all encoder hidden states is formalized as below:

$$ v^T_a tanh\left(W_a[S^{T}_{t};\overline h_{t^{'}}]\right)  $$

This concat method is very similar to the scoring method from the Bahdanau paper, which is shown below: 
    
$$ v^T_a tanh\left(W_aS^{T}_{t-1} + U_a \overline h_{t^{'}}\right)  $$

In the Bahdanau paper, there are two major differences:
1. It uses two weight matrices instead of one, each is applied to the respective vector.
2. It uses hidden state from the previous timestep at the decoder instead of hidden state from current timestep.
    * Actually, using hidden state from previous timestep and the one from current timestep are both legitimate choices. However, it would be better using hidden state from previous timestep to calculate context vector if you want to feed the context vector into the RNN cell of decoder. While using hidden state from current timestep might be a better choice if you want to feed the context vector into the dense layer. 

### 3.3 Apply Context Vector

After getting the attention scores, we then applies formula (1) and (2) to compute the `context vector`, denoted as $context_{t}$ (at time step $t$). The next question is how should we apply this $context_{t}$. 

Typically, we can feed this $context_{t}$ into two places:

<img src='images/feed_context_vector.png' style="width:200px;height:300px;"/>


1. `Decoder RNN cell`. If we are using teacher forcing, we concat $context_{t}$ with the target vector $y_t$ and feed $[context_{t}; y_t]$ into the RNN cell. In this case, we use $S_{t-1}$ to calculate attention scores since at this point we have not calculated $S_t$ yet.
2. `Dense layer`. We concat $context_{t}$ with the hidden state $S_t$ and feed $[context_{t}; S_t]$ into the dense layer. In this case, we use $S_{t}$ to calculate attention scores.

### 3.4 Encoder-Decoder with Attention in Detailed View

<img src='images/encoder_decoder_architecture_detailed_view.png' style="width:750px;height:700px;"/>


## 4 Attention Implementation Detail

We discuss the attention implementation detail based on **Keras framework**. If you are using other framework, the implemetation might be different. But, the main idea should not change much.

At a high level, the architecture has three components:
* Encoder
* Attention
* Decoder

Because the Attention is tightly coupled to the Decoder, we typically put them together as a single layer. This Decoder with Attention layer is the most complicated part of the whole architecture. We will explain the implementaion detail on this part. Also, we will keep tracking shapes of tensors while we are constructing the model to get a better understanding how the code works.

### 4.1 Regular Decoder with no Attention

For regular decoder with no attention, we just ask the Keras framwork to do the job for us. The Keras LSTM will automatically runs $T_y$ time steps, where $T_y$ is the length of `target_sequence` (used for teacher forcing).  
* We can feed the `target_sequence` all at once into decoder LSTM because each element of the `target_sequence` does not change while the LSTM is doing the computation for each time step.

```python
x = LSTM(units, return_sequences=True)(target_sequence)
y_hat = Dense(dimenstion)(x)
```

### 4.2 Decoder with Attention

For decoder with attention, things become more complicated. The `context vector` at time step $t$ is calculated based on the hidden state $S_{t-1}$ that changes at each time step. As a result, the `context vector` changes at each time step. Therefore, we cannot rely on Keras LSTM to compute `context vectors` for us and we have to code the calculation of `context vector` at each time step. 

We define a `one_time_step_attention` function to calculate `context vector` at each time step. Following picture illustrates the work that the `one_time_step_attention` function does:

<img src='images/attention_computation.png' style="width:500px;height:500px;"/>

`one_time_step_attention` function takes hidden state $S_{t-1}$ from decoder and hidden states of all time steps $[h_1, h_2, ... h_{T_x}]$ from encoder as input, and outputs the `context vector` at time step $t$.

The pseudocode for this function is:

```python
repeat_layer = RepeatVector(T_x) # Repeats the input T_x times. 
concat_layer = Concatenate(axis=-1)
dense1 = Dense(10, activation='tanh')
dense2 = Dense(1, activation=softmax_over_time)
dot = Dot(axes=1,name='attn_dot_layer')

def one_time_step_attention(h_bar, S_t_1):
    S_t_1 = repeat_layer(S_t_1)
    x = concat_layer([h_bar, S_t_1])
    x = dense1(x)
    alphas = dense2(x) 
    context = dot([alphas, h_bar])
    return context
```

where
* `repeat_layer` copies `S_t_1` the $T_x$ amount of times.  
* `concat_layer` concates `h_bar` with `S_t_1`. 
    * `h_bar` is a vector contains hidden states of all time steps $[h_1, h_2, ... h_{T_x}]$ from encoder.

** Calculating shapes of tensors**

Make sure that you are able to understand following calculation because it helps you understand how the code works.

* Suppose mini-batch size is $N$
* Suppose encoder LSTM has hidden units = $M_1$, then,
    * the shape of $h_t$ (e.g., $h_1$) is $(N, 2M_1)$ since we are using Bidirectional LSTM in encoder.
    * the shape of $\overline h$ (`h_bar` in code) is $(N, T_x, 2M_1)$, where $T_x$ is the length of input sequence and also is the number of time steps of encoder RNN layer.
* Suppose decoder LSTM has hidden units = $M_2$, then,
    * the shape of $S_{t-1}$ (`S_t_1` in code) is $(N, M_2)$. 

Let's walk through the `one_time_step_attention` funtion line by line:

* After `repeat_layer`, the shape of `S_t_1` is $(N, T_x, M_2)$
* After `concat_layer` that concates `h_bar` and `S_t_1`, the output `x` has shape of $(N, T_x, 2M_1 + M_2)$ because we defined `concat_layer` to concate two input tensors over their last dimension.
* After `dense1` that takes tensor with shape of $(N, T_x, 2M_1 + M_2)$ as input, the output `x` has shape of $(N, T_x, 10)$, since the `dense1` has hidden units of 10.
* After `dense2` that takes tensor with shape of $(N, T_x, 10)$ as input, the output `alpha` has shape of $(N, T_x, 1)$, since the `dense2` has hidden units of 1. Also,  the second dimension (i.e., the time step dimension) of `alpha` is **softmaxed**. 
* After `dot` that takes `alpha` with shape of $(N, T_x, 1)$ and `h_bar` with shape of $(N, T_x, 2M_1)$, the output `context` has shape of $(N, 1, 2M_1)$. This is because:
    * We defined `dot` to calculate dot-product between two input tensors over their second dimension (axes=1), which is their time step dimension. 
    * What the `dot` does are that:
        1. It first makes `alpha` to be a tensor with shape of $(N, T_x, 2M_1)$ by broadcasting. More specifically, it copies the original `alpha` with shape of $(N, T_x, 1)$ the $2M_1$ times and stacks those copies together to form a $(N, T_x, 2M_1)$ tensor.
        2. It performs dot-product between the new `alpha` and `h_bar` over the second dimension.

Now, we know how the attention works at a single time step. We then feed this `one_time_step_attention` into a bigger picture. That is how the attention works with the decoder. 

Following picture shows the computation of decoder-attention at time step $t$:

<img src='images/decoder_attention_one_step.png' style="width:300px;height:400px;"/>

We will use a loop to perform the computation at each time step and following picture shows the whole view of the decoder-attention layer:

<img src='images/decoder_attention_all_steps.png' style="width:600px;height:400px;"/>

The pseudocode for this layer is:

```python

concat_layer = Concatenate(axis=2)
decoder_lstm = LSTM(hidden_units, return_state=True)
dense = Dense(num_words_output, activation='softmax')

h_bar = encoder(input_sequence)
s = initial_value
c = initial_value
for t in range(T_y):
    context = one_time_step_attention(h_bar, s)
    x = concat_layer([context, y_t])
    o, s, c = decoder_lstm(x, initial_state=[s, c])
    y_hat = dense(o)
```

where 
* `y_t` is the target_sequence at time step $t$. 
* We concatenate the `context` vector with `y_t` and feed the concatenation into the decoder LSTM cell as input at time step $t$.
    * For inferencing/predication, we concatenate the `context` vector with `y_hat_t`, which is the previous predicted word index, and feed the concatenation into the decoder LSTM cell as input at time step $t$.

** Calculating shapes of tensors**

* `y_t` is the target_sequence at time step $t$. It is a word vector. Let's assume that the word embedding dimention is K. Then, 
    * 'y_t' has shape of $(N, 1, K)$
* We have already known that `context` has shape of $(N, 1, 2M_1)$.
* After `concat_layer` that concates `context` and `y_t`, the output `x` has shape of $(N, 1, 2M_1 + K)$ because we defined `concat_layer` to concate two input tensors over their second dimention which is also their last dimension.
* After `decoder_lstm`, we have output `o` that has shape of $(N, M2)$
* After `dense`, we have output `y_hat` that has shape of $(N, D)$, where D is total number of words.

## References:

* [Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
