## What is Attention?
* **Attention** is ***simply a vector, often the outputs of dense layer using softmax function.***
* Before Attention mechanism, ***translation relies on reading a complete sentence and compress all information into a fixed-length vector***, as you can image, a sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc.

## Attention Architecture with Idea Behind it.

* The **basic idea:** each time the **model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence.** In ***other words, it only pays attention to some input words. Let’s investigate how this is implemented.***

![](https://cdn-images-1.medium.com/max/800/1*9Lcq9ni9aujScFYyyHRhhA.png)

* Encoder works as usual, and the**difference is only on the decoder’s part.** As you can see from a picture, ***the decoder’s hidden state is computed with a context vector, the previous output and the previous hidden state. But now we use not a single context vector c, but a separate context vector c_i for each target word.***
* These context vectors are **computed as a weighted sum of annotations generated by the encoder.** In **Bahdanau’s paper, they use a Bidirectional LSTM, so these annotations are concatenations of hidden states in forward and backward directions.**
* The weight of each annotation is computed by an alignment model which scores how well the **inputs and the output match.** An alignment model is a **feedforward neural network**, for instance. In general, it can be any other model as well.
* As a result, the **alphas — the weights of hidden states when computing a context vector — show how important a given annotation is in deciding the next state and generating the output word. These are the attention scores.**

## Why Attention?

* The **core of Probabilistic Language Model** is to **assign a probability to a sentence by Markov Assumption.** Due to the nature of sentences that consist of different numbers of words, RNN is naturally introduced to model the **conditional probability among words.**
![](https://cdn-images-1.medium.com/max/800/0*SX7ClVkt8w9J39ed.)

**Vanilla RNN (the classic one) often gets trapped when modeling:**

* ***Structure Dilemma:*** in real world, **the length of outputs and inputs can be totally different**, while **Vanilla RNN** can only **handle fixed-length problem which is difficult for the alignment.** Consider an ***EN-FR translation examples: “he doesn’t like apples” → “Il n’aime pas les pommes”.***
* ***Mathematical Nature:*** it suffers from **Gradient Vanishing/Exploding** which means ***it is hard to train when sentences are long enough (maybe at most 4 words).***
* ***Translation often requires arbitrary input length and out put length, to deal with the deficits above, encoder-decoder model is adopted and basic RNN cell is changed to GRU or LSTM cell, hyperbolic tangent activation is replaced by ReLU. We use GRU cell here.***

![](https://cdn-images-1.medium.com/max/800/0*VwQyyHLPDgEWSD-2.)

* **Embedding layer** maps **discrete words into dense vectors for computational efficiency**. Then **embedded word vectors are fed into encoder**, aka ***GRU cells sequentially.*** What happened during **encoding?** Information flows from left to right and **each word vector** is **learned according to not only current input but also all previous words.** When **the sentence is completely read, encoder generates an output and a hidden state at timestep 4 for further processing.** For ***encoding part, decoder (GRUs as well) grabs the hidden state from encoder, trained by teacher forcing (a mode that previous cell’s output as current input), then generate translation words sequentially.***

* It seems amazing as this model can be applied to **N-to-M sequence**, yet there still is **one main deficit left unsolved: is one hidden state really enough?**

## How does attention work?

![](https://cdn-images-1.medium.com/max/800/0*VrRTrruwf2BtW4t5.)

* Similar to the **basic encoder-decoder architecture,** this fancy mechanism **plug a context vector into the gap between encoder and decoder.** According to the schematic above, **blue represents encoder** and **red represents decoder;** and we could see that **context vector takes all cells’ outputs as input to compute the probability distribution of source language words for each single word decoder wants to generate.** By utilizing this mechanism, **it is possible for decoder to capture somewhat global information rather than solely to infer based on one hidden state.**
* And to **build context vector is fairly simple.** For a **fixed target word**, ***first***, we **loop over all encoders’ states to compare target** and **source states to generate scores for each state in encoders.** Then we could **use softmax to normalize all scores, which generates the probability distribution conditioned on target states.** At last, the ***weights are introduced to make context vector easy to train. That’s it. Math is shown below:***

![](https://cdn-images-1.medium.com/max/800/0*4y96boGNMiNVHNo8.)

**To understand the seemingly complicated math, we need to keep three key points in mind:**
* ***During decoding,context vectors are computed for every output word.*** So we will have a **2D matrix whose size is # of target words multiplied by # of source words.** Equation **(1) demonstrates how to compute a single value given one target word and a set of source word.**
* **Once context vector is computed, attention vector** could be computed by **context vector, target word, and attention function f.**
* We need **attention mechanism to be trainable**. According to equation **(4), both styles offer the trainable weights (W in Luong’s, W1 and W2 in Bahdanau’s). Thus, different styles may result in different performance.**

## Attention Scoring
### Inputs to the scoring function
Let's start by looking at the inputs we'll give to the scoring function. We will assume we're in the first step in the decoging phase. The first input to the scoring function is the hidden state of decoder (assuming a toy RNN with three hidden nodes -- not usable in real life, but easier to illustrate):

In [None]:
dec_hidden_state = [5,1,20]

Let's visualize this vector:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Let's visualize our decoder hidden state
plt.figure(figsize=(1.5, 4.5))
sns.heatmap(np.transpose(np.matrix(dec_hidden_state)), annot=True, cmap=sns.light_palette("purple", as_cmap=True), linewidths=1)

Our first scoring function will score a single annotation (encoder hidden state), which looks like this:

In [None]:
annotation = [3,12,45] #e.g. Encoder hidden state

In [None]:
# Let's visualize the single annotation
plt.figure(figsize=(1.5, 4.5))
sns.heatmap(np.transpose(np.matrix(annotation)), annot=True, cmap=sns.light_palette("orange", as_cmap=True), linewidths=1)

### IMPLEMENT: Scoring a Single Annotation
Let's calculate the dot product of a single annotation. Numpy's [dot()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html) is a good candidate for this operation

In [None]:
def single_dot_attention_score(dec_hidden_state, enc_hidden_state):
    # TODO: return the dot product of the two vectors
    return np.dot(dec_hidden_state, enc_hidden_state)
    
single_dot_attention_score(dec_hidden_state, annotation)


### Annotations Matrix
Let's now look at scoring all the annotations at once. To do that, here's our annotation matrix:

In [None]:
annotations = np.transpose([[3,12,45], [59,2,5], [1,43,5], [4,3,45.3]])

And it can be visualized like this (each column is a hidden state of an encoder time step):

In [None]:
# Let's visualize our annotation (each column is an annotation)
ax = sns.heatmap(annotations, annot=True, cmap=sns.light_palette("orange", as_cmap=True), linewidths=1)

### IMPLEMENT: Scoring All Annotations at Once
Let's calculate the scores of all the annotations in one step using matrix multiplication. Let's continue to us the dot scoring method

<img src="http://yaox023.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/RNN/attention%E7%90%86%E8%AE%BA/Attention_python%E6%BC%94%E7%A4%BA/images/scoring_functions.png" />

To do that, we'll have to transpose `dec_hidden_state` and [matrix multiply](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html) it with `annotations`.

In [None]:
def dot_attention_score(dec_hidden_state, annotations):
    # TODO: return the product of dec_hidden_state transpose and enc_hidden_states
    return np.matmul(np.transpose(dec_hidden_state), annotations)
    
attention_weights_raw = dot_attention_score(dec_hidden_state, annotations)
attention_weights_raw

Looking at these scores, can you guess which of the four vectors will get the most attention from the decoder at this time step?

## Softmax
Now that we have our scores, let's apply softmax:
<img src="http://yaox023.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/RNN/attention%E7%90%86%E8%AE%BA/Attention_python%E6%BC%94%E7%A4%BA/images/softmax.png" />

In [None]:
def softmax(x):
    x = np.array(x, dtype=np.float128)
    e_x = np.exp(x)
    return e_x / e_x.sum(axis=0) 

attention_weights = softmax(attention_weights_raw)
attention_weights

Even when knowing which annotation will get the most focus, it's interesting to see how drastic softmax makes the end score become. The first and last annotation had the respective scores of 927 and 929. But after softmax, the attention they'll get is 0.12 and 0.88 respectively.

# Applying the scores back on the annotations
Now that we have our scores, let's multiply each annotation by its score to proceed closer to the attention context vector. This is the multiplication part of this formula (we'll tackle the summation part in the latter cells)

<img src="http://yaox023.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/RNN/attention%E7%90%86%E8%AE%BA/Attention_python%E6%BC%94%E7%A4%BA/images/Context_vector.png" />

In [None]:
def apply_attention_scores(attention_weights, annotations):
    # TODO: Multiple the annotations by their weights
    return attention_weights * annotations

applied_attention = apply_attention_scores(attention_weights, annotations)
applied_attention

Let's visualize how the context vector looks now that we've applied the attention scores back on it:

In [None]:
# Let's visualize our annotations after applying attention to them
ax = sns.heatmap(applied_attention, annot=True, cmap=sns.light_palette("orange", as_cmap=True), linewidths=1)

Contrast this with the raw annotations visualized earlier in the notebook, and we can see that the second and third annotations (columns) have been nearly wiped out. The first annotation maintains some of its value, and the fourth annotation is the most pronounced.

# Calculating the Attention Context Vector
All that remains to produce our attention context vector now is to sum up the four columns to produce a single attention context vector


In [None]:
def calculate_attention_vector(applied_attention):
    return np.sum(applied_attention, axis=1)

attention_vector = calculate_attention_vector(applied_attention)
attention_vector

In [None]:
# Let's visualize the attention context vector
plt.figure(figsize=(1.5, 4.5))
sns.heatmap(np.transpose(np.matrix(attention_vector)), annot=True, cmap=sns.light_palette("Blue", as_cmap=True), linewidths=1)

Now that we have the context vector, we can concatinate it with the hidden state and pass it through a hidden layer to produce the the result of this decoding time step.

### References : 

[1] Vinyals, Oriol, et al. Show and tell: A neural image caption generator. arXiv:1411.4555 (2014).  
[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).  
[3] Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. Describing Multimedia Content using Attention-based Encoder–Decoder Networks. arXiv:1507.01053 (2015)  
[4] Xu, Kelvin, et al. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044 (2015).  
[5] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. End-to-end memory networks. Advances in Neural Information Processing Systems. (2015).  
[6] Joulin, Armand, and Tomas Mikolov. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets. arXiv:1503.01007 (2015).  
[7] Hermann, Karl Moritz, et al. Teaching machines to read and comprehend. Advances in Neural Information Processing Systems. (2015).  
[8] Raffel, Colin, and Daniel PW Ellis. Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems. arXiv:1512.08756 (2015).  
[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A. et al. . Attention Is All You Need. arXiv: 1706.03762 (2017).  