## **Deep Learning Made Easy**

----

**Important:** The code of this notebook was developed by <a href="https://machinelearningmastery.com/the-attention-mechanism-from-scratch/">Stefania Cristina</a> published in the *Machine Learning Mastery* site. The text was based on the above post and some others as indicated below. A few modifications have been done by <a href="https://www.linkedin.com/in/valdivino-alexandre-de-santiago-j%C3%BAnior-103109206/?locale=en_US">Valdivino Alexandre de Santiago Júnior</a>. It is a notebook to explain the attention mechanism.

<br>

**If you wish to cite this material, please do so by actually citing the other authors, especially <a href="https://machinelearningmastery.com/the-attention-mechanism-from-scratch/">Stefania Cristina's</a> post.**


<br>

To run this notebook in Colab, use the link below:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vsantjr/DeepLearningMadeEasy/blob/master/Attention_Mechanism.ipynb)

## Seq2Seq Model
----

From: https://www.kdnuggets.com/2021/01/attention-mechanism-deep-learning-explained.html


The seq2seq model is normally composed of an encoder-decoder architecture, where the encoder processes the input sequence and encodes/compresses the information into a context vector (or “thought vector”) **of fixed length**. This representation is anticipated to be a good summary of the **complete input sequence**. The decoder is then initialized with this context vector, using which it starts producing the transformed or translated output.

A critical disadvantage of this **fixed-length context vector design is the inability of the system to retain longer sequences**. Often it has forgotten the earlier elements of the input sequence once it has processed the complete sequence. **The attention mechanism was created to resolve this problem of long dependencies**.

Thus the main idea of the attention mechanism is that each time the model predicts an output word, it only uses parts of the input where the most relevant information is concentrated instead of the entire sequence. **In simpler words, it only pays attention to some input words (not the entire sequence of words)**.





## Example
----



Let us consider this sentence:

<br>

"The cat is sleeping on the couch". 

<br>

And the translation to Portuguese:

<br>

"O gato está dormindo no sofá".

<br>

<img src="https://drive.google.com/uc?id=1sKqU02obQ8Y4_680RBXmxwjB2p4Q2ro3" alt="Drawing" width="300"/>

<br>

Then when predicting "gato" it is clear that this name is the result of the word "cat" present in the input English sentence regardless of the rest of the sentence. Thus, we say that while predicting "gato", we **pay more attention** to the word "cat" in the input sentence.



## The Attention Mechanism
----

From: https://machinelearningmastery.com/the-attention-mechanism-from-scratch/

See also: https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g31364026ad_3_2

The attention mechanism proposed by <a href="https://arxiv.org/abs/1409.0473">Bahdanau et al. (2015)</a> is divided into step-by-step computations of the alignment scores, the weights, and the context vector as described below:



*   **Alignment scores**. The alignment model takes the encoded hidden states, $h_i$, and the previous decoder output, $s_{t-1}$ , to compute a score, $e_{t,i}$, that indicates how well the elements of the input sequence align with the current output at the position, $t$. The alignment model is represented by a function, $f(.)$, which can be implemented by a feedforward neural network: 

$$
e_{t,i} = f(s_{t-1}, h_i);
$$ 
*   **Weights**. The weights, $\alpha_{t,i}$, are computed by applying a softmax operation to the previously computed alignment scores:

$$
\alpha_{t,i} = softmax(e_{t,i});
$$

*  **Context vector**. A unique context vector, $c_t$, is fed into the decoder at each time step. It is computed by a weighted sum of all, $T$, encoder hidden states:

$$
c_t = \sum_{i=1}^{T} \alpha_{t,i} \cdot h_i
$$

They implemented an RNN for both the encoder and decoder.

<br>

<br>

<img src="https://drive.google.com/uc?id=1e7YMnFvPwnDTdD_ePHWlgEx5XHSBtn6i" alt="Drawing" width="1000"/>



Source: <a href="https://arxiv.org/abs/1409.0473">Bahdanau et al. (2015) </a> and <a href="https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g31364026ad_3_2">Beyer (2022)</a>.

## The General Attention Mechanism
---

From: https://machinelearningmastery.com/the-attention-mechanism-from-scratch/

See also: https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g31364026ad_3_2


The general attention mechanism makes use of three main components, namely the queries, $Q$, the keys, $K$, and the values, $V$. 

If you compare these three components to the attention mechanism as proposed by <a href="https://arxiv.org/abs/1409.0473">Bahdanau et al. (2015)</a>, then the **query** would be analogous to the **previous decoder output, $s_{t-1}$**, while the **values** would be analogous to the **encoded inputs, $h_i$**. In the Bahdanau attention mechanism, **the keys and values are the same vector**.










## Self-Attention (SA)
----

From: https://blogs.oracle.com/ai-and-datascience/post/multi-head-self-attention-in-nlp

We can also explain the general attention mechanism as follows. See its mathematical representation below.

<br>


<img src="https://drive.google.com/uc?id=1IXRH1QBjN3_WDmeO6pkUvytKXIH0Nns-" alt="Drawing" width="600"/>

Source: <a href="https://blogs.oracle.com/ai-and-datascience/post/multi-head-self-attention-in-nlp">Praphul Singh</a>.

<br>

So, $X$ is the **input word sequence**, and we calculate three values from that which is $Q$, $K$ and $V$. The task is to find the important words from the **keys** for the **query** word. This is done by passing the query and key to a mathematical function (usually matrix multiplication followed by softmax). The resulting context vector for $Q$ is the multiplication of the probability vector obtained by the softmax with the **value**.

When the **query**, **key**, and **value** are all generated from the same input sequence $X$, the general attention mechanism is called **self-attention**.





## Multi-Head Self-Attention (MSA)
----

From: https://blogs.oracle.com/ai-and-datascience/post/multi-head-self-attention-in-nlp

See also:  https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g31364026ad_3_2

**Multi-head** is just doing the same thing discussed earlier by different heads (brains?). The aim here is to combine the knowledge explored by multiple heads or agents instead of doing it by one, as in the traditional case.

Mathematically, it relates to attending to not only the different words of the sentence, but to different segments of the words, too. The words vectors are divided into a fixed number ($H$, number of heads) of chunks, and then self-attention is applied to the corresponding chunks, resulting in $H$ context vectors for each word. The final context vector is obtained by concatenating all those $H$ context vectors.

<br>

<img src="https://drive.google.com/uc?id=1kRiTpmdEU1U1EsxrVRMWdTcmeIOgb06U" alt="Drawing" width="450"/>

Source: <a href="https://blogs.oracle.com/ai-and-datascience/post/multi-head-self-attention-in-nlp">Praphul Singh</a>.

<br>

Above is a visualisation of the outputs upon using two heads. We can see that if the **query** word is "it", the first head focuses more on the words "the animal", and the second head focuses more on the word "tired". Hence, the final context representation will be focusing on all the words **"the", "animal" and "tired"**, and thus it is a superior representation as compared to the traditional way.

In [None]:
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax
 
# encoder representations of four different words. Thus, these come from the encoder.
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])
 
# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])
print('Words:\n {} \n Shape: {}\n\n'.format(words,words.shape)) 


Words:
 [[1 0 0]
 [0 1 0]
 [1 1 0]
 [0 0 1]] 
 Shape: (4, 3)




The next step generates the weight matrices, which you will eventually multiply to the word embeddings to generate the queries, keys, and values. Here, you shall generate these weight matrices randomly; however, in actual practice, these would have been learned during training.

In [None]:
# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

print('Weight matrix - Queries:\n {} \n Shape: {}\n\n'.format(W_Q,W_Q.shape))
print('Weight matrix - Keyes:\n {} \n Shape: {}\n\n'.format(W_K,W_K.shape))
print('Weight matrix - Values:\n {} \n Shape: {}\n\n'.format(W_V,W_V.shape))

Weight matrix - Queries:
 [[2 0 2]
 [2 0 0]
 [2 1 2]] 
 Shape: (3, 3)


Weight matrix - Keyes:
 [[2 2 2]
 [0 2 1]
 [0 1 1]] 
 Shape: (3, 3)


Weight matrix - Values:
 [[1 1 0]
 [0 1 1]
 [0 0 0]] 
 Shape: (3, 3)




Subsequently, the query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.

In [None]:
# generating the queries, keys and values.
# PS: In Python, @ is a binary operator used for matrix multiplication.

Q = words @ W_Q
K = words @ W_K
V = words @ W_V

print('Queries:\n {} \n Shape: {}\n\n'.format(Q,Q.shape))
print('Keyes:\n {} \n Shape: {}\n\n'.format(K,K.shape))
print('Values:\n {} \n Shape: {}\n\n'.format(V,V.shape))

Queries:
 [[2 0 2]
 [2 0 0]
 [4 0 2]
 [2 1 2]] 
 Shape: (4, 3)


Keyes:
 [[2 2 2]
 [0 2 1]
 [2 4 3]
 [0 1 1]] 
 Shape: (4, 3)


Values:
 [[1 1 0]
 [0 1 1]
 [1 2 1]
 [0 0 0]] 
 Shape: (4, 3)




The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three) to keep the gradients stable.

In [None]:
# scoring the query vectors against all key vectors
scores = Q @ K.transpose()
 
# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
 
# computing the attention by a weighted sum of the value vectors
attention = weights @ V
 
print('Attention:\n', attention)

Attention:
 [[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]
