<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/NLP_Attention_and_Transformer_architecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In a previous notebook, we explained an important NLP task: [*neural machine translation*](https://github.com/victorviro/Deep_learning_python/blob/master/NLP_Encoder_Decoder_NMT.ipynb) (NMT), using a pure Encoder-Decoder model. We also saw how we can improve this architecture using bidirectional RNNs or the beam search algorithm ([notebook](https://github.com/victorviro/Deep_learning_python/blob/master/Bidirectional_RNNs_and_Beam_Search.ipynb)). Let's see how we can improving it with attention mechanisms, and finally we will look the extraordinary Transformer architecture.

## Attention Mechanisms

Consider the path from the word "milk" to its translation "lait" in Figure 16-3 which depicts a pure encoder-decoder architecture: 

![texto alternativo](https://i.ibb.co/9mx5s5B/machine-learning-model.png)

This path is quite long! This means that a representation of this word (along with all the other words) needs to be carried over many steps before it is actually used. Suppose now a very long sentence. The encoder needs to read the whole sentence, "memorize" it, and store it in the activations which are passed to the decoder which generates the sentence translation. Note that this is not how a human translator would do it because is hard to memorize the whole long sentence. A human translator would read a first part of the sentence and he would maybe generate a part of the translation, later he would read a second part of the sentence, he would translate a few more words, and so on. Therefore in a pure encoder-decoder model, the performance decreases when the sentences are very long (more than 30 or 40 words). The attention module translates maybe a bit more like humans might looking at part of a sentence at a time and this way allows the system to translate very long sentences getting good results. A further explanation about the attention model intuition can be seen in this [video](https://youtu.be/SysgYptB198?list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6).   

The attention module was the core idea in a groundbreaking 2014 [paper](https://arxiv.org/abs/1409.0473). They introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. For example, at the time step where the decoder needs to output the word "lait", it will focus its attention on the word "milk". This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact. Attention mechanisms revolutionized neural machine translation (and NLP in general), allowing a significant improvement in the state of the art, especially for long sentences (over 30 words).

Figure 16-6 shows this model’s architecture (slightly simplified, as we will see). 

![](https://i.ibb.co/DfZ2vjr/attention-module-nlp.png)

On the left, we have the encoder and the decoder. Instead of just sending the encoder’s final hidden state to the decoder (which is still done, although it is not shown in the figure), we now also send all of its outputs to the decoder. At each time step, the decoder’s memory cell computes a weighted sum of all these encoder outputs: this determines which words it will focus on at this step. The weight $\alpha(t,i)$ is the weight of the $i^{\text{th}}$ encoder output at the $t^{\text{th}}$ decoder time step and depicts how much attention it would pay in the input word $x_{(t)}$ when predicting the output word $y_{(i)}$. For example, if the weight $\alpha(3,2)$ is much larger than the weights $\alpha(3,0)$ and $\alpha(3,1)$, then the decoder will pay much more attention to word number 2 (“milk”) than to the other two words, at least at this time step. The rest of the decoder works just like earlier: at each time step the memory cell receives the inputs we just discussed, plus the hidden state from the previous time step, and finally (although it is not represented in the diagram) it receives the target word from the previous time step (or at inference time, the output from the previous time step).

But where do these $\alpha(t,i)$ weights come from? It’s actually pretty simple: they are generated by a type of small neural network called an *alignment model* (or an *attention layer*), which is trained jointly with the rest of the Encoder-Decoder model. This alignment model is illustrated on the righthand side of Figure 16-6. It starts with a `Dense` layer with a single neuron, which receives as input all the encoder outputs, concatenated with the decoder’s previous hidden state (e.g., $\boldsymbol{h}_{(2)})$. This layer outputs a score (or energy) for each encoder output (e.g., $e_{(3, 2)}$): this score measures how well each output is aligned with the decoder’s previous hidden state. Finally, all the scores go through a softmax layer to get a final weight for each encoder output (e.g., $\alpha(3,2)$). All the weights for a given decoder time step add up to 1 (since a softmax layer is used). This particular attention mechanism is called *Bahdanau attention* (named after the paper’s first author). Since it concatenates the encoder output with the decoder’s previous hidden state, it is sometimes called *concatenative attention* (or *additive attention*).

**Note**: If the input sentence is $n$ words long, and assuming the output sentence is about as long, then this model will need to compute about $n^2$ weights. Fortunately, this quadratic computational complexity is still tractable because even long sentences don’t have thousands of words.

Another common attention mechanism was proposed shortly after, in a 2015 [paper](https://arxiv.org/abs/1508.04025). Because the goal of the attention mechanism is to measure the similarity between one of the encoder’s outputs and the decoder’s previous hidden state, the authors proposed to simply compute the *dot product* of these two vectors, as this is often a fairly good similarity measure, and modern hardware can compute it much faster. For this to be possible, both vectors must have the same dimensionality. This is called *Luong attention* (again, after the paper’s first author), or sometimes *multiplicative attention*. The dot product gives a score, and all the scores (at a given decoder time step) go through a softmax layer to give the final weights, just like in Bahdanau attention. Another simplification they proposed was to use the decoder’s hidden state at the current time step rather than at the previous time step (i.e., $\boldsymbol{h}_{(t)})$ rather than $\boldsymbol{h}_{(t-1)})$), then to use the output of the attention mechanism (noted $\hat{\boldsymbol{h}}_{(t)}$) directly to compute the decoder’s predictions (rather than using it to compute the decoder’s current hidden state). They also proposed a variant of the dot product mechanism where the encoder outputs first go through a linear transformation (i.e., a Dense layer without a bias term) before the dot products are computed. This is called the "general" dot product approach. They compared both dot product approaches to the concatenative attention mechanism (adding a rescaling parameter vector $\boldsymbol{v}$), and they observed that the dot product variants performed better than concatenative attention. For this reason, concatenative attention is much less used now. The equations for these three attention mechanisms are summarized in the next equation.

$$\hat{\boldsymbol{h}}_{(t)}=\sum_{i}\alpha(t,i)\boldsymbol{y}_{(i)}$$

where 

$$\alpha(t,i)=\frac{\text{exp}(e_{(t,i)})}{\sum_{j}\text{exp}(e_{(t,j)})}$$

and 

$$
e_{(t,i)}=\begin{cases}
              \boldsymbol{h}_{(t)}^T\boldsymbol{y}_{(i)} & \text{dot }\\\\
              \boldsymbol{h}_{(t)}^T\boldsymbol{W}\boldsymbol{y}_{(i)} & \text{general }\\\\
              \boldsymbol{v}^T \text{tanh}(\boldsymbol{W}[\boldsymbol{h}_{(t)},\boldsymbol{y}_{(i)}]) & \text{concat }\\
\end{cases}
$$

A further explanation of how these attention modules work can be seen in this [video](https://youtu.be/quoGRI-1l0A?list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6).

This [notebook](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt) shows how we can add Luong attention to an Encoder-Decoder model using TensorFlow Addons. Another [notebook](https://www.tensorflow.org/tutorials/text/nmt_with_attention) where the Bahdanau attention is added instead.

## Visual Attention


Attention mechanisms are now used for a variety of purposes. One of their first applications beyond NMT was in generating image captions using [visual attention](https://arxiv.org/abs/1502.03044): a convolutional neural network first processes the image and outputs some feature maps, then a decoder RNN equipped with an attention mechanism generates the caption, one word at a time. At each decoder time step (each word), the decoder uses the attention model to focus on just the right part of the image. For example, in Figure 16-7, the model generated the caption "A woman is throwing a frisbee in a park", and we can see what part of the input image the decoder focused its attention on when it was about to output the word "frisbee": clearly, most of its attention was focused on the frisbee.

![texto alternativo](https://i.ibb.co/NNXPsN8/visual-attention.png)

**Note**: One extra benefit of attention mechanisms is that they make it easier to understand what led the model to produce its output. This is called *explainability*. It can be especially useful when the model makes a mistake: for example, if an image of a dog walking in the snow is labeled as "a wolf walking in the snow", then you can go back and check what the model focused on when it outputs the word "wolf". We may find that it was paying attention not only to the dog but also to the snow, hinting at a possible explanation: perhaps the way the model learned to distinguish dogs from wolves is by checking whether or not there’s a lot of snow around. We can then fix this by training the model with more images of wolves without snow, and dogs with snow. This example comes from a great 2016 [paper](https://arxiv.org/abs/1602.04938). that uses a different approach to explainability: learning an interpretable model locally around a classifier’s prediction. In some applications, explainability is not just a tool to debug a model; it can be a legal requirement (think of a system deciding whether or not it should grant you a loan).

Attention mechanisms are so powerful that we can actually build state-of-the-art models using only attention mechanisms.

## Attention Is All You Need: The Transformer Architecture

In a groundbreaking 2017 [paper](https://arxiv.org/abs/1706.03762) a team of Google researchers suggested that "Attention Is All You Need". They managed to create an architecture called the *Transformer*, which significantly improved the state of the art in NMT without using any recurrent or convolutional layers, just attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few other bits and pieces). As a bonus, this architecture was also much faster to train and easier to parallelize, so they managed to train it at a fraction of the time and cost of the previous state-of-the-art models.

The Transformer architecture is represented in Figure 16-8.

![texto alternativo](https://i.ibb.co/b56p1mS/transformer-arquitecture.png)

Let’s walk through this figure:

- The lefthand part is the encoder. Just like earlier, it takes as input a batch of sentences represented as sequences of word IDs (the input shape is [*batch size, max input sentence length*]), and it encodes each word into a 512-dimensional representation (so the encoder’s output shape is [*batch size, max input sentence length,* 512]). Note that the top part of the encoder is stacked $N$ times (in the paper, $N = 6$).

- The righthand part is the decoder. During training, it takes the target sentence as input (also represented as a sequence of word IDs), shifted one-time step to the right (i.e., a start-of-sequence token is inserted at the beginning). It also receives the outputs of the encoder (i.e., the arrows coming from the left side). Note that the top part of the decoder is also stacked $N$ times, and the encoder stack’s final outputs are fed to the decoder at each of these $N$ levels. Just like earlier, the decoder outputs a probability for each possible next word, at each time step (its output shape is [*batch size, max output sentence length, vocabulary length*]).

- During inference, the decoder cannot be fed targets, so we feed it the previously output words (starting with a start-of-sequence token). So the model needs to be called repeatedly, predicting one more word at every round (which is fed to the decoder at the next round, until the end-of-sequence token is output).



- Looking more closely, there are two embedding layers, $5 \times N$ skip connections, each of them followed by a normalization layer, $2 \times N$ "Feed Forward" modules that are composed of two dense layers each (the first one using the ReLU activation function, the second with no activation function), and finally, the output layer is a dense layer using the softmax activation function. All of these layers are time-distributed, so each word is treated independently of all the others. But how can we translate a sentence by only looking at one word at a time? Well, that’s where the new components come in:

 - The encoder’s *Multi-Head Attention* layer encodes each word’s relationship with every other word in the same sentence, paying more attention to the most relevant ones. For example, the output of this layer for the word "Queen" in the sentence "They welcomed the Queen of the United Kingdom" will depend on all the words in the sentence, but it will probably pay more attention to the words "United" and "Kingdom" than to the words "They" or "welcomed". This attention mechanism is called *self-attention* (the sentence is paying attention to itself). We will discuss exactly how it works shortly. The decoder’s *Masked Multi-Head Attention* layer does the same thing, but each word is only allowed to attend to words located before it. Finally, the decoder’s upper Multi-Head Attention layer is where the decoder pays attention to the words in the input sentence. For example, the decoder will probably pay close attention to the word "Queen" in the input sentence when it is about to output this word’s translation.

 - The *positional encodings* are simply dense vectors (much like word embeddings) that represent the position of a word in the sentence. This technique is used because there is no notion of word order in the architecture. All words of the input sequence are fed to the network (by the Multi-Head Attention layers) which do not consider the order or position of the words (unlike common RNN or ConvNet architectures); they only look at their relationships. Thus, model the has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. This addition maintains the embedding information and also adds the vital position information.


Let’s look a bit closer at both these novel components of the Transformer architecture, starting with the positional encodings.

### Positional encodings

A positional encoding is a dense vector that encodes the position of a word within a sentence: the $i^{\text{th}}$ positional encoding is simply added to the word embedding of the $i^{\text{th}}$ word in the sentence. These positional encodings can be learned by the model, but in the paper, the authors preferred to use fixed positional encodings, defined using the sine and cosine functions of different frequencies. The positional encoding matrix $\boldsymbol{P}$ is defined in Equation 16-2 and represented at the bottom of Figure 16-9 (transposed), where $P_{(p,i)}$ is the $i^{\text{th}}$ component of the embedding for the word located at the $p^{\text{th}}$ position in the sentence.

\begin{cases}
P_{p,2i} = \sin(\frac{p}{10000^{\frac{2i}{d}}})\\\\
P_{p,2i+1} = \cos(\frac{p}{10000^{\frac{2i}{d}}})
\end{cases}

where $d$ is the dimension of the word embeddings. 

For example, for word $w$ at position $\text{p}\in[0,L-1]$ in the input sequence $w=(w_0,...,w_{L-1})$, with 4-dimensional embedding $e_w$ ($d=4$ thus $i=0,1$), the operation would be

$$e^{\prime}_w=e_w+[\sin(\frac{p}{1000^0}),\cos(\frac{p}{1000^0}),\sin(\frac{p}{1000^{\frac{2}{4}}}),\cos(\frac{p}{1000^{\frac{2}{4}}})]=e_w+[\sin(p),\cos(p),\sin(\frac{p}{100}),\cos(\frac{p}{100})]$$


In the next figure, we can visualize another example of positional encodings with sentences with $L=200$ words and $d=150$:

![texto alternativo](https://i.ibb.co/K5KDjFx/positional-encoding-matrix.png)

This solution gives the same performance as learned positional encodings do, but it can extend to arbitrarily long sentences, which is why it’s favored. After the positional encodings are added to the word embeddings, the rest of the model has access to the absolute position of each word in the sentence because there is a unique positional encoding for each position (e.g., the positional encoding for the word located at the position $p=22$ in a sentence is represented by the vertical dashed line at the bottom left of Figure 16-9, and we can see that it is unique to that position). Moreover, the choice of oscillating functions (sine and cosine) makes it possible for the model to learn relative positions as well. For example, words located 38 words apart (e.g., at positions $p = 22$ and $p = 60$) always have the same positional encoding values in the embedding dimensions $i = 100$ and $i = 101$, as we can see in Figure 16-9. This explains why we need both the sine and the cosine for each frequency: if we only used the sine (the blue wave at $i = 100$), the model would not be able to distinguish positions $p = 25$ and $p = 35$ (marked by a cross).

There is no PositionalEncoding layer in TensorFlow, but it is easy to create one. For efficiency reasons, we precompute the positional encoding matrix in the constructor (so we need to know the maximum sentence length (`max_steps`), and the number of dimensions for each word representation (`max_dims`)). Then the `call()` method crops this embedding matrix to the size of the inputs, and it adds it to the inputs. Since we added an extra first dimension of size 1 when creating the positional encoding matrix, the rules of broadcasting will ensure that the matrix gets added to every sentence in the inputs:

In [None]:
from tensorflow import keras
import numpy as np
import tensorflow as tf

In [None]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_encoding = tf.constant(pos_emb.astype(self.dtype))
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_encoding[:, :shape[-2], :shape[-1]]

Then we can create the first layers of the Transformer:

In [None]:
embed_size = 512; max_steps = 500; vocab_size = 10000
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

Now let’s look deeper into the heart of the Transformer model: the Multi-Head Attention layer.



### Multi-head attention 


To understand how a Multi-Head Attention layer works, we must first understand the Scaled Dot-Product Attention layer, which it is based on. Let’s suppose the encoder analyzed the input sentence “They played chess,” and it managed to understand that the word “They” is the subject and the word “played” is the verb, so it encoded this information in the representations of these words. Now suppose the decoder has already translated the subject, and it thinks that it should translate the verb next. For this, it needs to fetch the verb from the input sentence. This is analog to a dictionary lookup: it’s as if the encoder created a dictionary {“subject”: “They”, “verb”: “played”, …}, and the decoder wanted to look up the value that corresponds to the key “verb.” However, the model does not have discrete tokens to represent the keys (like “subject” or “verb”); it has vectorized representations of these concepts (which it learned during training), so the key it will use for the lookup (called the query) will not perfectly match any key in the dictionary. The solution is to compute a similarity measure between the query and each key in the dictionary, and then use the softmax function to convert these similarity scores to weights that add up to 1. If the key that represents the verb is by far the most similar to the query, then that key’s weight will be close to 1. Then the model can compute a weighted sum of the corresponding values, so if the weight of the “verb” key is close to 1, then the weighted sum will be very close to the representation of the word “played.” In short, we can think of this whole process as a differentiable dictionary lookup. The similarity measure used by the Transformer is just the dot product, like in Luong attention. In fact, the equation is the same as for Luong attention, except for a scaling factor. The equation is shown in the next equation, in a vectorized form.

Let's visualize an example with an input sequence of two words:

![](https://i.ibb.co/G07pY4s/self-attention-process.png)

- The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices ($W^Q$, $W^K$ and $W^V$) that we trained during the training process. Notice that, as in the example, these new vectors can be smaller in dimension than the embedding vector.

- The second step is to calculate a score. Say we’re calculating the self-attention for the first word in this example, "Thinking". We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position. The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of $q_1$ and $k_1$. The second score would be the dot product of $q_1$ and $k_2$.

- The third and fourth steps are to divide the scores by the square root of the dimension of the key vectors. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1. This softmax score determines how much each word will be expressed at this position. Clearly, the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

- The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

- The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word). The resulting vector is one we can send along to the feed-forward neural network. 

In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

The first step is to calculate the Query, Key, and Value matrices ($\boldsymbol{Q}$, $\boldsymbol{K}$ and $\boldsymbol{V}$). We do that by packing our embeddings into a matrix $\boldsymbol{X}$ (every row in the $\boldsymbol{X}$ matrix corresponds to a word in the input sentence), and multiplying it by the weight matrices we’ve trained ($W^Q$, $W^K$ and $W^V$). Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

$$\text{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\text{softmax}(\frac{\boldsymbol{Q}\boldsymbol{K}^T}{\sqrt{d_{\text{keys}}}})\boldsymbol{V}$$

where:

- $\boldsymbol{Q}$ is a matrix containing one row per query. Its shape is [$n_{\text{queries}}$, $d_{\text{keys}}$], where $n_{\text{queries}}$ is the number of queries and $d_{\text{keys}}$ is the number of dimensions of each query and each key.

- $\boldsymbol{K}$ is a matrix containing one row per key. Its shape is [$n_{\text{keys}}$, $d_{\text{keys}}$], where $n_{\text{keys}}$ is the number of keys and values.

- $\boldsymbol{V}$ is a matrix containing one row per value. Its shape is [$n_{\text{keys}}$, $d_{\text{values}}$], where $d_{\text{values}}$ is the number of dimensions of each value.

- The shape of $\boldsymbol{Q}\boldsymbol{K}^T$ is [$n_{\text{queries}}$, $n_{\text{keys}}$]: it contains one similarity score for each query/key pair. The output of the softmax function has the same shape, but all rows sum up to 1. The final output has a shape of [$n_{\text{queries}}$, $d_{\text{values}}$]: there is one row per query, where each row represents the query result (a weighted sum of the values).

- The scaling factor scales down the similarity scores to avoid saturating the softmax function, which would lead to tiny gradients.

- It is possible to mask out some key/value pairs by adding a very large negative value to the corresponding similarity scores, just before computing the softmax. This is useful in the Masked Multi-Head Attention layer.

In the encoder, this equation is applied to every input sentence in the batch, with $\boldsymbol{Q}$, $\boldsymbol{K}$, and $\boldsymbol{V}$ all equal to the list of words in the input sentence (so each word in the sentence will be compared to every word in the same sentence, including itself). Similarly, in the decoder’s masked attention layer, the equation will be applied to every target sentence in the batch, with $\boldsymbol{Q}$, $\boldsymbol{K}$, and $\boldsymbol{V}$ all equal to the list of words in the target sentence, but this time using a mask to prevent any word from comparing itself to words located after it (at inference time the decoder will only have access to the words it already output, not to future words, so during training we must mask out future output tokens). In the upper attention layer of the decoder, the keys $\boldsymbol{K}$ and values $\boldsymbol{V}$ are simply the list of word encodings produced by the encoder, and the queries $\boldsymbol{Q}$ are the list of word encodings produced by the decoder.

The `keras.layers.Attention` layer implements Scaled Dot-Product Attention, efficiently applying the previous equation to multiple sentences in a batch. Its inputs are just like $\boldsymbol{Q}$, $\boldsymbol{K}$, and $\boldsymbol{V}$, except with an extra batch dimension (the first dimension).

If we ignore the skip connections, the layer normalization layers, the Feed Forward blocks, and the fact that this is Scaled Dot-Product Attention, not exactly Multi-Head Attention, then the rest of the Transformer model can be implemented like this:

In [None]:
Z = encoder_in
for N in range(6):
    Z = keras.layers.Attention(use_scale=True)([Z, Z])

encoder_outputs = Z
Z = decoder_in
for N in range(6):
    Z = keras.layers.Attention(use_scale=True, causal=True)([Z, Z])
    Z = keras.layers.Attention(use_scale=True)([Z, encoder_outputs])

outputs = keras.layers.TimeDistributed(
    keras.layers.Dense(vocab_size, activation="softmax"))(Z)

The `use_scale=True` argument creates an additional parameter that lets the layer learn how to properly downscale the similarity scores. This is a bit different from the Transformer model, which always downscales the similarity scores by the same factor $\sqrt{d_{\text{keys}}}$. The `causal=True` argument when creating the second attention layer ensures that each output token only attends to previous output tokens, not future ones.

Now it’s time to look at the final piece of the puzzle: what is a Multi-Head Attention layer? Its architecture is shown in Figure 16-10.

![texto alternativo](https://i.ibb.co/NnD0xCG/multi-head-attention.png)

As we can see, it is just a bunch of Scaled Dot-Product Attention layers, each preceded by a linear transformation of the values, keys, and queries (i.e., a time-distributed Dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation (again, time-distributed). But why? What is the intuition behind this architecture? Well, consider the word "played" we discussed earlier (in the sentence "They played chess"). The encoder was smart enough to encode the fact that it is a verb. But the word representation also includes its position in the text, thanks to the positional encodings, and it probably includes many other features that are useful for its translation, such as the fact that it is in the past tense. In short, the word representation encodes many different characteristics of the word. If we just used a single Scaled Dot-Product Attention layer, we would only be able to query all of these characteristics in one shot. This is why the Multi-Head Attention layer applies multiple different linear transformations of the values, keys, and queries: this allows the model to apply many different projections of the word representation into different subspaces, each focusing on a subset of the word’s characteristics. Perhaps one of the linear layers will project the word representation into a subspace where all that remains is the information that the word is a verb, another linear layer will extract just the fact that it is past tense, and so on. Then the Scaled Dot-Product Attention layers implement the lookup phase, and finally, we concatenate all the results and project them back to the original space.

A further explanation with nice visualizations of the transformer architecture can be seen in this [article](http://jalammar.github.io/illustrated-transformer/).

At this time, there is no Transformer class or MultiHeadAttention class available for TensorFlow 2. However, we can check out TensorFlow’s great tutorial for building a [Transformer model for language understanding](https://www.tensorflow.org/tutorials/text/transformer).

# References

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

- https://github.com/ageron/handson-ml2

- [Attention model Coursera](https://www.coursera.org/lecture/nlp-sequence-models/attention-model-intuition-RDXpX)

- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)

- [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

- [Illustrated transformer post](http://jalammar.github.io/illustrated-transformer/)

- [Transformer model for language understanding with TensorFlow](https://www.tensorflow.org/tutorials/text/transformer)

