## Transformer

The Transformer was proposed in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762). A TensorFlow implementation of it is available as a part of the [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor) package. Harvard's NLP group created a [guide annotating the paper with PyTorch implementation](http://nlp.seas.harvard.edu/2018/04/03/attention.html). In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

**Tranformer architecture**

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/t_architecture.png" width="800"/>
  <br>
  <em>Transformer Architecture</em>
</p>



### A High-Level Look

Let's begin by looking at the model as a single black box. In a machine translation application, it takes a sentence in one language and outputs its translation in another.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/the_transformer_3.png" width="800"/>
  <br>
  <em>Transformer Black Box</em>
</p>

Popping open that <em>Optimus Prime</em> goodness, we see an <strong>encoding component</strong>, a <strong>decoding component</strong>, and the connections between them.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/The_transformer_encoders_decoders.png" width="600"/>
  <br>
  <em>Encoder and Decoder</em>
</p>

The encoder is a stack of encoders (the original paper uses six stacked encoders — there’s nothing magical about the number six; you can experiment with other depths). The decoder is a stack of decoders of the same number.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/The_transformer_encoder_decoder_stack.png" width="600"/>
  <br>
  <em>Encoder and Decoder Stack</em>
</p>

All encoders are identical in structure (but they do <strong>not</strong> share weights). Each encoder consists of two main sub-layers:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/Transformer_encoder.png" width="600"/>
  <br>
  <em>Transformer Encoder</em>
</p>

First, the encoder’s inputs pass through a <strong>self-attention</strong> layer — this layer allows the encoder to consider other words in the input sentence when encoding a specific word. We’ll look more closely at self-attention later in the notebook.

The outputs of the self-attention layer then pass through a <strong>feed-forward neural network</strong>. The same feed-forward network is applied independently to each position.

The decoder has both of these layers too, but it also includes an additional attention layer between them. This extra attention layer helps the decoder focus on relevant parts of the input sentence — similar to the attention mechanism in <a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">seq2seq models</a>.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/Transformer_decoder.png" width="600"/>
  <br>
  <em>Transformer Decoder</em>
</p>


### Bringing The Tensors Into The Picture

Now that we've seen the major components of the model, let's start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an [embedding algorithm](https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca).

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/embeddings.png" width="600"/>
  <br>
  <em>Each word is embedded into a vector of size 512.</em>
</p>

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of size 512 — in the bottom encoder that would be the word embeddings, but in higher encoders, it would be the output of the encoder below. The size of this list is a hyperparameter we can set — basically it matches the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/encoder_with_tensors.png" width="600"/>
  <br>
  <em>Word vectors flow through the self-attention and feed-forward layers of the encoder.</em>
</p>

Here we begin to see one key property of the Transformer: each word position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, so the various paths can be computed in parallel while flowing through the feed-forward network.

Next, we'll switch to a shorter sentence and look at what happens in each sub-layer of the encoder.


### Now We're Encoding!

As we've mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a self-attention layer, then into a feed-forward neural network, and then sends the output upwards to the next encoder.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/encoder_with_tensors_2.png" width="600"/>
  <br>
  <em>Each word passes through self-attention and then a feed-forward neural network — the same network is applied to each vector separately.</em>
</p>



### Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with — I personally hadn’t come across it until reading the *Attention Is All You Need* paper. Let’s break down how it works.

Say the following sentence is an input sentence we want to translate:

> “The animal didn’t cross the street because it was too tired.”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s an easy question for a human, but not so trivial for an algorithm.

When the model processes the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the sequence for clues that help produce a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words with the current one it’s processing. Self-attention is the method the Transformer uses to “bake in” the understanding of other relevant words into the one we’re currently processing.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_self-attention_visualization.png" width="500"/>
  <br>
  <em>Encoding the word “it” — part of the attention mechanism focuses on “The animal” and blends its information into the encoding of “it”.</em>
</p>

Be sure to check out the [Tensor2Tensor notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb) where you can load a Transformer model and examine it using this interactive visualization.

### Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then we’ll see how it’s actually implemented — using matrices.

The **first step** in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So, for each word we create a **Query vector**, a **Key vector**, and a **Value vector**. These vectors are produced by multiplying the embedding by three weight matrices that are learned during training.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality 512. They don’t *have* to be smaller — this is an architectural choice that keeps the computation of multi-headed attention (mostly) constant.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_self_attention_vectors.png" width="600"/>
  <br>
  <em>Each word embedding is multiplied by the weight matrices W<sub>Q</sub>, W<sub>K</sub>, and W<sub>V</sub> to produce its query, key, and value vectors.</em>
</p>

What are the “query”, “key”, and “value” vectors?

They’re abstractions that help us calculate and reason about attention. Once you understand how attention scores are computed, you’ll see exactly what role each vector plays.

The **second step** in calculating self-attention is to compute a **score**. Suppose we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score every word in the input sentence relative to this word. The score determines how much focus to place on other parts of the sentence as we encode a word at a given position.

The score is computed by taking the dot product of the **query vector** with the **key vector** of the word being scored. So, if we’re processing self-attention for the word in position **#1**, the first score is the dot product of **q₁** and **k₁**. The second score is the dot product of **q₁** and **k₂**, and so on.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_self_attention_score.png" width="600"/>
  <br>
  <em>Calculating attention scores by taking the dot product of a word’s query vector with all other key vectors.</em>
</p>

The **third and fourth steps** are to divide the scores by 8 (which is the square root of the dimension of the key vectors used in the paper — 64. This scaling helps produce more stable gradients. There could be other choices here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and sum to 1.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/self-attention_softmax.png" width="600"/>
  <br>
  <em>The raw scores are scaled and passed through softmax to get normalized attention weights.</em>
</p>

This softmax output determines how much each word will contribute at this position. Naturally, the word at this position often has the highest score, but it’s sometimes useful to attend more strongly to another relevant word.

The **fifth step** is to multiply each value vector by its softmax score (preparing them to be summed up). The intuition here is that important words keep their values intact, while less relevant words are “drowned out” by multiplying them by small numbers (like 0.001).

The **sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer for this position (e.g., for the first word).

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/self-attention-output.png" width="600"/>
  <br>
  <em>The final output of self-attention for a word: a sum of the weighted value vectors.</em>
</p>

That concludes the self-attention calculation! The resulting vector is ready to be passed to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for efficiency. Now that we’ve seen the intuition at the word level, let’s look at how it’s done in matrix form.


### Matrix Calculation of Self-Attention

The **first step** is to calculate the Query, Key, and Value matrices. We do this by packing our embeddings into a matrix **X**, and multiplying it by the weight matrices we’ve learned (**W<sub>Q</sub>**, **W<sub>K</sub>**, **W<sub>V</sub>**).

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/self-attention-matrix-calculation.png" width="500"/>
  <br>
  <em>Each row in X corresponds to a word in the input. The embedding vector has a larger dimension (512) than the q/k/v vectors (64) for computational efficiency.</em>
</p>

**Finally**, since we’re dealing with matrices, we can condense steps two through six into a single formula to calculate the output of the self-attention layer.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/self-attention-matrix-calculation-2.png" width="600"/>
  <br>
  <em>The self-attention calculation in compact matrix form.</em>
</p>

### The Beast With Many Heads

The paper further refined the self-attention mechanism by introducing *multi-headed attention*. This improves the model in two ways:

1. It expands the model’s ability to focus on different positions. In the earlier example, **z₁** contains a bit of every other word’s encoding but could still be dominated by the word itself. For a sentence like *“The animal didn’t cross the street because it was too tired,”* it’s helpful for the model to know exactly which word “it” refers to.

2. It gives the model multiple “representation subspaces.” With multi-headed attention, we have multiple sets of Query/Key/Value weight matrices (the Transformer uses 8 attention heads, so there are 8 separate sets). Each is randomly initialized, and after training, each projects the input embeddings (or outputs from previous layers) into different subspaces.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_attention_heads_qkv.png" width="700"/>
  <br>
  <em>Multi-headed attention maintains separate Q/K/V weight matrices for each head. Each head produces different Q/K/V matrices.</em>
</p>

If we run the same self-attention calculation for each head with its unique weights, we get 8 different **Z** matrices:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_attention_heads_z.png" width="700"/>
  <br>
  <em>Each attention head produces its own output matrix Z.</em>
</p>

This leaves us with a challenge: the feed-forward layer expects a **single matrix** (one vector per word), not 8 separate ones. So we need to combine them.

How? We concatenate the outputs from all heads, then multiply by an additional weight matrix **W<sub>O</sub>**:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_attention_heads_weight_matrix_o.png" width="700"/>
  <br>
  <em>Concatenating all Z matrices and projecting them back with W<sub>O</sub>.</em>
</p>

That’s pretty much all there is to multi-headed self-attention! It may seem like a lot of matrices, but they all come together in a clear way. Here’s a recap of the whole mechanism:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_multi-headed_self-attention-recap.png" width="700"/>
  <br>
  <em>Putting it all together: multi-headed self-attention summarized in one diagram.</em>
</p>

Now that we’ve seen how attention heads work, let’s revisit our example to see where the different heads focus when encoding the word *“it”* in the sentence:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_self-attention_visualization_2.png" width="500"/>
  <br>
  <em>When encoding “it”, one head focuses on “the animal” and another on “tired”, enriching the representation.</em>
</p>

When we add all the attention heads together, the visualization can look more complex and harder to interpret:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_self-attention_visualization_3.png" width="500"/>
  <br>
  <em>Combined focus of all attention heads for the word “it”.</em>
</p>


### Representing The Order of The Sequence Using Positional Encoding

One thing missing from the model so far is a way to account for the order of words in the input sequence.

To address this, the Transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model can learn, which helps it determine the position of each word or the distance between words in the sequence. The intuition here is that adding these values to the embeddings encodes meaningful position information into the Q/K/V vectors during dot-product attention.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_positional_encoding_vectors.png" width="700"/>
  <br>
  <em>To give the model a sense of word order, we add positional encoding vectors with a predictable pattern.</em>
</p>

If we assume the embedding has a dimensionality of 4, the actual positional encodings might look like this:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_positional_encoding_example.png" width="700"/>
  <br>
  <em>A toy example of positional encoding with an embedding size of 4.</em>
</p>

What does this pattern look like?

In the next figure, each row corresponds to the positional encoding vector for one word. So the first row is the vector added to the embedding of the first word in an input sequence. Each row contains 512 values — each between -1 and 1 — color-coded so the pattern is visible.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_positional_encoding_large_example.png" width="600"/>
  <br>
  <em>A real example: positional encodings for 20 words (rows) with an embedding size of 512 (columns). The left half comes from a sine function, the right half from a cosine function — they are concatenated.</em>
</p>

The formula for positional encoding is described in the paper (*Attention Is All You Need*, section 3.5). You can see the code for generating positional encodings in [`get_timing_signal_1d()`](https://github.com/tensorflow/tensor2tensor/blob/23bd23b9830059fbc349381b70d9429b5c40a139/tensor2tensor/layers/common_attention.py). This is not the only possible method for positional encoding — but it scales well to unseen sequence lengths (e.g. when the model translates a sentence longer than any in the training set).

**July 2020 Update:**  
The positional encoding shown above is from the Tensor2Tensor implementation of the Transformer. The method described in the original paper is slightly different: it doesn’t concatenate the sine and cosine parts, but interleaves them. The next figure shows what that looks like. [Here’s the code to generate it](https://github.com/jalammar/jalammar.github.io/blob/master/notebookes/transformer/transformer_positional_encoding_graph.ipynb):

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/attention-is-all-you-need-positional-encoding.png" width="600"/>
  <br>
  <em>Interleaved positional encoding as described in the original paper.</em>
</p>


## The Residuals

One detail in the encoder’s architecture that we should mention before moving on is that each sub-layer (self-attention, feed-forward) in each encoder has a **residual connection** around it, followed by a [layer normalization](https://arxiv.org/abs/1607.06450) step.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_resideual_layer_norm.png" width="600"/>
  <br>
  <em>Each sub-layer (like self-attention) has a residual connection and layer normalization.</em>
</p>

If we visualize the vectors and the layer norm around the self-attention block, it looks like this:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_resideual_layer_norm_2.png" width="500"/>
  <br>
  <em>Residual connection around self-attention plus layer norm.</em>
</p>

The same structure applies to the sub-layers of the decoder as well. If we think of a Transformer with two stacked encoders and decoders, it would look like this:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_resideual_layer_norm_3.png" width="700"/>
  <br>
  <em>Residual connections and layer normalization in stacked encoders and decoders.</em>
</p>


### The Decoder Side

Now that we’ve covered the main ideas on the encoder side, we basically know how the components of the decoder work too — but let’s look at how they work together.

The encoder starts by processing the input sequence. The output of the top encoder is transformed into a set of attention vectors **K** and **V**. These are used by each decoder block in its *encoder-decoder attention* layer to help the decoder focus on relevant parts of the input sequence:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_decoding_1.gif" width="700"/>
  <br>
  <em>After encoding, decoding begins. Each step outputs a token of the output sequence (e.g., an English translation).</em>
</p>

The decoder repeats this process until it reaches a special `<end of sentence>` symbol, indicating that the output is complete. The output from each step is fed to the bottom decoder block at the next time step, and the decoder layers build up the result step by step, just like the encoders did. And just like the encoder inputs, we embed and add positional encodings to decoder inputs to signal the position of each word.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_decoding_2.gif" width="700"/>
  <br>
  <em>The decoder uses its own output so far as input for the next step, with positional encoding added.</em>
</p>

The self-attention layers in the decoder work slightly differently from those in the encoder:

In the decoder, the self-attention layer can only attend to **earlier positions** in the output sequence. This is enforced by *masking* future positions (setting them to `-inf`) before the softmax step in self-attention.

The *encoder-decoder attention* layer works just like multi-headed self-attention, except it creates its **Queries** matrix from the decoder layer below it, and takes the **Keys** and **Values** matrices from the encoder stack’s output.


### The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats — but how do we turn that into an actual word? That’s the job of the final **Linear** layer, which is followed by a **Softmax** layer.

The Linear layer is just a fully connected layer that projects the vector produced by the decoder stack into a much larger vector called a *logits vector*.

Suppose our model knows 10,000 unique English words (its *output vocabulary*) learned from the training data. The logits vector would then be 10,000 cells wide — each cell corresponds to the score for one word. That’s how we interpret the model’s output after the Linear layer.

The Softmax layer then turns these scores into probabilities (all positive, all summing to 1.0). The cell with the highest probability is selected, and the word associated with it is produced as the output for this time step.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_decoder_output_softmax.png" width="700"/>
  <br>
  <em>From the decoder’s output vector to the final word: Linear layer projects to logits, Softmax selects the most probable word.</em>
</p>


### Recap Of Training

Now that we’ve covered the full forward-pass process through a trained Transformer, it’s useful to look at the intuition behind training the model.

During training, an untrained model goes through the same forward pass. But because we train it on a labeled dataset, we can compare its predictions with the actual correct outputs.

To visualize this, let’s assume our output vocabulary only contains six words: *“a”*, *“am”*, *“i”*, *“thanks”*, *“student”*, and *“<eos>”* (short for “end of sentence”).

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/vocabulary.png" width="600"/>
  <br>
  <em>The model’s output vocabulary is created in the preprocessing step before training begins.</em>
</p>

Once we define our output vocabulary, we can use a vector of the same width to represent each word — this is called *one-hot encoding*. For example, the word *“am”* can be represented like this:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/one-hot-vocabulary-example.png" width="500"/>
  <br>
  <em>Example: one-hot encoding for a word in the output vocabulary.</em>
</p>

Following this recap, we can now look at the model’s *loss function* — the metric we optimize during training to help the model become accurate.


### The Loss Function

Say we’re training our model. Imagine it’s the first training step, and we’re training on a simple example — translating *“merci”* to *“thanks.”*

What we want is for the model’s output to be a probability distribution that strongly points to the word *“thanks.”* But since the model is untrained, that’s unlikely at first.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/transformer_logits_output_and_label.png" width="500"/>
  <br>
  <em>At first, the randomly initialized model produces an arbitrary probability distribution. We compare this with the correct target, then adjust the model’s weights using backpropagation.</em>
</p>

How do we compare two probability distributions? We measure the difference — typically using [cross-entropy](https://colah.github.io/posts/2015-09-Visual-Information/) or [Kullback–Leibler divergence](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained).

Of course, this is an oversimplified one-word example. More realistically, the input might be *“je suis étudiant”* with the expected output *“i am a student.”* This means we want the model to output a series of probability distributions such that:

- Each distribution is a vector of size `vocab_size` (6 in our toy example, but realistically 30,000 or more).
- The first output distribution has the highest probability for *“i.”*
- The second output distribution has the highest probability for *“am.”*
- And so on, until the final output distribution signals `<end of sentence>` using its own cell in the vocabulary.

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/output_target_probability_distributions.png" width="600"/>
  <br>
  <em>Target probability distributions for one training example.</em>
</p>

After training the model long enough on a large dataset, we hope its output distributions look like this:

<p align="center">
  <img src="https://raw.githubusercontent.com/sarakpyny/CREST-Internship/main/img/transformer/output_trained_model_probability_distributions.png" width="600"/>
  <br>
  <em>After training, the model ideally outputs the expected translations. Note: every word still gets some probability mass — a useful property of softmax for training.</em>
</p>

Since the model produces outputs one word at a time, it can pick the word with the highest probability at each step — this is called *greedy decoding*. Another option is *beam search*. For example, instead of keeping only the single highest-probability word, we keep the top two candidates (*“I”* and *“a”* maybe). Then we run the model forward twice: once assuming the first word was *“I,”* and once assuming it was *“a.”* Whichever sequence produces lower total error is kept. We repeat this for positions #2 and #3, and so on.  

This is called *beam search* — where `beam_size` is how many partial sequences we keep at each step, and `top_beams` is how many final translations we return. Both are hyperparameters you can experiment with.
