# GPT2 Transformer Architecture


#### Encoder-Decoder Architecture has been successful

![alt](image/1a.jpg)

### Language Modeling

> Language model is basically a machine learning model that is able to look at part of a sentence and predict the next word. Ex :- smartphone keyboards suggestion


[Playground :- Try Yourself](https://demo.allennlp.org/next-token-lm?text=A%20)


Variants of GPT2

![alt](image/1.jpg)

### Difference From BERT

![alt](image/2.jpg)


**GPT-2** is built using transformer **decoder blocks** whereas **BERT** uses transformer **encoder blocks**. 

> Key difference between the BERT & GPT2 is GPT2 like traditional language models outputs one token at a time.

Working Procedure of these models is that after each token is produced, that token is added to the sequence of inputs and that new sequence becomes the input to the model in its next step. This is an idea called “Auto-regression”. 

**BERT is not auto regressive.**

![alt](image/3.gif)

## Transformer BLOCK

### Encoder BLOCK

![alt](image/41.jpg)

### Decoder Block

![alt](image/5.jpg)

It has a small architectural variation from the encoder block – a layer to allow it to pay attention to specific segments from the encoder block.

One key difference in the self-attention layer here, is that it masks future tokens – not by changing the word to [mask] like BERT, but by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated.

If, for example, we’re to highlight the path of position #4, we can see that it is only allowed to attend to the present and previous tokens:

![alt](image/5a.jpg)

It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening:

![alt](image/6a.jpg)

### The Decoder-Only Block

[Generating Wikipedia by Summarizing Long Sequences](https://arxiv.org/pdf/1801.10198.pdf) proposed another arrangement of the transformer block that is capable of doing language modeling. This model threw away the Transformer encoder. For that reason, let’s call the model the “Transformer-Decoder”.

![alt](image/7a.jpg)

These blocks were very similar to the original decoder blocks, except they did away with that second self-attention layer.

**The OpenAI GPT-2 model uses these decoder-only blocks.**

### Doing Post mortem of GPT2

a trained GPT-2 on our surgery table and look at how it works

![alt](image/6.jpg)

**The GPT-2 can process 1024 tokens. Each token flows through all the decoder blocks along its own path.**

The simplest way to run a trained GPT-2 is to allow it to ramble on its own (which is technically called generating unconditional samples) – alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating interactive conditional samples). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses <|endoftext|> as its start token. Let’s call it s instead).


![alt](image/7.jpg)

The model only has one input token, so that path would be the only active one. The token is processed successively through all the layers, then a vector is produced along that path. That vector can be scored against the model’s vocabulary (all the words the model knows, 50,000 words in the case of GPT-2). In this case we selected the token with the highest probability, ‘the’. But we can certainly mix things up – you know how if you keep clicking the suggested word in your keyboard app, it sometimes can stuck in repetitive loops where the only way out is if you click the second or third suggested word. The same can happen here. GPT-2 has a parameter called top-k that we can use to have the model consider sampling words other than the top word (which is the case when top-k = 1).


In the next step, we add the output from the first step to our input sequence, and have the model make its next prediction:

![alt](image/8.jpg)

Notice that the second path is the only that’s active in this calculation. Each layer of GPT-2 has retained its own interpretation of the first token and will use it in processing the second token (we’ll get into more detail about this in the following section about self-attention). GPT-2 does not re-interpret the first token in light of the second token.

## Going Deeper

### Input Encoding

Let’s look at more details to get to know the model more intimately. Let’s start from the input. As in other NLP models we’ve discussed before, the model looks up the embedding of the input word in its embedding matrix – one of the components we get as part of a trained model.

![alt](image/9.jpg)

So in the beginning, we look up the embedding of the start token s in the embedding matrix. Before handing that to the first block in the model, we need to incorporate positional encoding – a signal that indicates the order of the words in the sequence to the transformer blocks. Part of the trained model is a matrix that contains a positional encoding vector for each of the 1024 positions in the input.

![alt](image/10.jpg)

With this, we’ve covered how input words are processed before being handed to the first transformer block. We also know two of the weight matrices that constitute the trained GPT-2.


![alt](image/11.jpg)

**Sending a word to the first transformer block means looking up its embedding and adding up the positional encoding vector for position #1.**

### A journey up the Stack

The first block can now process the token by first passing it through the self-attention process, then passing it through its neural network layer. Once the first transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. The process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers.

![alt](image/12.jpg)





## Self-Attention Recap

Language heavily relies on context. For example, look at the second law:

>Second Law of Robotics :-
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

I have highlighted three places in the sentence where the words are referring to other words. There is no way to understand or process these words without incorporating the context they are referring to. When a model processes this sentence, it has to be able to know that:

* it refers to the robot
* such orders refers to the earlier part of the law, namely “the orders given it by human beings”
* The First Law refers to the entire First Law

This is what self-attention does. It bakes in the model’s understanding of relevant and associated words that explain the context of a certain word before processing that word (passing it through a neural network). It does that by assigning scores to how relevant each word in the segment is, and adding up their vector representation.

As an example, this self-attention layer in the top block is paying attention to “a robot” when it processes the word “it”. The vector it will pass to its neural network is a sum of the vectors for each of the three words multiplied by their scores.

![alt](image/13.jpg)

### Self-Attention Process

**Self-attention is processed along the path of each token in the segment**. The significant components are three vectors:

* Query: The query is a representation of the current word used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.
* Key: Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words.
* Value: Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.

![alt](images/14.jpg)

A crude analogy is to think of it like searching through a filing cabinet. The query is like a sticky note with the topic you’re researching. The keys are like the labels of the folders inside the cabinet. When you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. Except you’re not only looking for one value, but a blend of values from a blend of folders.

Multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax).

![alt](images/15.jpg)

We multiply each value by its score and sum up – resulting in our self-attention outcome.

![alt](images/16.jpg)

This weighted blend of value vectors results in a vector that paid 50% of its “attention” to the word robot, 30% to the word a, and 19% to the word it. Later in the post, we’ll got deeper into self-attention. But first, let’s continue our journey up the stack towards the output of the model.

### Model Output

When the top block in the model produces its output vector (the result of its own self-attention followed by its own neural network), the model multiplies that vector by the embedding matrix.

![alt](images/17.jpg)

Recall that each row in the embedding matrix corresponds to the embedding of a word in the model’s vocabulary. The result of this multiplication is interpreted as a score for each word in the model’s vocabulary.


![alt](images/18.jpg)

We can simply select the token with the highest score (top_k = 1). But better results are achieved if the model considers other words as well. So a better strategy is to sample a word from the entire list using the score as the probability of selecting that word (so words with a higher score have a higher chance of being selected). A middle ground is setting top_k to 40, and having the model consider the 40 words with the highest scores.

![alt](images/19a.jpg)

With that, the model has completed an iteration resulting in outputting a single word. The model continues iterating until the entire context is generated (1024 tokens) or until an end-of-sequence token is produced.



![alt](image/20.jpg)

# Self Attention

Let’s start by looking at the original self-attention as it’s calculated in an encoder block. Let’s look at a toy transformer block that can only process four tokens at a time.

Self-attention is applied through three main steps:

* Create the Query, Key, and Value vectors for each path.
* For each input token, use its query vector to score against all the other key vectors
* Sum up the value vectors after multiplying them by their associated scores.

![alt](image/21.jpg)

### 1- Create Query, Key, and Value Vectors

Let’s focus on the first path. We’ll take its query, and compare against all the keys. That produces a score for each key. The first step in self-attention is to calculate the three vectors for each token path (let’s ignore attention heads for now):

![alt](image/22.jpg)

### 2- Score
Now that we have the vectors, we use the query and key vectors only for step #2. Since we’re focused on the first token, we multiply its query by all the other key vectors resulting in a score for each of the four tokens.

![alt](image/23.jpg)

### 3- Sum

We can now multiply the scores by the value vectors. A value with a high score will constitute a large portion of the resulting vector after we sum them up.

![alt](image/24.jpg)

**The lower the score, the more transparent we're showing the value vector. That's to indicate how multiplying by a small number dilutes the values of the vector.**

If we do the same operation for each path, we end up with a vector representing each token containing the appropriate context of that token. Those are then presented to the next sublayer in the transformer block (the feed-forward neural network):

![alt](image/25.jpg)

## Masked Self-Attention

Now that we’ve looked inside a transformer’s self-attention step, let’s proceed to look at masked self-attention. Masked self-attention is identical to self-attention except when it comes to step #2. Assuming the model only has two tokens as input and we’re observing the second token. In this case, the last two tokens are masked. So the model interferes in the scoring step. It basically always scores the future tokens as 0 so the model can’t peak to future words:

![alt](images/26.jpg)

This masking is often implemented as a matrix called an attention mask. Think of a sequence of four words (“robot must obey orders”, for example). In a language modeling scenario, this sequence is absorbed in four steps – one per word (assuming for now that every word is a token). As these models work in batches, we can assume a batch size of 4 for this toy model that will process the entire sequence (with its four steps) as one batch.

![alt](images/27.jpg)

In matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. Let’s visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:

![alt](images/28.jpg)

After the multiplication, we slap on our attention mask triangle. It set the cells we want to mask to -infinity or a very large negative number (e.g. -1 billion in GPT2):

![alt](images/29.jpg)

Then, applying softmax on each row produces the actual scores we use for self-attention:

![alt](images/30.jpg)

What this scores table means is the following:

* When the model processes the first example in the dataset (row #1), which contains only one word (“robot”), 100% of its attention will be on that word.
* When the model processes the second example in the dataset (row #2), which contains the words (“robot must”), when it processes the word “must”, 48% of its attention will be on “robot”, and 52% of its attention will be on “must”.
* And so on





## GPT-2 Masked Self-Attention
Let’s get into more detail on GPT-2’s masked attention.

Evaluation Time: Processing One Token at a Time
We can make the GPT-2 operate exactly as masked self-attention works. But during evaluation, when our model is only adding one new word after each iteration, it would be inefficient to recalculate self-attention along earlier paths for tokens which have already been processed.

In this case, we process the first token (ignoring s for now).

![alt](images/31.jpg)

GPT-2 holds on to the key and value vectors of the the a token. Every self-attention layer holds on to its respective key and value vectors for that token:

![alt](images/32.jpg)

Now in the next iteration, when the model processes the word robot, it does not need to generate query, key, and value queries for the a token. It just reuses the ones it saved from the first iteration:

![alt](images/33.jpg)



## GPT-2 Self-attention: 1- Creating queries, keys, and values

Let’s assume the model is processing the word it. If we’re talking about the bottom block, then its input for that token would be the embedding of it + the positional encoding for slot #9:

![alt](image/34.jpg)

Every block in a transformer has its own weights (broken down later in the post). The first we encounter is the weight matrix that we use to create the queries, keys, and values.

![alt](image/35.jpg)

The multiplication results in a vector that’s basically a concatenation of the query, key, and value vectors for the word it.

![alt](image/36.jpg)

**Multiplying the input vector by the attention weights vector (and adding a bias vector aftwards) results in the key, value, and query vectors for this token.**


## GPT-2 Self-attention: 1.5- Splitting into attention heads

In the previous examples, we dove straight into self-attention ignoring the “multi-head” part. It would be useful to shed some light on that concept now. Self attention is conducted multiple times on different parts of the Q,K,V vectors. “Splitting” attention heads is simply reshaping the long vector into a matrix. The small GPT2 has 12 attention heads, so that would be the first dimension of the reshaped matrix:

![alt](image/37.jpg)

In the previous examples, we’ve looked at what happens inside one attention head. One way to think of multiple attention-heads is like this (if we’re to only visualize three of the twelve attention heads):

![alt](image/38.jpg)

## GPT-2 Self-attention: 2- Scoring
We can now proceed to scoring – knowing that we’re only looking at one attention head (and that all the others are conducting a similar operation):

![alt](image/39.jpg)

Now the token can get scored against all of keys of the other tokens (that were calculated in attention head #1 in previous iterations):

![alt](image/40.jpg)

## GPT-2 Self-attention: 3- Sum

As we’ve seen before, we now multiply each value with its score, then sum them up, producing the result of self-attention for attention-head #1:

![alt](image/41a.jpg)


## GPT-2 Self-attention: 3.5- Merge attention heads

The way we deal with the various attention heads is that we first concatenate them into one vector:

![alt](image/42.jpg)

But the vector isn’t ready to be sent to the next sublayer just yet. We need to first turn this Frankenstein’s-monster of hidden states into a homogenous representation.

## GPT-2 Self-attention: 4- Projecting

We’ll let the model learn how to best map concatenated self-attention results into a vector that the feed-forward neural network can deal with. Here comes our second large weight matrix that projects the results of the attention heads into the output vector of the self-attention sublayer:

![alt](image/43.jpg)

And with this, we have produced the vector we can send along to the next layer:

![alt](image/44.jpg)

## GPT-2 Fully-Connected Neural Network: Layer #1

The fully-connected neural network is where the block processes its input token after self-attention has included the appropriate context in its representation. It is made up of two layers. The first layer is four times the size of the model (Since GPT2 small is 768, this network would have 768*4 = 3072 units). Why four times? That’s just the size the original transformer rolled with (model dimension was 512 and layer #1 in that model was 2048). This seems to give transformer models enough representational capacity to handle the tasks that have been thrown at them so far.

![alt](image/45.jpg)

## GPT-2 Fully-Connected Neural Network: Layer #2 - Projecting to model dimension

The second layer projects the result from the first layer back into model dimension (768 for the small GPT2). The result of this multiplication is the result of the transformer block for this token.

![alt](image/46.jpg)

