# Deep Dive Into GPT-2

In this chapter we will deconstruct one of the most known LLMs ever created - **GPT-2**.
The reason we chose this model is that most of the models that followed it reused much of its architecture.
If you understand GPT-2, you will not have too much trouble understanding later models.
Additionally, GPT-2 is not too large ("just" 1.5 billion parameters), so you will be able to load it in the memory of your local machine and won't have to provision a GPU instance.

As usual, we need to import a few things:

In [1]:
import torch

from transformers import AutoTokenizer, AutoModel

We will also disable the progress bars as to not pollute the book (you should probably keep them though when following along):

In [2]:
import os

os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

## Loading the Model and Performing Inference

First, we will load the tokenizer and the model:

In [3]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

In [4]:
print(type(tokenizer))

<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>


In [5]:
print(type(model))

<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>


Next perform inference using an example text:

In [6]:
text = "This is an example sentence"

In [7]:
encoded_input = tokenizer(text, return_tensors='pt')

In [8]:
print(encoded_input)

{'input_ids': tensor([[1212,  318,  281, 1672, 6827]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


In [9]:
output = model(**encoded_input)

In [10]:
print(type(output))

<class 'transformers.modeling_outputs.BaseModelOutputWithPastAndCrossAttentions'>


The `output` contains among other things a `last_hidden_state` attribute which has the sequence of all the hidden states at the output of the last layer:

In [11]:
output.last_hidden_state.shape

torch.Size([1, 5, 768])

Note that the first dimension is the batch size (which is `1` here since we only have a single text).
The second dimension is the number of tokens.
Finally, the third dimension is the dimension of the hidden state.

In [12]:
output.last_hidden_state

tensor([[[ 0.0530, -0.0137, -0.2393,  ..., -0.1245, -0.1116,  0.0225],
         [ 0.2470,  0.2260,  0.0397,  ...,  0.2413,  0.4349,  0.1768],
         [ 0.7483, -0.4052, -0.9382,  ...,  0.3646, -0.0287,  0.3722],
         [ 0.1990, -0.3695, -1.8210,  ..., -0.1772,  0.0093,  0.1647],
         [ 0.0704, -0.0537, -2.5189,  ...,  0.0582, -0.1217, -0.3843]]],
       grad_fn=<ViewBackward0>)

Since we want to predict the next token, we want only the hidden state at the last token:

In [13]:
hidden_state = output.last_hidden_state[0][-1]

## The Architecture

Let's output the model architecture:

In [14]:
print(model)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0): GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (1): GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwis

We see that the model has two embedding layers at the beginning - `wte` and `wpe`.
The `wte` layer is the embedding layer for the tokens:

In [15]:
print(model.wte)

Embedding(50257, 768)


The `wpe` layer is the positional embedding layer:

In [16]:
print(model.wpe)

Embedding(1024, 768)


These layers are followed by a dropout layer:

In [17]:
print(model.drop)

Dropout(p=0.1, inplace=False)


Next, we have the module list `h`:

In [18]:
print(type(model.h))

<class 'torch.nn.modules.container.ModuleList'>


The module list consists of 12 blocks:

In [19]:
print(len(model.h))

12


Each block is a so called `GPT2Block`:

In [20]:
print(type(model.h[0]))

<class 'transformers.models.gpt2.modeling_gpt2.GPT2Block'>


Looking inside such a `GPT2Block`, we will see a lot of very familiar things:

In [21]:
model.h[0]

GPT2Block(
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attn): GPT2Attention(
    (c_attn): Conv1D()
    (c_proj): Conv1D()
    (attn_dropout): Dropout(p=0.1, inplace=False)
    (resid_dropout): Dropout(p=0.1, inplace=False)
  )
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (mlp): GPT2MLP(
    (c_fc): Conv1D()
    (c_proj): Conv1D()
    (act): NewGELUActivation()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

Basically, a `GPT2Block` has a layer normalization, followed a `GPT2Attention` block which contains the attention mechanism.
This is in turn followed by another layer normalization, followed by a `GPT2MLP` block.

Whenever you want to find out, how an LLM model works, it is extremely instructive to print out its architecture to get a basic view of the components it has.
Now let's say how the tensors actually flow through the model.

We start with the embeddings.

## Embeddings

Let's retrieve the token IDs of our example text:

In [22]:
token_ids = encoded_input["input_ids"]

In [23]:
token_ids

tensor([[1212,  318,  281, 1672, 6827]])

Let's also get the attention mask for later:

In [24]:
attention_mask = encoded_input["attention_mask"]

In [25]:
print(attention_mask)

tensor([[1, 1, 1, 1, 1]])


Let's also generate the position IDs.

For every token ID we need a corresponding position ID.
The position ID sequence is constructed by simply starting at `0` and then counting up to `len(token_ids) - 1`.

In [26]:
position_ids = torch.tensor([[0, 1, 2, 3, 4]], dtype=torch.long)

Let's calculate the token embeddings:

In [27]:
token_embeds = model.wte(token_ids)
print(token_embeds.shape)

torch.Size([1, 5, 768])


Let's also calculate the positional embeddings:

In [28]:
position_embeds = model.wpe(position_ids)
print(position_embeds.shape)

torch.Size([1, 5, 768])


To get the final embeddings, we simply *add* the token embeddings and the positional embeddings.

These final embeddings will also be what we pass to the model as the first layer input.
Since the `transformers` codebase refers to the layer inputs/outputs as "hidden states", we will stick to that convention:

In [29]:
hidden_states = token_embeds + position_embeds
print(hidden_states)

tensor([[[ 0.0065, -0.2930,  0.0762,  ...,  0.0184, -0.0275,  0.1638],
         [ 0.0142, -0.0437, -0.0393,  ...,  0.1487, -0.0278, -0.0255],
         [-0.0828, -0.0964,  0.1232,  ...,  0.0530,  0.0755, -0.1057],
         [ 0.0714, -0.2025,  0.1870,  ..., -0.3685, -0.0108, -0.1304],
         [-0.0888, -0.0326,  0.1666,  ..., -0.2539, -0.0370, -0.2046]]],
       grad_fn=<AddBackward0>)


Here is a graphic representation of the process:

![Embeddings](images/embeddings.png)

## The First GPT2 Block

### The Attention Part

The `hidden_states` tensor is what we will not feed into the first GPT block.

Let's save the tensor since we will need it later:

In [30]:
residual = hidden_states

Let's also give the module list a more meaningful name:

In [31]:
layer_blocks = model.h

In [32]:
layer_block = layer_blocks[0]

As a reminder, here is how the layer block looks like:

In [33]:
print(layer_block)

GPT2Block(
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attn): GPT2Attention(
    (c_attn): Conv1D()
    (c_proj): Conv1D()
    (attn_dropout): Dropout(p=0.1, inplace=False)
    (resid_dropout): Dropout(p=0.1, inplace=False)
  )
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (mlp): GPT2MLP(
    (c_fc): Conv1D()
    (c_proj): Conv1D()
    (act): NewGELUActivation()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)


First, we pass the hidden states through the attention block.
This means, we need to run them through the normalization layer:

In [34]:
hidden_states = layer_block.ln_1(hidden_states)
print(hidden_states.shape)

torch.Size([1, 5, 768])


Next, we want to get the query, key and value tensors:

In [35]:
query_key_value = layer_block.attn.c_attn(hidden_states)
print(query_key_value.shape)

torch.Size([1, 5, 2304])


The queries, keys and values are combined into a single tensor.
This is why the third dimension is `2304 = 768 * 3`.

Since we want to work with these tensor separately, we need to split them out using the `split` function.
We have three dimensions and the queries, keys and values are split across the dimension number 2 (number 0 is the batch size, number 1 is the number tokens), so we need to `split` across `dim=2`.

The order of the items in the tensor is query first, key second and value third:

In [36]:
query, key, value = query_key_value.split(768, dim=2)

In [37]:
print(query.shape, key.shape, value.shape)

torch.Size([1, 5, 768]) torch.Size([1, 5, 768]) torch.Size([1, 5, 768])


Next, we need to split the attention heads.

The GPT-2 model has 12 attention heads and a head dimension of 64:

In [38]:
num_heads = 12
head_dim = 64
query = layer_block.attn._split_heads(query, num_heads, head_dim)
key = layer_block.attn._split_heads(key, num_heads, head_dim)
value = layer_block.attn._split_heads(value, num_heads, head_dim)
print(query.shape, key.shape, value.shape)

torch.Size([1, 12, 5, 64]) torch.Size([1, 12, 5, 64]) torch.Size([1, 12, 5, 64])


Next, we compute the attention scores:

In [39]:
attn_output, attn_weights = layer_block.attn._attn(query, key, value, attention_mask)

In [40]:
print(attn_output.shape, attn_weights.shape)

torch.Size([1, 12, 5, 64]) torch.Size([1, 12, 5, 5])


Now its time to merge the attention heads back together:

In [41]:
attn_output = layer_block.attn._merge_heads(attn_output, num_heads, head_dim)
attn_output.shape

torch.Size([1, 5, 768])

Now, we pass the the tensor through a final linear layer:

In [42]:
attn_output = layer_block.attn.c_proj(attn_output)

In [43]:
attn_output.shape

torch.Size([1, 5, 768])

Finally, we add our saved hidden state to the output:

In [44]:
hidden_states = attn_output + residual

In [45]:
hidden_states.shape

torch.Size([1, 5, 768])

Here is again a visualization of the process:

![Attention](images/attention.png)

### The MLP Part

The second part of the GPT block is the MLP part.

Again, we first save the current hidden state tensor:

In [46]:
residual = hidden_states

And - again - we first pass the hidden states through a layer normalization block:

In [47]:
hidden_states = layer_block.ln_2(hidden_states)

In [48]:
print(hidden_states.shape)

torch.Size([1, 5, 768])


Next, we pass the hidden states through the linear layer:

In [49]:
feed_forward_hidden_states = layer_block.mlp(hidden_states)

In [50]:
print(feed_forward_hidden_states.shape)

torch.Size([1, 5, 768])


Finally, we add the residual:

In [51]:
hidden_states = residual + feed_forward_hidden_states

In [52]:
print(hidden_states.shape)

torch.Size([1, 5, 768])


In [53]:
hidden_states

tensor([[[ 1.1035, -0.2464,  0.3231,  ..., -1.1987, -0.6627,  1.9803],
         [-1.3085, -0.2814, -0.8437,  ...,  0.2028,  0.1010,  0.7039],
         [-0.9680, -0.7867,  0.1910,  ..., -0.4938, -0.0832, -1.1468],
         [-2.1911, -0.1754, -1.5701,  ..., -2.0236,  0.8131,  1.1406],
         [-3.2628,  2.9190, -0.5089,  ...,  0.0336, -0.0497, -0.7092]]],
       grad_fn=<AddBackward0>)

## The Other GPT Blocks

Now, we simply pass the hidden states through one block after the other, where the output of each block is the input to the next block:

In [54]:
for block in model.h[1:]:
    hidden_states = block(hidden_states)[0]

In [59]:
print(hidden_states.shape)

torch.Size([1, 5, 768])


Finally, we pass the final result through one last layer normalization:

In [61]:
hidden_states = model.ln_f(hidden_states)
print(hidden_states.shape)

torch.Size([1, 5, 768])


Let's verify that our calculations are correct by checking if the `hidden_states` tensor we computed manually is the same as `output.last_hidden_state`:

In [58]:
torch.allclose(hidden_states, output.last_hidden_state, rtol=1e-04, atol=1e-06)

True