Run locally or <a target="_blank" href="https://colab.research.google.com/github/aalgahmi/dl_handouts/blob/main/15.case_study_GPT-2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Case Study: GPT-2

The goal here is to showcase how transformer-based models are implemented from scratch using a real example. GPT-2 is small enough to demonstrate while still being similar to and relevant to more advanced successor models.  

The original GPT-2 model was implemented by OpenAI using TensorFlow, and its trained weights were released.  

In this notebook, we will:  

- Download Hugging Face's version of the GPT-2 model in PyTorch  
- Implement it from scratch  
- Copy the trained weights from the Hugging Face implementation to our implementation  
- Compare both models using an example  

Toward the end, we will also download LLaMA and test its text generation pipeline using the same prompt—just to show how far foundation models have evolved since GPT-2.  


First we download the required libraries:

In [1]:
%%capture
!pip install -q tiktoken torchinfo

## Downloading the Pre-trained GPT 2 Model from HuggingFace

In [2]:
from transformers import GPT2LMHeadModel

hf_model = GPT2LMHeadModel.from_pretrained('gpt2')
hf_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [3]:
from torchinfo import summary
summary(hf_model, depth=4)

Layer (type:depth-idx)                             Param #
GPT2LMHeadModel                                    --
├─GPT2Model: 1-1                                   --
│    └─Embedding: 2-1                              38,597,376
│    └─Embedding: 2-2                              786,432
│    └─Dropout: 2-3                                --
│    └─ModuleList: 2-4                             --
│    │    └─GPT2Block: 3-1                         --
│    │    │    └─LayerNorm: 4-1                    1,536
│    │    │    └─GPT2Attention: 4-2                2,362,368
│    │    │    └─LayerNorm: 4-3                    1,536
│    │    │    └─GPT2MLP: 4-4                      4,722,432
│    │    └─GPT2Block: 3-2                         --
│    │    │    └─LayerNorm: 4-5                    1,536
│    │    │    └─GPT2Attention: 4-6                2,362,368
│    │    │    └─LayerNorm: 4-7                    1,536
│    │    │    └─GPT2MLP: 4-8                      4,722,432
│    │    └─GPT2Block: 3

## Implementing GPT 2 from Scratch
We will follow a naming convention similar to that of Hugging Face.

In [4]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

* Setting the values of the hyperparameters that control the size and architecture of the model.

In [5]:
class GPT2Config:
  max_seq_len = 1024
  vocab_size = 50257 # 50,000 BPE merges + 256 byte tokens + 1 end_of_text token
  n_block = 12       # Blocks
  n_head = 12        # Heads
  n_embd = 768       # Must be divisible by n_head

* Implementing self-attention (See lecture notes)
$$\mathbf{Y} = Attention(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = Softmax(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{D_K}}) \times \mathbf{V}$$

In [6]:
class GPT2Attention(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.config = config

    E, H = config.n_embd, config.n_head

    assert E % H == 0

    # To create query, key, and value projections
    self.c_attn = nn.Linear(E, 3 * E)
    # Final context projection
    self.c_proj = nn.Linear(E, E)

  def scaled_dot_product_attention(self, q, k, v, is_causal=False):
    B, H, T, L = q.size()
    scale_factor = 1 / math.sqrt(L)

    # (B, H, T, L) x (B, H, L, T) = (B, H, T, T)
    attn = (q @ k.transpose(-2, -1)) * scale_factor
    if is_causal:
      mask = torch.tril(torch.ones(T, T)).view(1, 1, T, T)
      attn = attn.masked_fill(mask == 0, float('-inf'))

    attn = F.softmax(attn, dim=-1)

    return attn @ v # (B, H, T, T) x (B, H, T, L) -> (B, H, T, L)

  def forward(self, x):
    B, T, E = x.size()     # Batch size, Number of Tokens, Embedding Size
    H = self.config.n_head # Number of heads
    L = E // H             # Head length(size) such that E = L * H

    # Projecting input on queries, keys, and values
    qkv = self.c_attn(x)   # (B, T, 3 * E)
    q, k, v = qkv.split(E, dim=2) # each (B, T, E)

    q = q.view(B, T, H, L).transpose(1, 2) # (B, H, T, L)
    k = k.view(B, T, H, L).transpose(1, 2) # (B, H, T, L)
    v = v.view(B, T, H, L).transpose(1, 2) # (B, H, T, L)

    y = self.scaled_dot_product_attention(q, k, v, is_causal=True) # (B, H, T, L)
    # or
    # y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

    y = y.transpose(1, 2).contiguous().view(B, T, E)
    y = self.c_proj(y)
    return y

Let's test it on a dummy input and summarize it.

In [7]:
cttn = GPT2Attention(GPT2Config())
cttn(torch.ones(7, 10, 768)).shape

torch.Size([7, 10, 768])

In [8]:
summary(cttn)

Layer (type:depth-idx)                   Param #
GPT2Attention                            --
├─Linear: 1-1                            1,771,776
├─Linear: 1-2                            590,592
Total params: 2,362,368
Trainable params: 2,362,368
Non-trainable params: 0

* Implementing the MLP component, using GELU activation instead of ReLU.

In [9]:
from collections import OrderedDict

class MLP(nn.Module):
  def __init__(self, config):
    super().__init__()
    E = config.n_embd
    self.net = nn.Sequential(OrderedDict([
        ('c_fc', nn.Linear(E, 4 * E)),
        ('gelu', nn.GELU()),
        ('c_proj', nn.Linear(4 * E, E)),
    ]))

  def forward(self, x):
    return self.net(x)

In [10]:
summary(MLP(GPT2Config()), layer_names=True)

Layer (type:depth-idx)                   Param #
MLP                                      --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       2,362,368
│    └─GELU: 2-2                         --
│    └─Linear: 2-3                       2,360,064
Total params: 4,722,432
Trainable params: 4,722,432
Non-trainable params: 0

* Implementing the LayerNorm layer, which is applied before and after self-attention.

In [11]:
class LayerNorm(nn.Module):
  def __init__(self, config, eps=1e-6):
    super().__init__()
    E = config.n_embd

    self.a = nn.Parameter(torch.ones(E))
    self.b = nn.Parameter(torch.zeros(E))
    self.eps = eps

  def forward(self, x):
    mean = x.mean(-1, keepdim=True)
    std = x.std(-1, keepdim=True)

    return self.a * (x - mean) / (std + self.eps) + self.b

In [12]:
lnorm = LayerNorm(GPT2Config())
lnorm(torch.ones(7, 10, 768)).shape

torch.Size([7, 10, 768])

In [13]:
summary(lnorm)

Layer (type:depth-idx)                   Param #
LayerNorm                                1,536
Total params: 1,536
Trainable params: 1,536
Non-trainable params: 0

* Putting together the GPT-2 blocks using the layers implemented above.

In [14]:
class GPT2Block(nn.Module):
  def __init__(self, config):
    super().__init__()
    E = config.n_embd

    self.block = nn.ModuleDict({
        'ln_1': nn.LayerNorm(E),
        'attn': GPT2Attention(config),
        'ln_2': nn.LayerNorm(E),
        'mlp': MLP(config),
    })

  def forward(self, x):
    x = x + self.block["attn"](self.block["ln_1"](x))
    x = x + self.block["mlp"](self.block["ln_2"](x))

    return x

In [24]:
summary(GPT2Block(GPT2Config()))

Layer (type:depth-idx)                   Param #
GPT2Block                                --
├─ModuleDict: 1-1                        --
│    └─LayerNorm: 2-1                    1,536
│    └─GPT2Attention: 2-2                --
│    │    └─Linear: 3-1                  1,771,776
│    │    └─Linear: 3-2                  590,592
│    └─LayerNorm: 2-3                    1,536
│    └─MLP: 2-4                          --
│    │    └─Sequential: 3-3              4,722,432
Total params: 7,087,872
Trainable params: 7,087,872
Non-trainable params: 0

* Finally, implementing the GPT-2 model itself.





In [28]:
class GPT2(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.config = config
    E, V, H, T = config.n_embd, config.vocab_size, config.n_head, config.max_seq_len

    self.transformer = nn.ModuleDict({
        'wte': nn.Embedding(V, E),
        'wpe': nn.Embedding(T, E),
        'h': nn.ModuleList([GPT2Block(config) for _ in range(config.n_block)]),
        'ln_f': nn.LayerNorm(E),
    })

    self.lm_head = nn.Linear(E, V, bias=False)

    # Weight tying === sharing weights between two layers
    self.transformer.wte.weight = self.lm_head.weight

  def forward(self, x):
    B, T = x.size()

    # Position embedding
    pos = torch.arange(0, T, dtype=torch.long, device=x.device) # shape (T)
    p_emb = self.transformer.wpe(pos) # (T, E)

    # Token ID embedding
    t_emb = self.transformer.wte(x) # (B, T, E)

    # Combined embedding
    x = t_emb + p_emb

    # Pass x to GPT bocks
    for block in self.transformer.h:
        x = block(x)

    x = self.transformer.ln_f(x)

    logits = self.lm_head(x) # (B, T, V)

    return logits

  @classmethod
  def from_pretrained(cls, hf_model):
      model = GPT2(GPT2Config())

      state = model.state_dict()
      hf_state = hf_model.state_dict()

      state_keys = [k for k in model.state_dict() if not k.endswith('.attn.bias')]
      hf_state_keys = [k for k in hf_model.state_dict() if not k.endswith('.attn.bias')]
      assert len(state_keys) == len(hf_state_keys)

      transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
      for hf_k, k in zip(hf_state_keys, state_keys):
        with torch.no_grad():
          if any(hf_k.endswith(w) for w in transposed):
              state[k].copy_(hf_state[hf_k].t())
          else:
              state[k].copy_(hf_state[hf_k])

      return model

In [29]:
from_scratch_model = GPT2.from_pretrained(hf_model)
from_scratch_model

GPT2(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (block): ModuleDict(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Linear(in_features=768, out_features=2304, bias=True)
            (c_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            (net): Sequential(
              (c_fc): Linear(in_features=768, out_features=3072, bias=True)
              (gelu): GELU(approximate='none')
              (c_proj): Linear(in_features=3072, out_features=768, bias=True)
            )
          )
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

* Comparing both models.



In [30]:
"HuggingFace:", summary(hf_model), "", "", "From Scratch:", summary(from_scratch_model)

('HuggingFace:',
 Layer (type:depth-idx)                             Param #
 GPT2LMHeadModel                                    --
 ├─GPT2Model: 1-1                                   --
 │    └─Embedding: 2-1                              38,597,376
 │    └─Embedding: 2-2                              786,432
 │    └─Dropout: 2-3                                --
 │    └─ModuleList: 2-4                             --
 │    │    └─GPT2Block: 3-1                         7,087,872
 │    │    └─GPT2Block: 3-2                         7,087,872
 │    │    └─GPT2Block: 3-3                         7,087,872
 │    │    └─GPT2Block: 3-4                         7,087,872
 │    │    └─GPT2Block: 3-5                         7,087,872
 │    │    └─GPT2Block: 3-6                         7,087,872
 │    │    └─GPT2Block: 3-7                         7,087,872
 │    │    └─GPT2Block: 3-8                         7,087,872
 │    │    └─GPT2Block: 3-9                         7,087,872
 │    │    └─GPT2Block

* Generating text with both models. First here is a function to do that:

In [31]:
import tiktoken

def generate(model, prompt, n_gen_seqs = 5, K = 50, max_length=32):
  tokenizer = tiktoken.get_encoding("gpt2")

  model.eval()
  tokens = tokenizer.encode(prompt)
  sequences = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).repeat(n_gen_seqs, 1)

  rgen = torch.Generator()

  while sequences.size(1) < max_length:
    # forward the model to get the logits
    with torch.inference_mode():
      output = model(sequences)
      logits = output.logits if hasattr(output, "logits") else output # (B, T, V)
      # take the logits at the last position
      logits = logits[:, -1, :] # (B, V)
      # Softmax for probabilities
      probs = F.softmax(logits, dim=-1)
      # Top-k sampling
      topk_probs, topk_indices = torch.topk(probs, K, dim=-1)
      # Select a random token from the top-k probs
      i = torch.multinomial(topk_probs, 1, generator=rgen) # (B, 1)
      # Gather indices
      token = torch.gather(topk_indices, -1, i) # (B, 1)
      # Append to sequences
      sequences = torch.cat((sequences, token), dim=1)

  generated_sequences = []
  for i in range(n_gen_seqs):
      tokens = sequences[i, :max_length].tolist()
      generated_sequences.append(tokenizer.decode(tokens))

  return generated_sequences

In [32]:
generate(from_scratch_model, "The best thing in life is")

["The best thing in life is a good time, one with good people around. If you're one of those people you can give to someone else. If someone",
 'The best thing in life is you have freedom of choice and we all have. So many people have been given the choice of trying to change something, or to',
 "The best thing in life is just to stay alive in this world. It's very exciting when you're in the same situation as me.\n\nDo you",
 'The best thing in life is always finding food, so in this era people often need to eat, which is all we know."\n\nKathleen Mc',
 'The best thing in life is if you can feel it.\n\nAdvertisement\n\n"If you feel it, then you\'re going to find it. You']

In [33]:
generate(hf_model, "The best thing in life is")

["The best thing in life is a good time, one with good people around. If you're one of those people you can give to someone else. If someone",
 'The best thing in life is you have freedom of choice and we all have. So many people have been given the choice of trying to change something, or to',
 "The best thing in life is just to stay alive in this world. It's very exciting when you're in the same situation as me.\n\nDo you",
 'The best thing in life is always finding food, so in this era people often need to eat, which is all we know."\n\nKathleen Mc',
 'The best thing in life is if you can feel it.\n\nAdvertisement\n\n"If you feel it, then you\'re going to find it. You']

# Comparison with LLama 3.2-1B

You will need to login with your huggingface token for this.

In [None]:
from huggingface_hub import login
login(token = 'hf_your_token_goes_here')

In [35]:
import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    max_new_tokens=48,
)

pipe("The best thing in life is")

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'The best thing in life is to have a job you love, and the best thing in life is to have a job that you love and enjoy. You can’t get that from a job you hate. You can’t get that from a job that you hate. You'}]