In [None]:
#| default_exp GPTText2Text


## Introduction

Large language models like GPT have revolutionized natural language processing, but their architectures can seem complex and mysterious. In this tutorial, we demystify the GPT decoder architecture by building one from scratch using PyTorch.

This implementation is inspired by Andrej Karpathy's excellent "Let's build GPT" [tutorial](https://youtu.be/kCc8FmEb1nY), which provides a fantastic walkthrough of transformer fundamentals.

We'll implement a character-level language model trained on Shakespeare's complete works, constructing every component step-by-step: tokenization, embeddings, multi-head attention with causal masking, feed-forward networks, and the complete decoder stack. By working with a smaller model (33K parameters), we can understand the fundamentals without requiring extensive computational resources.





## Data Prep

### Download input
The text is taken from [Karpathy's nanogpt](https://github.com/karpathy/build-nanogpt) .

In [None]:
try:
    import google.colab
    !pip install -q git+https://github.com/tripathysagar/NanoTransformer.git
except Exception as e:
    pass

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
#|export
from NanoTransformer.data import *

### Read text

In [None]:
print(tokenizer.vocab)
len(tokenizer.vocab)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


65

## Configs

In [None]:
#|export
from dataclasses import dataclass
import torch

@dataclass
class GPTConfig:
    bs = 256
    seq_len = 128                # context length
    embedding_dim = 128          # dim of the embedding layer
    n_layers = 4                # no of decoder block stack on top of each other
    n_heads = 8                 # no of heads in a single decoder block
    vocab_size = len(tokenizer.vocab)
    dropout = 0.1

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16 # Use bfloat16 on Ampere+ GPUs, otherwise use float16

    lr = 1e-3
    max_grad_norm = 1.0

    epochs = 75
gptConfig = GPTConfig()

## Tokenizer
Maps each char to unique index. It have some key attributes i.e.
1. **voacb**: where it maped to all the char present text field
1. **encode**: to encode given string to list of tokens
1. **decode**: to decode given tokens to str
1. **c2i** and **i2c**: helper function to convert char to tokens and tokens to char respectively

In [None]:
s = 'abc'
assert tokenizer.decode(tokenizer.encode(s))  == s

## DataLoader
The Dataset should be of **non-overlapping chunks**:


In [None]:
#|export
dls = get_text_dl(bs=gptConfig.bs, seq_len=gptConfig.seq_len)
dls

{'train': <torch.utils.data.dataloader.DataLoader>,
 'valid': <torch.utils.data.dataloader.DataLoader>}

In [None]:
for x, y in dls['train']:
  break
assert x.shape == y.shape
assert x.shape[0] == gptConfig.bs
assert x.shape[1] == gptConfig.seq_len
print("✓ Sequence lengths correct")

✓ Sequence lengths correct


In [None]:
[f"{k}:{len(v)=}" for k, v in dls.items()]

['train:len(v)=31', 'valid:len(v)=4']

The dataloders have around 62 batch of training data.

## Attention Mechanisim

![pasted_image_be04a721-d6fa-4a79-a961-2e851b81545a.png](attachment:be04a721-d6fa-4a79-a961-2e851b81545a)
Single Decoder Block contains:

Masked Multi-Head Attention - prevents looking at future tokens (causal mask)
Add & Norm - residual connection + layer normalization
Feed Forward - two linear layers with activation (usually GELU)
Add & Norm - another residual connection + layer norm
Full Model structure:

Input Embedding - converts token indices to vectors
Positional Encoding - adds position information (learned or fixed)
N× Decoder Blocks - stacked blocks (GPT-2 small uses 12)
Linear layer - projects to vocabulary size
Softmax - converts to probabilities
Key difference from full transformer: GPT decoder only has the right side - no encoder, no cross-attention (the "Add & Norm + Multi-Head Attention" in the middle). Just masked self-attention.

### Input Embedding Layer
It consists of two type of embedding.
1. token embedding: for the all the tokens in the vocab
1. Postion embedding: for encode the postions encoding to learn one after another


In [None]:
#|export
import torch
from torch import nn
class Embedding(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()
        self.register_buffer('pos_ids', torch.arange(config.seq_len))  # for adding the postional encoding from 0 to seq_len - 1

        self.embed = nn.Embedding(config.vocab_size, config.embedding_dim)
        self.pos_embed =  nn.Embedding(config.seq_len, config.embedding_dim)

    def forward(self, x):           #bs * seq_len
        return self.embed(x) + self.pos_embed(self.pos_ids[:x.size(1)])     #bs * seq_len * embedding_dim

In [None]:
embed = Embedding(config=gptConfig)
x = torch.tensor([1, 2, 3])[None,:] #input sequence
pred_embed = embed(x)

assert x.shape == pred_embed.shape[:2]
assert pred_embed.shape[-1] == gptConfig.embedding_dim

### Single attention head
Map all the tokens in seq_len(context window) where we keep track of how each token affects other token.

It is the heart of of **Transformer** architecture
1. A input is mapped to query `Q`, key `K` and value `V` linear parameters
1. **Causal mask**- This is crucial for GPT! It prevents the model from "looking ahead" at future tokens. You mask out positions that come after the current position.
1. **Softmax**- after the scaled dot prod of Q and K, then applay softmax to get attention weight
1. **Dropout**- added addition regularization

The final formula is :
    ```
    Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k) + mask) @ V
    ```

Where:
- `d_k` is the dim of the key(head_dim)
- The mask ensures causality (no peeking at future tokens)

**Causal mask**
- During training, the model sees the entire sequence at once
- Without masking, it could "cheat" by looking at future tokens
- The mask ensures each position only attends to previous positions (autoregressive)

  Example for seq_len=3:
  ```
  Position 0: can only see position 0
  Position 1: can see positions 0, 1
  Position 2: can see positions 0, 1, 2
  ```

- *How it works?*
  1. by converting the mask with lower trigular matrix filled with 1
  1. substituting the 0 with `-inf`
  1. using softmax to zeros out these fields

In [None]:
head_dim = gptConfig.embedding_dim // gptConfig.n_heads
assert gptConfig.embedding_dim % gptConfig.n_heads == 0

In [None]:
Q_W = nn.Linear(gptConfig.embedding_dim, head_dim)
K_W = nn.Linear(gptConfig.embedding_dim, head_dim)
V_W = nn.Linear(gptConfig.embedding_dim, head_dim)

In [None]:
q = Q_W(pred_embed)
k = K_W(pred_embed)
v = V_W(pred_embed)
q.shape, k.shape, v.shape

(torch.Size([1, 3, 16]), torch.Size([1, 3, 16]), torch.Size([1, 3, 16]))

In [None]:
attn = q @ k.transpose(-2, -1) # Shape: (bs, seq_len, seq_len)
attn.shape

torch.Size([1, 3, 3])

In [None]:
mask = torch.tril(torch.ones(3, 3))
print(mask)
mask = mask.masked_fill(mask == 0, float('-inf'))
print(mask)

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
tensor([[1., -inf, -inf],
        [1., 1., -inf],
        [1., 1., 1.]])


In [None]:
attn = attn + mask
attn

tensor([[[4.6983,   -inf,   -inf],
         [2.8262, 0.2329,   -inf],
         [1.8241, 3.0780, 5.0162]]], grad_fn=<AddBackward0>)

In [None]:
torch.softmax(attn, dim =-1)

tensor([[[1.0000, 0.0000, 0.0000],
         [0.9304, 0.0696, 0.0000],
         [0.0347, 0.1215, 0.8439]]], grad_fn=<SoftmaxBackward0>)

This is the heart of the attention block, which blocks out future tokens. By setting the masked entries to negative infinity `-inf`, the softmax operation turns these values into approximately zero $e^{-\infty}=0$. This effectively prevents the model from attending to future tokens, ensuring causality.

In [None]:
#|export
class AttentionHead(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()

        assert config.embedding_dim % config.n_heads == 0
        self.head_dim = config.embedding_dim // config.n_heads

        self.Q_W = nn.Linear(config.embedding_dim, self.head_dim)
        self.K_W = nn.Linear(config.embedding_dim, self.head_dim)
        self.V_W = nn.Linear(config.embedding_dim, self.head_dim)


        mask = torch.tril(torch.ones(config.seq_len, config.seq_len))
        self.register_buffer('mask', mask.masked_fill(mask == 0, float('-inf'))) # for building casual mask

        self.dropout = nn.Dropout(p = config.dropout)

    def forward(self, x): #bs * seq_len * embedding_dim

        Q, K, V = self.Q_W(x), self.K_W(x), self.V_W(x)        #bs * seq_len * head_dim

        attn = Q @ K.transpose(-2, -1) /  self.head_dim ** 0.5         #bs * seq_len * head_dim @ bs * head_dim * seq_len -> bs * seq_len * seq_len

        attn += self.mask[:x.shape[1], :x.shape[1]]

        attn = torch.softmax(attn, dim=-1)

        return self.dropout(attn @ V)         # bs * seq_len * seq_len @ bs * seq_len * head_dim -> bs * seq_len *  head_dim

In [None]:
attn_head = AttentionHead(gptConfig)
test_input = torch.randn(2, 5, gptConfig.embedding_dim)  # bs=2, seq_len=5
output = attn_head(test_input)

assert output.shape == (2, 5, gptConfig.embedding_dim // gptConfig.n_heads)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Expected head_dim: {gptConfig.embedding_dim // gptConfig.n_heads}")

Input shape: torch.Size([2, 5, 128])
Output shape: torch.Size([2, 5, 16])
Expected head_dim: 16


### Multi-Head Attention
In each decoder block multiple attention head stacked together to build

In [None]:
#|export
class MultiHeadAttention(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()
        assert config.embedding_dim % config.n_heads == 0 # config.n_heads * output of the embedding layer

        self.heads = nn.ModuleList([AttentionHead(config) for _ in range(config.n_heads)])
        self.dropout = nn.Dropout(p=config.dropout)
        self.linear = nn.Linear(config.embedding_dim, config.embedding_dim)
        self.layer_norm = nn.LayerNorm(config.embedding_dim)

    def forward(self, x): #bs * seq_len * embedding_dim
        head = torch.cat([head(x) for head in self.heads], dim=-1) #bs * seq_len * embedding_dim
        head = self.dropout(self.linear(head))                     #bs * seq_len * embedding_dim
        return self.layer_norm(head + x)

In [None]:
# Test 1: Shape preservation
mha = MultiHeadAttention(gptConfig)
test_input = torch.randn(2, 5, gptConfig.embedding_dim)
output = mha(test_input)
assert output.shape == test_input.shape, "Shape mismatch!"
print("✓ Test 1 passed: Shape preserved")

mha.eval()
# Test 2: Causality check
torch.manual_seed(42)
test_input = torch.randn(2, 5, gptConfig.embedding_dim)
output1 = mha(test_input)

# Modify last token
test_input_modified = test_input.clone()
test_input_modified[:, -1, :] = torch.randn(2, gptConfig.embedding_dim)
output2 = mha(test_input_modified)

# First tokens should be identical (not affected by future)
assert torch.allclose(output1[:, 0, :], output2[:, 0, :], atol=1e-5), "Causality violated!"
print("✓ Test 2 passed: Causality maintained")

# Test 3: Residual connection working
torch.manual_seed(42)
test_input = torch.randn(2, 5, gptConfig.embedding_dim)
output = mha(test_input)
# Output should be different from input (but related due to residual)
assert not torch.allclose(output, test_input), "Output shouldn't equal input"
print("✓ Test 3 passed: Residual connection active")

# Test 4: Different inputs produce different outputs
input1 = torch.randn(2, 5, gptConfig.embedding_dim)
input2 = torch.randn(2, 5, gptConfig.embedding_dim)
out1 = mha(input1)
out2 = mha(input2)
assert not torch.allclose(out1, out2), "Different inputs should produce different outputs"
print("✓ Test 4 passed: Different inputs → different outputs")

✓ Test 1 passed: Shape preserved
✓ Test 2 passed: Causality maintained
✓ Test 3 passed: Residual connection active
✓ Test 4 passed: Different inputs → different outputs


### Feed Forward Network (FFN)
The Feed Forward Network is applied to each position independently after the attention layer.

1. **Two linear layers** - expands then contracts the dimensionality
   - First layer: `embedding_dim → 4 * embedding_dim` (typical expansion factor is 4)
   - Second layer: `4 * embedding_dim → embedding_dim`
2. **Activation function** - GELU (Gaussian Error Linear Unit) is commonly used in GPT models, though ReLU also works
3. **Dropout** - applied after the second linear layer for regularization
4. **Residual connection + Layer Norm** - same pattern as in attention

The formula is:
`FFN(x) = LayerNorm(Dropout(Linear2(GELU(Linear1(x)))) + x)`


This allows the model to process information from the attention layer and make non-linear transformations.




In [None]:
#|export
class FFN(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()

        self.dropout = nn.Dropout(p=config.dropout)
        self.linear1 = nn.Linear(config.embedding_dim, 4 * config.embedding_dim)
        self.linear2 = nn.Linear(4 *config.embedding_dim, config.embedding_dim)
        self.layer_norm = nn.LayerNorm(config.embedding_dim)
        self.gelu = nn.GELU(approximate='tanh')

    def forward(self, x): #bs * seq_len * embedding_dim
        pred = self.linear2(self.gelu(self.linear1(x)))
        return self.layer_norm(self.dropout(pred) + x)

In [None]:
# Test 1: Shape preservation
ffn = FFN(gptConfig)
test_input = torch.randn(2, 5, gptConfig.embedding_dim)
output = ffn(test_input)
assert output.shape == test_input.shape, "Shape mismatch!"


A single decoder block is straightforward now - it's just combining these two pieces in sequence:

```python
x → MultiHeadAttention → FFN → output
```

## Final model
We have implemented the all the components of the attention block, and time has come to wrap up to a single unit DecoderBlock.

`
x → MultiHeadAttention → FFN → output
`

In [None]:
#|export

class GPTModel(nn.Module):
    def __init__(self, config:GPTConfig):
        super().__init__()

        self.embed = Embedding(config)
        self.blocks = nn.ModuleList(
            [
                nn.Sequential(MultiHeadAttention(config), FFN(config))
                for _ in range(config.n_layers)
            ])
        self.layer_norm = nn.LayerNorm(config.embedding_dim)
        self.lm_head = nn.Linear(config.embedding_dim, config.vocab_size)

    def forward(self, x):
        x = self.embed(x)
        for block in self.blocks:
            x = block(x)
        return self.lm_head(self.layer_norm(x))

model = GPTModel(gptConfig)

In [None]:
sum(p.numel() for p in model.parameters() if p.requires_grad)

826433

Model Architecture Overview:
1. Vocabulary size `len(vocab)`: 65 (unique chars)
1. Embedding dimension `Embedding.embed`: 128
1. Number of layers `MultiHeadAttention` on top of each other: 8
1. Number of attention heads `AttentionHead` in a attention head: 8
1. Context length `config.seq_len`: 128 tokens
1. Total parameters: 33793

## Loss function **CrossEntropyLoss**.



A key detail: the model outputs shape `(bs, seq_len, vocab_size)`, but CrossEntropyLoss expects `(bs * seq_len, vocab_size)` for input and `(bs * seq_len)` for targets.



In [None]:
#|export
loss_func = nn.CrossEntropyLoss()

In [None]:
for x, y in dls['train']:
    break
x.shape, y.shape

(torch.Size([256, 128]), torch.Size([256, 128]))

In [None]:
logits = model(x)
logits.shape

torch.Size([256, 128, 65])

In [None]:
loss_func(logits.view(-1,gptConfig.vocab_size), y.view(-1))

tensor(4.2966, grad_fn=<NllLossBackward0>)

## Training

In [None]:
#|export
from torch.optim import AdamW
from torch.nn.utils import clip_grad_norm_

model = GPTModel(gptConfig).to(gptConfig.device)
optimizer = AdamW(model.parameters(), lr=gptConfig.lr)

Please switch the run time to GPU. The training will take some time. Take a break and enjoy a cup of ☕.

In [None]:

for epoch in range(gptConfig.epochs):
    model.train()
    train_loss = 0

    for x, y in dls['train']:
        x, y = x.to(gptConfig.device), y.to(gptConfig.device)
        optimizer.zero_grad()

        with torch.autocast(device_type=gptConfig.device, dtype=gptConfig.dtype):
          logits = model(x)
          loss = loss_func(logits.view(-1, gptConfig.vocab_size), y.view(-1))

        loss.backward()

        clip_grad_norm_(model.parameters(), gptConfig.max_grad_norm) # to clip gradients

        optimizer.step()

        train_loss += loss.item()

    model.eval()
    val_loss = 0
    with torch.no_grad(), torch.autocast(device_type=gptConfig.device, dtype=gptConfig.dtype):
        for x, y in dls['valid']:
            x, y = x.to(gptConfig.device), y.to(gptConfig.device)

            logits = model(x)
            loss = loss_func(logits.view(-1, gptConfig.vocab_size), y.view(-1))
            val_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss/len(dls['train']):.4f} Validation Loss: {val_loss/len(dls['valid']):.4f}")

Epoch 1/200, Train Loss: 3.0104 Validation Loss: 2.6568
Epoch 2/200, Train Loss: 2.5700 Validation Loss: 2.5209
Epoch 3/200, Train Loss: 2.4969 Validation Loss: 2.4792
Epoch 4/200, Train Loss: 2.4562 Validation Loss: 2.4387
Epoch 5/200, Train Loss: 2.4110 Validation Loss: 2.4001
Epoch 6/200, Train Loss: 2.3606 Validation Loss: 2.3436
Epoch 7/200, Train Loss: 2.2957 Validation Loss: 2.2682
Epoch 8/200, Train Loss: 2.2161 Validation Loss: 2.1847
Epoch 9/200, Train Loss: 2.1318 Validation Loss: 2.1201
Epoch 10/200, Train Loss: 2.0595 Validation Loss: 2.0650
Epoch 11/200, Train Loss: 1.9976 Validation Loss: 2.0251
Epoch 12/200, Train Loss: 1.9425 Validation Loss: 1.9926
Epoch 13/200, Train Loss: 1.8969 Validation Loss: 1.9581
Epoch 14/200, Train Loss: 1.8553 Validation Loss: 1.9319
Epoch 15/200, Train Loss: 1.8223 Validation Loss: 1.9125
Epoch 16/200, Train Loss: 1.7947 Validation Loss: 1.8919
Epoch 17/200, Train Loss: 1.7698 Validation Loss: 1.8735
Epoch 18/200, Train Loss: 1.7517 Validat

## Inferenece

In [None]:
@torch.no_grad()
def generate(prompt, max_new_tokens=100, temperature=1.0):
    """
    prompt: string to start generation
    max_new_tokens: how many tokens to generate
    temperature: higher = more random, lower = more deterministic
    """
    model.eval()
    tokens = tokenizer.encode(prompt)
    tokens = torch.tensor(tokens).unsqueeze(0)  # Add batch dim
    tokens = tokens.to('cuda')
    for _ in range(max_new_tokens):
        # Crop to last seq_len tokens if needed
        context = tokens if tokens.size(1) <= model.embed.pos_ids.size(0) else tokens[:, -model.embed.pos_ids.size(0):]

        # Get predictions
        with torch.no_grad(), torch.autocast(device_type='cuda', dtype=torch.bfloat16):
          logits = model(context)
        logits = logits[:, -1, :] / temperature  # Focus on last token

        # Sample next token
        probs = torch.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)

        # Append to sequence
        tokens = torch.cat([tokens, next_token], dim=1)

    return tokenizer.decode(tokens.squeeze().tolist())
print(generate("To be or not to be"))

To be or not to be such a mind.

DUKE VINCENTIO:
Help, most that hang pleasing, wer
To please your forfections, strang


## Conclusion

In this implementation, we built a GPT decoder model from scratch using PyTorch, gaining deep insights into transformer architecture. Starting with a character-level tokenizer on Shakespeare's complete works, we implemented every component: embeddings with positional encoding, causal masked multi-head attention, feed-forward networks with GELU activation, and residual connections with layer normalization.

The final model contains 33,793 parameters across 8 decoder layers with 8 attention heads each, processing sequences of 128 tokens with an embedding dimension of 128. While significantly smaller than production models like GPT-2, this implementation demonstrates all the core concepts that make transformers powerful for language modeling.

Key learnings include understanding how causal masking enables autoregressive generation, why residual connections and layer normalization are crucial for training deep networks, and how multi-head attention allows the model to attend to different aspects of the input simultaneously.

This foundational implementation can be extended with techniques like dropout scheduling, gradient clipping, better initalization and larger and other architectures to improve performance further.

Motivations:
1. Another excelent intro can be found here done by the [3Blue1Brown](https://youtu.be/wjZofJX0v4M).

## References

1. Vaswani et al. (2017) - ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) - The original Transformer paper
1. Radford et al. (2019) - ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) - The GPT-2 paper
1. Andrej Karpathy - ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY) - Video tutorial
1. Another excelent intro can be found here done by the [3Blue1Brown](https://youtu.be/wjZofJX0v4M).
