## Implementing a GPT Model from Scratch


This notebook covers Chapter 4 of [*Build a Large Language Model from Scratch*](https://www.manning.com/books/build-a-large-language-model-from-scratch) by Sebastian Raschka (2025).

### Coding an LLM Architecture

Generative pretrained transformer (GPT) models are large language models (LLMs) whose purpose is generate text, one token (or word) at a time.

- While LLMs are large models, the architecture is not actually super esoteric.
    - Many components repeat.

In this notebook, we'll code te [smallest GPT-2 model](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), which has 124 million parameters. 

> "In the context of deep learning and LLMs like GPT, the term 'parameters' refers to the trainable weights of the model. These weights are essentially the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function" (Raschka 2025:93).

**A GPT Model, Visualized**

![GPT Model](../img/fig-4_2-gpt-model.png)

> Source: Raschka, Sebastian. 2025. *Build a Large Language Model (From Scratch).* Shelter Island, NY: Manning.

### Build the Model

Start by specifying the configuration:

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257, # the number of unique words in the BPE tokenizer
    "context_length": 1024, # the number of tokens in a sequence
    "emb_dim": 768, # embedding dimensions
    "n_heads": 12, # number of attention heads
    "n_layers": 12, # number of layers
    "drop_rate": 0.1, # the dropout rate (prevents overfitting)
    "qkv_bias": False # whether to use a bias (intercept) for the Query-Key-Value linear layers
}

Start with a basic, dummy, backbone:

In [None]:
import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # token embeddings of shape (n_tokens, embedding_dims):
        self.tok_emb = nn.Embedding(config["vocab_size"], config["emb_dim"])

        # positional embeddings of shape (context_length, embedding_dims):
        self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])

        # dropout:
        self.drop_emb = nn.Dropout(config["drop_rate"])

        # transformer block:
        self.transformer_blocks = nn.Sequential(
            *[DummyTransformerBlock(config)
            for _ in range (config["n_layers"])]
        )

        # normalization:
        self.final_norm = DummyLayerNorm(config["emb_dim"])

        # output:
        self.out_head = nn.Linear(config["emb_dim"], config["vocab_size"], bias=False)
    
    # forward pass:
    def forward(self, in_idx):
        batch_size, seq_length = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(
            torch.arange(seq_length, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.transformer_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

class DummyTransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
    
    def forward(self, x):
        return x

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
    
    def forward(self, x):
        return x

Prepare some input data:

In [None]:
import tiktoken

# tokenizer:
tokenizer = tiktoken.get_encoding("gpt2")

# sample texts:
text1 = "Every effort moves you"
text2 = "Every day holds a"

# encode and append to batch:
batch = []

batch.append(
    torch.tensor(tokenizer.encode(text1))
)

batch.append(
    torch.tensor(tokenizer.encode(text2))
)

batch = torch.stack(batch, dim=0)

print(f"Batch shape: {batch.shape}")
print(batch)

Initiate a dummy model:

In [None]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print(f"Output shape: {logits.shape}")
print(logits)

The out tensor shape corresponds to:

- `2` $\rightarrow$ two text examples
- `4` $\rightarrow$ the length of the input sequences
- `50257` $\rightarrow$ the size of the tokenizer vocabulary

#### Normalizing activations with layer normalization 

Large, deep neural networks with many layers often encounter the issue of vanishing or exploding gradients, which makes training unstable.

Layer normalization improves the stability of deep neural network training.

- The goal of normalization is to obtain ***unit variance*** for the activation outputs, i.e., a mean of `0` and variance of `1`.
- Unit variance makes converging to good weights quicker, more reliable, and more stable.

In most GPT models and transformer architectures, normalization is applied before and after multi-headed attention.

Example:

In [None]:
torch.manual_seed(123)
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

Get mean and variance:

In [None]:
out_mean = out.mean(dim=-1, keepdim=True)
out_var = out.var(dim=-1, keepdim=True)

print(f"Mean: {out_mean}")
print(f"Variance: {out_var}")

Normalize:

In [None]:
out_norm = (out - out_mean) / torch.sqrt(out_var)
out_norm_mean = out_norm.mean(dim=-1, keepdim=True)
out_norm_var = out_norm.var(dim=-1, keepdim=True)

print(f"Normed outputs: {out_norm}")
print(f"Normed means: {out_norm_mean}")
print(f"Normed variances: {out_norm_var}")

Build out the `LayerNorm` class using these insights:

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        # small constant to prevent division by 0 errors:
        self.eps = 1e-5

        # trainable parameters to scale and shift weights
        # if model feels that doing so will improve 
        # the training:
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mu = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False) # no Bessel correction (n-1)
        norm_x = (x - mu) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

Try it out:

In [None]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
out_mean = out_ln.mean(dim=-1, keepdim=True)
out_var = out_ln.var(dim=-1, unbiased=False, keepdim=True)

print(f"Means: {out_mean}")
print(f"Variances: {out_var}")

#### Impletning a feed forward network with GELU actiations

**ReLU (rectified linear unit)** has historically been the go-to activation function, but the **Gaussian error linear unit (GELU)** and **Swish-gated linear unit (SwiGLU)** are often employed in LLMs.

- GELU and SwiGLU are more complicated than ReLU, but they're also smoother.
- They also have performance enhancements over ReLU.

$GELU(x) = x \cdot \phi(x), \text{where} \ \phi \ \text{is the cumulative distribution function of the Gaussian distribution}$

In practice, a more efficient approximation is implemented:


$GELU(x) \approx 0.5 \cdot x \cdot (1 + tanh[\sqrt{\frac{2}{\pi}} \cdot (x + 0.044715 \cdot x^3)])$

Implement GELU:

In [None]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))

Plot GELU and ReLU:

In [None]:
import matplotlib.pyplot as plt

gelu = GELU()
relu = nn.ReLU()

x = torch.linspace(-3, 3, 100)
y_gelu, y_relu = gelu(x), relu(x)

plt.figure(figsize=(10,4))
for i, (y, label) in enumerate(zip([y_gelu, y_relu], ["GELU", "ReLU"]), 1):
    plt.subplot(1, 2, i)
    plt.plot(x, y)
    plt.title(f"{label} Activation")
    plt.xlabel("x")
    plt.ylabel(f"{label}(x)")
    plt.grid(True)
plt.tight_layout()
plt.show()

ReLU outputs `0` if `x` is negative, else `x`.
- Thus, `x` is always positive.
- ReLU's sharp "corner" at `0` can  make optimization harder.

By contrast, GELU is a smooth non-linear runction.
- There is a non-zero gradient for most negative values (except around `-0.75`).
- Because GELU is smoother, it can lead to better optimization during training.
- Hence, small non-zero values are allowed.
    - Neurons with negative values contribute to learning, though not as much as neurons with positive values. 

Let's now add a **feed forward neural network** that incorporates GELU:

- This network will first ***expand*** the embedding dimension by some factor (here, `4`).
- Then, it will pass the outputs into the GELU activation function.
- Finally, the GELU outputs are scaled back down to the original embedding dimension of `768`.
- This expansion and contraction should allow the model to learn more complicated and nuanced representations.

In [None]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(config["emb_dim"], 4 * config["emb_dim"]),
            GELU(),
            nn.Linear(4 * config["emb_dim"], config["emb_dim"])
        )
    
    def forward(self, x):
        return self.layers(x)

And try it out:

In [None]:
ffn = FeedForward(GPT_CONFIG_124M)
x = torch.rand(2, 3, 768)
out = ffn(x)

print(out.shape)
print(out)

#### Adding shortcut connections

***Shortcut*** (or ***skip*** or ***residual***) connections create alternative "shorter paths" for gradients to pass through the network by "skipping one or more layers" (p. 109).
- This helps avoid the vanishing gradient problem (e.g., gradients getting very small during repeated backpropagation).

Shortcut connections work by adding input values to the outputs of a layer.

Example:

In [None]:
# 5 layer neural network
class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList(
            [
                nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
                nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
                nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
                nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
                nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
            ]
        )

    def forward(self, x):
        for layer in self.layers:
            layer_output = layer(x)
            if self.use_shortcut and x.shape == layer_output.shape: # shapes must match for shortcut
                x = x + layer_output
            else:
                x = layer_output
        return x

Function for computing gradients:

In [None]:
def print_gradients(model, x):
    # model prediction:
    output = model(x)

    # "true" prediction:
    target = torch.tensor([[0.0]])

    # loss:
    loss = nn.MSELoss()
    loss = loss(output, target)

    # backpropagation:
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            print(f"mu(abs({name})) = {param.grad.abs().mean().item()}")

Try it out; notice how small the gradients get:

In [None]:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=False)
print_gradients(model_without_shortcut, sample_input)

We can correct for vanishing gradients by using skip connections:

In [None]:
layer_sizes = [3, 3, 3, 3, 3, 1]
sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=True)
print_gradients(model_without_shortcut, sample_input)

#### Building the transformer block

Key ideas:

1. Multi-headed attention learns relationships between elements of the input sequence.
2. The feed forward network "modifies the data individually at each position" (p. 113).

The result: "a more nuanced understanding and processing of the iinput" and the enhancement of "the model's overall capacity for handling complex data patterns" (p. 113).

A transformer block:
1. Starts with inputs, whose tokens receive embeddings.
2. Token embeddings are passed through a normalization layer.
3. The normed outputs are passed into a masked multi-head attention layer.
4. Dropout is applied (optionally).
5. The outputs pass through another normalization layer.
6. The normed outputs are passed through a feed forward network.
    - Linear --> GELU --> Linear 
7. Drop is applied (optionally).

We need to fetch our `MultiHeadAttentionClass` from chapter 2:

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out,
                 context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()

        # logic check:
        assert (d_out % num_heads == 0), "Error: d_out must be divisible by num_heads!"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # final embedding size

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

        # linear layer for head outputs
        # (not strictly necessary, but commonly used):
        self.out_proj = nn.Linear(d_out, d_out)

        # dropout:
        self.dropout = nn.Dropout(dropout)

        # register buffer:
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        batch_size, num_tokens, d_in = x.shape

        # queries, keys, values
        # of shape (batch_size, num_tokens, d_out):
        queries = self.W_query(x)
        keys = self.W_key(x)
        values = self.W_value(x)
        
        # split the matrices:
        queries = queries.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        values = values.view(batch_size, num_tokens, self.num_heads, self.head_dim)

        # transpose from (batch_size, num_tokens, num_heads, head_dim)
        # to (batch_size, num_heads, num_tokens, head_dim):
        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        # attention scores:
        attn_scores = queries @ keys.transpose(2, 3)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens] # mask
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        # context vectors:
        context_vec = (attn_weights @ values).transpose(1, 2) # (1, 2) --> (num_tokens, num_heads)
        context_vec = context_vec.contiguous().view( # tensor of shape (batch_size, num_tokens, num_heads, head_dim)
            batch_size, num_tokens, self.d_out
        )
        context_vec = self.out_proj(context_vec)
        return context_vec

    

And now we can build a transformer block:

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # attention:
        self.attention = MultiHeadAttention(
            d_in=config["emb_dim"],
            d_out=config["emb_dim"],
            context_length=config["context_length"],
            num_heads=config["n_heads"],
            dropout=config["drop_rate"],
            qkv_bias=config["qkv_bias"]
        )

        # feed forward:
        self.ff = FeedForward(config)

        # norm:
        self.norm1 = LayerNorm(config["emb_dim"])
        self.norm2 = LayerNorm(config["emb_dim"])
        
        # dropout with shortcut:
        self.drop_shortcut = nn.Dropout(config["drop_rate"])

    def forward(self, x):
        shortcut = x

        # pre-layer norm:
        x = self.norm1(x)

        # attention:
        x = self.attention(x)

        # dropout with shortcut:
        x = self.drop_shortcut(x)
        x = x + shortcut

        shortcut = x

        # pre-layer norm:
        x = self.norm2(x)

        # feed forward network:
        x = self.ff(x)

        # dropout with shortcut:
        x = self.drop_shortcut(x)
        x = x + shortcut
        
        return x

Test it out:

In [None]:
torch.manual_seed(123)
x = torch.rand(2, 4, 768)
block = TransformerBlock(GPT_CONFIG_124M)
output = block(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

### Coding the full GPT model

The basic steps of the GPT model are:

1. Create token embeddings.
2. Create positional embeddings.
3. Add the positional embeddings to the token embeddings.
4. Perform dropout.
5. Pass the output from the dropout layer through `N` transformer layers.
6. Perform a final normalization.
7. Compute the outputs.

In [None]:
class GPTModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        # token embeddings of shape (n_tokens, embedding_dims):
        self.tok_emb = nn.Embedding(config["vocab_size"], config["emb_dim"])

        # positional embeddings of shape (context_length, embedding_dims):
        self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])

        # dropout:
        self.drop_emb = nn.Dropout(config["drop_rate"])

        # N transformer blocks (corresponds to n_layers):
        self.trf_blocks = nn.Sequential(
            *[
                TransformerBlock(config) for _
                in range(config["n_layers"])
            ]
        )

        # normalization:
        self.final_norm = LayerNorm(config["emb_dim"])

        # final output head:
        self.out_head = nn.Linear(
            config["emb_dim"], config["vocab_size"], bias=False
        )

    # forward pass:
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        token_embeddings = self.tok_emb(in_idx)
        pos_embeddings = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        
        x = token_embeddings + pos_embeddings
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        
        logits = self.out_head(x)
        return logits

Initialize a GPT-2 model:

In [None]:
torch.manual_seed(123)

model = GPTModel(GPT_CONFIG_124M)
output = model(batch)

print(f"Input batch:\n {batch}\n")
print(f"Output shape: {output.shape}\n")
print(f"Output:\n {output}")

PyTorch provides functionality, like `numel()` (*number of elements*) to inspect the model:

In [None]:
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

The number of parameters for the supposed 124 million parameter model is actually >163 million. What gives?

- GPT-2 uses ***weight tying***.
- That is, weights from the token embedding layer are reused in the output layer.

If we subtract the number of parameters in the output head from the total number of parameters, we'll get to 124 million parameters:

In [None]:
total_params_gpt2 = (
    total_params - sum(p.numel() for p in model.out_head.parameters())
)

print(f"Total number of trainable parameters: {total_params_gpt2:,}")

Parameters in feed forward and attention modules:

In [None]:
block = TransformerBlock(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in block.ff.parameters())
print(f"Feed forward parameters: {total_params}")

total_params = sum(p.numel() for p in block.attention.parameters())
print(f"Attention parameters: {total_params}")

Memory requirements:

In [None]:
# assume float32:
total_params = sum(p.numel() for p in model.parameters())
total_size_bytes = total_params * 4
total_size_mb = total_size_bytes / (1024 * 1024)
total_size_gb = total_size_mb / 1000

print(f"Total size: {total_size_mb:.3f} MB ({total_size_gb:.3f} GB)")

#### Generating text

The tensor outputs from the GPT model need to converted back into tokens.

- With each iteration across the input, the sequence context grows (i.e., more and more tokens are seen sequentially by the model).

The shape of the output tensor is: `[batch_size, num_tokens, vocab_size]`. To get new tokens, we must:

1. Decode the tensors.
2. Select tokens based on a probability distribution.
3. Convert tokens into natural language.

In practice, the last column of the output matrix is the vector corresponding to the ***next token***.

- The logits in this vector are passed through a softmax, converting them to probabilities.
    - This isn't strictly necessary, since softmax is a monotonic transformation.
    - That is, the index with the highest logit is also going to have the highest softmax probability.
- The index with the highest probability corresponds the ID of the next token in the vocabulary.
    - Simply generating the most probable next token is called ***greedy decoding***: we'll use a sampling strategy later on to make outputs more interesting.
- The identified token is decoded and appended to the previous inputs.

Implement greedy token generation:

In [None]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        # truncate to only the last N tokens in the context_size:
        idx_cond = idx[:, -context_size:]
        
        # predict:
        with torch.no_grad():
            logits = model(idx_cond)
        
        # focus on the last step
        # (batch, vocab_size):
        logits = logits[:, -1, :]

        # get probabilities:
        probs = torch.softmax(logits, dim=-1)

        # get next token index with the highest probability:
        idx_next = torch.argmax(probs, dim=-1, keepdim=True)

        # append next token to previous inputs:
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

And test:

In [None]:
start_context = "Hello, I am"
encoded = tokenizer.encode(start_context)
print(f"Encoded input: {start_context}")

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print(f"Encoded tensor shape: {encoded_tensor.shape}")
print(f"Encoded tensor: {encoded_tensor}")

Run through model:

- Put the model in `.eval()` mode to disbale random components (e.g., dropout)

In [None]:
model.eval()

out = generate_text_simple(
    model=model, 
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

print(f"Output: {out}")
print(f"Output length: {len(out[0])}")

Decode output:

In [None]:
print(
    tokenizer.decode(out.squeeze(0).tolist())
)

Hooray, we generated gibberish with an untrained model!

---

**Done!**