# Chapter 4: Implementing a GPT model from Scratch To Generate Text


This chapter covers
- Coding a GPT-like large language model (LLM) that can be trained to generate
human-like text
- Normalizing layer activations to stabilize neural network training
- Adding shortcut connections in deep neural networks to train models more
effectively
- Implementing transformer blocks to create GPT models of various sizes
- Computing the number of parameters and storage requirements of GPT models


In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/01.webp" width="700px">


## 4.1 Coding an LLM architecture

- Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture
- Therefore, these LLMs are often referred to as "decoder-like" LLMs
- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code
- We'll see that many elements are repeated in an LLM's architecture

Figure 4.2 provides a top-down view of a GPT-like LLM, with its main components highlighted.

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/02.webp" width="700px">


- In the previous chapters, we used smaller embedding dimensions for simplicity, ensuring
that the concepts and examples could comfortably fit on a single page. Now, in this chapter,
we are scaling up to the size of a small GPT-2 model, specifically the smallest version with
124 million parameters, as described in Radford et al.'s paper, "Language Models are
Unsupervised Multitask Learners." Note that while the original report mentions 117 million
parameters, this was later corrected

- Chapter 6 will focus on loading pretrained weights into our implementation and adapting
it for larger GPT-2 models with 345, 762, and 1,542 million parameters. In the context of
deep learning and LLMs like GPT, the term "parameters" refers to the trainable weights of
the model. These weights are essentially the internal variables of the model that are
adjusted and optimized during the training process to minimize a specific loss function. This
optimization allows the model to learn from the training data.


`GPT-2 VERSUS GPT-3`

- Note that we are focusing on GPT-2 because OpenAI has made the weights of the
pretrained model publicly available, which we will load into our implementation in
chapter 6. GPT-3 is fundamentally the same in terms of model architecture, except
that it is scaled up from 1.5 billion parameters in GPT-2 to 175 billion parameters in
GPT-3, and it is trained on more data. As of this writing, the weights for GPT-3 are
not publicly available. GPT-2 is also a better choice for learning how to implement
LLMs, as it can be run on a single laptop computer, whereas GPT-3 requires a GPU
cluster for training and inference. `According to Lambda Labs, it would take 355 years to train GPT-3 on a single V100 datacenter GPU, and 665 years on a consumer RTX 8000 GPU.`


Configuration details for the 124 million parameter GPT-2 model include:


In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

- "vocab_size" refers to a vocabulary of 50,257 words, as used by the
BPE tokenizer from chapter 2.
- "context_length" denotes the maximum number of input tokens the
model can handle, via the positional embeddings discussed in chapter 2.
- "emb_dim" represents the embedding size, transforming each token into
a 768-dimensional vector.
- "n_heads" indicates the count of attention heads in the multi-head
attention mechanism, as implemented in chapter 3.
- "n_layers" specifies the number of transformer blocks in the model,
which will be elaborated on in upcoming sections.
- "drop_rate" indicates the intensity of the dropout mechanism (0.1
implies a 10% drop of hidden units) to prevent overfitting, as covered in
chapter 3.
- "qkv_bias" determines whether to include a bias vector in the Linear
layers of the multi-head attention for query, key, and value computations.
We will initially disable this, following the norms of modern LLMs, but will
revisit it in chapter 6 when we load pretrained GPT-2 weights from
OpenAI into our model.


Using the configuration above, we will start this chapter by implementing a GPT placeholder
architecture (DummyGPTModel) in this section, as shown in Figure 4.3. This will provide us
with a big-picture view of how everything fits together and what other components we need
to code in the upcoming sections to assemble the full GPT model architecture.

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/03.webp" width="700px">



In [2]:
import torch
import torch.nn as nn

In [3]:
class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])  #A
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])                    #B
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):         #C
    def __init__(self, cfg):
        super().__init__()

    def forward(self, x):                       #D
        return x
    

class DummyLayerNorm(nn.Module):                        #E
    def __init__(self, normalized_shape, eps=1e-5):     #F
        super().__init__()

    def forward(self, x):               
        return x

#A Use a placeholder for TransformerBlock
#B Use a placeholder for LayerNorm
#C A simple placeholder class that will be replaced by a real TransformerBlock later
#D This block does nothing and just returns its input.
#E A simple placeholder class that will be replaced by a real TransformerBlock later
#F The parameters here are just to mimic the LayerNorm interface.

- Next, we will prepare the input data and initialize a new GPT model to illustrate its
usage. Building on the figures we have seen in chapter 2, where we coded the tokenizer,
Figure 4.4 provides a high-level overview of how data flows in and out of a GPT model.

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/04.webp" width="700px">

Figure 4.4 A big-picture overview showing how the input data is tokenized, embedded, and fed to the GPT
model. Note that in our DummyGPTClass coded earlier, the token embedding is handled inside the GPT model.
In LLMs, the embedded input token dimension typically matches the output dimension. The output embeddings
here represent the context vectors we discussed in chapter 3.

- To implement the steps shown in Figure 4.4, we tokenize a batch consisting of two text
inputs for the GPT model using the tiktoken tokenizer introduced in chapter 2:


In [7]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
print(len(batch))
batch = torch.stack(batch, dim=0)
print(batch.shape)
print(batch)

2
torch.Size([2, 4])
tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [8]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0448,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


- The output tensor has two rows corresponding to the two text samples. Each text sample
consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of
the tokenizer's vocabulary.
- The embedding has 50,257 dimensions because each of these dimensions refers to a
unique token in the vocabulary. At the end of this chapter, when we implement the
postprocessing code, we will convert these 50,257-dimensional vectors back into token IDs,
which we can then decode into words.
- Now that we have taken a top-down look at the GPT architecture and its in- and outputs,
we will code the individual placeholders in the upcoming sections, starting with the real
layer normalization class that will replace the DummyLayerNorm in the previous code.


## 4.2 Normalizing activations with layer normalization

- Training deep neural networks with many layers can sometimes prove challenging due to
issues like vanishing or exploding gradients. These issues lead to unstable training
dynamics and make it difficult for the network to effectively adjust its weights, which
means the learning process struggles to find a set of parameters (weights) for the neural
network that minimizes the loss function. In other words, the network has difficulty learning
the underlying patterns in the data to a degree that would allow it to make accurate
predictions or decisions.
- In this section, we will implement layer normalization to improve the stability and
efficiency of neural network training.
- The main idea behind layer normalization is to adjust the activations (outputs) of a
neural network layer to have a mean of 0 and a variance of 1, also known as unit variance.
This adjustment speeds up the convergence to effective weights and ensures consistent,
reliable training. 
- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer
- Before we implement layer normalization in code, Figure 4.5 provides a visual overview
of how layer normalization functions.

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/05.webp" width="700px">


- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:

We can recreate the example shown in Figure 4.5 via the following code, where we
implement a neural network layer with 5 inputs and 6 outputs that we apply to two input
example

In [9]:
torch.manual_seed(123)
batch_example = torch.randn(2, 5)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


- Before we apply layer normalization to these outputs, let's examine the mean and
variance:



In [16]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print(f'Mean:\n {mean}')
print(f'Variance:\n {var}')
mean.shape, var.shape

Mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


(torch.Size([2, 1]), torch.Size([2, 1]))

- The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/06.webp" width="700px">


- Later, when
adding layer normalization to the GPT model, which produces 3D tensors with shape
[batch_size, num_tokens, embedding_size], we can still use dim=-1 for normalization
across the last dimension, avoiding a change from dim=1 to dim=2.


In [17]:
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print('Mean:\n', mean)
print('Variance:\n', var)


Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[9.9341e-09],
        [5.9605e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


- Each input is centered at 0 and has a unit variance of 1; to improve readability, we can disable PyTorch's scientific notation:


In [18]:
torch.set_printoptions(sci_mode=False)
print('Mean:\n', mean)
print('Variance:\n', var)

Mean:
 tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


- Above, we normalized the features of each input
- Now, using the same idea, we can implement a LayerNorm class:

In [19]:
# A layer normalization class

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift


- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a scale and a shift parameter
- The initial scale (multiplying by 1) and shift (adding 0) values don't have any effect; however, scale and shift are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task
- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing
- Note that we also add a smaller value (eps) before computing the square root of the variance; this is to avoid division-by-zero errors if the variance is 0

**BIASED VARIANCE**
- In our variance calculation method, we have opted for an implementation detail by
setting unbiased=False. For those curious about what this means, in the variance
calculation, we divide by the number of inputs n in the variance formula. This
approach does not apply Bessel's correction, which typically uses n-1 instead of n in
the denominator to adjust for bias in sample variance estimation. This decision
results in a so-called biased estimate of the variance. For large-scale language
models (LLMs), where the embedding dimension n is significantly large, the
difference between using n and n-1 is practically negligible. We chose this approach
to ensure compatibility with the GPT-2 model's normalization layers and because it
reflects TensorFlow's default behavior, which was used to implement the original GPT2 model. Using a similar setting ensures our method is compatible with the
pretrained weights we will load in chapter 6.


Let's now try the LayerNorm module in practice and apply it to the batch input:


In [20]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, keepdim=True, unbiased=False)
print("Mean:\n", mean)
print("Variance:\n", var)


Mean:
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


In this section, we covered one of the building blocks we will need to implement the GPT
architecture, as shown in the mental model in Figure 4.7

  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch04_compressed/07.webp" width="700px">


- In the next section, we will look at the GELU activation function, which is one of the
activation functions used in LLMs, instead of the traditional ReLU function we used in this
section.


**LAYER NORMALIZATION VERSUS BATCH NORMALIZATION**
- If you are familiar with batch normalization, a common and traditional normalization
method for neural networks, you may wonder how it compares to layer
normalization. Unlike batch normalization, which normalizes across the batch
dimension, layer normalization normalizes across the feature dimension. LLMs often
require significant computational resources, and the available hardware or the
specific use case can dictate the batch size during training or inference. Since layer
normalization normalizes each input independently of the batch size, it offers more
flexibility and stability in these scenarios. This is particularly beneficial for distributed
training or when deploying models in environments where resources are constrained.
