#### Chapter 4: Implementing a GPT model from Scratch To Generate Text

In [None]:
from importlib.metadata import version

import matplotlib
import tiktoken
import torch

print("matplotlib version", version("matplotlib"))
print("torch version", version("torch"))
print("tiktoken version:", version("tiktoken"))

###### In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM

#### 4.1 Coding an LLM architecture
###### Chapter 1 discussed models like GPT and Llama, which generate words sequentially and are based on the decoder part of the original transformer architecture
###### Therefore, these LLMs are often referred to as "decoder-like" LLMs
###### Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code
###### We'll see that many elements are repeated in an LLM's architecture

###### In previous chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page
###### In this chapter, we consider embedding and model sizes akin to a small GPT-2 model
###### We'll specifically code the architecture of the smallest GPT-2 model (124 million parameters), as outlined in Radford et al.'s Language Models are Unsupervised Multitask Learners (note that the initial report lists it as 117M parameters, but this was later corrected in the model weight repository)
###### Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762, and 1542 million parameters
###### Configuration details for the 124 million parameter GPT-2 model include:

In [None]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,      # Vocabulary size
    "context_length": 1024,   # context length
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # number of attention heads
    "n_layers": 12,           # number of layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query key value bias
}

###### We use short variable names to avoid long lines of code later
###### "vocab_size" indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2
###### "context_length" represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2
###### "emb_dim" is the embedding size for token inputs, converting each input token into a 768-dimensional vector
###### "n_heads" is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3
###### "n_layers" is the number of transformer blocks within the model, which we'll implement in upcoming sections
###### "drop_rate" is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting
###### s"qkv_bias" decides if the Linear layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in chapter 5

In [None]:
import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
   def __init__(self, cfg):
       super().__init__()
       self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
       self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
       self.drop_emb = nn.Dropout(cfg["dropout"])

       # Use a placeholder for TransformerBlock
       self.trf_blocks = nn.Sequential(
        *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
       )

       # Use a placeholder for LayerNorm
       self.final_norm = DummyLayerNorm(cfg["emb_dim"])
       self.out_head = nn.Linear(
        cfg["emb_dim"], cfg["vocab_size"], bias=False
       )

   def forward(self, in_idx):
       batch_size, seq_len = in_idx.shape
       tok_embeds = self.tok_emb(in_idx)
       pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
       x = tok_embeds + pos_embeds
       x = self.drop_emb(x)
       x = self.trf_blocks(x)
       x = self.final_norm(x)
       logits = self.out_head(x)
       return logits

class DummyTransformerBlock(nn.Module):
   def __init__(self, cfg):
       super().__init__()
       # A simple placeholder

   def forward(self, x):
       # This block does nothing and just returns its input
       return x

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface

    def forward(self, x):
        # This layer does nothing and just returns its input 
        return x

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you."
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

In [None]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)

logits = model(batch)
print("Output shape:", logits.shape)
print(logits)


#### Note

###### If you are running this code on Windows or Linux, the resulting values above may look like as follows:
###### Output shape: torch.Size([2, 4, 50257])
###### tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
######         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
######         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
######         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

######        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
######         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
######         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
######         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
######       grad_fn=<UnsafeViewBackward0>)
###### Since these are just random numbers, this is not a reason for concern, and you can proceed with the remainder of the chapter without issues
##### 4.2 Normalizing activations with layer normalization
###### Layer normalization, also known as LayerNorm (Ba et al. 2016), centers the activations of a neural network layer around a mean of 0 and normalizes their variance to 1
###### This stabilizes training and enables faster convergence to effective weights
###### Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; it's also applied before the final output layer

###### Let's see how layer normalization works by passing a small input sample through a simple neural network layer:

In [None]:
torch.manual_seed(123)

# Create 2 training examples with 5 dimensions (features) each
bach_example = torch.randn(2, 5)

layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(bach_example)
print(out)

###### Let's compute the mean and variance for each of the 2 inputs above:

In [None]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)

print("Mean:\n", mean)
print("Variance:\n", var)

###### The normalization is applied to each of the two inputs (rows) independently; using dim=-1 applies the calculation across the last dimension (in this case, the feature dimension) instead of the row dimension

###### Subtracting the mean and dividing by the square-root of the variance (standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column (feature) dimension:

In [None]:
out_norm = (out - mean) / torch.sqrt(var)
print("Normalized layer outputs:\n", out_norm)

mean = out_norm.mean