
# 3) Coding an LLM architecture

In [None]:
from importlib.metadata import version


print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

In this notebook, we implement the architecture for a GPT-like Large Language Model (LLM). The next notebook will focus on the training process for this model, where we’ll dive into how to fine-tune and optimize it for specific tasks.

<br>
<br>
<br>
<br>

# 3.1 Coding an LLM architecture

Models like GPT, Gemma, Phi, Mistral, and Llama generate words in a sequence and are built on the decoder part of the original transformer architecture. As a result, these LLMs are commonly referred to as "decoder-like" models. When compared to traditional deep learning models, LLMs are significantly larger—not because of the code itself, but due to the sheer number of parameters they contain. As we explore further, we’ll notice that many components within an LLM’s architecture are repeated to scale effectively.

- Configuration details for the 124 million parameter GPT-2 model (GPT-2 "small") include:

In [2]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.0,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

<br>
<br>
<br>
<br>



# 3.2 Coding the GPT model


We’re almost there! Now, let's integrate the transformer block into the architecture we initially coded at the start of this notebook to create a functional GPT model. It's important to note that the transformer block is repeated several times within the architecture. For instance, the smallest GPT-2 model, with 124 million parameters, repeats the transformer block 12 times.

- The corresponding code implementation, where `cfg["n_layers"] = 12`:

In [None]:
import torch.nn as nn
from supplementary import TransformerBlock, LayerNorm


class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

- Using the configuration of the 124M parameter model, we can now instantiate this GPT model with random initial weights as follows:

In [None]:
import torch
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

In [None]:
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)

out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

- We will train this model in the next notebook

<br>
<br>
<br>
<br>



# 3.4 Generating text

LLMs, such as the GPT model we built earlier, generate one word at a time during the prediction process.

The generate_text_simple function below implements greedy decoding, a straightforward and efficient method for generating text. In greedy decoding, at each step, the model selects the word (or token) with the highest probability as the next output. Since the highest logit corresponds to the highest probability, we technically don’t need to compute the softmax function explicitly. The diagram below illustrates how the GPT model generates the next word token based on a given input context.

In [6]:
def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]
        
        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)
        
        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx


The generate_text_simple function above implements an iterative process, generating one token at a time with each iteration.

<br>
<br>
<br>
<br>



# Exercise: Generate some text

Use the tokenizer.encode method to encode your input text into token IDs.
Convert the encoded tokens into a PyTorch tensor using torch.tensor.
Add a batch dimension by calling .unsqueeze(0) on the tensor.
Pass the prepared tensor into the generate_text_simple function to generate text based on the input.
Finally, convert the generated token IDs back into text using the tokenizer.decode method.

In [7]:
model.eval();  # disable dropout

<br>
<br>
<br>
<br>



# Solution

In [None]:
start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

In [None]:
out = generate_text_simple(
    model=model,
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

- Remove batch dimension and convert back into text:

In [None]:
decoded_text = tokenizer.decode(out.squeeze(0).tolist())
print(decoded_text)