#### Chapter 5: Pretraining on Unlabeled Data

In [None]:
from importlib.metadata import version

pkgs = [
    "matplotlib",
    "numpy",
    "tiktoken",
    "torch",
    "tensorflow"  # For OpenAI's pretrained weights
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

###### In this chapter, we implement the training loop and code for basic model evaluation to pretrain an LLM
###### At the end of this chapter, we also load openly available pretrained weights from OpenAI into our model

###### The topics covered in this chapter are shown below

#### 5.1 Evaluating generative text models
###### We start this section with a brief recap of initializing a GPT model using the code from the previous chapter
###### Then, we discuss basic evaluation metrics for LLMs
###### Lastly, in this section, we apply these evaluation metrics to a training and validation dataset

##### 5.1.1 Using GPT to generate text
###### We initialize a GPT model using the code from the previous chapter


In [None]:
import torch
from previous_chapters import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,      # Vocabulary size
    "context_length": 256,    # Shortened context length (orig: 1024)
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Number of attention heads
    "n_layers": 12,           # Number of layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval();   # Disable dropout during inference

###### We use dropout of 0.1 above, but it's relatively common to train LLMs without dropout nowadays
###### Modern LLMs also don't use bias vectors in the nn.Linear layers for the query, key, and value matrices (unlike earlier GPT models), which is achieved by setting "qkv_bias": False
###### We reduce the context length (context_length) of only 256 tokens to reduce the computational resource requirements for training the model, whereas the original 124 million parameter GPT-2 model used 1024 tokens
######       -> This is so that more readers will be able to follow and execute the code examples on their laptop computer
######       -> However, please feel free to increase the context_length to 1024 tokens (this would not require any code changes)
######       -> We will also load a model with a 1024 context_length later from pretrained weights
###### Next, we use the generate_text_simple function from the previous chapter to generate text
###### In addition, we define two convenience functions, text_to_token_ids and token_ids_to_text, for converting between token and text representations that we use throughout this chapter

In [None]:
import tiktoken
from previous_chapters import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M("context_length")
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

###### As we can see above, the model does not produce good text because it has not been trained yet
###### How do we measure or capture what "good text" is, in a numeric form, to track it during training?
###### The next subsection introduces metrics to calculate a loss metric for the generated outputs that we can use to measure the training progress
###### The next chapters on finetuning LLMs will also introduce additional ways to measure model quality

#### 5.1.2 Calculating the text generation loss: cross-entropy and perplexity
###### Suppose we have an inputs tensor containing the token IDs for 2 training examples (rows)
###### Corresponding to the inputs, the targets contain the desired token IDs that we want the model to generate
###### Notice that the targets are the inputs shifted by 1 position, as explained in chapter 2 when we implemented the data loader

In [None]:
inputs = torch.tensor([[16833, 3626, 6100],  # [ "Every effort moves",
                        [40, 1107, 588]])    # I really like"],

targets = torch.tensor([[3626, 6100, 345],  # [ "Every effort moves",
                        [1107, 588, 11311]])    # I really like"],

###### Feeding the inputs to the model, we obtain the logits vector for the 2 input examples that consist of 3 tokens each
###### Each of the tokens is a 50,257-dimensional vector corresponding to the size of the vocabulary
###### Applying the softmax function, we can turn the logits tensor into a tensor of the same dimension containing probability scores

In [None]:
with torch.no_grad():
    logits = model(inputs)

probas = torch.softmax(logits, dim=-1)   # Probability of each token in Vocabulary
print(probas.shape) # Shape: (batch_size, num_tokens, vocab_size)

###### The figure below, using a very small vocabulary for illustration purposes, outlines how we convert the probability scores back into text, which we discussed at the end of the previous chapter
###### As discussed in the previous chapter, we can apply the argmax function to convert the probability scores into predicted token IDs
###### The softmax function above produced a 50,257-dimensional vector for each token; the argmax function returns the position of the highest probability score in this vector, which is the predicted token ID for the given token
###### Since we have 2 input batches with 3 tokens each, we obtain 2 by 3 predicted token IDs: