#### Chapter 5: Pretraining on Unlabeled Data

In [None]:
from importlib.metadata import version

pkgs = [
    "matplotlib",
    "numpy",
    "tiktoken",
    "torch",
    "tensorflow"  # For OpenAI's pretrained weights
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

###### In this chapter, we implement the training loop and code for basic model evaluation to pretrain an LLM
###### At the end of this chapter, we also load openly available pretrained weights from OpenAI into our model

###### The topics covered in this chapter are shown below

#### 5.1 Evaluating generative text models
###### We start this section with a brief recap of initializing a GPT model using the code from the previous chapter
###### Then, we discuss basic evaluation metrics for LLMs
###### Lastly, in this section, we apply these evaluation metrics to a training and validation dataset

##### 5.1.1 Using GPT to generate text
###### We initialize a GPT model using the code from the previous chapter


In [None]:
import torch
from previous_chapters import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,      # Vocabulary size
    "context_length": 256,    # Shortened context length (orig: 1024)
    "emb_dim": 768,           # Embedding dimension
    "n_heads": 12,            # Number of attention heads
    "n_layers": 12,           # Number of layers
    "drop_rate": 0.1,         # Dropout rate
    "qkv_bias": False         # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval();   # Disable dropout during inference

###### We use dropout of 0.1 above, but it's relatively common to train LLMs without dropout nowadays
###### Modern LLMs also don't use bias vectors in the nn.Linear layers for the query, key, and value matrices (unlike earlier GPT models), which is achieved by setting "qkv_bias": False
###### We reduce the context length (context_length) of only 256 tokens to reduce the computational resource requirements for training the model, whereas the original 124 million parameter GPT-2 model used 1024 tokens
######       -> This is so that more readers will be able to follow and execute the code examples on their laptop computer
######       -> However, please feel free to increase the context_length to 1024 tokens (this would not require any code changes)
######       -> We will also load a model with a 1024 context_length later from pretrained weights
###### Next, we use the generate_text_simple function from the previous chapter to generate text
###### In addition, we define two convenience functions, text_to_token_ids and token_ids_to_text, for converting between token and text representations that we use throughout this chapter

In [None]:
import tiktoken
from previous_chapters import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # add batch dimension
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M("context_length")
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

###### As we can see above, the model does not produce good text because it has not been trained yet
###### How do we measure or capture what "good text" is, in a numeric form, to track it during training?
###### The next subsection introduces metrics to calculate a loss metric for the generated outputs that we can use to measure the training progress
###### The next chapters on finetuning LLMs will also introduce additional ways to measure model quality

#### 5.1.2 Calculating the text generation loss: cross-entropy and perplexity
###### Suppose we have an inputs tensor containing the token IDs for 2 training examples (rows)
###### Corresponding to the inputs, the targets contain the desired token IDs that we want the model to generate
###### Notice that the targets are the inputs shifted by 1 position, as explained in chapter 2 when we implemented the data loader

In [None]:
inputs = torch.tensor([[16833, 3626, 6100],  # [ "Every effort moves",
                        [40, 1107, 588]])    # I really like"],

targets = torch.tensor([[3626, 6100, 345],  # [ "Every effort moves",
                        [1107, 588, 11311]])    # I really like"],

###### Feeding the inputs to the model, we obtain the logits vector for the 2 input examples that consist of 3 tokens each
###### Each of the tokens is a 50,257-dimensional vector corresponding to the size of the vocabulary
###### Applying the softmax function, we can turn the logits tensor into a tensor of the same dimension containing probability scores

In [None]:
with torch.no_grad():
    logits = model(inputs)

probas = torch.softmax(logits, dim=-1)   # Probability of each token in Vocabulary
print(probas.shape) # Shape: (batch_size, num_tokens, vocab_size)

###### The figure below, using a very small vocabulary for illustration purposes, outlines how we convert the probability scores back into text, which we discussed at the end of the previous chapter
###### As discussed in the previous chapter, we can apply the argmax function to convert the probability scores into predicted token IDs
###### The softmax function above produced a 50,257-dimensional vector for each token; the argmax function returns the position of the highest probability score in this vector, which is the predicted token ID for the given token
###### Since we have 2 input batches with 3 tokens each, we obtain 2 by 3 predicted token IDs:

In [None]:
token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

###### If we decode these tokens, we find that these are quite different from the tokens we want the model to predict, namely the target tokens:

In [None]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

###### That's because the model wasn't trained yet
###### To train the model, we need to know how far it is away from the correct predictions (targets)
###### The token probabilities corresponding to the target indices are as follows:

In [None]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

###### We want to maximize all these values, bringing them close to a probability of 1
###### In mathematical optimization, it is easier to maximize the logarithm of the probability score than the probability score itself; this is out of the scope of this book, but I have recorded a lecture with more details here: L8.2 Logistic Regression Loss Function

In [None]:
# compute logarithm of all token probababilities
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(log_probas)

###### Next, we compute the average log probability:

In [None]:
# Calculate the average probability for each token
avg_log_probas = torch.mean(log_probas)
print(avg_log_probas)

###### The goal is to make this average log probability as large as possible by optimizing the model weights
###### Due to the log, the largest possible value is 0, and we are currently far away from 0
###### In deep learning, instead of maximizing the average log-probability, it's a standard convention to minimize the negative average log-probability value; in our case, instead of maximizing -10.7722 so that it approaches 0, in deep learning, we would minimize 10.7722 so that it approaches 0
###### The value negative of -10.7722, i.e., 10.7722, is also called cross-entropy loss in deep learning

In [None]:
neg_avg_log_probas = avg_log_probas * -1
print(neg_avg_log_probas)

###### PyTorch already implements a cross_entropy function that carries out the previous steps

###### Before we apply the cross_entropy function, let's check the shape of the logits and targets

In [None]:
# Logits have shape (batch_size, num_tokens, vocab_size)
print("Logits shape:", logits.shape)

# Targets have shape (batch_size, num_tokens)
print("Targets shape:", targets.shape)

###### For the cross_entropy function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:

In [None]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()

print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

###### Note that the targets are the token IDs, which also represent the index positions in the logits tensors that we want to maximize
###### The cross_entropy function in PyTorch will automatically take care of applying the softmax and log-probability computation internally over those token indices in the logits that are to be maximized

In [None]:
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

###### A concept related to the cross-entropy loss is the perplexity of an LLM
###### The perplexity is simply the exponential of the cross-entropy loss

In [None]:
perplexity = torch.exp(loss)
print(perplexity)

###### The perplexity is often considered more interpretable because it can be understood as the effective vocabulary size that the model is uncertain about at each step (in the example above, that'd be 48,725 words or tokens)
###### In other words, perplexity provides a measure of how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset
###### Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution

##### 5.1.3 Calculating the training and validation set losses
###### We use a relatively small dataset for training the LLM (in fact, only one short story)

###### The reasons are:

###### You can run the code examples in a few minutes on a laptop computer without a suitable GPU
###### The training finishes relatively fast (minutes instead of weeks), which is good for educational purposes
###### We use a text from the public domain, which can be included in this GitHub repository without violating any usage rights or bloating the repository size
###### For example, Llama 2 7B required 184,320 GPU hours on A100 GPUs to be trained on 2 trillion tokens

###### At the time of this writing, the hourly cost of an 8xA100 cloud server at AWS is approximately $30
###### So, via an off-the-envelope calculation, training this LLM would cost 184,320 / 8 * $30 = $690,000

###### Below, we use the same dataset we used in chapter 2

In [None]:
import os
import urllib.request

file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
         text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
         file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
         text_data = file.read()

###### A quick check that the text loaded ok by printing the first and last 100 words

In [None]:
# First 100 characters
print(text_data[:99])

In [None]:
# Last 100 characters
print(text_data[-99:])

In [None]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print("Characters:", total_characters)
print("Tokens:", total_tokens)

###### With 5,145 tokens, the text is very short for training an LLM, but again, it's for educational purposes (we will also load pretrained weights later)
###### Next, we divide the dataset into a training and a validation set and use the data loaders from chapter 2 to prepare the batches for LLM training
###### For visualization purposes, the figure below assumes a max_length=6, but for the training loader, we set the max_length equal to the context length that the LLM supports
###### The figure below only shows the input tokens for simplicity
######     Since we train the LLM to predict the next word in the text, the targets look the same as these inputs, except that the targets are shifted by one position

In [None]:
from previous_chapters import create_dataloader_v1

# Train / validation ratio
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0 
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0 
)

In [None]:
# Sanity check

if total_tokens * (train_ratio) < GPT_CONFIG_124M["context_length"]:
    print("Not enough tokens for the training loader. "
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
          "increase the `training_ratio`")

if total_tokens * (1 - train_ratio) < GPT_CONFIG_124M["context_length"]:
    print("Not enough tokens for the validation loader. "
          "Try to lower the `GPT_CONFIG_124M['context_length']` or "
          "decrease the `training_ratio`")

###### We use a relatively small batch size to reduce the computational resource demand, and because the dataset is very small to begin with
###### Llama 2 7B was trained with a batch size of 1024, for example
###### An optional check that the data was loaded correctly:

In [None]:
print("Train loader:")
for x, y in train_loader:
    print(x.shape, y.shape)

print("\nValidation loader:")
for x, y in train_loader:
    print(x.shape, y.shape)

###### Another optional check that the token sizes are in the expected ballpark:

In [None]:
train_tokens = 0
for input_batch, target_batch in train_loader:
    train_tokens += input_batch.numel()

val_tokens = 0
for input_batch, target_batch in val_loader:
     val_tokens += input_batch.numel()

print("Training tokens:", train_tokens)
print("validation tokens:", val_tokens)
print("All tokens:", train_tokens + val_tokens)

###### Next, we implement a utility function to calculate the cross-entropy loss of a given batch
###### In addition, we implement a second utility function to compute the loss for a user-specified number of batches in a data loader

In [None]:
def calc_loss_batch(input_batch, target_batch, model, device):
     input_batch, target_batch = input_batch.to(device), target_batch.to(device)
     logits = model(input_batch)
     loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), target_batch.flatten())
     return loss

def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_losses = 0.
    if len(data_loader) == 0:
         return float("nan")
    elif num_batches is None:
         num_batches = len(data_loader)
    else:
        # Reduce the number of batches to match the total number of batches in the data loader
        # if num_batches exceeds the number of batches in the data loader
        num_batches = min(num_batches, len(data_loader))

    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

###### If you have a machine with a CUDA-supported GPU, the LLM will train on the GPU without making any changes to the code
###### Via the device setting, we ensure that the data is loaded onto the same device as the LLM model

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Note:
# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,
# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).
# However, the resulting loss values may be slightly different.

#if torch.cuda.is_available():
#    device = torch.device("cuda")
#elif torch.backends.mps.is_available():
#    device = torch.device("mps")
#else:
#    device = torch.device("cpu")
#
# print(f"Using {device} device.")

model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes

torch.manual_seed(123) # For reproducibility due to the shuffling in the data loader

with torch.no_grad():  # Disable gradient tracking for efficiency because we are not training, yet
   train_loss = calc_loss_loader(train_loader, model, device)
   val_loss = calc_loss_loader(val_loader, model, device)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

#### 5.2 Training an LLM
###### In this section, we finally implement the code for training the LLM
###### We focus on a simple training function (if you are interested in augmenting this training function with more advanced techniques, such as learning rate warmup, cosine annealing, and gradient clipping, please refer to Appendix D)

In [None]:
def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter,
                          start_context, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, 1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # set model to training model

        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()  # Calculate loss gradients 
            optimizer.step()  # Update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
               train_loss, val_loss = evaluate_model(
                model, train_loader, val_loader, device, eval_iter
               )
               train_losses.append(train_loss)
               val_losses.append(val_loss)