# Chapter 5 : Pretraining on unlabeled data

- basic model evaluation to measure the quality of the generated text
- how to load pretrained weights

In [1]:
import torch
from chapter_04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12, 
    "drop_rate": 0.1,
    "qkv_bias": False
}
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(256, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_block): Sequential(
    (0): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=768, out_features=768, bias=False)
        (W_key): Linear(in_features=768, out_features=768, bias=False)
        (W_value): Linear(in_features=768, out_features=768, bias=False)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ff): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (norm1): LayerNorm()
      (norm2): LayerNorm()
      (drop_shortcut): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (att): MultiHeadAttention(
        (W_query): Linear(in_features=

In [2]:
import tiktoken
from chapter_04 import generate_text_simple

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)
print("Output text:", token_ids_to_text(token_ids, tokenizer))

Output text: Every effort moves you rentingetic wasnم refres RexMeCHicular stren


The goal of training an LLM is to maximize the lilkelihood of the correct token, which involves increasing the probaility relative to other tokens
Part of the text evaluation is to measure "how far" the generated tokens are from the correct predictions (targets). We use this information to adjust the model weights

The model training aims to increase the softmax probability in the index positions corresponding to the correct target token IDs. This softmax probability is also used in the evaluation metric to numerically assess the model's generated outputs: the higher the proba in the correct positions, the better

Working with logarithms of probability scores is more manageable in mathematical optimization than handling the scores directly

![log_likelihood](/home/mat/Documents/whisper_triton/understand_LLMs/log_likelihood.png)


The goal is to get the average log probability as close to 0 as possible by updating the model's weights as part of the training process
In DL. the goal is to bring the negative average log proba to 0. It's called the **cross entropy loss**

At its core the cross entropy loss is a popular measure that measures the difference between 2 probability distributions - typically tge true distribution of labels (here tokens in dataset) and the predicted distribution from a model

In [16]:
inputs = torch.tensor([[16833, 3626, 6100],   # ["every effort moves",
                       [40,    1107, 588]])   #  "I really like"]

targets = torch.tensor([[3626, 6100, 345  ],  # [" effort moves you",
                        [1107, 588, 11311]])  #  " really like chocolate"]

with torch.no_grad():
    logits = model(inputs)
probas = torch.softmax(logits, dim=-1)
print(probas.shape)

token_ids = torch.argmax(probas, dim=-1, keepdim=True)
print("Token IDs:\n", token_ids)

print()
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1:"
      f" {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 1:", target_probas_1)

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print("Text 2:", target_probas_2)

log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
print(f"{log_probas=}")

avg_log_probas = torch.mean(log_probas)
print(f"{avg_log_probas=}")

neg_avg_log_probas = avg_log_probas * -1
print(f"{neg_avg_log_probas=}")
print()
print("Logits shape:", logits.shape)
print("Targets shape:", targets.shape)

# For the cross_entropy loss function in PyTorch we want to flatten these tensors by combining them over the batch dimension
logits_flat = logits.flatten(0,1)
targets_flat = targets.flatten()
print("Flattened logits:", logits_flat.shape)
print("Flattened targets:", targets_flat.shape)

print()
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)
print(loss)

torch.Size([2, 3, 50257])
Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]])

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix
Text 1: tensor([7.4540e-05, 3.1061e-05, 1.1563e-05])
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06])
log_probas=tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561])
avg_log_probas=tensor(-10.7940)
neg_avg_log_probas=tensor(10.7940)

Logits shape: torch.Size([2, 3, 50257])
Targets shape: torch.Size([2, 3])
Flattened logits: torch.Size([6, 50257])
Flattened targets: torch.Size([6])

tensor(10.7940)


# Perplexity

It's a measure often used alongside cross entropy loss to evaluate the performance of models in tasks like language modeling. It can provide a more intrepretable way to understand the uncertainty of a model in predicting the next token in a sequence

Perplexity measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset. Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution

Perplexity can be calculated as `perplexity = torch.exp(loss)` which return `tensor(48725.8203)` when applied to the previously calculated loss

Perplexity is often considered more interpretable than the raw loss value because it signifies the effective vocabulary size about which the model is uncertain at each step.
In the given example, this would translate to the model being unsure about which among 48,725 tokens in the vocabulary to generate as the next token