**To be covered**

1. Calculating loss using backprop algo, setting testing and val datasets
2. Pretraining and saving model weights
3. Loading Openai GPT2 weights into our architecture

![](images/eval-1.png)

![eval-2](images/eval-2.png)

**Generating Text Function**

1. unsqueeze(0) — Adding the "Batch" Dimension
Models like GPT are designed to process many sentences at once (a "batch") to be efficient. Because of this, they always expect a 2D or 3D input, even if you are only sending them a single sentence.

    Your input: [15496, 995] (Shape: [2]) — Just a simple list of words.

    Model expects: [[15496, 995]] (Shape: [1, 2]) — A batch containing one sentence.

    By calling .unsqueeze(0), you are telling PyTorch: "Add a new dimension at the very beginning (index 0) so this looks like a batch of 1."

2. squeeze(0) — Removing the "Batch" Dimension
Once the model is done and gives you an output, it still includes that extra batch "container." However, your tokenizer.decode() function doesn't know what a batch is—it just wants a flat list of numbers.

    Model output: tensor([[15496, 995, ...]]) (Shape: [1, 12])

    Decoder wants: [15496, 995, ...] (Shape: [12])

    By calling .squeeze(0), you are saying: "Take that outermost 'batch' dimension away so I can just get the list of word IDs back."

In [6]:
import tiktoken
import torch
from modules import generate_text_simple, GPTModel, GPT_CONFIG_124M

def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flattened = token_ids.squeeze(0)
    return tokenizer.decode(flattened.tolist())

model = GPTModel(GPT_CONFIG_124M)
model.eval()  

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(model=model, idx=text_to_token_ids(start_context, tokenizer), max_new_tokens=10, context_size=GPT_CONFIG_124M["context_length"])

print("output text - ", token_ids_to_text(token_ids, tokenizer))

output text -  Every effort moves youlord Simone intu intro caloric vaccinationestablished ClassificationPortland oats


Gives gibberish as model is not yet trained, we need to train it and for that, we need to define an evalution metric, a framework which can let us know if the model is getting better, so lets define that loss

The model training aims to increase softmax probabilites at the indices of the correct token ids. This is done by updating model weights using BackPropagation, and backprop requires a loss function which calculates the difference between model's predicted outputs and correct outputs.

![eval3](images/eval-3.png)

We compute the logits(output) from the model, apply softmax, get the probablity at the correct next token index and negative average log probabilities is the loss.

#Perplexity
A measure used along with cross entropy loss to evaluate the performance of models in tasks like language modelling. It tells how much the prob distribution of model outputs is away from the correct prob distribution. Given by torch.exp(loss), and similar to loss, less means better model. Signifies the effective vocab size about which the model is uncertain at each step.

<h2>Training an LLM</h2>

In [7]:
import os
import requests

file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

if not os.path.exists(file_path):
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    text_data = response.text
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()



In [8]:
total_characters = len(text_data)
total_tokens = len(tokenizer.encode(text_data))

print("Characters:", total_characters)
print("Tokens:", total_tokens)


Characters: 20479
Tokens: 5145


Creating training and validation datasets

In [10]:
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


def create_dataloader_v1(txt, batch_size, max_length, stride,
                         shuffle=True, drop_last=True, num_workers=0):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

    return dataloader


In [11]:
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


torch.manual_seed(123)

train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)