# Building LLMs From Scratch (Part 3): Crafting the Data Pipeline

Welcome to Part 3 of our **"Building LLMs from Scratch"** series! 

In [Part 1](https://soloshun.medium.com/building-llms-from-scratch-part-1-the-complete-theoretical-foundation-e66b45b7f379), we built the theoretical foundation. In [Part 2](https://soloshun.medium.com/building-llms-from-scratch-part-2-the-power-of-tokenization), we learned how to convert raw text into tokens.

Now it's time to tackle a crucial step: **How do we feed this data to our model?**

## What We'll Learn Today

1. **The core concept**: Why LLMs need input-target pairs
2. **Key parameters**: Context size and stride - the building blocks of our data pipeline  
3. **PyTorch implementation**: Build a custom `Dataset` and `DataLoader`
4. **See it in action**: Watch our pipeline create training batches

Let's dive in! 🚀

---

**📖 This notebook accompanies the Medium article:** [Building LLMs From Scratch (Part 3): Crafting the Data Pipeline]

**📂 Find the clean Python script version:** [src/part03_dataloader.py](../src/part03_dataloader.py)


## Chapter 1: The Core Task - Next Token Prediction

A GPT-style LLM has **one fundamental job**: Given a sequence of tokens, predict what token comes next.

To train the model to do this, we need to show it millions of examples where:
- **Input**: A sequence of tokens (e.g., `[40, 367, 2885, 1464]`)
- **Target**: The next token that should follow (e.g., `1807`)

But there's a catch: we don't just want to predict one token at a time. We want to make predictions at **every position** in the sequence. This is much more efficient for training.

Let's see what this looks like:


In [None]:
# Let's start with a simple example
import tiktoken

# Load our tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Example text 
text = "I HAD always thought Jack Gisburn rather a cheap genius"
tokens = tokenizer.encode(text)

print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")


In [None]:
# Now let's see how we create input-target pairs
context_size = 8  # How many tokens to use as input

# Create input and target sequences
input_tokens = tokens[:context_size]
target_tokens = tokens[1:context_size+1]  # Shifted by one position

print("🎯 Input-Target Pair Example:")
print(f"Input:  {input_tokens}")
print(f"Target: {target_tokens}")
print()

# Let's see what this means for each prediction task
print("📝 Individual Prediction Tasks:")
for i in range(len(input_tokens)):
    context = input_tokens[:i+1]
    target = target_tokens[i]
    context_text = tokenizer.decode(context)
    target_text = tokenizer.decode([target])
    print(f"'{context_text}' → '{target_text}'")


## Chapter 2: Key Concepts - Context Size & Stride

Now that we understand the basic idea, let's explore the two crucial parameters that control our data pipeline.

### Context Size (max_length)
The **context size** determines how many tokens the model looks at when making a prediction. It's like the model's "attention span."

### Stride  
The **stride** controls how we slide our window across the text to create the next chunk:
- **stride = 1**: Maximum overlap, more training examples, computationally intensive
- **stride = context_size**: No overlap, fewer examples, faster training

Let's visualize this:


In [None]:
# Let's demonstrate different stride values
def show_sliding_window(tokens, context_size, stride, max_examples=5):
    print(f"📊 Sliding Window: context_size={context_size}, stride={stride}")
    print("-" * 60)
    
    count = 0
    for i in range(0, len(tokens) - context_size, stride):
        if count >= max_examples:
            print("... (and more)")
            break
            
        input_chunk = tokens[i:i + context_size]
        target_chunk = tokens[i + 1:i + context_size + 1]
        
        print(f"Window {count+1}: Input={input_chunk}, Target={target_chunk}")
        count += 1
    
    total_chunks = len(range(0, len(tokens) - context_size, stride))
    print(f"📈 Total chunks created: {total_chunks}")
    print()

# Compare different stride values
show_sliding_window(tokens, context_size=4, stride=1)
show_sliding_window(tokens, context_size=4, stride=4)


## Chapter 3: Building the PyTorch Dataset

Now let's implement this logic using PyTorch's `Dataset` class. This class needs three methods:

1. `__init__`: Set up the data (tokenize and create all chunks)
2. `__len__`: Return the number of chunks
3. `__getitem__`: Return a specific chunk by index


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        
        # 1. Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
        
        # 2. Use a sliding window to create input-target chunks
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        # 3. Return the total number of chunks
        return len(self.input_ids)
       
    def __getitem__(self, idx):
        # 4. Return a single input-target pair
        return self.input_ids[idx], self.target_ids[idx]

print("✅ GPTDatasetV1 class created!")


✅ GPTDatasetV1 class created!


## Chapter 4: Creating the DataLoader

The `DataLoader` takes our `Dataset` and handles:
- **Batching**: Groups multiple examples together
- **Shuffling**: Randomizes the order for better training
- **Parallel processing**: Uses multiple CPU cores for speed

Let's create a utility function to set this up:


In [None]:
def create_dataloader_v1(
    txt, 
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0
):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    
    # Create the dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    
    # Create the dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )
    
    print(f"📦 DataLoader created:")
    print(f"   Dataset size: {len(dataset)} chunks")
    print(f"   Batch size: {batch_size}")
    print(f"   Number of batches: {len(dataloader)}")
    
    return dataloader

print("✅ create_dataloader_v1 function ready!")


## Chapter 5: Testing with Real Data

Let's load our text data and see our pipeline in action!


In [None]:
# Load the text data
try:
    with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
        raw_text = f.read()
    print(f"✅ Successfully loaded text! Total characters: {len(raw_text)}")
    print(f"First 100 characters: '{raw_text[:100]}'")
except FileNotFoundError:
    print("❌ File not found. Make sure 'the-verdict.txt' is in the '../data/' folder")
    # For demonstration, let's use a sample text
    raw_text = """I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that he had been caught."""
    print(f"Using sample text instead: '{raw_text[:80]}...'")


In [None]:
# Create a small dataloader for easy inspection
print("🔬 Creating a small dataloader for inspection...")
small_dataloader = create_dataloader_v1(
    raw_text, 
    batch_size=8, 
    max_length=4, 
    stride=4, 
    shuffle=False  # Keep order for easier understanding
)

# Get the first batch
data_iter = iter(small_dataloader)
inputs, targets = next(data_iter)

print(f"\n🎯 First Batch:")
print(f"Inputs shape:  {inputs.shape}")
print(f"Targets shape: {targets.shape}")
print(f"\nInputs:\n{inputs}")
print(f"\nTargets:\n{targets}")


In [None]:
# Let's decode a few examples to see the actual text
print("📖 Decoded Examples:")
print("-" * 50)

for i in range(min(3, inputs.shape[0])):  # Show first 3 examples
    input_text = tokenizer.decode(inputs[i].tolist())
    target_text = tokenizer.decode(targets[i].tolist())
    
    print(f"Example {i+1}:")
    print(f"  Input:  '{input_text}'")
    print(f"  Target: '{target_text}'")
    print()


In [None]:
# Now let's create a more realistic dataloader
print("🚀 Creating a realistic dataloader for training...")
training_dataloader = create_dataloader_v1(
    raw_text,
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True
)

# Show what a training batch looks like
data_iter = iter(training_dataloader)
inputs, targets = next(data_iter)

print(f"\n🎯 Training Batch:")
print(f"Inputs shape:  {inputs.shape}")
print(f"Targets shape: {targets.shape}")
print(f"Batch size: {inputs.shape[0]}")
print(f"Sequence length: {inputs.shape[1]}")

# Show a snippet of the first example
print(f"\nFirst example (first 20 tokens):")
print(f"Input:  {inputs[0][:20].tolist()}")
print(f"Target: {targets[0][:20].tolist()}")
print(f"Text:   '{tokenizer.decode(inputs[0][:20].tolist())}'...")


## 🎉 Summary & What We've Accomplished

Congratulations! You've just built a complete data pipeline for training LLMs. Here's what we've achieved:

### Key Takeaways:

1. **Understanding the Task**: LLMs learn by predicting the next token at every position in a sequence
2. **Critical Parameters**: 
   - **Context Size**: Controls how much text the model sees at once
   - **Stride**: Controls overlap between training examples
3. **PyTorch Implementation**: We built a custom `Dataset` and `DataLoader` that can handle any text data
4. **Efficiency**: Our pipeline automatically creates thousands of training examples from raw text

### The Magic ✨

Look at what we've created: our `DataLoader` takes raw text and automatically generates perfectly formatted training batches where:
- Each input sequence contains `max_length` tokens
- Each target sequence is the input shifted by one position
- The model will learn to predict the next token at every position

### What's Next?

Now that we can feed data to our model, we need to build the model itself!

**In Part 4**, we'll tackle **embeddings**—how we convert our token IDs into meaningful vectors that capture semantic relationships. This is where the real magic of understanding begins.

---

🔗 **Find the complete code:**
- **This notebook**: `notebooks/part03_dataloader.ipynb`
- **Python script**: `src/part03_dataloader.py`  
- **GitHub repository**: [llm-from-scratch](https://github.com/soloeinsteinmit/llm-from-scratch)

📝 **Read the full article**: [Building LLMs From Scratch (Part 3): Crafting the Data Pipeline]

Happy coding! 🚀
