# Building LLMs From Scratch (Part 5): The Complete Data Preprocessing Pipeline

Welcome to Part 5! This notebook demonstrates the complete end-to-end data preprocessing pipeline we've built over the previous parts. We'll see how tokenization, data loading, and embeddings work together to transform raw text into model-ready tensors.

### üîó Quick Links
- **Medium Article**: [Part 5: The Complete Data Preprocessing Pipeline](https://soloshun.medium.com/link-to-part-5)
- **GitHub Repository**: [llm-from-scratch](https://github.com/soloeinsteinmit/llm-from-scratch)

### üìã What We'll Cover
1. **Step 1**: Tokenization with BPE (Byte Pair Encoding)
2. **Step 2**: Creating Input-Target pairs with DataLoader
3. **Step 3**: Token & Positional Embeddings
4. **Step 4**: Complete Pipeline Integration


## Setup and Imports

Let's import all the necessary libraries and our custom modules from previous parts.


In [None]:
import torch
import torch.nn as nn
import tiktoken
import sys

# Add the parent directory to the system path to allow imports
sys.path.insert(0, '../')

from src.part03_dataloader import create_dataloader_v1
from src.part04_embeddings import GPTEmbedding

print("‚úÖ All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device available: {'CUDA' if torch.cuda.is_available() else 'CPU'}")


## Step 1: Load Raw Text Data

First, let's load our text data. We'll use "The Verdict" text file that we've been working with throughout the series.


In [None]:
# Load the raw text data
with open("../data/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"üìñ Loaded text data:")
print(f"Total characters: {len(raw_text):,}")
print(f"First 100 characters: '{raw_text[:100]}'")
print(f"Last 100 characters: '{raw_text[-100:]}'")

# Show a sample of the text structure
lines = raw_text.split('\n')
print(f"\nüìä Text structure:")
print(f"Total lines: {len(lines)}")
print(f"Average line length: {sum(len(line) for line in lines) / len(lines):.1f} characters")


## Step 2: Tokenization with Byte Pair Encoding (BPE)

Now we'll tokenize our text using OpenAI's GPT-2 tokenizer, which uses Byte Pair Encoding. This is the same tokenizer used in GPT-2, GPT-3, and GPT-4.


In [None]:
# Initialize the GPT-2 tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Tokenize the entire text
tokenized_text = tokenizer.encode(raw_text)

print(f"üî¢ Tokenization Results:")
print(f"Original text length: {len(raw_text):,} characters")
print(f"Tokenized length: {len(tokenized_text):,} tokens")
print(f"Compression ratio: {len(raw_text) / len(tokenized_text):.2f} chars/token")
print(f"Vocabulary size: {tokenizer.n_vocab:,}")

# Show some example tokens
print(f"\nüîç First 20 tokens: {tokenized_text[:20]}")
print(f"Decoded: '{tokenizer.decode(tokenized_text[:20])}'")

# Show individual token examples
sample_tokens = tokenized_text[10:15]
print(f"\nüìù Token breakdown:")
for i, token_id in enumerate(sample_tokens):
    token_text = tokenizer.decode([token_id])
    print(f"  Token {i}: ID={token_id:5d} ‚Üí '{token_text}'")


## Step 3: Create Input-Target Pairs with DataLoader

Now we'll use our custom DataLoader from Part 3 to create training examples. The DataLoader will create input-target pairs using a sliding window approach.


In [None]:
# Define hyperparameters
BATCH_SIZE = 8
CONTEXT_SIZE = 4  # Small for demonstration
STRIDE = CONTEXT_SIZE  # No overlap for this demo

# Create the DataLoader
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=BATCH_SIZE,
    max_length=CONTEXT_SIZE,
    stride=STRIDE,
    shuffle=False,
    drop_last=True
)

print(f"üìä DataLoader Configuration:")
print(f"Batch size: {BATCH_SIZE}")
print(f"Context size: {CONTEXT_SIZE}")
print(f"Stride: {STRIDE}")
print(f"Total batches: {len(dataloader)}")
print(f"Total examples: {len(dataloader.dataset)}")

# Get one batch of data
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"\nüéØ Sample Batch:")
print(f"Inputs shape: {inputs.shape}")
print(f"Targets shape: {targets.shape}")
print(f"\nInputs (token IDs):\n{inputs}")
print(f"\nTargets (token IDs):\n{targets}")

# Show the relationship between inputs and targets
print(f"\nüîç Input-Target Relationship (first example):")
input_tokens = inputs[0].tolist()
target_tokens = targets[0].tolist()
print(f"Input:  {input_tokens}")
print(f"Target: {target_tokens}")
print(f"Notice: Target is Input shifted by 1 position ‚Üí")


## Step 4: Token & Positional Embeddings

Now we'll convert our token IDs into dense vectors using our GPTEmbedding module from Part 4. This combines both token embeddings (semantic meaning) and positional embeddings (sequence order).


In [None]:
# Define embedding parameters
VOCAB_SIZE = 50257  # GPT-2 vocabulary size
EMB_DIM = 256       # Embedding dimension

# Initialize the embedding layer
torch.manual_seed(123)  # For reproducible results
embedding_layer = GPTEmbedding(VOCAB_SIZE, EMB_DIM, CONTEXT_SIZE)

print(f"üß† Embedding Layer Configuration:")
print(f"Vocabulary size: {VOCAB_SIZE:,}")
print(f"Embedding dimension: {EMB_DIM}")
print(f"Context size: {CONTEXT_SIZE}")

# Calculate total parameters
total_params = sum(p.numel() for p in embedding_layer.parameters())
token_params = VOCAB_SIZE * EMB_DIM
pos_params = CONTEXT_SIZE * EMB_DIM

print(f"\nüìä Parameter Count:")
print(f"Token embedding parameters: {token_params:,}")
print(f"Positional embedding parameters: {pos_params:,}")
print(f"Total parameters: {total_params:,}")

# Convert token IDs to embeddings
model_ready_inputs = embedding_layer(inputs)

print(f"\n‚ú® Embedding Results:")
print(f"Input shape (token IDs): {inputs.shape}")
print(f"Output shape (embeddings): {model_ready_inputs.shape}")
print(f"Each token ID ‚Üí {EMB_DIM}-dimensional vector with positional info")

# Show the transformation for one example
print(f"\nüîç Transformation Example (first sample):")
print(f"Token IDs: {inputs[0].tolist()}")
print(f"Embedding shape: {model_ready_inputs[0].shape}")
print(f"First embedding vector (first 10 values): {model_ready_inputs[0][0][:10]}")


## Step 5: Complete Pipeline Integration

Let's now put everything together into a single, streamlined function that demonstrates the complete preprocessing pipeline from raw text to model-ready tensors.


In [None]:
def complete_preprocessing_pipeline(raw_text, batch_size=8, context_size=4, emb_dim=256):
    """
    Complete data preprocessing pipeline from raw text to model-ready tensors.
    
    Args:
        raw_text: Raw input text
        batch_size: Number of examples per batch
        context_size: Length of each input sequence
        emb_dim: Embedding dimension
    
    Returns:
        model_ready_inputs: Tensor ready for transformer model [batch_size, context_size, emb_dim]
        targets: Target tokens for training [batch_size, context_size]
    """
    print("üöÄ Running Complete Preprocessing Pipeline")
    print("=" * 50)
    
    # Step 1: Create DataLoader (handles tokenization + input-target pairs)
    dataloader = create_dataloader_v1(
        raw_text,
        batch_size=batch_size,
        max_length=context_size,
        stride=context_size,
        shuffle=False
    )
    print(f"‚úÖ Step 1: DataLoader created ({len(dataloader)} batches)")
    
    # Step 2: Initialize embedding layer
    embedding_layer = GPTEmbedding(50257, emb_dim, context_size)
    print(f"‚úÖ Step 2: Embedding layer initialized ({sum(p.numel() for p in embedding_layer.parameters()):,} params)")
    
    # Step 3: Get one batch and process it
    data_iter = iter(dataloader)
    inputs, targets = next(data_iter)
    print(f"‚úÖ Step 3: Batch loaded {inputs.shape}")
    
    # Step 4: Convert to embeddings
    model_ready_inputs = embedding_layer(inputs)
    print(f"‚úÖ Step 4: Embeddings created {model_ready_inputs.shape}")
    
    print("=" * 50)
    print("üéâ Pipeline Complete!")
    
    return model_ready_inputs, targets

# Run the complete pipeline
final_inputs, final_targets = complete_preprocessing_pipeline(
    raw_text, 
    batch_size=BATCH_SIZE, 
    context_size=CONTEXT_SIZE, 
    emb_dim=EMB_DIM
)

print(f"\nüìä Final Results:")
print(f"Model-ready inputs: {final_inputs.shape}")
print(f"Training targets: {final_targets.shape}")
print(f"Ready for transformer model! üöÄ")


## üéØ Summary

We have successfully built and demonstrated the complete data preprocessing pipeline for our LLM! 

### What We Accomplished:
- ‚úÖ **Tokenization**: Used BPE (GPT-2 tokenizer) to convert text to token IDs
- ‚úÖ **Data Loading**: Created input-target pairs with sliding window approach  
- ‚úÖ **Embeddings**: Combined token and positional embeddings for rich representations
- ‚úÖ **Batching**: Organized data for efficient training

### The Journey:
```
Raw Text ‚Üí Tokenization ‚Üí Input/Target Pairs ‚Üí Embeddings ‚Üí Model-Ready Tensors
```

### Key Takeaways:
- **BPE tokenization** efficiently handles any text with a manageable vocabulary
- **Sliding window** creates thousands of training examples from a single text
- **Embeddings** transform meaningless IDs into information-rich vectors
- **Modular design** makes each component reusable and testable

### What's Next:
In **Part 6**, we'll build the **self-attention mechanism** - the heart of the transformer that will process these embeddings and learn to understand language!
