![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Module 2: Data Engineering for Context

**‚è±Ô∏è Time:** 20 minutes

## üéØ Learning Objectives

By the end of this module, you will:

1. **Understand** the data pipeline for LLM consumption
2. **Know** when to chunk documents (and when NOT to)
3. **Apply** transformation techniques for token optimization
4. **See** the impact of context engineering (91% token reduction)

---

## üìö Part 1: The Data Pipeline (10 min)

### From Raw Data to LLM-Ready Context

Before you can use RAG effectively, you need to **prepare your data** for LLM consumption:

```
Raw Data ‚Üí Extract ‚Üí Clean ‚Üí Transform ‚Üí Optimize ‚Üí Store
```

| Stage | What Happens | Example |
|-------|--------------|--------|
| **Extract** | Get data from sources | Pull course catalog from database |
| **Clean** | Remove noise | Strip `id`, timestamps, internal fields |
| **Transform** | Convert format | JSON ‚Üí Natural text |
| **Optimize** | Reduce tokens | Summaries + details (progressive disclosure) |
| **Store** | Index for retrieval | Redis Vector DB with embeddings |

### Why This Matters

Raw data often contains:
- **Noise fields**: IDs, timestamps, internal metadata
- **Verbose formats**: Nested JSON, XML tags
- **Redundant information**: Repeated headers, boilerplate

All of this consumes precious tokens without adding value to the LLM's understanding.

In [1]:
# Setup
import os
import sys
import json
from pathlib import Path

repo_root = Path.cwd().parent
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

from dotenv import load_dotenv
load_dotenv()
load_dotenv(repo_root / ".env")

import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens in text for a given model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

print("‚úÖ Setup complete!")

‚úÖ Setup complete!


### Example: Raw vs Cleaned Data

In [2]:
# RAW DATA - What comes from the database
raw_course = {
    "id": "course_abc123",
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-20T14:22:00Z",
    "enrollment_capacity": 50,
    "current_enrollment": 0,
    "course_code": "CS301",
    "title": "Introduction to Machine Learning",
    "department": "Computer Science",
    "credits": 4,
    "difficulty_level": "intermediate",
    "format": "online",
    "instructor": "Dr. Smith",
    "description": "Comprehensive introduction to machine learning algorithms.",
    "syllabus": [
        {"week": 1, "topic": "Introduction to ML", "readings": ["Chapter 1"], "assignments": ["HW1"]},
        {"week": 2, "topic": "Supervised Learning", "readings": ["Chapter 2"], "assignments": ["HW2"]},
    ],
    "assignments": [{"title": "HW1", "points": 100}, {"title": "HW2", "points": 100}],
    "grading_policy": {"homework": 40, "midterm": 30, "final": 30}
}

raw_text = json.dumps(raw_course, indent=2)
print(f"Raw JSON: {count_tokens(raw_text)} tokens")
print(raw_text[:500] + "...")

Raw JSON: 323 tokens
{
  "id": "course_abc123",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-20T14:22:00Z",
  "enrollment_capacity": 50,
  "current_enrollment": 0,
  "course_code": "CS301",
  "title": "Introduction to Machine Learning",
  "department": "Computer Science",
  "credits": 4,
  "difficulty_level": "intermediate",
  "format": "online",
  "instructor": "Dr. Smith",
  "description": "Comprehensive introduction to machine learning algorithms.",
  "syllabus": [
    {
      "week": 1,
     ...


In [3]:
# CLEANED & TRANSFORMED - LLM-ready format
clean_course = """CS301: Introduction to Machine Learning
Department: Computer Science | Credits: 4 | Level: Intermediate | Format: Online
Instructor: Dr. Smith
Description: Comprehensive introduction to machine learning algorithms.
Topics: Introduction to ML, Supervised Learning, Neural Networks, Model Evaluation
"""

print(f"Cleaned text: {count_tokens(clean_course)} tokens")
print(clean_course)

Cleaned text: 57 tokens
CS301: Introduction to Machine Learning
Department: Computer Science | Credits: 4 | Level: Intermediate | Format: Online
Instructor: Dr. Smith
Description: Comprehensive introduction to machine learning algorithms.
Topics: Introduction to ML, Supervised Learning, Neural Networks, Model Evaluation



In [4]:
# Compare
raw_tokens = count_tokens(raw_text)
clean_tokens = count_tokens(clean_course)
reduction = (1 - clean_tokens / raw_tokens) * 100

print(f"\nüìä Token Reduction: {reduction:.0f}%")
print(f"   Raw: {raw_tokens} tokens ‚Üí Clean: {clean_tokens} tokens")


üìä Token Reduction: 82%
   Raw: 323 tokens ‚Üí Clean: 57 tokens


---

## üìö Part 2: The Chunking Decision (5 min)

### Chunking is a Design Choice, Not a Default

A common question: **"Do I need to chunk my documents?"**

**The answer: It depends on your data, application, and retrieval needs!**

| Data Type | Characteristics | Chunk?       | Why |
|-----------|-----------------|--------------|-----|
| Course catalog | Self-contained records | ‚ùå **No**     | Each course is a natural retrieval unit |
| Product listings | Natural boundaries | ‚ùå **No**     | Splitting would break context |
| FAQ entries | Atomic Q&A pairs | ‚ùå **No**     | Question + answer must stay together |
| Research papers | Multiple distinct sections | ‚úÖ **Maybe**  | May need section-level retrieval |
| Legal contracts | Nested clauses | ‚úÖ **Likely** | May need clause-level retrieval |
| Books/chapters | Long-form content | ‚úÖ **Likely** | Topic-based retrieval helps |

### Decision Framework: Ask These Questions

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Is each item semantically complete?         ‚îÇ
‚îÇ   YES ‚Üí Don't chunk (preserve boundaries)   ‚îÇ
‚îÇ   NO  ‚Üì                                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Do users need to find specific sections?    ‚îÇ
‚îÇ   NO  ‚Üí Don't chunk (simpler is better)     ‚îÇ
‚îÇ   YES ‚Üì                                     ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Are there multiple distinct topics?         ‚îÇ
‚îÇ   NO  ‚Üí Don't chunk (one topic = one unit)  ‚îÇ
‚îÇ   YES ‚Üí Consider chunking strategies        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

‚ö†Ô∏è **Warning:** Over-chunking can **hurt** retrieval quality by splitting related information! Sometimes your data is already at the right granularity.

### Research Context: Why This Matters

Research shows that **how you structure context matters more than fitting everything in**:

- **"Lost in the Middle"** (Stanford, 2023): LLMs have a "U-shaped" attention pattern ‚Äî poor recall for information in the middle of long context. ([arXiv:2307.03172](https://arxiv.org/abs/2307.03172))

- **"Context Rot"** (Chroma, 2025): Irrelevant content actively degrades model performance. Even 4 distractor documents hurt output quality. ([research.trychroma.com/context-rot](https://research.trychroma.com/context-rot))

**What this means for chunking:**

These research findings don't prescribe a universal chunking rule‚Äîthey inform your design decisions:

- **Structured records** (courses, products, FAQs): The "lost in the middle" problem typically doesn't apply because each record is already a focused, atomic unit. However, if your records are unusually large or contain multiple distinct topics, chunking may still help.

- **Long-form documents**: Context rot and positional bias become more relevant as document length increases, but the degree depends on your specific content, query patterns, and model capabilities. Chunking can help surface relevant sections, but over-chunking fragments context.

- **Mixed content types**: Real-world data rarely fits neat categories. A research paper with embedded tables, a product listing with extensive reviews, or a FAQ with nested sub-questions all require case-by-case judgment.

The research provides **mental models for reasoning about trade-offs**, not binary rules. Experiment with your actual data and queries.

> üí° For our course catalog, we use **whole-record embedding** ‚Äî each course is already a natural retrieval unit.

---

## üìö Part 3: Context Optimization Demo (5 min)

### Stage 1 vs Stage 2: The Impact

Let's see the real-world impact of context engineering on our course advisor.

In [5]:
# Stage 1 (Baseline RAG) - Returns EVERYTHING
stage1_context = {
    "query": "machine learning courses",
    "results": [
        {
            "id": "course_abc123",
            "created_at": "2024-01-15T10:30:00Z",
            "updated_at": "2024-01-20T14:22:00Z",
            "course_code": "CS301",
            "title": "Introduction to Machine Learning",
            "department": "Computer Science",
            "credits": 4,
            "difficulty_level": "intermediate",
            "format": "online",
            "instructor": "Dr. Smith",
            "description": "Comprehensive introduction to machine learning algorithms.",
            "syllabus": [
                {"week": 1, "topic": "Introduction to ML"},
                {"week": 2, "topic": "Supervised Learning"},
                {"week": 3, "topic": "Neural Networks"},
            ],
            "grading_policy": {"homework": 40, "midterm": 30, "final": 30}
        }
        # Imagine 4 more courses with full details...
    ]
}

stage1_text = json.dumps(stage1_context, indent=2)
stage1_per_course = count_tokens(stage1_text)
print(f"Stage 1 (per course): {stage1_per_course} tokens")
print(f"Stage 1 (5 courses):  ~{stage1_per_course * 5} tokens")

Stage 1 (per course): 245 tokens
Stage 1 (5 courses):  ~1225 tokens


In [6]:
# Stage 2 (Context-Engineered) - Clean, optimized
stage2_context = """Found 5 machine learning courses:

1. CS301: Introduction to Machine Learning
   Dept: CS | Credits: 4 | Level: Intermediate | Format: Online
   Prereqs: CS201 | Instructor: Dr. Smith

2. CS401: Deep Learning
   Dept: CS | Credits: 4 | Level: Advanced | Format: Hybrid
   Prereqs: CS301, MATH201 | Instructor: Dr. Johnson

3. CS402: Natural Language Processing
   Dept: CS | Credits: 3 | Level: Advanced | Format: Online
   Prereqs: CS301 | Instructor: Dr. Lee

4. DS301: Machine Learning for Data Science
   Dept: Data Science | Credits: 4 | Level: Intermediate | Format: In-person
   Prereqs: STAT201 | Instructor: Dr. Garcia

5. CS403: Computer Vision
   Dept: CS | Credits: 3 | Level: Advanced | Format: Hybrid
   Prereqs: CS301, MATH301 | Instructor: Dr. Wang
"""

stage2_tokens = count_tokens(stage2_context)
print(f"Stage 2 (5 courses): {stage2_tokens} tokens")

Stage 2 (5 courses): 223 tokens


In [7]:
# The Impact
stage1_total = 6133  # Measured from actual Stage 1 output
stage2_total = 539   # Measured from actual Stage 2 output

reduction = (1 - stage2_total / stage1_total) * 100

print("="*50)
print("CONTEXT ENGINEERING IMPACT")
print("="*50)
print(f"Stage 1 (Baseline):    {stage1_total:,} tokens")
print(f"Stage 2 (Engineered):  {stage2_total:,} tokens")
print(f"Reduction:             {reduction:.0f}%")
print("="*50)
print(f"\nüí∞ Cost Savings (per 1000 queries @ $0.01/1K tokens):")
print(f"   Stage 1: ${stage1_total * 1000 / 1000 * 0.01:.2f}")
print(f"   Stage 2: ${stage2_total * 1000 / 1000 * 0.01:.2f}")
print(f"   Savings: ${(stage1_total - stage2_total) * 1000 / 1000 * 0.01:.2f}")

CONTEXT ENGINEERING IMPACT
Stage 1 (Baseline):    6,133 tokens
Stage 2 (Engineered):  539 tokens
Reduction:             91%

üí∞ Cost Savings (per 1000 queries @ $0.01/1K tokens):
   Stage 1: $61.33
   Stage 2: $5.39
   Savings: $55.94


### Techniques Applied

| Technique | What It Does | Token Savings |
|-----------|--------------|---------------|
| **Cleaning** | Remove noise fields (id, timestamps) | ~150-200/course |
| **Transformation** | JSON ‚Üí natural text | ~50-100/course |
| **Optimization** | Summaries, remove redundancy | ~200-500/course |

**Key Insight:** Context engineering isn't about losing information‚Äîit's about removing *noise* while preserving *signal*.

---

## üéØ Key Takeaways

1. **Data pipeline matters**: Extract ‚Üí Clean ‚Üí Transform ‚Üí Optimize ‚Üí Store
2. **Chunking is a design choice**: Consider your data type and retrieval needs
3. **"Don't chunk" is valid**: For structured data like course catalogs, whole-record embedding often works best
4. **Over-chunking hurts**: Splitting related info degrades retrieval quality
5. **91% reduction is achievable**: Through cleaning, transformation, and optimization

---

## ‚û°Ô∏è Next Module

In **Module 3: RAG Essentials**, you'll learn:
- How vector embeddings enable semantic search
- Building a complete RAG pipeline with Redis
- Progressive disclosure in practice