![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Data Engineering for Context Systems: A Theoretical Foundation

**A Comprehensive Guide to Chunking, Data Modeling, and Retrieval Optimization**

## üéØ Learning Objectives

By the end of this notebook, you will understand:

1. **The fundamental question**: When to chunk vs. when not to chunk
2. **Data modeling principles**: How to structure data for optimal retrieval
3. **Chunking strategies**: Document-based, fixed-size, and semantic approaches
4. **Context engineering impact**: How data engineering decisions affect what reaches the LLM
5. **Production patterns**: Real-world decision frameworks and trade-offs
6. **Multimodal content**: Handling tables, formulas, and figures in documents

---

## üìñ Table of Contents

**Part 1: The Foundation - Data Modeling for RAG**
- The critical first question: What is your natural retrieval unit?
- When NOT to chunk (structured records)
- The hierarchical pattern: Summaries + Details

**Part 2: When Chunking Matters**
- Document types that benefit from chunking
- Research foundations: Lost in the Middle, Context Rot
- The retrieval precision problem

**Part 3: Core Chunking Strategies**
- Strategy 1: Document-Based (Structure-Aware)
- Strategy 2: Fixed-Size (Token-Based)
- Strategy 3: Semantic (Meaning-Based)
- Comparative analysis and decision framework

**Part 4: Advanced Topics**
- Multimodal content (tables, formulas, figures)
- Complex documents (legal contracts, knowledge graphs)
- Troubleshooting common chunking failures

**Part 5: Context Engineering Principles**
- How chunking affects context quality
- Token efficiency vs. retrieval precision
- Production-ready decision frameworks

**‚è±Ô∏è Estimated Time:** 45-60 minutes

---

## Prerequisites

- Understanding of vector embeddings and semantic search
- Familiarity with RAG (Retrieval-Augmented Generation) concepts
- Basic knowledge of LLM context windows

---

## Setup


In [None]:
import os
import sys
from pathlib import Path

from dotenv import load_dotenv

# Handle both running from notebooks/ directory and from project root
if Path.cwd().name == "notebooks":
    project_root = Path.cwd().parent
else:
    project_root = Path.cwd()

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Load environment variables
env_path = project_root / ".env"
load_dotenv(dotenv_path=env_path)

# Verify required environment variables
required_vars = ["OPENAI_API_KEY"]
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    print(f"""‚ö†Ô∏è  Missing required environment variables: {', '.join(missing_vars)}

Please create a .env file with:
OPENAI_API_KEY=your_openai_api_key
REDIS_URL=redis://localhost:6379
""")
    sys.exit(1)

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379")
print("‚úÖ Environment variables loaded")


In [None]:
import asyncio
import json
from typing import Any, Dict, List

import redis
import tiktoken
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

# Import hierarchical components
from redis_context_course.hierarchical_manager import HierarchicalCourseManager
from redis_context_course.hierarchical_context import HierarchicalContextAssembler

# Initialize
hierarchical_manager = HierarchicalCourseManager(redis_client=redis.from_url(REDIS_URL, decode_responses=True))
context_assembler = HierarchicalContextAssembler()
redis_client = redis.from_url(REDIS_URL, decode_responses=True)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Token counter
encoding = tiktoken.encoding_for_model("gpt-4o")


def count_tokens(text: str) -> int:
    return len(encoding.encode(text))


print("‚úÖ Dependencies loaded")


---

## Part 1: The Foundation - Data Modeling for RAG

### üéØ The Critical First Question

Before thinking about chunking, you must ask:

> **"What is the natural unit of information I want to retrieve?"**

This is the most important question in data engineering for RAG systems. Just like database schema design, how you structure your knowledge base dramatically affects:
- **Retrieval quality**: Can you find the right information?
- **Token efficiency**: Are you wasting context on irrelevant data?
- **System performance**: How fast can you retrieve and process?

### üîë Why This Matters for Context Engineering

**Context engineering is about controlling what information reaches the LLM.** Your data modeling decisions directly impact:

1. **Precision**: Does the retrieved context contain exactly what's needed?
2. **Completeness**: Is all necessary information included?
3. **Efficiency**: Are you minimizing irrelevant tokens?

**The Wrong Approach:**
```
"I have documents ‚Üí I need to chunk them ‚Üí What chunk size should I use?"
```

**The Right Approach:**
```
"What is my natural retrieval unit? ‚Üí Does it need chunking? ‚Üí If yes, which strategy?"
```


### üìä Natural Retrieval Units: Examples Across Domains

Understanding your natural retrieval unit is domain-specific. Here are common patterns:

| Domain | Natural Unit | Why | Chunking Needed? |
|--------|-------------|-----|------------------|
| **Course Catalog** | Individual course | Each course is self-contained, complete | ‚ùå No |
| **Product Catalog** | Individual product | All product info should be retrieved together | ‚ùå No |
| **FAQ Database** | Question + Answer pair | Q&A is an atomic unit | ‚ùå No |
| **Research Papers** | Section or paragraph | Different sections answer different queries | ‚úÖ Yes |
| **Legal Contracts** | Clause or section | Need clause-level precision | ‚úÖ Yes |
| **Support Tickets** | Individual ticket | Single issue with context | ‚ùå No |
| **Technical Docs** | Topic/section | Each section covers distinct functionality | ‚úÖ Yes |
| **Code Repositories** | Function/class | Semantic boundaries at code structure | ‚úÖ Yes |

**Key Insight:** Many structured data types (catalogs, FAQs, tickets) are already at optimal granularity. Chunking them would **reduce** retrieval quality.


### üéì Theory: 

The "Don't Chunk" Strategy
**Concept:** For structured records with natural boundaries, chunking is counterproductive.

**When to Use:**- Data is already organized into discrete, self-contained units- Each unit represents a complete semantic entity- Query patterns align with unit boundaries- Units are reasonably sized (typically 100-500 tokens)

**Example: Course Catalog**Let's examine why a course catalog doesn't need chunking:>

**Note**: The following code cell demonstrates hierarchical retrieval with Redis. If you don't have Redis running with course data loaded, you can skip this cell and continue reading - the concepts are explained in the markdown cells.

In [None]:
# Get a sample course to analyzeasync def show_course_example():    sample_courses = await hierarchical_manager.search_summaries(        query="programming courses", limit=3    )    sample_course = sample_courses[0]        # Generate embedding text if not present    if not sample_course.embedding_text:        sample_course.generate_embedding_text()        # Display the course structure    print(f"""üìö Sample Course: {sample_course.course_code}{'=' * 80}Title: {sample_course.title}Department: {sample_course.department}Level: {sample_course.difficulty_level.value}Credits: {sample_course.credits}Instructor: {sample_course.instructor}Description:{sample_course.short_description}Prerequisites: {', '.join(sample_course.prerequisite_codes) if sample_course.prerequisite_codes else 'None'}Tags: {', '.join(sample_course.tags) if sample_course.tags else 'None'}{'=' * 80}Token count: {count_tokens(sample_course.embedding_text)}""")# Run the exampleawait show_course_example()

### üìä Analysis: Why Courses Don't Need Chunking

Let's evaluate this course against chunking criteria:

**1. Semantic Completeness:** ‚úÖ
- All information about the course is in one record
- No cross-references to other sections
- Natural boundary exists (one course = one retrieval unit)

**2. Query Patterns:** ‚úÖ
- Users ask about specific courses or course types:
  - "What machine learning courses are available?"
  - "Tell me about CS016"
  - "What are the prerequisites for RU102JS?"
- Each query expects a complete course record, not fragments

**3. Retrieval Precision:** ‚úÖ
- When a user asks about a course, they need ALL the information
- Splitting would fragment related information:
  - Separating prerequisites from description
  - Splitting instructor from course content
  - Breaking apart tags from topics
- Each course is already the optimal retrieval unit

**4. Token Efficiency:** ‚úÖ
- Courses are reasonably sized (~150-200 tokens each)
- Not too large (no wasted context)
- Not too small (no fragmentation overhead)

**5. Context Engineering Impact:**
- **Without chunking**: Retrieve complete, coherent course information
- **With chunking**: Risk fragmenting related data, requiring multiple retrievals
- **Result**: Don't chunk - preserve natural boundaries

**Decision:** ‚ùå **Don't chunk course data** - it's already optimally structured!


### üèóÔ∏è The Hierarchical Pattern: A Better Data Model

Instead of chunking structured records, use a **hierarchical pattern** with multiple tiers:

**Architecture:**

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Tier 1: Summaries (Lightweight, Searchable)                ‚îÇ
‚îÇ - Stored in vector index                                    ‚îÇ
‚îÇ - ~150-200 tokens each                                      ‚îÇ
‚îÇ - Fast semantic search                                      ‚îÇ
‚îÇ - Returns: Top-k matches                                    ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Tier 2: Details (On-Demand, Complete)                      ‚îÇ
‚îÇ - Stored as plain Redis keys                                ‚îÇ
‚îÇ - Full information with all fields                          ‚îÇ
‚îÇ - Retrieved only when needed                                ‚îÇ
‚îÇ - Returns: Complete records for top matches                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Why This Works:**

1. **Separation of Concerns**:
   - Search uses lightweight summaries (fast, efficient)
   - Context assembly uses full details (complete, accurate)

2. **Token Efficiency**:
   - Search 100 summaries = ~20,000 tokens
   - Retrieve 3 full details = ~1,500 tokens
   - Total context = ~1,500 tokens (not 20,000!)

3. **Retrieval Quality**:
   - Summaries optimized for semantic matching
   - Details optimized for completeness
   - No information fragmentation

**This is data modeling, not chunking** - we're structuring data for optimal retrieval patterns.


In [None]:
# Hierarchical retrieval exampleasync def demonstrate_hierarchical_retrieval():    query = "beginner programming courses"        # Tier 1: Search summaries (lightweight, fast)    print(f"üîç Query: '{query}'\n")    print("Tier 1: Searching summaries...")    summaries = await hierarchical_manager.search_summaries(query, limit=5)        print(f"Found {len(summaries)} relevant courses\n")    for i, course in enumerate(summaries, 1):        print(f"{i}. {course.course_code}: {course.title}")        print(f"   Level: {course.difficulty_level.value} | Credits: {course.credits}")        print()        # Tier 2: Get full details for top matches    print("\nTier 2: Fetching full details for top 3 courses...")    top_course_ids = [c.course_code for c in summaries[:3]]    full_details = await hierarchical_manager.get_full_details(top_course_ids)        print(f"\nüìö Full Details Retrieved:\n")    for course in full_details:        print(f"{course.course_code}: {course.title}")        print(f"Description: {course.short_description[:100]}...")        print(f"Prerequisites: {', '.join(course.prerequisite_codes) if course.prerequisite_codes else 'None'}")        print(f"Instructor: {course.instructor}")        print()        # Show token efficiency    summary_tokens = sum(count_tokens(c.embedding_text) for c in summaries)    detail_tokens = sum(count_tokens(c.embedding_text) for c in full_details)        print(f"""üìä Token Efficiency:{'=' * 80}Tier 1 (5 summaries): {summary_tokens} tokensTier 2 (3 full details): {detail_tokens} tokensTotal: {summary_tokens + detail_tokens} tokensvs. retrieving 5 full courses: {sum(count_tokens(c.embedding_text) for c in summaries)} tokensSavings: {100 * (1 - (summary_tokens + detail_tokens) / sum(count_tokens(c.embedding_text) for c in summaries)):.1f}%""")# Run the demonstrationawait demonstrate_hierarchical_retrieval()

### üîë Key Takeaway: Part 1

> **For structured records like courses, products, or FAQs, the hierarchical pattern (summaries + details) is superior to chunking because it respects natural data boundaries and retrieval patterns.**

**Context Engineering Principle:**
- **Chunking** = Breaking apart what should stay together
- **Hierarchical modeling** = Organizing data at appropriate granularity levels
- **Result**: Better retrieval precision, lower token costs, clearer context

---

## Part 2: When Documents DO Need Chunking

Now let's examine the opposite case: **long-form documents** with multiple distinct topics.

### üéØ The Problem: Information Overload

Some documents are fundamentally different from structured records:
- They contain multiple distinct topics
- Different sections answer different queries
- Retrieving the entire document wastes tokens and reduces precision

**Example: Research Papers**

Let's load a real research paper to understand the problem:


In [None]:
# Load the actual research paper PDF
import pypdf

pdf_path = project_root / "data" / "arxiv_2504_02268.pdf"
reader = pypdf.PdfReader(pdf_path)

# Extract text from all pages
research_paper = ""
for page in reader.pages:
    research_paper += page.extract_text() + "\n"

paper_tokens = count_tokens(research_paper)
print(f"""üìÑ Real Research Paper
{'=' * 80}
Title: "Advancing Semantic Caching for LLMs with Domain-Specific Embeddings"
Authors: Waris Gill et al. (Redis & Virginia Tech, 2025)
Source: arXiv:2504.02268

Structure:
- Abstract
- Introduction
- Background and Related Work
- Methodology (Synthetic Data Generation)
- Evaluation and Results
- Conclusion

Pages: {len(reader.pages)}
Token count: {paper_tokens:,}
Characters: {len(research_paper):,}
{'=' * 80}
""")


### üìä Comparative Analysis: Course vs. Research Paper

Let's compare the course catalog (doesn't need chunking) with the research paper (does need chunking):

| Factor | Course Catalog | Research Paper |
|--------|---------------|----------------|
| **Document Structure** | Single topic per record | Multiple distinct sections |
| **Semantic Completeness** | Each course is self-contained | Sections cover different topics |
| **Query Patterns** | "Show me CS courses" | "How is synthetic data generated?" |
| **Optimal Retrieval Unit** | Whole course | Specific section |
| **Token Count** | ~150-200 per course | ~6,000 for entire paper |
| **Chunking Decision** | ‚ùå Don't chunk | ‚úÖ Chunk by section |

### üéì Theory: Why Research Papers Need Chunking

**1. Multiple Distinct Topics:**
- Abstract, Introduction, Methodology, Evaluation, Conclusion each cover different aspects
- A query about "synthetic data generation" only needs the Methodology section, not the entire paper

**2. Retrieval Precision Problem:**

| Query | Needs | Without Chunking | With Chunking | Improvement |
|-------|-------|------------------|---------------|-------------|
| "How is synthetic data generated?" | Methodology section | Entire paper (~6,000 tokens) | Methodology (~500 tokens) | **92% reduction** |
| "What were the hit rate results?" | Evaluation + Tables | Entire paper (~6,000 tokens) | Evaluation (~400 tokens) | **93% reduction** |
| "What embedding models were tested?" | Results section | Entire paper (~6,000 tokens) | Results (~300 tokens) | **95% reduction** |
| "What is semantic caching?" | Introduction + Background | Entire paper (~6,000 tokens) | Intro+Background (~600 tokens) | **90% reduction** |

**Impact:** 8-12x reduction in irrelevant context, leading to:
- Faster response times (less processing)
- Better answer quality (less noise)
- Lower costs (fewer tokens)

**3. Context Engineering Impact:**

Without chunking:
```
Query: "How is synthetic data generated?"
Retrieved: [Entire 6,000-token paper]
Problem: 5,500 tokens are irrelevant (Abstract, Intro, Evaluation, Conclusion)
Result: LLM must filter through noise to find answer
```

With chunking:
```
Query: "How is synthetic data generated?"
Retrieved: [Methodology section, 500 tokens]
Benefit: Only relevant content in context
Result: LLM gets precise, focused information
```

**üí° Key Insight:** Chunking isn't about fitting in context windows - it's about **data modeling for retrieval**. Just like you wouldn't store all customer data in one database row, you shouldn't embed all document content in one vector when sections serve different purposes.


### üìö Research Background: Why Chunking Matters

Even with large context windows (128K+ tokens), research shows that **how you structure context matters more than fitting everything in**.

**Key Research Findings:**

**1. "Lost in the Middle" (Stanford/UC Berkeley, 2023)** - [arXiv:2307.03172](https://arxiv.org/abs/2307.03172)

**Finding:** LLMs exhibit **U-shaped attention** - high recall at beginning/end, degraded in middle

**Implication for Chunking:**
- Chunking ensures relevant sections are retrieved and placed prominently
- Avoids burying critical information in the middle of long context
- Enables strategic placement of most relevant chunks

**2. "Context Rot" (Chroma Research, 2025)** - [research.trychroma.com/context-rot](https://research.trychroma.com/context-rot)

**Finding:** Performance degrades as input length increases, even when relevant info is present

**Key Observations:**
- **Distractor effect**: Irrelevant content actively hurts model performance
- Longer context ‚â† better performance
- Quality of context > quantity of context

**Implication for Chunking:**
- Smaller, focused chunks reduce "distractor tokens"
- Precision retrieval beats comprehensive retrieval
- Token efficiency improves answer quality

**3. Needle in the Haystack (NIAH)** - [github.com/gkamradt/LLMTest_NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)

**Finding:** Models often fail to retrieve information buried in long context

**Implication for Chunking:**
- For structured data, NIAH is irrelevant‚Äîeach record IS the needle
- For long documents, chunking creates multiple small haystacks
- Semantic search finds the right haystack, avoiding the needle problem

**The Takeaway:** These findings inform design decisions but don't prescribe universal rules:
- **Structured records** (courses, products) don't need chunking
- **Long-form documents** (papers, books) benefit from chunking
- **Experiment with YOUR data** to find optimal approach


### üîë Key Takeaway: Part 2

> **Long-form documents with multiple distinct topics benefit from chunking because it enables precision retrieval, reduces irrelevant context, and improves answer quality.**

**Context Engineering Principle:**
- **Problem**: Entire document = too much irrelevant information
- **Solution**: Chunk by semantic boundaries (sections, topics)
- **Result**: Retrieve only what's needed, minimize noise

---

## Part 3: Core Chunking Strategies

Now that we understand **when** to chunk (long-form documents) and **when not to** (structured records), let's explore **how** to chunk effectively.

**There's no single "best" strategy** - the optimal approach depends on:
- Document structure (structured vs. unstructured)
- Content type (text, tables, code, formulas)
- Query patterns (specific facts vs. summaries)
- Token budget (how much context can you afford?)

We'll explore three core approaches with theoretical foundations and practical examples:


### Strategy 1: Document-Based Chunking (Structure-Aware)

**üéì Theory:**

Split documents based on their inherent structure (sections, paragraphs, headings) rather than arbitrary token counts.

**Core Principle:** Respect semantic boundaries that authors created

**How It Works:**
1. Identify structural markers (markdown headers, section numbers, page breaks)
2. Split at these boundaries
3. Keep each section intact as a chunk
4. Preserve context (headers, section titles)

**Best For:**
- Research papers with clear sections
- Technical documentation with headers
- Books with chapters/sections
- Any document with explicit structure

**Context Engineering Impact:**
- ‚úÖ Preserves semantic coherence (sections stay together)
- ‚úÖ Keeps tables, formulas, and code WITH their context
- ‚úÖ Natural alignment with query patterns ("What does the methodology section say?")
- ‚ö†Ô∏è Variable chunk sizes (some sections longer than others)


In [None]:
# Strategy 1: Document-Based Chunking
# Split research paper by sections (using markdown headers)


def chunk_by_structure(text: str, separator: str = "\n## ") -> List[str]:
    """
    Split text by structural markers (e.g., markdown headers).

    This respects the document's inherent organization, preserving
    semantic boundaries created by the author.
    """
    # Split by headers
    sections = text.split(separator)

    # Clean and format chunks
    chunks = []
    for i, section in enumerate(sections):
        if section.strip():
            # Add header back (except for first chunk which is title)
            if i > 0:
                chunk = "## " + section
            else:
                chunk = section
            chunks.append(chunk.strip())

    return chunks


# Apply to research paper
structure_chunks = chunk_by_structure(research_paper)

print(f"""üìä Strategy 1: Document-Based (Structure-Aware) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(structure_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(structure_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    # Show first 100 chars of each chunk
    preview = chunk[:200].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...\n")

if len(structure_chunks) > 5:
    print(f"... ({len(structure_chunks) - 5} more chunks)")


**Strategy 1 Analysis:**

‚úÖ **Advantages:**
- **Semantic coherence**: Each chunk is a complete section with related content
- **Context preservation**: Tables, formulas, and code stay WITH their explanations
- **Query alignment**: Matches how users think about documents ("What's in the methodology?")
- **Easy implementation**: Simple to implement for structured documents
- **Author intent**: Respects the structure the author designed

‚ö†Ô∏è **Trade-offs:**
- **Variable sizes**: Some sections may be very long or very short
- **Requires structure**: Only works for documents with clear structural markers
- **May need refinement**: Very long sections might still need sub-chunking

üéØ **Best Use Cases:**
- Research papers with clear sections (Abstract, Introduction, Methods, Results)
- Technical documentation with hierarchical headers
- Books with chapters and subsections
- Any document where structure aligns with semantic boundaries

**Context Engineering Principle:**
> Structure-aware chunking optimizes for **semantic completeness** - each chunk contains a complete thought or topic, minimizing the need for cross-chunk references.


### Strategy 2: Fixed-Size Chunking (Token-Based)

**üéì Theory:**

Split text into chunks of a predetermined size (e.g., 512 tokens) with overlap to preserve context across boundaries.

**Core Principle:** Consistent, predictable chunk sizes for uniform processing

**How It Works:**
1. Define target chunk size (e.g., 800 characters, ~200 tokens)
2. Define overlap (e.g., 100 characters) to preserve context
3. Use smart separators (try paragraphs ‚Üí sentences ‚Üí words ‚Üí characters)
4. Split text while respecting natural boundaries when possible

**Best For:**
- Unstructured text without clear sections
- Quick prototyping and baselines
- When consistent chunk sizes are required
- Documents where structure doesn't align with semantics

**Context Engineering Impact:**
- ‚úÖ Predictable token usage (easier to budget context)
- ‚úÖ Works on any text (structured or unstructured)
- ‚úÖ Smart boundary detection (doesn't split mid-sentence)
- ‚ö†Ô∏è May break semantic coherence (splits related content)
- ‚ö†Ô∏è Overlap creates redundancy (increases storage/cost)


In [None]:
# Strategy 2: Fixed-Size Chunking (Using LangChain)
# Industry-standard approach with smart boundary detection

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create text splitter with smart boundary detection
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,  # Target chunk size in characters
    chunk_overlap=100,  # Overlap to preserve context
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
    is_separator_regex=False,
)

print("üîÑ Running fixed-size chunking with LangChain...")
print("   Trying to split on: paragraphs ‚Üí sentences ‚Üí words ‚Üí characters\n")

# Apply to research paper
fixed_chunks_docs = text_splitter.create_documents([research_paper])
fixed_chunks = [doc.page_content for doc in fixed_chunks_docs]

print(f"""üìä Strategy 2: Fixed-Size (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Target chunk size: 800 characters (~200 words)
Overlap: 100 characters
Number of chunks: {len(fixed_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(fixed_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

print(f"... ({len(fixed_chunks) - 5} more chunks)")


**Strategy 2 Analysis:**

‚úÖ **Advantages:**
- **Consistent sizes**: Predictable token usage for context budgeting
- **Universal applicability**: Works on any text, structured or not
- **Smart boundaries**: Tries to split at natural points (paragraphs, sentences)
- **Overlap**: Preserves context across chunk boundaries
- **Battle-tested**: Industry-standard approach with proven libraries

‚ö†Ô∏è **Trade-offs:**
- **Ignores structure**: Doesn't understand document organization
- **May break coherence**: Can split related content (table from caption, formula from explanation)
- **Redundancy**: Overlap increases storage and processing costs
- **Arbitrary boundaries**: Splits based on size, not semantics

üéØ **Best Use Cases:**
- Unstructured text (novels, articles without clear sections)
- Quick prototyping and baseline implementations
- When you need consistent chunk sizes for processing
- Documents where structure doesn't provide semantic boundaries

**Context Engineering Principle:**
> Fixed-size chunking optimizes for **predictability** - you know exactly how much context each chunk will consume, making it easier to manage token budgets.


### Strategy 3: Semantic Chunking (Meaning-Based)

**üéì Theory:**

Split text based on semantic similarity using embeddings - create new chunks when topic changes significantly.

**Core Principle:** Let meaning, not structure or size, determine boundaries

**How It Works:**
1. Split text into sentences or paragraphs
2. Generate embeddings for each segment
3. Calculate similarity between consecutive segments
4. Create chunk boundaries where similarity drops (topic shift detected)
5. Group similar consecutive segments into chunks

**Best For:**
- Dense academic text where topics shift gradually
- Legal documents with complex clause relationships
- Narratives and stories where semantic boundaries don't align with structure
- Content where you want adaptive chunk sizes based on coherence

**Context Engineering Impact:**
- ‚úÖ Meaning-aware: Chunks based on topic shifts, not arbitrary boundaries
- ‚úÖ Adaptive: Chunk sizes vary based on content coherence
- ‚úÖ Better retrieval: Each chunk is semantically focused
- ‚úÖ Free: Uses local embeddings (no API costs)
- ‚ö†Ô∏è Slower processing: Requires embedding generation for all segments
- ‚ö†Ô∏è Variable sizes: Harder to predict token usage
- ‚ö†Ô∏è May ignore structure: Doesn't respect document organization


In [None]:
# Strategy 3: Semantic Chunking (Using LangChain)
# Industry-standard approach with local embeddings (no API costs!)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Suppress tokenizer warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize local embeddings (no API costs!)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# Create semantic chunker with percentile-based breakpoint detection
semantic_chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # Split at bottom 25% of similarities
    breakpoint_threshold_amount=25,  # 25th percentile
    buffer_size=1,  # Compare consecutive sentences
)

print("üîÑ Running semantic chunking with LangChain...")
print("   Using local embeddings (sentence-transformers/all-MiniLM-L6-v2)")
print("   Breakpoint detection: 25th percentile of similarity scores\n")

# Apply to research paper
semantic_chunks_docs = semantic_chunker.create_documents([research_paper])

# Extract text from Document objects
semantic_chunks = [doc.page_content for doc in semantic_chunks_docs]

print(f"""üìä Strategy 3: Semantic (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(semantic_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(semantic_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

if len(semantic_chunks) > 5:
    print(f"... ({len(semantic_chunks) - 5} more chunks)")


**Strategy 3 Analysis:**

‚úÖ **Advantages:**
- **Meaning-aware**: Detects topic shifts using semantic similarity
- **Adaptive boundaries**: Chunk sizes vary based on content coherence
- **Better retrieval**: Each chunk is semantically focused on a single topic
- **No API costs**: Uses local embeddings (sentence-transformers)
- **Intelligent**: Understands when topics change, even without structural markers

‚ö†Ô∏è **Trade-offs:**
- **Slower processing**: Must generate embeddings for all segments
- **Variable sizes**: Harder to predict token usage and budget
- **May ignore structure**: Doesn't respect document organization (headers, sections)
- **Requires tuning**: Threshold and buffer size affect results
- **Computational cost**: More expensive than simple text splitting

üéØ **Best Use Cases:**
- Dense academic text where topics shift gradually
- Legal documents with complex semantic relationships
- Narratives and stories where structure doesn't indicate topic changes
- Content where semantic coherence is more important than structure

**Context Engineering Principle:**
> Semantic chunking optimizes for **topical coherence** - each chunk focuses on a single topic or concept, maximizing the relevance of retrieved content.


### üìä Comparing Chunking Strategies: Decision Framework

Now let's compare all three strategies side-by-side:


In [None]:
print(f"""
{'=' * 80}
CHUNKING STRATEGY COMPARISON
{'=' * 80}

Document: Research Paper ({paper_tokens:,} tokens)

Strategy              | Chunks | Avg Size | Complexity | Best For
--------------------- | ------ | -------- | ---------- | --------
Document-Based        | {len(structure_chunks):>6} | {sum(count_tokens(c) for c in structure_chunks) // len(structure_chunks):>8} | Low        | Structured docs
Fixed-Size            | {len(fixed_chunks):>6} | {sum(count_tokens(c) for c in fixed_chunks) // len(fixed_chunks):>8} | Low        | Unstructured text
Semantic              | {len(semantic_chunks):>6} | {sum(count_tokens(c) for c in semantic_chunks) // len(semantic_chunks):>8} | High       | Dense academic text

{'=' * 80}
""")


### üéØ YOUR Chunking Decision Framework

Chunking strategy is a **design choice** that depends on your specific context. There's no universal "correct" chunk size or strategy.

**Step 1: Start with Document Type**

| Document Type | Default Approach | Reasoning |
|---------------|------------------|----------|
| **Structured records** (courses, products, FAQs) | Don't chunk | Natural boundaries already exist |
| **Long-form text** (papers, books, docs) | Consider chunking | May need retrieval precision |
| **PDFs with visual layout** | Page-level or structure-based | Preserves tables, figures |
| **Code** | Function/class boundaries | Semantic structure matters |
| **Unstructured text** | Fixed-size with overlap | No clear structure to follow |

**Step 2: Evaluate These Factors**

1. **Semantic completeness:** Is each item self-contained?
   - ‚úÖ Yes ‚Üí Don't chunk (preserve natural boundaries)
   - ‚ùå No ‚Üí Consider chunking strategy

2. **Query patterns:** What will users ask?
   - Specific facts ‚Üí Smaller, focused chunks help
   - Summaries/overviews ‚Üí Larger chunks or hierarchical
   - Mixed ‚Üí Consider hierarchical approach

3. **Topic density:** How many distinct topics per document?
   - Single topic ‚Üí Whole-document embedding often works
   - Multiple distinct topics ‚Üí Chunking may improve precision

4. **Document structure:** Does it have clear organization?
   - ‚úÖ Yes ‚Üí Document-based chunking
   - ‚ùå No ‚Üí Fixed-size or semantic chunking

5. **Content type:** What's in the document?
   - Text-only ‚Üí Any strategy works
   - Tables/formulas/code ‚Üí Structure-aware chunking (keep context together)

**Step 3: Choose Your Strategy**

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Decision Tree                                                ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                              ‚îÇ
‚îÇ Is data already structured records? (courses, products)     ‚îÇ
‚îÇ   ‚îú‚îÄ YES ‚Üí Don't chunk, use hierarchical pattern            ‚îÇ
‚îÇ   ‚îî‚îÄ NO ‚Üí Continue                                           ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ Does document have clear structure? (sections, headers)     ‚îÇ
‚îÇ   ‚îú‚îÄ YES ‚Üí Document-based chunking                           ‚îÇ
‚îÇ   ‚îî‚îÄ NO ‚Üí Continue                                           ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ Do you need consistent chunk sizes?                         ‚îÇ
‚îÇ   ‚îú‚îÄ YES ‚Üí Fixed-size chunking                               ‚îÇ
‚îÇ   ‚îî‚îÄ NO ‚Üí Semantic chunking                                  ‚îÇ
‚îÇ                                                              ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Example Decisions:**

| Domain | Data Characteristics | Decision | Why |
|--------|---------------------|----------|-----|
| **Course Catalog** | Small, self-contained records | **Don't chunk** | Each course is a complete retrieval unit |
| **Research Papers** | Multi-section, dense topics | **Document-based** | Sections are natural semantic units |
| **Support Tickets** | Single issue per ticket | **Don't chunk** | Already at optimal granularity |
| **Legal Contracts** | Nested structure, many clauses | **Hierarchical + Structure-based** | Need both overview and clause-level detail |
| **Novels** | Continuous narrative, no structure | **Semantic** | Topic shifts don't align with structure |
| **Technical Docs** | Clear sections and subsections | **Document-based** | Structure aligns with semantics |


### üîë Key Takeaway: Part 3

> **Ask "What is my natural retrieval unit?" before deciding on a chunking strategy. For many structured data use cases, the answer is "don't chunk."**

**Context Engineering Principles:**

1. **Structure-aware chunking** ‚Üí Optimizes for semantic completeness
2. **Fixed-size chunking** ‚Üí Optimizes for predictability
3. **Semantic chunking** ‚Üí Optimizes for topical coherence

**Choose based on:**
- Your data characteristics
- Your query patterns
- Your token budget
- Your quality requirements

---

## Part 4: Advanced Topics

### üé® Handling Multimodal Content

**The Challenge:** Research papers and technical documents aren't just text - they contain:
- **Tables** with structured data
- **Formulas** with variable definitions
- **Figures** with visual patterns
- **Code** with implementation details

Standard text chunking can break these elements, separating content from context.


### üéì Theory: Multimodal Chunking Principles

**Core Principle:** Context is king - never separate content from explanation

**1. Tables:**
- **Problem**: Splitting table from caption loses meaning
- **Solution**: Keep table WITH caption and surrounding explanation
- **Implementation**: Detect table boundaries, include ¬±200 chars of context

**2. Formulas:**
- **Problem**: Formula without variable definitions is useless
- **Solution**: Keep formula WITH variable definitions and explanation
- **Implementation**: Include surrounding context (¬±200 chars)

**3. Figures:**
- **Problem**: Figure reference without description is incomplete
- **Solution**: Describe visual patterns in text, keep WITH caption
- **Implementation**: Extract caption and discussion together

**4. Code:**
- **Problem**: Code snippet without usage example is hard to understand
- **Solution**: Keep code WITH usage examples and context
- **Implementation**: Include function/class definitions with docstrings

**Context Engineering Impact:**

Without multimodal awareness:
```
Chunk 1: "...as shown in Table 1."
Chunk 2: [Table 1 data]
Chunk 3: "The results indicate..."

Problem: Table separated from context
Result: LLM can't interpret table meaning
```

With multimodal awareness:
```
Chunk: "...as shown in Table 1. [Table 1 data] The results indicate..."

Benefit: Table WITH context
Result: LLM understands table meaning and implications
```


### üìä Practical Example: Chunking Multimodal Content

Let's see how to handle different content types from our research paper:


In [None]:
import re

# Example 1: Extract and chunk a table with context
table_pattern = r'(Table \d+:.*?)(?=\n\n[A-Z]|\nFigure|\n\d+\.|\Z)'
tables_found = re.findall(table_pattern, research_paper, re.DOTALL)

if tables_found:
    table_chunk = {
        "content_type": "table",
        "text": tables_found[0][:500],  # First 500 chars
        "metadata": {
            "page": "6",
            "section": "Evaluation",
            "table_id": "Table 1"
        }
    }

    print("‚úÖ TABLE CHUNKING EXAMPLE:")
    print("=" * 70)
    print(f"Content Type: {table_chunk['content_type']}")
    print(f"Metadata: {table_chunk['metadata']}")
    print(f"\nChunk Text:\n{table_chunk['text'][:300]}...")
    print("\n‚úÖ Best Practice: Keep table WITH caption and surrounding context")
else:
    print("Table extraction pattern needs adjustment for this PDF")


In [None]:
# Example 2: Extract and chunk formulas with context
formula_pattern = r'(loss.*?=.*?(?:\n|$))'
formulas = re.findall(formula_pattern, research_paper, re.IGNORECASE)

if formulas:
    # Find context around the formula
    formula_text = formulas[0]
    formula_idx = research_paper.find(formula_text)
    context_start = max(0, formula_idx - 200)
    context_end = min(len(research_paper), formula_idx + len(formula_text) + 200)

    formula_chunk = {
        "content_type": "formula",
        "text": research_paper[context_start:context_end],
        "metadata": {
            "section": "Methodology",
            "formula_type": "contrastive_loss"
        }
    }

    print("\n‚úÖ FORMULA CHUNKING EXAMPLE:")
    print("=" * 70)
    print(f"Content Type: {formula_chunk['content_type']}")
    print(f"Metadata: {formula_chunk['metadata']}")
    print(f"\nChunk Text:\n{formula_chunk['text'][:300]}...")
    print("\n‚úÖ Best Practice: Keep formula WITH variable definitions and explanation")


In [None]:
# Summary: Multimodal chunking principles
print("\n" + "=" * 70)
print("MULTIMODAL CHUNKING PRINCIPLES:")
print("=" * 70)
print("""
1. **Tables**: Keep WITH caption and explanation
   - Preserve structure (markdown/HTML)
   - Add metadata: table_id, section, content_type

2. **Formulas**: Keep WITH variable definitions
   - Include surrounding context (¬±200 chars)
   - Preserve LaTeX if available

3. **Figures**: Describe visual patterns in text
   - Keep caption WITH discussion
   - Add metadata: figure_id, visual_type

4. **Code**: Keep WITH usage examples and context
   - Preserve syntax and comments
   - Include function/class definitions

5. **General Rule**: Context is king - never separate content from explanation
""")


### üèõÔ∏è Advanced Topic: Complex Documents (Legal Contracts)

**Note:** Some document types require approaches beyond chunking.

**Why Legal Documents Are Different:**

Legal contracts require sophisticated data engineering beyond simple chunking:

**Key Challenges:**
1. **Clause-level granularity** with hierarchical numbering (Section 3.2.1)
2. **Cross-references** between clauses ("as defined in Section 1.5...")
3. **Hierarchical dependencies** (amendments modify earlier provisions)
4. **Legal precedence** ("Notwithstanding Section 2.1..." creates overrides)

**What This Requires:**

Simple chunking is insufficient. You need:
- **Knowledge graphs** to capture clause relationships
- **Recursive retrieval** to fetch referenced clauses
- **Metadata enrichment** (clause type, parties, dates, jurisdiction)

**Example Retrieval Flow:**
```
Query: "What are the payment terms?"

1. Retrieve: Clause 3.2 (Payment Terms)
2. Detect reference: "as defined in Section 1.5"
3. Fetch: Clause 1.5 (Definitions: "Net 30")
4. Detect modification: Clause 8.1 modifies 3.2
5. Fetch: Clause 8.1 (Amendment: "Net 45 for Q4")
6. Assemble: [3.2 + 1.5 + 8.1] with relationship metadata
```

**Recommendation:** This is a **research-level problem** requiring domain expertise. For production systems:
- Start with clause-level chunking as baseline
- Build knowledge graphs for relationships (Neo4j, etc.)
- Implement recursive retrieval for dependencies
- Consider specialized legal NLP tools (LexNLP, Blackstone)

**Resources:**
- [Multi-Graph Multi-Agent Systems](https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052)
- [GraphRAG for Contracts](https://neo4j.com/blog/developer/agentic-graphrag-for-commercial-contracts/)


### üîß Troubleshooting Common Chunking Failures

**Common Failure Patterns and Solutions:**

| Problem | Likely Cause | Solution |
|---------|--------------|----------|
| **Tables split across chunks** | Fixed-size chunking | Use structure-aware chunking |
| **Formulas without context** | Naive chunking | Keep formulas with explanations |
| **Missing cross-references** | Single-chunk retrieval | Implement recursive retrieval |
| **Generic answers** | Chunks too large | Reduce chunk size or use semantic chunking |
| **Incomplete answers** | Chunks too small | Increase chunk size or add overlap |
| **Poor retrieval precision** | Wrong chunking strategy | Re-evaluate natural retrieval unit |
| **High token costs** | No chunking on long docs | Implement appropriate chunking |
| **Fragmented information** | Over-chunking structured data | Don't chunk, use hierarchical pattern |

**Iterative Process:**
1. Start simple (baseline strategy)
2. Measure performance (retrieval quality, token usage)
3. Identify failures (what queries fail? why?)
4. Test improvements (try different strategies)
5. Iterate (refine based on results)

**Context Engineering Principle:**
> Chunking is an iterative engineering process, not a one-time decision. Monitor, measure, and refine based on real-world performance.

---

## Part 5: Context Engineering Principles

### üéØ How Data Engineering Affects Context Quality

Every data engineering decision directly impacts what information reaches the LLM. Let's understand the connections:


### üìä The Context Engineering Stack

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Layer 5: LLM Response                                       ‚îÇ
‚îÇ - Quality depends on context quality                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚Üë
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Layer 4: Context Assembly                                   ‚îÇ
‚îÇ - Combine retrieved chunks into coherent context            ‚îÇ
‚îÇ - Order matters (Lost in the Middle)                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚Üë
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Layer 3: Retrieval                                          ‚îÇ
‚îÇ - Semantic search finds relevant chunks                     ‚îÇ
‚îÇ - Quality depends on chunk granularity                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚Üë
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Layer 2: Chunking Strategy ‚Üê YOU ARE HERE                  ‚îÇ
‚îÇ - How you split documents affects retrieval precision       ‚îÇ
‚îÇ - Chunk size affects token efficiency                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                            ‚Üë
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Layer 1: Data Modeling                                      ‚îÇ
‚îÇ - Natural retrieval units                                   ‚îÇ
‚îÇ - Hierarchical patterns                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Key Insight:** Data engineering decisions at Layer 1-2 cascade through the entire stack, affecting final response quality.


### üîë Core Context Engineering Principles

**Principle 1: Precision Over Completeness**

```
Bad: Retrieve entire 6,000-token document
Good: Retrieve 500-token relevant section

Why: Context Rot - irrelevant content actively hurts performance
```

**Principle 2: Semantic Boundaries Over Arbitrary Boundaries**

```
Bad: Split mid-table because chunk size limit reached
Good: Keep table with caption, even if chunk is larger

Why: Semantic completeness - content needs context to be useful
```

**Principle 3: Natural Units Over Forced Chunking**

```
Bad: Chunk course catalog into smaller pieces
Good: Keep each course as a complete unit

Why: Natural retrieval units - data already at optimal granularity
```

**Principle 4: Structure-Aware Over Structure-Blind**

```
Bad: Fixed-size chunking on research paper with clear sections
Good: Document-based chunking that respects section boundaries

Why: Author intent - structure often aligns with semantic boundaries
```

**Principle 5: Measure, Don't Assume**

```
Bad: "512 tokens is the best chunk size" (universal rule)
Good: Test different strategies on YOUR data with YOUR queries

Why: Context is domain-specific - what works for one use case may fail for another
```


### üìà Token Efficiency vs. Retrieval Precision

Understanding the trade-off between token efficiency and retrieval precision:

**Scenario 1: No Chunking (Long Documents)**

```
Document: 6,000 tokens
Query: "What is the methodology?"
Retrieved: Entire document (6,000 tokens)

Token Efficiency: ‚ùå Low (5,500 irrelevant tokens)
Retrieval Precision: ‚ùå Low (90% irrelevant content)
Answer Quality: ‚ùå Poor (LLM must filter noise)
```

**Scenario 2: Over-Chunking (Structured Records)**

```
Course: 200 tokens
Chunked into: 4 chunks of 50 tokens each
Query: "Tell me about CS101"
Retrieved: 2-3 chunks (100-150 tokens)

Token Efficiency: ‚ö†Ô∏è Medium (some fragmentation)
Retrieval Precision: ‚ùå Low (missing related info)
Answer Quality: ‚ùå Poor (incomplete information)
```

**Scenario 3: Optimal Chunking (Research Paper)**

```
Document: 6,000 tokens
Chunked into: 12 sections (~500 tokens each)
Query: "What is the methodology?"
Retrieved: Methodology section (500 tokens)

Token Efficiency: ‚úÖ High (only relevant content)
Retrieval Precision: ‚úÖ High (exact section needed)
Answer Quality: ‚úÖ Excellent (focused, complete)
```

**Scenario 4: Hierarchical Pattern (Structured Records)**

```
Catalog: 100 courses √ó 200 tokens = 20,000 tokens
Hierarchical: 100 summaries (150 tokens) + 3 details (600 tokens)
Query: "Beginner programming courses"
Retrieved: 5 summaries + 3 details = 1,350 tokens

Token Efficiency: ‚úÖ High (93% reduction)
Retrieval Precision: ‚úÖ High (relevant courses)
Answer Quality: ‚úÖ Excellent (complete, focused)
```

**The Sweet Spot:**

```
Optimal Chunking = Maximum Retrieval Precision + Minimum Token Waste

Achieved by:
1. Understanding natural retrieval units
2. Choosing appropriate chunking strategy
3. Preserving semantic completeness
4. Measuring and iterating
```


### üéØ Production-Ready Decision Framework

**Step-by-Step Process for Production Systems:**

**1. Analyze Your Data**
```python
Questions to ask:
- What is the natural retrieval unit?
- How many distinct topics per document?
- Does structure align with semantics?
- What content types exist? (text, tables, code, formulas)
```

**2. Understand Your Query Patterns**
```python
Questions to ask:
- What will users ask?
- Do queries target specific sections or whole documents?
- How precise do answers need to be?
- What's the acceptable token budget?
```

**3. Choose Initial Strategy**
```python
Decision tree:
if structured_records:
    strategy = "hierarchical_pattern"  # Don't chunk
elif has_clear_structure:
    strategy = "document_based"  # Chunk by sections
elif need_consistent_sizes:
    strategy = "fixed_size"  # Chunk by tokens
else:
    strategy = "semantic"  # Chunk by meaning
```

**4. Implement and Measure**
```python
Metrics to track:
- Retrieval precision (% relevant chunks retrieved)
- Token efficiency (avg tokens per query)
- Answer quality (human eval or LLM-as-judge)
- Latency (time to retrieve and process)
```

**5. Iterate and Refine**
```python
Optimization loop:
1. Identify failure cases
2. Analyze root causes
3. Test alternative strategies
4. Measure improvements
5. Deploy and monitor
```


---

## Summary and Key Takeaways

### üéØ The Core Insight

> **Chunking isn't about fitting in context windows - it's about data modeling for retrieval.**

Just like database schema design, how you structure your knowledge base dramatically affects retrieval quality, token efficiency, and system performance.

### üìö Key Concepts Covered

**1. The Critical First Question**
- What is my natural retrieval unit?
- Many structured data types don't need chunking
- Chunking is a design choice, not a default step

**2. When NOT to Chunk**
- Structured records (courses, products, FAQs)
- Self-contained units with natural boundaries
- Data already at optimal granularity
- Use hierarchical patterns instead

**3. When Chunking Helps**
- Long-form documents with multiple topics
- Research papers, technical docs, books
- Improves retrieval precision (8-12x reduction in irrelevant context)
- Reduces token costs and improves answer quality

**4. Core Chunking Strategies**
- **Document-Based**: Split by structure (sections, headers)
  - Best for: Structured documents with clear organization
  - Optimizes for: Semantic completeness
- **Fixed-Size**: Split by token count with overlap
  - Best for: Unstructured text, consistent sizes needed
  - Optimizes for: Predictability
- **Semantic**: Split by topic shifts using embeddings
  - Best for: Dense academic text, adaptive boundaries
  - Optimizes for: Topical coherence

**5. Advanced Topics**
- Multimodal content (tables, formulas, figures)
- Complex documents (legal contracts, knowledge graphs)
- Troubleshooting common failures

**6. Context Engineering Principles**
- Precision over completeness
- Semantic boundaries over arbitrary boundaries
- Natural units over forced chunking
- Structure-aware over structure-blind
- Measure, don't assume

### üéì Decision Framework Summary

| Question | Answer | Strategy |
|----------|--------|----------|
| **What is my natural retrieval unit?** | Single record (course, product, FAQ) | Don't chunk - use hierarchical patterns |
| | Long-form document (paper, book) | Chunk by sections or semantically |
| | Legal contract with cross-references | Advanced: knowledge graphs + recursive retrieval |
| **How many topics per document?** | Single topic | Whole-document embedding |
| | Multiple distinct topics | Chunking improves precision |
| **What content types?** | Text-only | Standard chunking strategies |
| | Multimodal (tables, figures) | Keep content WITH context |
| **Does structure align with semantics?** | Yes | Document-based chunking |
| | No | Fixed-size or semantic chunking |

### üí° Remember

**This is engineering, not magic:**
- Start with understanding your data
- Choose strategy based on characteristics
- Implement and measure
- Iterate based on results
- There's no universal "best" approach

**Context engineering is about:**
- Controlling what information reaches the LLM
- Maximizing retrieval precision
- Minimizing irrelevant tokens
- Preserving semantic completeness

---

## What's Next?

### Module 4: Memory Systems for Context Engineering

Now that you understand data modeling and chunking for knowledge bases, you'll learn to manage conversation context:
- **Working Memory**: Track conversation history within a session
- **Long-term Memory**: Remember user preferences across sessions
- **Memory-Enhanced RAG**: Combine retrieved knowledge with conversation memory
- **Redis Agent Memory Server**: Automatic memory extraction and retrieval

```
Module 1: Context Engineering Fundamentals
    ‚Üì
Module 2: RAG Fundamentals
    ‚Üì
Module 3: Chunking and Data Modeling ‚Üê You are here
    ‚Üì
Module 4: Memory Systems ‚Üê Next
    ‚Üì
Module 5: Building Agents (Complete System)
```

---

## Practice Exercises

### Exercise 1: Analyze Your Data
Think about a dataset you work with. Answer these questions:
1. What is the natural retrieval unit?
2. Does it need chunking? Why or why not?
3. If yes, which chunking strategy would you use?

### Exercise 2: Design a Chunking Strategy
For each document type, choose the best approach:
1. Product catalog with 1,000 items
2. 50-page technical manual with chapters
3. Customer support tickets (avg 200 words each)
4. Legal contracts (avg 20 pages, multiple clauses)

### Exercise 3: Experiment with Chunking
Take the research paper example and:
1. Try all three chunking strategies
2. Compare the number of chunks and average size
3. Which strategy would work best for queries about "semantic caching methodology"?

---

## Additional Resources

**Chunking Strategies:**
- [LangChain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- [LlamaIndex Node Parsers](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/)

**Research Papers:**
- ["Lost in the Middle" (arXiv:2307.03172)](https://arxiv.org/abs/2307.03172) - U-shaped attention patterns in LLMs
- ["Context Rot" (Chroma Research, 2025)](https://research.trychroma.com/context-rot) - Performance degradation with input length
- [Needle in the Haystack Benchmark](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) - Retrieval in long contexts
- ["Contextual Retrieval" (Anthropic, 2024)](https://www.anthropic.com/news/contextual-retrieval) - 49-67% reduction in retrieval failures
- ["Advancing Semantic Caching for LLMs" (arXiv:2504.02268)](https://arxiv.org/abs/2504.02268) - Redis/Virginia Tech research

**Advanced Topics:**
- [Multi-Graph Multi-Agent Systems for Legal Documents](https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052)
- [GraphRAG for Commercial Contracts](https://neo4j.com/blog/developer/agentic-graphrag-for-commercial-contracts/)

**Vector Databases:**
- [Redis Vector Search Documentation](https://redis.io/docs/stack/search/reference/vectors/)
- [RedisVL Python Library](https://github.com/RedisVentures/redisvl)
