![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Module 3: Chunking and Data Modeling for RAG

## From Basic RAG to Production-Ready Knowledge Bases

In Module 2, you built a working RAG system with hierarchical search. Now you'll learn the critical engineering decisions that separate toy demos from production systems: **when and how to chunk your data**.

**The Critical Question:** Does my data need chunking?

This module teaches you that **chunking is a design choice, not a default step**. Just like database schema design, how you structure your knowledge base dramatically affects retrieval quality, token efficiency, and system performance.

## What You'll Learn

**1. The "Don't Chunk" Strategy:**
- When whole-document embedding is the right choice
- Why structured records (courses, products, FAQs) often don't need chunking
- How to recognize natural retrieval boundaries in your data

**2. When Chunking Helps:**
- Document types that benefit from chunking (research papers, long-form content)
- Research-backed insights: "Lost in the Middle", "Context Rot"
- How chunking improves retrieval precision

**3. Chunking Strategies:**
- Document-based (structure-aware): Split by sections/headers
- Fixed-size (token-based): Using LangChain's RecursiveCharacterTextSplitter
- Semantic (meaning-based): Using embeddings to detect topic shifts
- Trade-offs and decision framework

**4. Data Modeling for RAG:**
- The hierarchical pattern: summaries + details
- Engineering workflow: Extract ‚Üí Clean ‚Üí Transform ‚Üí Optimize ‚Üí Store
- Real-world examples with Redis University course catalog

**‚è±Ô∏è Estimated Time:** 60-75 minutes

---

## Prerequisites

- Completed Module 2: RAG Fundamentals and Implementation
- Redis 8 running locally with course data loaded
- OpenAI API key set
- Understanding of vector embeddings and semantic search

---

## Setup

In [None]:
import os
import sys
from pathlib import Path

from dotenv import load_dotenv

# Handle both running from workshop/ directory and from project root
if Path.cwd().name == "workshop":
    project_root = Path.cwd().parent
else:
    project_root = Path.cwd()

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Load environment variables from project root
env_path = project_root / ".env"
load_dotenv(dotenv_path=env_path)

# Verify required environment variables
required_vars = ["OPENAI_API_KEY"]
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    print(f"""‚ö†Ô∏è  Missing required environment variables: {', '.join(missing_vars)}

Please create a .env file with:
OPENAI_API_KEY=your_openai_api_key
REDIS_URL=redis://localhost:6379
""")
    sys.exit(1)

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379")
print("‚úÖ Environment variables loaded")

In [1]:
import asyncio
import json
from typing import Any, Dict, List

import redis
import tiktoken
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

# Import hierarchical components (from Module 2)
from redis_context_course.hierarchical_manager import HierarchicalCourseManager
from redis_context_course.hierarchical_context import HierarchicalContextAssembler

# Initialize
hierarchical_manager = HierarchicalCourseManager(redis_client=redis.from_url(REDIS_URL, decode_responses=True))
context_assembler = HierarchicalContextAssembler()
redis_client = redis.from_url(REDIS_URL, decode_responses=True)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Token counter
encoding = tiktoken.encoding_for_model("gpt-4o")


def count_tokens(text: str) -> int:
    return len(encoding.encode(text))


print("‚úÖ Dependencies loaded")

‚úÖ Environment variables loaded


## Part 1: Data Modeling - The Foundation of RAG Quality

### The Critical First Question: What is My Natural Retrieval Unit?

Before thinking about chunking, ask: **"What is the natural unit of information I want to retrieve?"**

This is similar to database design - you wouldn't store all customer data in one row, and you shouldn't embed all document content in one vector without thinking about retrieval patterns.

**Examples of Natural Retrieval Units:**

| Domain | Natural Unit | Why |
|--------|-------------|-----|
| **Course Catalog** | Individual course | Each course is self-contained, complete |
| **Product Catalog** | Individual product | All product info should be retrieved together |
| **FAQ Database** | Question + Answer pair | Q&A is an atomic unit |
| **Research Papers** | Section or paragraph | Different sections answer different queries |
| **Legal Contracts** | Clause or section | Need clause-level precision |
| **Support Tickets** | Individual ticket | Single issue with context |

Let's see this in practice with our course catalog:

### Example: Course Catalog - A Natural Retrieval Unit

Let's examine a single course to understand why it's already an optimal retrieval unit:

In [3]:
# Get a sample course to analyze using search
sample_courses = await hierarchical_manager.search_summaries(
    query="programming courses", limit=3
)
sample_course = sample_courses[0]  # Get first course

# Generate embedding text if not present
if not sample_course.embedding_text:
    sample_course.generate_embedding_text()

# Display the course summary
print(f"""üìö Sample Course: {sample_course.course_code}
{'=' * 80}
Title: {sample_course.title}
Department: {sample_course.department}
Level: {sample_course.difficulty_level.value}
Credits: {sample_course.credits}
Instructor: {sample_course.instructor}

Description:
{sample_course.short_description}

Prerequisites: {', '.join(sample_course.prerequisite_codes) if sample_course.prerequisite_codes else 'None'}
Tags: {', '.join(sample_course.tags) if sample_course.tags else 'None'}
{'=' * 80}

Token count: {count_tokens(sample_course.embedding_text)}
""")

17:29:08 httpx INFO   HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
üìä Naive Approach Results:
   Courses included: 10
   Token count: 1,688
   Estimated cost per request: $0.0042

   For 100 courses, this would be ~16,880 tokens!


üìÑ Sample of raw JSON context:
[
  {
    "id": "course_catalog:01KBQSB5VYQ55YPXGFV2CM0S9S",
    "course_code": "CS004",
    "title": "Database Systems",
    "description": "Design and implementation of database systems. SQL, normalization, transactions, and database administration.",
    "department": "Computer Science",
    "credits": 3,
    "difficulty_level": "intermediate",
    "format": "online",
    "instructor": "Christopher Adams",
    "prerequisites": [],
    "created_at": "2025-12-05 17:29:08.824502",
    "updated_a...


### Analysis: Why Courses Don't Need Chunking

**Semantic Completeness:** ‚úÖ Each course is self-contained
- All information about the course is in one record
- No cross-references to other sections
- Natural boundary exists (one course = one retrieval unit)

**Query Patterns:** ‚úÖ Users ask about specific courses or course types
- "What machine learning courses are available?"
- "Tell me about CS016"
- "What are the prerequisites for RU102JS?"

**Retrieval Precision:** ‚úÖ Whole-course embedding maximizes relevance
- When a user asks about a course, they need ALL the information
- Splitting would fragment related information (e.g., separating prerequisites from description)
- Each course is already the optimal retrieval unit

**Token Efficiency:** ‚úÖ Courses are reasonably sized (~150-200 tokens each)
- Not too large (no wasted context)
- Not too small (no fragmentation)

**Decision:** ‚ùå **Don't chunk course data** - it's already optimally structured!

This is the **"don't chunk" strategy** - a valid and often optimal choice for structured records.

### The Hierarchical Pattern: A Better Data Model

Instead of chunking, we use a **hierarchical pattern** with two tiers:

**Tier 1: Summaries (Lightweight)**
- Searchable, compact course overviews
- Stored in vector index for fast retrieval
- ~150-200 tokens each

**Tier 2: Details (On-Demand)**
- Full course information with all fields
- Retrieved only when needed
- Stored as plain Redis keys (not in vector index)

This is **data modeling**, not chunking - we're structuring data for optimal retrieval patterns.

Let's see this in action:

In [None]:
# Hierarchical retrieval example
query = "beginner programming courses"

# Tier 1: Search summaries (fast, lightweight)
summaries, details = await hierarchical_manager.hierarchical_search(
    query=query,
    summary_limit=5,  # Get 5 summary matches
    detail_limit=3,   # Fetch full details for top 3
)

print(f"""üîç Query: "{query}"
{'=' * 80}

üìä Tier 1: Summary Results (5 courses)
""")

for i, summary in enumerate(summaries, 1):
    print(f"{i}. {summary.course_code}: {summary.title} ({summary.difficulty_level})")

print(f"""
{'=' * 80}
üìÑ Tier 2: Detailed Information (top 3 courses)
""")

for detail in details:
    prereq_codes = [p.course_code for p in detail.prerequisites] if detail.prerequisites else []
    print(f"""
{detail.course_code}: {detail.title}
Department: {detail.department} | Credits: {detail.credits}
Prerequisites: {', '.join(prereq_codes) if prereq_codes else 'None'}

Description: {detail.full_description[:200]}...
""")

# Assemble context
context = context_assembler.assemble_hierarchical_context(summaries, details, query)
context_tokens = count_tokens(context)

print(f"""
{'=' * 80}
üìä Context Statistics:
- Summaries: 5 courses
- Details: 3 courses
- Total tokens: {context_tokens:,}
- Retrieval pattern: Hierarchical (summaries + details)
""")

**Key Takeaway:** For structured records like courses, the hierarchical pattern (summaries + details) is superior to chunking because it respects natural data boundaries and retrieval patterns.

---

## Part 2: When Documents DO Need Chunking

Now let's look at a completely different type of data: **long-form documents** with multiple distinct topics.

### Example: Research Paper

Let's create a sample research paper about Redis vector search optimization:

In [None]:
# Create a sample research paper about Redis vector search
research_paper = """
# Optimizing Vector Search Performance in Redis

## Abstract
This paper presents a comprehensive analysis of vector search optimization techniques in Redis,
examining the trade-offs between search quality, latency, and memory usage. We evaluate multiple
indexing strategies including HNSW and FLAT indexes across datasets ranging from 10K to 10M vectors.
Our results demonstrate that careful index configuration can improve search latency by up to 10x
while maintaining 95%+ recall. We also introduce novel compression techniques that reduce memory
usage by 75% with minimal impact on search quality.

## 1. Introduction
Vector databases have become essential infrastructure for modern AI applications, enabling semantic
search, recommendation systems, and retrieval-augmented generation (RAG). Redis, traditionally known
as an in-memory data structure store, has evolved to support high-performance vector search through
the RediSearch module. However, optimizing vector search performance requires understanding complex
trade-offs between multiple dimensions: search quality (recall), query latency, memory usage, and
index build time.

This paper makes three key contributions: (1) A systematic evaluation of HNSW parameter configurations
across different dataset sizes and query patterns, (2) Novel compression techniques that reduce memory
footprint while preserving search quality, and (3) Practical recommendations for production deployments
based on real-world workload analysis.

[... continues for several more pages ...]

## 2. Background and Related Work
Previous work on vector search optimization has focused primarily on algorithmic improvements to
approximate nearest neighbor (ANN) search. Malkov and Yashunin (2018) introduced HNSW (Hierarchical
Navigable Small World), which has become the de facto standard for high-dimensional vector search.
Johnson et al. (2019) developed FAISS, demonstrating that product quantization can significantly
reduce memory usage. More recently, Guo et al. (2020) proposed DiskANN for billion-scale search
with SSD-based storage.

However, these works primarily focus on standalone vector search systems. Our work specifically
addresses the unique challenges of integrating vector search into Redis, a multi-model database
that must balance vector search performance with other data structure operations.

[... continues ...]

## 3. Performance Analysis and Results

### 3.1 HNSW Configuration Trade-offs

Table 1 shows the performance comparison across different HNSW configurations. As M increases from 16 to 64,
we observe significant improvements in recall (0.89 to 0.97) but at the cost of increased latency (2.1ms to 8.7ms)
and memory usage (1.2GB to 3.8GB). The sweet spot for most real-world workloads is M=32 with ef_construction=200,
which achieves 0.94 recall with 4.3ms latency.

Table 1: HNSW Performance Comparison
| M  | ef_construction | Recall@10 | Latency (ms) | Memory (GB) | Build Time (min) |
|----|-----------------|-----------|--------------|-------------|------------------|
| 16 | 100            | 0.89      | 2.1          | 1.2         | 8                |
| 32 | 200            | 0.94      | 4.3          | 2.1         | 15               |
| 64 | 400            | 0.97      | 8.7          | 3.8         | 32               |

The data clearly demonstrates the fundamental trade-off between search quality and resource consumption.
For applications requiring high recall (>0.95), the increased latency and memory costs are unavoidable.

### 3.2 Mathematical Model

The recall-latency trade-off can be modeled as a quadratic function of the HNSW parameters:

Latency(M, ef) = Œ±¬∑M¬≤ + Œ≤¬∑ef + Œ≥

Where:
- M = number of connections per layer (controls graph connectivity)
- ef = size of dynamic candidate list (controls search breadth)
- Œ±, Œ≤, Œ≥ = dataset-specific constants (fitted from experimental data)

For our e-commerce dataset, we fitted: Œ±=0.002, Œ≤=0.015, Œ≥=1.2 (R¬≤=0.94)

[... continues ...]

## 4. Implementation Recommendations

Based on our findings, we recommend the following configuration for real-world deployments:

```python
# Optimal HNSW configuration for balanced performance
index_params = {
    "M": 32,                  # Balance recall and latency
    "ef_construction": 200,   # Higher quality index
    "ef_runtime": 100         # Fast search with good recall
}
```

This configuration achieves 0.94 recall with 4.3ms p95 latency, suitable for most real-time applications.

## 5. Conclusion
Our findings demonstrate that vector search optimization is fundamentally about understanding
YOUR specific requirements and constraints. There is no one-size-fits-all configuration.
"""

paper_tokens = count_tokens(research_paper)
print(f"""üìÑ Sample Research Paper
{'=' * 80}
Title: "Optimizing Vector Search Performance in Redis"

Structure:
- Abstract
- Introduction
- Background and Related Work
- Performance Analysis and Results
- Implementation Recommendations
- Conclusion

Token count: {paper_tokens:,}
Word count: ~{len(research_paper.split())}
{'=' * 80}
""")

### Analysis: Why This Research Paper NEEDS Chunking

Let's compare the course catalog (doesn't need chunking) with the research paper (does need chunking):

| Factor | Course Catalog | Research Paper |
|--------|---------------|----------------|
| **Document Structure** | Single topic per record | Multiple distinct sections |
| **Semantic Completeness** | Each course is self-contained | Sections cover different topics |
| **Query Patterns** | "Show me CS courses" | "What compression techniques?" |
| **Optimal Retrieval Unit** | Whole course | Specific section |
| **Token Count** | ~150-200 tokens | ~1,500+ tokens |
| **Chunking Decision** | ‚ùå Don't chunk | ‚úÖ Chunk by section |

**Why the research paper needs chunking:**

**1. Multiple Distinct Topics:**
- Abstract, Introduction, Background, Results, Conclusion each cover different aspects
- A query about "compression techniques" only needs the relevant section, not the entire paper

**2. Retrieval Precision:**
- Without chunking: Retrieve entire 1,500-token paper for every query
- With chunking: Retrieve only the 200-300 token section that's relevant
- Result: 80% reduction in irrelevant context

**3. Query-Specific Needs:**

| Query | Needs | Without Chunking | With Chunking |
|-------|-------|------------------|---------------|
| "What compression techniques?" | Methodology section | Entire paper (1,500 tokens) | Methodology (300 tokens) |
| "What were recall results?" | Results + Table | Entire paper (1,500 tokens) | Results section (250 tokens) |
| "How does HNSW work?" | Background + Formula | Entire paper (1,500 tokens) | Background (200 tokens) |
| "Recommended config?" | Implementation section | Entire paper (1,500 tokens) | Implementation (150 tokens) |

**Impact:** 5-10x reduction in irrelevant context, leading to faster responses and better quality.

**üí° Key Insight:** Chunking isn't about fitting in context windows - it's about **data modeling for retrieval**. Just like you wouldn't store all customer data in one database row, you shouldn't embed all document content in one vector when sections serve different purposes.

---

## Part 3: Research Background - Why Chunking Matters

Even with large context windows (128K+ tokens), research shows that **how you structure context matters more than fitting everything in**.

### Key Research Findings

**1. "Lost in the Middle" (Stanford/UC Berkeley, 2023)**

*Source: [arXiv:2307.03172](https://arxiv.org/abs/2307.03172)*

- LLMs exhibit **U-shaped attention**: high recall at beginning/end, degraded in middle
- Happens even in models designed for long contexts
- **Implication:** Chunking ensures relevant sections are retrieved and placed prominently, not buried

**2. "Context Rot" (Chroma Research, 2025)**

*Source: [research.trychroma.com/context-rot](https://research.trychroma.com/context-rot)*

- Performance degrades as input length increases, even when relevant info is present
- **Distractor effect**: Irrelevant content actively hurts model performance
- Even 4 distractor documents can significantly degrade output quality
- **Implication:** Smaller, focused chunks reduce "distractor tokens"

**3. Needle in the Haystack (NIAH) Benchmark**

*Source: [github.com/gkamradt/LLMTest_NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)*

- Models often fail to retrieve information buried in long context
- Performance varies by position (middle is worst)
- **Limitation:** Tests lexical retrieval only, not semantic understanding
- **Implication:** For structured data, NIAH is irrelevant‚Äîeach record IS the needle

**The Key Insight:**

These findings inform design decisions but don't prescribe universal rules:

- **Structured records** (courses, products, FAQs): "Lost in the middle" doesn't apply‚Äîeach record is already focused
- **Long-form documents** (papers, books): Context rot and positional bias become relevant‚Äîchunking helps
- **Mixed content**: Real-world data rarely fits neat categories‚Äîexperiment with YOUR data

---

## Part 4: Chunking Strategies - Three Approaches

Once you've determined your data needs chunking, the next question is: **How should you chunk it?**

There's no single "best" strategy - the optimal approach depends on YOUR data characteristics and query patterns.

### Strategy 1: Document-Based Chunking (Structure-Aware)

**Concept:** Split documents based on their inherent structure (sections, paragraphs, headings).

**Best for:** Structured documents with clear logical divisions (research papers, technical docs, books).

In [None]:
# Strategy 1: Document-Based Chunking
# Split research paper by sections (using markdown headers)


def chunk_by_structure(text: str, separator: str = "\n## ") -> List[str]:
    """Split text by structural markers (e.g., markdown headers)."""

    # Split by headers
    sections = text.split(separator)

    # Clean and format chunks
    chunks = []
    for i, section in enumerate(sections):
        if section.strip():
            # Add header back (except for first chunk which is title)
            if i > 0:
                chunk = "## " + section
            else:
                chunk = section
            chunks.append(chunk.strip())

    return chunks


# Apply to research paper
structure_chunks = chunk_by_structure(research_paper)

print(f"""üìä Strategy 1: Document-Based (Structure-Aware) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(structure_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(structure_chunks):
    chunk_tokens = count_tokens(chunk)
    # Show first 100 chars of each chunk
    preview = chunk[:300].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...\n")

**Strategy 1 Analysis:**

‚úÖ **Advantages:**
- Respects document structure (sections stay together)
- Semantically coherent (each chunk is a complete section)
- Easy to implement for structured documents
- **Keeps tables, formulas, and code WITH their context**

‚ö†Ô∏è **Trade-offs:**
- Variable chunk sizes (some sections longer than others)
- Requires documents to have clear structure
- May create chunks that are still too large

üéØ **Best for:**
- Research papers with clear sections
- Technical documentation with headers
- Books with chapters/sections

### Strategy 2: Fixed-Size Chunking (Token-Based)

**Concept:** Split text into chunks of a predetermined size (e.g., 512 tokens) with overlap.

**Best for:** Unstructured text, quick prototyping, when you need consistent chunk sizes.

In [None]:
# Strategy 2: Fixed-Size Chunking (Using LangChain)
# Industry-standard approach with smart boundary detection

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create text splitter with smart boundary detection
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,  # Target chunk size in characters
    chunk_overlap=100,  # Overlap to preserve context
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
    is_separator_regex=False,
)

print("üîÑ Running fixed-size chunking with LangChain...")
print("   Trying to split on: paragraphs ‚Üí sentences ‚Üí words ‚Üí characters\n")

# Apply to research paper
fixed_chunks_docs = text_splitter.create_documents([research_paper])
fixed_chunks = [doc.page_content for doc in fixed_chunks_docs]

print(f"""üìä Strategy 2: Fixed-Size (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Target chunk size: 800 characters (~200 words)
Overlap: 100 characters
Number of chunks: {len(fixed_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(fixed_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

print(f"... ({len(fixed_chunks) - 5} more chunks)")

**Strategy 2 Analysis:**

‚úÖ **Advantages:**
- **Respects natural boundaries**: Tries paragraphs ‚Üí sentences ‚Üí words ‚Üí characters
- Consistent chunk sizes (predictable token usage)
- Works on any text (structured or unstructured)
- **Doesn't split mid-sentence** (unless absolutely necessary)

‚ö†Ô∏è **Trade-offs:**
- Ignores document structure (doesn't understand sections)
- Can break semantic coherence (may split related content)
- Overlap creates redundancy (increases storage/cost)

üéØ **Best for:**
- Unstructured text (no clear sections)
- Quick prototyping and baselines
- When consistent chunk sizes are required

### Strategy 3: Semantic Chunking (Meaning-Based)

**Concept:** Split text based on semantic similarity using embeddings - create new chunks when topic changes significantly.

**How it works:**
1. Split text into sentences or paragraphs
2. Generate embeddings for each segment
3. Calculate similarity between consecutive segments
4. Create chunk boundaries where similarity drops (topic shift detected)

**Best for:** Dense academic text, legal documents, narratives where semantic boundaries don't align with structure.

In [None]:
# Strategy 3: Semantic Chunking (Using LangChain)
# Industry-standard approach with local embeddings (no API costs!)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Suppress tokenizer warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize local embeddings (no API costs!)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# Create semantic chunker with percentile-based breakpoint detection
semantic_chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # Split at bottom 25% of similarities
    breakpoint_threshold_amount=25,  # 25th percentile
    buffer_size=1,  # Compare consecutive sentences
)

print("üîÑ Running semantic chunking with LangChain...")
print("   Using local embeddings (sentence-transformers/all-MiniLM-L6-v2)")
print("   Breakpoint detection: 25th percentile of similarity scores\n")

# Apply to research paper
semantic_chunks_docs = semantic_chunker.create_documents([research_paper])

# Extract text from Document objects
semantic_chunks = [doc.page_content for doc in semantic_chunks_docs]

print(f"""üìä Strategy 3: Semantic (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(semantic_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(semantic_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

if len(semantic_chunks) > 5:
    print(f"... ({len(semantic_chunks) - 5} more chunks)")

**Strategy 3 Analysis:**

‚úÖ **Advantages:**
- **Meaning-aware**: Chunks based on topic shifts, not arbitrary boundaries
- **Adaptive**: Chunk sizes vary based on content coherence
- **Better retrieval**: Each chunk is semantically focused
- **Free**: Uses local embeddings (no API costs)

‚ö†Ô∏è **Trade-offs:**
- Slower processing (requires embedding generation)
- Variable chunk sizes (harder to predict token usage)
- May not respect document structure (sections, headers)
- Requires tuning (threshold, buffer size)

üéØ **Best for:**
- Dense academic text
- Legal documents
- Narratives and stories
- Content where semantic boundaries don't align with structure

### Comparing Chunking Strategies: Decision Framework

Now let's compare all strategies side-by-side:

In [None]:
print(f"""
{'=' * 80}
CHUNKING STRATEGY COMPARISON
{'=' * 80}

Document: Research Paper ({paper_tokens:,} tokens)

Strategy              | Chunks | Avg Size | Complexity | Best For
--------------------- | ------ | -------- | ---------- | --------
Document-Based        | {len(structure_chunks):>6} | {sum(count_tokens(c) for c in structure_chunks) // len(structure_chunks):>8} | Low        | Structured docs
Fixed-Size            | {len(fixed_chunks):>6} | {sum(count_tokens(c) for c in fixed_chunks) // len(fixed_chunks):>8} | Low        | Unstructured text
Semantic              | {len(semantic_chunks):>6} | {sum(count_tokens(c) for c in semantic_chunks) // len(semantic_chunks):>8} | High       | Dense academic text

{'=' * 80}
""")

### YOUR Chunking Decision Framework

Chunking strategy is a **design choice** that depends on your specific context. There's no universal "correct" chunk size.

**Step 1: Start with Document Type**

| Document Type | Default Approach | Reasoning |
|---------------|------------------|----------|
| **Structured records** (courses, products, FAQs) | Don't chunk | Natural boundaries already exist |
| **Long-form text** (papers, books, docs) | Consider chunking | May need retrieval precision |
| **PDFs with visual layout** | Page-level | Preserves tables, figures |
| **Code** | Function/class boundaries | Semantic structure matters |

**Step 2: Evaluate These Factors**

1. **Semantic completeness:** Is each item self-contained?
   - ‚úÖ Yes ‚Üí Don't chunk (preserve natural boundaries)
   - ‚ùå No ‚Üí Consider chunking strategy

2. **Query patterns:** What will users ask?
   - Specific facts ‚Üí Smaller, focused chunks help
   - Summaries/overviews ‚Üí Larger chunks or hierarchical
   - Mixed ‚Üí Consider hierarchical approach

3. **Topic density:** How many distinct topics per document?
   - Single topic ‚Üí Whole-document embedding often works
   - Multiple distinct topics ‚Üí Chunking may improve precision

**Example Decisions:**

| Domain | Data Characteristics | Decision | Why |
|--------|---------------------|----------|-----|
| **Course Catalog** | Small, self-contained records | **Don't chunk** | Each course is a complete retrieval unit |
| **Research Papers** | Multi-section, dense topics | Document-Based | Sections are natural semantic units |
| **Support Tickets** | Single issue per ticket | **Don't chunk** | Already at optimal granularity |
| **Legal Contracts** | Nested structure, many clauses | Hierarchical | Need both overview and clause-level detail |

> üí° **Key Takeaway:** Ask "What is my natural retrieval unit?" before deciding on a chunking strategy. For many structured data use cases, the answer is "don't chunk."

---

## Summary and Key Takeaways

### What You Learned

**1. Data Modeling is the Foundation of RAG Quality**
- The critical first question: "What is my natural retrieval unit?"
- For structured records (courses, products, FAQs), the answer is often "don't chunk"
- For long-form documents (papers, books), chunking may improve retrieval precision

**2. The "Don't Chunk" Strategy is Valid**
- Course catalogs, product listings, FAQ entries don't need chunking
- Each record is already semantically complete and self-contained
- Chunking would fragment related information and hurt quality
- Use hierarchical patterns (summaries + details) instead

**3. When Chunking Helps**
- Long-form documents with multiple distinct topics
- Research papers, technical documentation, books, legal contracts
- Improves retrieval precision by reducing irrelevant context
- Research-backed: "Lost in the Middle", "Context Rot" show why structure matters

**4. Three Chunking Strategies**
- **Document-Based (Structure-Aware):** Split by sections/headers - best for structured documents
- **Fixed-Size (Token-Based):** Split into fixed chunks with overlap - best for unstructured text
- **Semantic (Meaning-Based):** Split based on topic shifts - best for dense academic text
- Choose based on YOUR data characteristics and query patterns

**5. The Engineering Mindset**
- Chunking is a design choice, not a default step
- Like database schema design, structure affects retrieval quality
- No one-size-fits-all solution - analyze YOUR data and requirements
- Experiment, measure, iterate

### Decision Framework

**Ask these questions:**

1. **What is my natural retrieval unit?**
   - Single record (course, product) ‚Üí Don't chunk
   - Document section (paper, book) ‚Üí Consider chunking

2. **What are my query patterns?**
   - "Show me CS courses" ‚Üí Whole-record embedding
   - "What compression techniques?" ‚Üí Section-level chunking

3. **How many distinct topics per document?**
   - Single topic ‚Üí Whole-document embedding
   - Multiple topics ‚Üí Chunking improves precision

**Example Decisions:**

| Domain | Data Type | Decision | Strategy |
|--------|-----------|----------|----------|
| **Course Catalog** | Structured records | Don't chunk | Hierarchical (summaries + details) |
| **Research Papers** | Multi-section documents | Chunk | Document-based (by section) |
| **Support Tickets** | Single-issue records | Don't chunk | Whole-record embedding |
| **Legal Contracts** | Multi-clause documents | Chunk | Hierarchical + document-based |

### The Key Insight

> **Chunking isn't about fitting in context windows - it's about data modeling for retrieval.**
>
> Just like you wouldn't store all customer data in one database row, you shouldn't embed all document content in one vector without thinking about retrieval patterns.

---

## What's Next?

### Module 4: Memory Systems for Context Engineering

Now that you understand data modeling and chunking for knowledge bases, you'll learn to manage conversation context:
- **Working Memory:** Track conversation history within a session
- **Long-term Memory:** Remember user preferences across sessions
- **Memory-Enhanced RAG:** Combine retrieved knowledge with conversation memory
- **Redis Agent Memory Server:** Automatic memory extraction and retrieval

```
Module 1: Context Engineering Fundamentals
    ‚Üì
Module 2: RAG Fundamentals ‚Üê Completed
    ‚Üì
Module 3: Chunking and Data Modeling ‚Üê You are here
    ‚Üì
Module 4: Memory Systems ‚Üê Next
    ‚Üì
Module 5: Building Agents (Complete System)
```

---

## Practice Exercises

### Exercise 1: Analyze Your Data
Think about a dataset you work with. Answer these questions:
1. What is the natural retrieval unit?
2. Does it need chunking? Why or why not?
3. If yes, which chunking strategy would you use?

### Exercise 2: Design a Chunking Strategy
For each document type, choose the best approach:
1. Product catalog with 1,000 items
2. 50-page technical manual with chapters
3. Customer support tickets (avg 200 words each)
4. Legal contracts (avg 20 pages, multiple clauses)

### Exercise 3: Experiment with Chunking
Take the research paper example and:
1. Try all three chunking strategies
2. Compare the number of chunks and average size
3. Which strategy would work best for queries about "HNSW configuration"?

---

## Additional Resources

**Chunking Strategies:**
- [LangChain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- [LlamaIndex Node Parsers](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/)

**Research Papers:**
- ["Lost in the Middle" (arXiv:2307.03172)](https://arxiv.org/abs/2307.03172)
- ["Context Rot" (Chroma Research)](https://research.trychroma.com/context-rot)
- [Needle in the Haystack Benchmark](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)

**Data Modeling for RAG:**
- [OpenAI Best Practices](https://platform.openai.com/docs/guides/prompt-engineering)
- [Anthropic Prompt Engineering](https://docs.anthropic.com/claude/docs/prompt-engineering)

**Vector Databases:**
- [Redis Vector Search Documentation](https://redis.io/docs/stack/search/reference/vectors/)
- [RedisVL Python Library](https://github.com/RedisVentures/redisvl)
