![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Module 3: Chunking and Data Modeling for RAG

## From Basic RAG to Production-Ready Knowledge Bases

In Module 2, you built a working RAG system with hierarchical search. Now you'll learn the critical engineering decisions that separate toy demos from production systems: **when and how to chunk your data**.

**The Critical Question:** Does my data need chunking?

This module teaches you that **chunking is a design choice, not a default step**. Just like database schema design, how you structure your knowledge base dramatically affects retrieval quality, token efficiency, and system performance.

## What You'll Learn

**1. The "Don't Chunk" Strategy:**
- When whole-document embedding is the right choice
- Why structured records (courses, products, FAQs) often don't need chunking
- How to recognize natural retrieval boundaries in your data

**2. When Chunking Helps:**
- Document types that benefit from chunking (research papers, long-form content)
- Research-backed insights: "Lost in the Middle", "Context Rot"
- How chunking improves retrieval precision

**3. Chunking Strategies:**
- Document-based (structure-aware): Split by sections/headers
- Fixed-size (token-based): Using LangChain's RecursiveCharacterTextSplitter
- Semantic (meaning-based): Using embeddings to detect topic shifts
- Trade-offs and decision framework

**4. Data Modeling for RAG:**
- The hierarchical pattern: summaries + details
- Engineering workflow: Extract ‚Üí Clean ‚Üí Transform ‚Üí Optimize ‚Üí Store
- Real-world examples with Redis University course catalog

**‚è±Ô∏è Estimated Time:** 60-75 minutes

---

## Prerequisites

- Completed Module 2: RAG Fundamentals and Implementation
- Redis 8 running locally with course data loaded
- OpenAI API key set
- Understanding of vector embeddings and semantic search

---

## Setup

In [43]:
import os
import sys
from pathlib import Path

from dotenv import load_dotenv

# Handle both running from workshop/ directory and from project root
if Path.cwd().name == "workshop":
    project_root = Path.cwd().parent
else:
    project_root = Path.cwd()

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Load environment variables from project root
env_path = project_root / ".env"
load_dotenv(dotenv_path=env_path)

# Verify required environment variables
required_vars = ["OPENAI_API_KEY"]
missing_vars = [var for var in required_vars if not os.getenv(var)]

if missing_vars:
    print(f"""‚ö†Ô∏è  Missing required environment variables: {', '.join(missing_vars)}

Please create a .env file with:
OPENAI_API_KEY=your_openai_api_key
REDIS_URL=redis://localhost:6379
""")
    sys.exit(1)

REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379")
print("‚úÖ Environment variables loaded")

‚úÖ Environment variables loaded


In [44]:
import asyncio
import json
from typing import Any, Dict, List

import redis
import tiktoken
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI

# Import hierarchical components (from Module 2)
from redis_context_course.hierarchical_manager import HierarchicalCourseManager
from redis_context_course.hierarchical_context import HierarchicalContextAssembler

# Initialize
hierarchical_manager = HierarchicalCourseManager(redis_client=redis.from_url(REDIS_URL, decode_responses=True))
context_assembler = HierarchicalContextAssembler()
redis_client = redis.from_url(REDIS_URL, decode_responses=True)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Token counter
encoding = tiktoken.encoding_for_model("gpt-4o")


def count_tokens(text: str) -> int:
    return len(encoding.encode(text))


print("‚úÖ Dependencies loaded")

‚úÖ Dependencies loaded


## Part 1: Data Modeling - The Foundation of RAG Quality

### The Critical First Question: What is My Natural Retrieval Unit?

Before thinking about chunking, ask: **"What is the natural unit of information I want to retrieve?"**

This is similar to database design - you wouldn't store all customer data in one row, and you shouldn't embed all document content in one vector without thinking about retrieval patterns.

**Examples of Natural Retrieval Units:**

| Domain | Natural Unit | Why |
|--------|-------------|-----|
| **Course Catalog** | Individual course | Each course is self-contained, complete |
| **Product Catalog** | Individual product | All product info should be retrieved together |
| **FAQ Database** | Question + Answer pair | Q&A is an atomic unit |
| **Research Papers** | Section or paragraph | Different sections answer different queries |
| **Legal Contracts** | Clause or section | Need clause-level precision |
| **Support Tickets** | Individual ticket | Single issue with context |

Let's see this in practice with our course catalog:

### Example: Course Catalog - A Natural Retrieval Unit

Let's examine a single course to understand why it's already an optimal retrieval unit:

In [45]:
# Get a sample course to analyze using search
sample_courses = await hierarchical_manager.search_summaries(
    query="programming courses", limit=3
)
sample_course = sample_courses[0]  # Get first course

# Generate embedding text if not present
if not sample_course.embedding_text:
    sample_course.generate_embedding_text()

# Display the course summary
print(f"""üìö Sample Course: {sample_course.course_code}
{'=' * 80}
Title: {sample_course.title}
Department: {sample_course.department}
Level: {sample_course.difficulty_level.value}
Credits: {sample_course.credits}
Instructor: {sample_course.instructor}

Description:
{sample_course.short_description}

Prerequisites: {', '.join(sample_course.prerequisite_codes) if sample_course.prerequisite_codes else 'None'}
Tags: {', '.join(sample_course.tags) if sample_course.tags else 'None'}
{'=' * 80}

Token count: {count_tokens(sample_course.embedding_text)}
""")

11:11:27 httpx INFO   HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
11:11:27 redisvl.index.index INFO   Index already exists, not overwriting.
11:11:27 redis_context_course.hierarchical_manager INFO   Created summary index: course_summaries
11:11:27 redis_context_course.hierarchical_manager INFO   Found 3 course summaries for query: programming courses
üìö Sample Course: CS003
Title: Programming Fundamentals with C++
Department: Computer Science
Level: beginner
Credits: 3
Instructor: Angie Henderson

Description:
Core programming concepts using C++ for beginners.

Prerequisites: None
Tags: programming, c++, beginner, fundamentals, systems

Token count: 39



### Analysis: Why Courses Don't Need Chunking

**Semantic Completeness:** ‚úÖ Each course is self-contained
- All information about the course is in one record
- No cross-references to other sections
- Natural boundary exists (one course = one retrieval unit)

**Query Patterns:** ‚úÖ Users ask about specific courses or course types
- "What machine learning courses are available?"
- "Tell me about CS016"
- "What are the prerequisites for RU102JS?"

**Retrieval Precision:** ‚úÖ Whole-course embedding maximizes relevance
- When a user asks about a course, they need ALL the information
- Splitting would fragment related information (e.g., separating prerequisites from description)
- Each course is already the optimal retrieval unit

**Token Efficiency:** ‚úÖ Courses are reasonably sized (~150-200 tokens each)
- Not too large (no wasted context)
- Not too small (no fragmentation)

**Decision:** ‚ùå **Don't chunk course data** - it's already optimally structured!

This is the **"don't chunk" strategy** - a valid and often optimal choice for structured records.

### The Hierarchical Pattern: A Better Data Model

Instead of chunking, we use a **hierarchical pattern** with two tiers:

**Tier 1: Summaries (Lightweight)**
- Searchable, compact course overviews
- Stored in vector index for fast retrieval
- ~150-200 tokens each

**Tier 2: Details (On-Demand)**
- Full course information with all fields
- Retrieved only when needed
- Stored as plain Redis keys (not in vector index)

This is **data modeling**, not chunking - we're structuring data for optimal retrieval patterns.

Let's see this in action:

In [46]:
# Hierarchical retrieval example
query = "beginner programming courses"

# Tier 1: Search summaries (fast, lightweight)
summaries, details = await hierarchical_manager.hierarchical_search(
    query=query,
    summary_limit=5,  # Get 5 summary matches
    detail_limit=3,   # Fetch full details for top 3
)

print(f"""üîç Query: "{query}"
{'=' * 80}

üìä Tier 1: Summary Results (5 courses)
""")

for i, summary in enumerate(summaries, 1):
    print(f"{i}. {summary.course_code}: {summary.title} ({summary.difficulty_level})")

print(f"""
{'=' * 80}
üìÑ Tier 2: Detailed Information (top 3 courses)
""")

for detail in details:
    prereq_codes = [p.course_code for p in detail.prerequisites] if detail.prerequisites else []
    print(f"""
{detail.course_code}: {detail.title}
Department: {detail.department} | Credits: {detail.credits}
Prerequisites: {', '.join(prereq_codes) if prereq_codes else 'None'}

Description: {detail.full_description[:200]}...
""")

# Assemble context
context = context_assembler.assemble_hierarchical_context(summaries, details, query)
context_tokens = count_tokens(context)

print(f"""
{'=' * 80}
üìä Context Statistics:
- Summaries: 5 courses
- Details: 3 courses
- Total tokens: {context_tokens:,}
- Retrieval pattern: Hierarchical (summaries + details)
""")

11:11:27 redis_context_course.hierarchical_manager INFO   Hierarchical search: 'beginner programming courses' (summaries=5, details=3)
11:11:29 httpx INFO   HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
11:11:29 redis_context_course.hierarchical_manager INFO   Found 5 course summaries for query: beginner programming courses
11:11:29 redis_context_course.hierarchical_manager INFO   Fetched 3 course details
11:11:29 redis_context_course.hierarchical_manager INFO   Hierarchical search complete: 5 summaries, 3 details
üîç Query: "beginner programming courses"

üìä Tier 1: Summary Results (5 courses)

1. CS001: Introduction to Programming with Python (DifficultyLevel.BEGINNER)
2. CS002: Web Development Fundamentals (DifficultyLevel.BEGINNER)
3. CS003: Programming Fundamentals with C++ (DifficultyLevel.BEGINNER)
4. CS012: Machine Learning Fundamentals (DifficultyLevel.ADVANCED)
5. CS006: Web Development (DifficultyLevel.INTERMEDIATE)

üìÑ Tier 2: Detailed Infor

**Key Takeaway:** For structured records like courses, the hierarchical pattern (summaries + details) is superior to chunking because it respects natural data boundaries and retrieval patterns.

---

## Part 2: When Documents DO Need Chunking

Now let's look at a completely different type of data: **long-form documents** with multiple distinct topics.

### Example: Research Paper

Let's load a real research paper about semantic caching for LLMs:

In [47]:
# Load the actual research paper PDF
import pypdf

pdf_path = project_root / "data" / "arxiv_2504_02268.pdf"
reader = pypdf.PdfReader(pdf_path)

# Extract text from all pages
research_paper = ""
for page in reader.pages:
    research_paper += page.extract_text() + "\n"

paper_tokens = count_tokens(research_paper)
print(f"""üìÑ Real Research Paper
{'=' * 80}
Title: "Advancing Semantic Caching for LLMs with Domain-Specific Embeddings"
Authors: Waris Gill et al. (Redis & Virginia Tech, 2025)
Source: arXiv:2504.02268

Structure:
- Abstract
- Introduction
- Background and Related Work
- Methodology (Synthetic Data Generation)
- Evaluation and Results
- Conclusion

Pages: {len(reader.pages)}
Token count: {paper_tokens:,}
Characters: {len(research_paper):,}
{'=' * 80}
""")

ModuleNotFoundError: No module named 'pypdf'

### Analysis: Why This Research Paper NEEDS Chunking

Let's compare the course catalog (doesn't need chunking) with the research paper (does need chunking):

| Factor | Course Catalog | Research Paper |
|--------|---------------|----------------|
| **Document Structure** | Single topic per record | Multiple distinct sections |
| **Semantic Completeness** | Each course is self-contained | Sections cover different topics and types (text, formulas, charts, etc.) |
| **Query Patterns** | "Show me CS courses" | "How is synthetic data generated?" |
| **Optimal Retrieval Unit** | Whole course | Specific section |
| **Chunking Decision** | ‚ùå Don't chunk | ‚úÖ Chunk by section |

**Why the research paper needs chunking:**

**1. Multiple Distinct Topics:**
- Abstract, Introduction, Methodology, Evaluation, Conclusion each cover different aspects
- A query about "synthetic data generation" only needs the Methodology section, not the entire paper

**2. Retrieval Precision:**
- Without chunking: Retrieve entire ~6,000-token paper for every query
- With chunking: Retrieve only the 300-500 token section that's relevant
- Result: 85-90% reduction in irrelevant context

**3. Query-Specific Needs:**

| Query | Needs | Without Chunking | With Chunking |
|-------|-------|------------------|---------------|
| "How is synthetic data generated?" | Methodology section | Entire paper (~6,000 tokens) | Methodology (~500 tokens) |
| "What were the hit rate results?" | Evaluation + Tables | Entire paper (~6,000 tokens) | Evaluation (~400 tokens) |
| "What embedding models were tested?" | Results section | Entire paper (~6,000 tokens) | Results (~300 tokens) |
| "What is semantic caching?" | Introduction + Background | Entire paper (~6,000 tokens) | Intro+Background (~600 tokens) |

**Impact:** 8-12x reduction in irrelevant context, leading to faster responses and better quality.

**üí° Key Insight:** Chunking isn't about fitting in context windows - it's about **data modeling for retrieval**. Just like you wouldn't store all customer data in one database row, you shouldn't embed all document content in one vector when sections serve different purposes.

### Research Background: Why Chunking Matters

Even with large context windows (128K+ tokens), research shows that **how you structure context matters more than fitting everything in**.

**Key Research Findings:**

**1. "Lost in the Middle" (Stanford/UC Berkeley, 2023)** - [arXiv:2307.03172](https://arxiv.org/abs/2307.03172)
- LLMs exhibit **U-shaped attention**: high recall at beginning/end, degraded in middle
- **Implication:** Chunking ensures relevant sections are retrieved and placed prominently

**2. "Context Rot" (Chroma Research, 2025)** - [research.trychroma.com/context-rot](https://research.trychroma.com/context-rot)
- Performance degrades as input length increases, even when relevant info is present
- **Distractor effect**: Irrelevant content actively hurts model performance
- **Implication:** Smaller, focused chunks reduce "distractor tokens"

**3. Needle in the Haystack (NIAH)** - [github.com/gkamradt/LLMTest_NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack)
- Models often fail to retrieve information buried in long context
- **Implication:** For structured data, NIAH is irrelevant‚Äîeach record IS the needle

**The Takeaway:** These findings inform design decisions but don't prescribe universal rules. Structured records (courses, products) don't need chunking. Long-form documents (papers, books) benefit from chunking. Experiment with YOUR data.

---

## Part 3: Core Chunking Strategies

Now that we understand **when** to chunk (long-form documents with multiple topics) and **when not to** (structured records), let's explore **how** to chunk effectively.

There's no single "best" strategy - the optimal approach depends on YOUR data characteristics and query patterns.

We'll explore three core approaches with hands-on examples:

### Strategy 1: Document-Based Chunking (Structure-Aware)

**Concept:** Split documents based on their inherent structure (sections, paragraphs, headings).

**Best for:** Structured documents with clear logical divisions (research papers, technical docs, books).

In [None]:
# Strategy 1: Document-Based Chunking
# Split research paper by sections (using markdown headers)


def chunk_by_structure(text: str, separator: str = "\n## ") -> List[str]:
    """Split text by structural markers (e.g., markdown headers)."""

    # Split by headers
    sections = text.split(separator)

    # Clean and format chunks
    chunks = []
    for i, section in enumerate(sections):
        if section.strip():
            # Add header back (except for first chunk which is title)
            if i > 0:
                chunk = "## " + section
            else:
                chunk = section
            chunks.append(chunk.strip())

    return chunks


# Apply to research paper
structure_chunks = chunk_by_structure(research_paper)

print(f"""üìä Strategy 1: Document-Based (Structure-Aware) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(structure_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(structure_chunks):
    chunk_tokens = count_tokens(chunk)
    # Show first 100 chars of each chunk
    preview = chunk[:300].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...\n")

**Strategy 1 Analysis:**

‚úÖ **Advantages:**
- Respects document structure (sections stay together)
- Semantically coherent (each chunk is a complete section)
- Easy to implement for structured documents
- **Keeps tables, formulas, and code WITH their context**

‚ö†Ô∏è **Trade-offs:**
- Variable chunk sizes (some sections longer than others)
- Requires documents to have clear structure
- May create chunks that are still too large

üéØ **Best for:**
- Research papers with clear sections
- Technical documentation with headers
- Books with chapters/sections

### Strategy 2: Fixed-Size Chunking (Token-Based)

**Concept:** Split text into chunks of a predetermined size (e.g., 512 tokens) with overlap.

**Best for:** Unstructured text, quick prototyping, when you need consistent chunk sizes.

In [None]:
# Strategy 2: Fixed-Size Chunking (Using LangChain)
# Industry-standard approach with smart boundary detection

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create text splitter with smart boundary detection
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,  # Target chunk size in characters
    chunk_overlap=100,  # Overlap to preserve context
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try these in order
    is_separator_regex=False,
)

print("üîÑ Running fixed-size chunking with LangChain...")
print("   Trying to split on: paragraphs ‚Üí sentences ‚Üí words ‚Üí characters\n")

# Apply to research paper
fixed_chunks_docs = text_splitter.create_documents([research_paper])
fixed_chunks = [doc.page_content for doc in fixed_chunks_docs]

print(f"""üìä Strategy 2: Fixed-Size (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Target chunk size: 800 characters (~200 words)
Overlap: 100 characters
Number of chunks: {len(fixed_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(fixed_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

print(f"... ({len(fixed_chunks) - 5} more chunks)")

**Strategy 2 Analysis:**

‚úÖ **Advantages:**
- **Respects natural boundaries**: Tries paragraphs ‚Üí sentences ‚Üí words ‚Üí characters
- Consistent chunk sizes (predictable token usage)
- Works on any text (structured or unstructured)
- **Doesn't split mid-sentence** (unless absolutely necessary)

‚ö†Ô∏è **Trade-offs:**
- Ignores document structure (doesn't understand sections)
- Can break semantic coherence (may split related content)
- Overlap creates redundancy (increases storage/cost)

üéØ **Best for:**
- Unstructured text (no clear sections)
- Quick prototyping and baselines
- When consistent chunk sizes are required

### Strategy 3: Semantic Chunking (Meaning-Based)

**Concept:** Split text based on semantic similarity using embeddings - create new chunks when topic changes significantly.

**How it works:**
1. Split text into sentences or paragraphs
2. Generate embeddings for each segment
3. Calculate similarity between consecutive segments
4. Create chunk boundaries where similarity drops (topic shift detected)

**Best for:** Dense academic text, legal documents, narratives where semantic boundaries don't align with structure.

In [None]:
# Strategy 3: Semantic Chunking (Using LangChain)
# Industry-standard approach with local embeddings (no API costs!)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
import os

# Suppress tokenizer warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize local embeddings (no API costs!)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# Create semantic chunker with percentile-based breakpoint detection
semantic_chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # Split at bottom 25% of similarities
    breakpoint_threshold_amount=25,  # 25th percentile
    buffer_size=1,  # Compare consecutive sentences
)

print("üîÑ Running semantic chunking with LangChain...")
print("   Using local embeddings (sentence-transformers/all-MiniLM-L6-v2)")
print("   Breakpoint detection: 25th percentile of similarity scores\n")

# Apply to research paper
semantic_chunks_docs = semantic_chunker.create_documents([research_paper])

# Extract text from Document objects
semantic_chunks = [doc.page_content for doc in semantic_chunks_docs]

print(f"""üìä Strategy 3: Semantic (LangChain) Chunking
{'=' * 80}
Original document: {paper_tokens:,} tokens
Number of chunks: {len(semantic_chunks)}

Chunk breakdown:
""")

for i, chunk in enumerate(semantic_chunks[:5]):  # Show first 5
    chunk_tokens = count_tokens(chunk)
    preview = chunk[:100].replace("\n", " ")
    print(f"   Chunk {i+1}: {chunk_tokens:,} tokens - {preview}...")

if len(semantic_chunks) > 5:
    print(f"... ({len(semantic_chunks) - 5} more chunks)")

**Strategy 3 Analysis:**

‚úÖ **Advantages:**
- **Meaning-aware**: Chunks based on topic shifts, not arbitrary boundaries
- **Adaptive**: Chunk sizes vary based on content coherence
- **Better retrieval**: Each chunk is semantically focused
- **Free**: Uses local embeddings (no API costs)

‚ö†Ô∏è **Trade-offs:**
- Slower processing (requires embedding generation)
- Variable chunk sizes (harder to predict token usage)
- May not respect document structure (sections, headers)
- Requires tuning (threshold, buffer size)

üéØ **Best for:**
- Dense academic text
- Legal documents
- Narratives and stories
- Content where semantic boundaries don't align with structure

### Comparing Chunking Strategies: Decision Framework

Now let's compare all strategies side-by-side:

In [None]:
print(f"""
{'=' * 80}
CHUNKING STRATEGY COMPARISON
{'=' * 80}

Document: Research Paper ({paper_tokens:,} tokens)

Strategy              | Chunks | Avg Size | Complexity | Best For
--------------------- | ------ | -------- | ---------- | --------
Document-Based        | {len(structure_chunks):>6} | {sum(count_tokens(c) for c in structure_chunks) // len(structure_chunks):>8} | Low        | Structured docs
Fixed-Size            | {len(fixed_chunks):>6} | {sum(count_tokens(c) for c in fixed_chunks) // len(fixed_chunks):>8} | Low        | Unstructured text
Semantic              | {len(semantic_chunks):>6} | {sum(count_tokens(c) for c in semantic_chunks) // len(semantic_chunks):>8} | High       | Dense academic text

{'=' * 80}
""")

### YOUR Chunking Decision Framework

Chunking strategy is a **design choice** that depends on your specific context. There's no universal "correct" chunk size.

**Step 1: Start with Document Type**

| Document Type | Default Approach | Reasoning |
|---------------|------------------|----------|
| **Structured records** (courses, products, FAQs) | Don't chunk | Natural boundaries already exist |
| **Long-form text** (papers, books, docs) | Consider chunking | May need retrieval precision |
| **PDFs with visual layout** | Page-level | Preserves tables, figures |
| **Code** | Function/class boundaries | Semantic structure matters |

**Step 2: Evaluate These Factors**

1. **Semantic completeness:** Is each item self-contained?
   - ‚úÖ Yes ‚Üí Don't chunk (preserve natural boundaries)
   - ‚ùå No ‚Üí Consider chunking strategy

2. **Query patterns:** What will users ask?
   - Specific facts ‚Üí Smaller, focused chunks help
   - Summaries/overviews ‚Üí Larger chunks or hierarchical
   - Mixed ‚Üí Consider hierarchical approach

3. **Topic density:** How many distinct topics per document?
   - Single topic ‚Üí Whole-document embedding often works
   - Multiple distinct topics ‚Üí Chunking may improve precision

**Example Decisions:**

| Domain | Data Characteristics | Decision | Why |
|--------|---------------------|----------|-----|
| **Course Catalog** | Small, self-contained records | **Don't chunk** | Each course is a complete retrieval unit |
| **Research Papers** | Multi-section, dense topics | Document-Based | Sections are natural semantic units |
| **Support Tickets** | Single issue per ticket | **Don't chunk** | Already at optimal granularity |
| **Legal Contracts** | Nested structure, many clauses | Hierarchical | Need both overview and clause-level detail |

> üí° **Key Takeaway:** Ask "What is my natural retrieval unit?" before deciding on a chunking strategy. For many structured data use cases, the answer is "don't chunk."

---

## Part 4: Advanced Example - Research Paper with Multimodal Content

You've learned the three core chunking strategies. Now let's apply them to a **real-world research paper** and tackle a common challenge: **multimodal content** (tables, formulas, figures).

**The Challenge:** Research papers aren't just text - they contain:
- **Tables** with structured data
- **Formulas** with variable definitions
- **Figures** with visual patterns
- **Code** with implementation details

Standard text chunking can break these elements. Let's see how to handle them properly using the actual arXiv paper on semantic caching.

### Real-World Example: Chunking a Research Paper with Multimodal Content

Research papers contain heterogeneous content that requires specialized handling:
- **Text**: Paragraphs, sections, abstracts
- **Tables**: Structured data with captions
- **Figures**: Visual information with descriptions
- **Formulas**: Equations with variable definitions
- **Code**: Implementation examples

Let's apply our chunking strategies to the actual arXiv paper and see how to handle each content type.

**Paper:** ["Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data"](https://arxiv.org/abs/2504.02268)
**Authors:** Waris Gill et al. (Redis & Virginia Tech, 2025)
**Length:** 12 pages, ~42,000 characters

In [None]:
# Load the actual research paper PDF
import pypdf

pdf_path = project_root / "data" / "arxiv_2504_02268.pdf"
reader = pypdf.PdfReader(pdf_path)

# Extract text from all pages
full_text = ""
for page in reader.pages:
    full_text += page.extract_text() + "\n"

print(f"""‚úÖ PDF loaded successfully:
  Pages: {len(reader.pages)}
  Characters: {len(full_text):,}
  File: {pdf_path.name}
""")

In [None]:
# Strategy 1: Page-based chunking (simplest approach for PDFs)
page_chunks = []
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    page_chunks.append({
        "page": i + 1,
        "content": text,
        "char_count": len(text)
    })

print("PAGE-BASED CHUNKING:")
print(f"Number of chunks: {len(page_chunks)}")
print(f"Average chunk size: {sum(c['char_count'] for c in page_chunks) // len(page_chunks):,} chars")
print(f"\nFirst 3 pages:")
for chunk in page_chunks[:3]:
    preview = chunk['content'][:100].replace('\n', ' ')
    print(f"  Page {chunk['page']}: {preview}...")

In [None]:
# Strategy 2: Fixed-size chunking with overlap (using full text)
from langchain_text_splitters import RecursiveCharacterTextSplitter

fixed_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ". ", " "]
)

fixed_chunks = fixed_splitter.split_text(full_text)
print("FIXED-SIZE CHUNKING (1000 chars, 150 overlap):")
print(f"Number of chunks: {len(fixed_chunks)}")
print(f"Average chunk size: {sum(len(c) for c in fixed_chunks) // len(fixed_chunks):,} chars")
print(f"\nFirst chunk preview:\n{fixed_chunks[0][:200]}...")

In [None]:
# Compare the approaches
print("CHUNKING STRATEGY COMPARISON:")
print("=" * 70)
print(f"{'Strategy':<25} {'Chunks':<10} {'Avg Size':<15} {'Best For'}")
print("-" * 70)
print(f"{'Page-based':<25} {len(page_chunks):<10} {f'{sum(c["char_count"] for c in page_chunks) // len(page_chunks):,} chars':<15} {'Preserves layout/tables'}")
print(f"{'Fixed-size (1000)':<25} {len(fixed_chunks):<10} {f'{sum(len(c) for c in fixed_chunks) // len(fixed_chunks):,} chars':<15} {'Uniform retrieval'}")
print()
print("RECOMMENDATION: For this PDF:")
print("  - Page-based: Best for preserving tables, figures, and layout")
print("  - Fixed-size: Better for semantic search across the full text")
print("  - In production: Combine both (page metadata + semantic chunks)")

### Handling Multimodal Content: Tables, Formulas, Figures

**The Challenge:** Standard text chunking can break tables, formulas, and figures. Let's extract and chunk these properly from our PDF.

In [None]:
# Extract and chunk a table from the PDF
# The paper contains Table 1 on page 6 comparing embedding models

import re

# Find table content in the text
table_pattern = r'(Table \d+:.*?)(?=\n\n[A-Z]|\nFigure|\n\d+\.|\Z)'
tables_found = re.findall(table_pattern, full_text, re.DOTALL)

if tables_found:
    table_chunk = {
        "content_type": "table",
        "text": tables_found[0][:500],  # First 500 chars
        "metadata": {
            "page": "6",
            "section": "Evaluation",
            "table_id": "Table 1"
        }
    }

    print("‚úÖ TABLE CHUNKING EXAMPLE:")
    print("=" * 70)
    print(f"Content Type: {table_chunk['content_type']}")
    print(f"Metadata: {table_chunk['metadata']}")
    print(f"\nChunk Text:\n{table_chunk['text'][:300]}...")
    print("\n‚úÖ Best Practice: Keep table WITH caption and surrounding context")
else:
    print("Table extraction pattern needs adjustment for this PDF")

In [None]:
# Extract and chunk formulas/equations
# The paper discusses contrastive loss functions

formula_pattern = r'(loss.*?=.*?(?:\n|$))'
formulas = re.findall(formula_pattern, full_text, re.IGNORECASE)

if formulas:
    # Find context around the formula
    formula_text = formulas[0]
    formula_idx = full_text.find(formula_text)
    context_start = max(0, formula_idx - 200)
    context_end = min(len(full_text), formula_idx + len(formula_text) + 200)

    formula_chunk = {
        "content_type": "formula",
        "text": full_text[context_start:context_end],
        "metadata": {
            "section": "Methodology",
            "formula_type": "contrastive_loss"
        }
    }

    print("\n‚úÖ FORMULA CHUNKING EXAMPLE:")
    print("=" * 70)
    print(f"Content Type: {formula_chunk['content_type']}")
    print(f"Metadata: {formula_chunk['metadata']}")
    print(f"\nChunk Text:\n{formula_chunk['text'][:300]}...")
    print("\n‚úÖ Best Practice: Keep formula WITH variable definitions and explanation")

In [None]:
# Extract and chunk figure descriptions
# The paper has multiple figures comparing model performance

figure_pattern = r'(Figure \d+:.*?)(?=\n\n[A-Z]|\nTable|\n\d+\.|\Z)'
figures = re.findall(figure_pattern, full_text, re.DOTALL)

if figures:
    figure_chunk = {
        "content_type": "figure",
        "text": figures[0][:400],
        "metadata": {
            "section": "Evaluation",
            "figure_id": "Figure 1",
            "visual_type": "bar_chart"
        }
    }

    print("\n‚úÖ FIGURE CHUNKING EXAMPLE:")
    print("=" * 70)
    print(f"Content Type: {figure_chunk['content_type']}")
    print(f"Metadata: {figure_chunk['metadata']}")
    print(f"\nChunk Text:\n{figure_chunk['text'][:300]}...")
    print("\n‚úÖ Best Practice: Describe visual patterns in text, keep WITH caption")

In [None]:
# Summary: Multimodal chunking principles
print("\n" + "=" * 70)
print("MULTIMODAL CHUNKING PRINCIPLES:")
print("=" * 70)
print("""
1. **Tables**: Keep WITH caption and explanation
   - Preserve structure (markdown/HTML)
   - Add metadata: table_id, section, content_type

2. **Formulas**: Keep WITH variable definitions
   - Include surrounding context (¬±200 chars)
   - Preserve LaTeX if available

3. **Figures**: Describe visual patterns in text
   - Keep caption WITH discussion
   - Add metadata: figure_id, visual_type

4. **Code**: Keep WITH usage examples and context
   - Preserve syntax and comments
   - Include function/class definitions

5. **General Rule**: Context is king - never separate content from explanation
""")


### Advanced Topic: When Chunking Isn't Enough - Legal Contracts

**Note:** Some document types require approaches beyond chunking. Legal contracts are a prime example.

**Why Legal Documents Are Different:**

Legal contracts require sophisticated data engineering beyond simple chunking:

**Key Challenges:**
1. **Clause-level granularity** with hierarchical numbering (Section 3.2.1)
2. **Cross-references** between clauses ("as defined in Section 1.5...")
3. **Hierarchical dependencies** (amendments modify earlier provisions)
4. **Legal precedence** ("Notwithstanding Section 2.1..." creates overrides)

**What This Requires:**

Simple chunking is insufficient. You need:
- **Knowledge graphs** to capture clause relationships
- **Recursive retrieval** to fetch referenced clauses
- **Metadata enrichment** (clause type, parties, dates, jurisdiction)

**Example Retrieval Flow:**
```
Query: "What are the payment terms?"

1. Retrieve: Clause 3.2 (Payment Terms)
2. Detect reference: "as defined in Section 1.5"
3. Fetch: Clause 1.5 (Definitions: "Net 30")
4. Detect modification: Clause 8.1 modifies 3.2
5. Fetch: Clause 8.1 (Amendment: "Net 45 for Q4")
6. Assemble: [3.2 + 1.5 + 8.1] with relationship metadata
```

**Recommendation:** This is a **research-level problem** requiring domain expertise. For production systems:
- Start with clause-level chunking as baseline
- Build knowledge graphs for relationships (Neo4j, etc.)
- Implement recursive retrieval for dependencies
- Consider specialized legal NLP tools (LexNLP, Blackstone)

**Resources:** [Multi-Graph Multi-Agent Systems](https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052), [GraphRAG for Contracts](https://neo4j.com/blog/developer/agentic-graphrag-for-commercial-contracts/)


---

## Part 5: Troubleshooting Chunking

**Common Failure Patterns and Solutions:**

| Problem | Likely Cause | Solution |
|---------|--------------|----------|
| Tables split across chunks | Fixed-size chunking | Use structure-aware chunking |
| Formulas without context | Naive chunking | Keep formulas with explanations |
| Missing cross-references | Single-chunk retrieval | Implement recursive retrieval |
| Generic answers | Chunks too large | Reduce chunk size or use semantic chunking |
| Incomplete answers | Chunks too small | Increase chunk size or add overlap |

**Iterative Process:** Start simple ‚Üí Measure baseline ‚Üí Identify failures ‚Üí Test improvements ‚Üí Iterate

---

## Summary and Key Takeaways

### The Key Insight

> **Chunking isn't about fitting in context windows - it's about data modeling for retrieval.**

### Decision Framework

| Question | Answer | Strategy |
|----------|--------|----------|
| **What is my natural retrieval unit?** | Single record (course, product, FAQ) | Don't chunk - use hierarchical patterns |
| | Long-form document (paper, book) | Chunk by sections or semantically |
| | Legal contract with cross-references | Advanced: knowledge graphs + recursive retrieval |
| **How many topics per document?** | Single topic | Whole-document embedding |
| | Multiple distinct topics | Chunking improves precision |
| **What content types?** | Text-only | Standard chunking strategies |
| | Multimodal (tables, figures) | Keep content WITH context |

### Core Strategies

1. **Document-Based:** Split by sections/headers - best for structured documents
2. **Fixed-Size:** Split into fixed chunks with overlap - best for unstructured text
3. **Semantic:** Split based on topic shifts - best for dense academic text

**Remember:** Experiment, measure, iterate. This is engineering, not magic.

---

## What's Next?

### Module 4: Memory Systems for Context Engineering

Now that you understand data modeling and chunking for knowledge bases, you'll learn to manage conversation context:
- **Working Memory:** Track conversation history within a session
- **Long-term Memory:** Remember user preferences across sessions
- **Memory-Enhanced RAG:** Combine retrieved knowledge with conversation memory
- **Redis Agent Memory Server:** Automatic memory extraction and retrieval

```
Module 1: Context Engineering Fundamentals
    ‚Üì
Module 2: RAG Fundamentals ‚Üê Completed
    ‚Üì
Module 3: Chunking and Data Modeling ‚Üê You are here
    ‚Üì
Module 4: Memory Systems ‚Üê Next
    ‚Üì
Module 5: Building Agents (Complete System)
```

---

## Practice Exercises

### Exercise 1: Analyze Your Data
Think about a dataset you work with. Answer these questions:
1. What is the natural retrieval unit?
2. Does it need chunking? Why or why not?
3. If yes, which chunking strategy would you use?

### Exercise 2: Design a Chunking Strategy
For each document type, choose the best approach:
1. Product catalog with 1,000 items
2. 50-page technical manual with chapters
3. Customer support tickets (avg 200 words each)
4. Legal contracts (avg 20 pages, multiple clauses)

### Exercise 3: Experiment with Chunking
Take the research paper example and:
1. Try all three chunking strategies
2. Compare the number of chunks and average size
3. Which strategy would work best for queries about "semantic caching methodology"?

---

## Additional Resources

**Chunking Strategies:**
- [LangChain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- [LlamaIndex Node Parsers](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/)

**Research Papers:**
- ["Lost in the Middle" (arXiv:2307.03172)](https://arxiv.org/abs/2307.03172) - U-shaped attention patterns in LLMs
- ["Context Rot" (Chroma Research, 2025)](https://research.trychroma.com/context-rot) - Performance degradation with input length
- [Needle in the Haystack Benchmark](https://github.com/gkamradt/LLMTest_NeedleInAHaystack) - Retrieval in long contexts
- ["Contextual Retrieval" (Anthropic, 2024)](https://www.anthropic.com/news/contextual-retrieval) - 49-67% reduction in retrieval failures
- ["Advancing Semantic Caching for LLMs" (arXiv:2504.02268)](https://arxiv.org/abs/2504.02268) - Redis/Virginia Tech research
- ["VoxRAG" (arXiv:2505.17326, 2025)](https://arxiv.org/abs/2505.17326) - Transcription-free RAG with silence-aware chunking

**Advanced Topics:**
- [Multi-Graph Multi-Agent Systems for Legal Documents (Medium, 2024)](https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052)
- [GraphRAG for Commercial Contracts (Neo4j, 2024)](https://neo4j.com/blog/developer/agentic-graphrag-for-commercial-contracts/)

**Data Modeling for RAG:**
- [OpenAI Best Practices](https://platform.openai.com/docs/guides/prompt-engineering)
- [Anthropic Prompt Engineering](https://docs.anthropic.com/claude/docs/prompt-engineering)

**Vector Databases:**
- [Redis Vector Search Documentation](https://redis.io/docs/stack/search/reference/vectors/)
- [RedisVL Python Library](https://github.com/RedisVentures/redisvl)
