# Chunking Optimization: Finding the Sweet Spot

**Learning Objectives:**
- Compare fixed-size, semantic, and contextual chunking strategies
- Measure impact of chunk size on retrieval quality
- Understand trade-offs between chunk granularity and context
- Implement Anthropic's contextual retrieval augmentation

**Execution Time:** <5 minutes (DEMO mode), ~12 minutes (FULL mode)  
**Cost Estimate:** $0.20 (DEMO), $1.00 (FULL)

---

## Setup

In [None]:
import sys
sys.path.append('..')

import json
import numpy as np
from typing import List, Dict
from backend.semantic_retrieval import generate_embeddings, build_vector_index, semantic_search
from backend.context_judges import ContextPrecisionJudge, ContextRecallJudge
import os

# Set execution mode
MODE = os.getenv("EXECUTION_MODE", "DEMO")
print(f"üîß Execution Mode: {MODE}")

# Load recipe data
with open('../homeworks/hw4/data/processed_recipes.json', 'r') as f:
    recipes = json.load(f)

SAMPLE_SIZE = 30 if MODE == "DEMO" else 100
recipes = recipes[:SAMPLE_SIZE]

print(f"üìö Loaded {len(recipes)} recipes")

## Chunking Strategies Implementation

In [None]:
def fixed_size_chunking(text: str, chunk_size: int = 200, overlap: int = 50) -> List[str]:
    """Split text into fixed-size chunks with overlap."""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    
    return chunks


def semantic_chunking(text: str, max_chunk_size: int = 300) -> List[str]:
    """Split text at natural boundaries (periods, newlines)."""
    # Split by periods and newlines
    sentences = [s.strip() for s in text.replace('\n', '. ').split('. ') if s.strip()]
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sentence in sentences:
        sentence_size = len(sentence.split())
        
        if current_size + sentence_size > max_chunk_size and current_chunk:
            chunks.append('. '.join(current_chunk) + '.')
            current_chunk = [sentence]
            current_size = sentence_size
        else:
            current_chunk.append(sentence)
            current_size += sentence_size
    
    if current_chunk:
        chunks.append('. '.join(current_chunk) + '.')
    
    return chunks


def contextual_chunking(text: str, document_context: str, chunk_size: int = 200) -> List[str]:
    """Add document context to each chunk (Anthropic's method)."""
    base_chunks = fixed_size_chunking(text, chunk_size=chunk_size, overlap=50)
    
    # Prepend context to each chunk
    contextual_chunks = [
        f"[Document: {document_context}] {chunk}"
        for chunk in base_chunks
    ]
    
    return contextual_chunks


print("‚úÖ Chunking strategies defined")

## Create Chunked Corpora

In [None]:
# Prepare document texts
documents = []
for recipe in recipes:
    doc = f"{recipe['name']}. {recipe['description']}"
    documents.append({
        "text": doc,
        "name": recipe['name'],
        "ingredients": recipe['ingredients'][:10]
    })

# Test different chunking strategies
chunking_configs = [
    {"name": "Fixed-100", "strategy": "fixed", "chunk_size": 100, "overlap": 25},
    {"name": "Fixed-200", "strategy": "fixed", "chunk_size": 200, "overlap": 50},
    {"name": "Fixed-400", "strategy": "fixed", "chunk_size": 400, "overlap": 100},
    {"name": "Semantic-300", "strategy": "semantic", "max_chunk_size": 300},
    {"name": "Contextual-200", "strategy": "contextual", "chunk_size": 200},
]

if MODE == "DEMO":
    chunking_configs = chunking_configs[:3]  # Test only fixed-size variants

print(f"üîç Testing {len(chunking_configs)} chunking strategies")

In [None]:
# Create chunked versions
chunked_corpora = {}

for config in chunking_configs:
    all_chunks = []
    chunk_metadata = []  # Track which document each chunk belongs to
    
    for doc_idx, doc in enumerate(documents):
        if config["strategy"] == "fixed":
            chunks = fixed_size_chunking(
                doc["text"],
                chunk_size=config["chunk_size"],
                overlap=config["overlap"]
            )
        elif config["strategy"] == "semantic":
            chunks = semantic_chunking(
                doc["text"],
                max_chunk_size=config["max_chunk_size"]
            )
        elif config["strategy"] == "contextual":
            context = f"Recipe: {doc['name']}"
            chunks = contextual_chunking(
                doc["text"],
                document_context=context,
                chunk_size=config["chunk_size"]
            )
        
        for chunk in chunks:
            all_chunks.append(chunk)
            chunk_metadata.append(doc_idx)
    
    chunked_corpora[config["name"]] = {
        "chunks": all_chunks,
        "metadata": chunk_metadata,
        "config": config
    }
    
    print(f"  {config['name']}: {len(all_chunks)} chunks (avg: {len(all_chunks)/len(documents):.1f} chunks/doc)")

print("\n‚úÖ All chunked corpora created")

## Build Vector Indices

In [None]:
print("üîÑ Building vector indices (this may take 2-5 minutes)...\n")

indices = {}

for name, corpus in chunked_corpora.items():
    print(f"  Building index for {name}...")
    embeddings = generate_embeddings(corpus["chunks"])
    index = build_vector_index(embeddings)
    indices[name] = {
        "index": index,
        "embeddings": embeddings,
        "chunks": corpus["chunks"],
        "metadata": corpus["metadata"]
    }

print("\n‚úÖ All indices built")

## Evaluation: Retrieval Quality

In [None]:
# Test queries
test_queries = [
    "creamy pasta with cheese",
    "healthy vegetarian dinner",
    "quick chocolate dessert",
    "comfort food for winter",
]

if MODE == "DEMO":
    test_queries = test_queries[:2]

print(f"üîç Testing {len(test_queries)} queries across {len(chunking_configs)} chunking strategies")

In [None]:
# Initialize judges
precision_judge = ContextPrecisionJudge()

results = []
k = 5  # Top-5 retrieval

for query_idx, query in enumerate(test_queries):
    print(f"\n{'='*60}")
    print(f"Query {query_idx+1}/{len(test_queries)}: '{query}'")
    print(f"{'='*60}")
    
    query_embedding = generate_embeddings([query])[0]
    
    for strategy_name, index_data in indices.items():
        # Retrieve top-k chunks
        search_results = semantic_search(query_embedding, index_data["index"], k=k)
        retrieved_chunks = [index_data["chunks"][idx] for idx, score in search_results]
        
        # Evaluate precision
        precision_eval = precision_judge.evaluate(query, retrieved_chunks)
        
        # Calculate diversity (how many unique documents retrieved)
        retrieved_doc_ids = [index_data["metadata"][idx] for idx, score in search_results]
        diversity = len(set(retrieved_doc_ids)) / k
        
        print(f"  {strategy_name:20s} Precision: {precision_eval['precision']:.2f}, Diversity: {diversity:.2f}")
        
        results.append({
            "query": query,
            "strategy": strategy_name,
            "precision": precision_eval['precision'],
            "diversity": diversity,
            "top_chunk": retrieved_chunks[0][:100] + "..."
        })

## Analysis: Find Optimal Chunk Size

In [None]:
import pandas as pd

# Create summary dataframe
df = pd.DataFrame(results)

# Aggregate by strategy
summary = df.groupby('strategy').agg({
    'precision': 'mean',
    'diversity': 'mean'
}).reset_index()

summary.columns = ['Strategy', 'Avg Precision', 'Avg Diversity']
summary = summary.sort_values('Avg Precision', ascending=False)

print("\n" + "="*60)
print("üìä CHUNKING STRATEGY COMPARISON")
print("="*60)
print(summary.to_string(index=False))

# Find best strategy
best_strategy = summary.iloc[0]
print(f"\nüèÜ Best Strategy: {best_strategy['Strategy']}")
print(f"   Precision: {best_strategy['Avg Precision']:.3f}")
print(f"   Diversity: {best_strategy['Avg Diversity']:.3f}")

## Key Insights

**Trade-offs:**

1. **Small Chunks (100-150 words)**
   - ‚úÖ High precision: Retrieves focused content
   - ‚ùå Low diversity: Multiple chunks from same document
   - ‚ö†Ô∏è Risk: May lose important context

2. **Medium Chunks (200-300 words)**
   - ‚úÖ Balanced precision and context
   - ‚úÖ Good diversity across documents
   - üéØ **Recommended for most RAG applications**

3. **Large Chunks (400+ words)**
   - ‚úÖ Maximum context preservation
   - ‚ùå Lower precision: Includes irrelevant content
   - ‚ö†Ô∏è May exceed LLM context windows

4. **Semantic Chunking**
   - ‚úÖ Respects natural boundaries
   - ‚úÖ Better readability
   - ‚ö° Variable chunk sizes

5. **Contextual Retrieval (Anthropic)**
   - ‚úÖ Adds document metadata to chunks
   - ‚úÖ Improves cross-document retrieval
   - üí∞ Higher embedding costs (longer chunks)

**Recommendation:** Use **200-word semantic chunks** with **50-word overlap** for production systems.

## Save Results

In [None]:
# Save results for dashboard
output = {
    "mode": MODE,
    "sample_size": SAMPLE_SIZE,
    "strategies_tested": len(chunking_configs),
    "queries_tested": len(test_queries),
    "results": results,
    "summary": summary.to_dict(orient='records'),
    "best_strategy": {
        "name": best_strategy['Strategy'],
        "precision": float(best_strategy['Avg Precision']),
        "diversity": float(best_strategy['Avg Diversity'])
    }
}

os.makedirs('results', exist_ok=True)
with open('results/chunking_comparison.json', 'w') as f:
    json.dump(output, f, indent=2)

print("\n‚úÖ Results saved to: lesson-12/results/chunking_comparison.json")

---

## Next Steps

1. **Advanced Chunking:** Implement recursive text splitting (LangChain `RecursiveCharacterTextSplitter`)
2. **Context Window Aware:** Test chunk sizes against your LLM's context window
3. **Domain-Specific:** Tune chunk size for your specific document types
4. **Evaluation at Scale:** Run FULL mode with 200+ documents

**Related Tutorials:**
- [Concept: Context Quality Evaluation](context_quality_evaluation.md)
- [Hybrid Search Comparison](hybrid_search_comparison.ipynb)
- [Lesson 13: RAG Generation & Attribution](../lesson-13/TUTORIAL_INDEX.md)