# Notebook 2: Documents & Chunking

**Difficulty:** Beginner-Intermediate | **Estimated Time:** 90-120 minutes

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Load documents from multiple sources (local files, PDFs, web)
2. ‚úÖ Implement different chunking strategies (sentence, token, semantic)
3. ‚úÖ Add custom metadata at document and node levels
4. ‚úÖ Create and manage node relationships
5. ‚úÖ Optimize chunking for retrieval quality
6. ‚úÖ Apply batch embedding optimization

## Prerequisites

- Completed Notebook 1: Setup & Basics
- Understanding of embeddings and chunking concepts
- Sample PDFs in `data/research_papers/` directory

## Curriculum Coverage

- **Section 2.1:** Loading Documents from Various Sources
- **Section 2.2:** Document Preprocessing
- **Section 2.3:** Document Parsing and Chunking
- **Section 2.4:** Metadata Management
- **Section 2.5:** Document Nodes

---

## 1. Setup & Imports

In [1]:
# Core LlamaIndex
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings, Document
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo

# Node Parsers (Chunking Strategies)
from llama_index.core.node_parser import (
    SentenceSplitter,
    TokenTextSplitter,
    SemanticSplitterNodeParser,
)

# Metadata Extraction
from llama_index.core.extractors import (
    TitleExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline

# LLM and Embeddings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Utilities
from dotenv import load_dotenv
import os
from pathlib import Path
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


In [2]:
# Load environment variables and configure Settings
load_dotenv()

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    dimensions=1536
)

print("‚úÖ Settings configured")

‚úÖ Settings configured


---

## 2. Loading Documents from Multiple Sources

### 2.1 Local File Loading with SimpleDirectoryReader

In [4]:
# Check data directory structure
data_dir = Path("./data")
sample_docs_dir = data_dir / "sample_docs"
research_papers_dir = data_dir / "research_papers"

print(f"Data directory exists: {data_dir.exists()}")
print(f"Sample docs directory: {sample_docs_dir.exists()}")
print(f"Research papers directory: {research_papers_dir.exists()}")

if research_papers_dir.exists():
    files = list(research_papers_dir.glob("*.pdf"))
    print(f"\nFound {len(files)} PDF files in research_papers/")
    for f in files:
        print(f"  - {f.name}")

Data directory exists: True
Sample docs directory: True
Research papers directory: True

Found 2 PDF files in research_papers/
  - brain_tumor_2024_removed.pdf
  - brain_tumor_2023_2_removed.pdf


### SimpleDirectoryReader Features

**Key Parameters:**
- `input_dir`: Directory path
- `required_exts`: Filter by extensions (e.g., `[".pdf", ".txt"]`)
- `recursive`: Scan subdirectories
- `filename_as_id`: Use filename as document ID
- `file_metadata`: Custom metadata function
- `exclude_hidden`: Skip hidden files

In [5]:
# Create sample documents if no PDFs available
# In practice, you'd load actual PDFs from the data directory

sample_papers = [
    Document(
        text="""
        Title: Attention Is All You Need
        Authors: Vaswani et al.
        Year: 2017
        
        Abstract: The dominant sequence transduction models are based on complex recurrent or 
        convolutional neural networks that include an encoder and a decoder. The best performing 
        models also connect the encoder and decoder through an attention mechanism. We propose a 
        new simple network architecture, the Transformer, based solely on attention mechanisms, 
        dispensing with recurrence and convolutions entirely.
        
        Introduction: Recurrent neural networks, long short-term memory and gated recurrent neural 
        networks in particular, have been firmly established as state of the art approaches in 
        sequence modeling and transduction problems. The Transformer is the first transduction model 
        relying entirely on self-attention to compute representations of its input and output without 
        using sequence-aligned RNNs or convolution.
        """,
        metadata={
            "title": "Attention Is All You Need",
            "authors": "Vaswani et al.",
            "year": 2017,
            "category": "transformers",
            "citations": 85000,
            "source": "research_paper"
        }
    ),
    Document(
        text="""
        Title: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
        Authors: Devlin et al.
        Year: 2019
        
        Abstract: We introduce a new language representation model called BERT, which stands for 
        Bidirectional Encoder Representations from Transformers. Unlike recent language representation 
        models, BERT is designed to pre-train deep bidirectional representations from unlabeled text 
        by jointly conditioning on both left and right context in all layers.
        
        Introduction: Language model pre-training has been shown to be effective for improving many 
        natural language processing tasks. Pre-trained language representations can be either context-free 
        or context-based. BERT alleviates the unidirectionality constraint by using a masked language 
        model (MLM) pre-training objective.
        """,
        metadata={
            "title": "BERT",
            "authors": "Devlin et al.",
            "year": 2019,
            "category": "language_models",
            "citations": 65000,
            "source": "research_paper"
        }
    ),
    Document(
        text="""
        Title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
        Authors: Lewis et al.
        Year: 2020
        
        Abstract: Large pre-trained language models have been shown to store factual knowledge in their 
        parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, 
        their ability to access and precisely manipulate knowledge is still limited. We explore a general 
        fine-tuning recipe for retrieval-augmented generation (RAG) models which combine parametric and 
        non-parametric memory.
        
        Introduction: Pre-trained neural language models store and retrieve knowledge using their parameters. 
        RAG models combine parametric memory (the LLM) with non-parametric memory (a dense vector index of 
        Wikipedia). This provides the model with access to up-to-date information and allows for more 
        interpretable and modular systems.
        """,
        metadata={
            "title": "RAG",
            "authors": "Lewis et al.",
            "year": 2020,
            "category": "rag",
            "citations": 3500,
            "source": "research_paper"
        }
    ),
]

print(f"‚úÖ Created {len(sample_papers)} sample research papers")
for doc in sample_papers:
    print(f"  - {doc.metadata['title']} ({doc.metadata['year']})")

‚úÖ Created 3 sample research papers
  - Attention Is All You Need (2017)
  - BERT (2019)
  - RAG (2020)


### 2.2 Custom Metadata Functions

In [6]:
# Add processing metadata
for doc in sample_papers:
    doc.metadata["processed_date"] = datetime.now().isoformat()
    doc.metadata["char_count"] = len(doc.text)
    doc.metadata["word_count"] = len(doc.text.split())

print("Enhanced metadata for first document:")
for key, value in sample_papers[0].metadata.items():
    print(f"  {key}: {value}")

Enhanced metadata for first document:
  title: Attention Is All You Need
  authors: Vaswani et al.
  year: 2017
  category: transformers
  citations: 85000
  source: research_paper
  processed_date: 2025-12-21T07:43:02.600762
  char_count: 1006
  word_count: 123


---

## 3. Chunking Strategies

### Why Chunking Matters

Chunking is **critical** for RAG quality:

1. **Context Window Limits**: LLMs have token limits
2. **Embedding Quality**: Smaller chunks = more focused embeddings
3. **Retrieval Precision**: Granular chunks improve relevance
4. **Cost Optimization**: Smaller chunks = fewer tokens to LLM

### 3.1 Sentence-Based Chunking

In [7]:
# SentenceSplitter: Respects sentence boundaries
sentence_splitter = SentenceSplitter(
    chunk_size=1024,     # Target tokens per chunk
    chunk_overlap=200,   # Overlap to preserve context
    separator=" ",       # Split on spaces first
)

sentence_nodes = sentence_splitter.get_nodes_from_documents(sample_papers)

print(f"SentenceSplitter Results:")
print(f"  Input documents: {len(sample_papers)}")
print(f"  Output nodes: {len(sentence_nodes)}")
print(f"  Avg chars per node: {sum(len(n.text) for n in sentence_nodes) / len(sentence_nodes):.0f}")

print(f"\nFirst node preview:")
print(f"  Text (first 200 chars): {sentence_nodes[0].text[:200]}...")
print(f"  Metadata: {sentence_nodes[0].metadata}")

SentenceSplitter Results:
  Input documents: 3
  Output nodes: 3
  Avg chars per node: 936

First node preview:
  Text (first 200 chars): Title: Attention Is All You Need
        Authors: Vaswani et al.
        Year: 2017

        Abstract: The dominant sequence transduction models are based on complex recurrent or 
        convolutiona...
  Metadata: {'title': 'Attention Is All You Need', 'authors': 'Vaswani et al.', 'year': 2017, 'category': 'transformers', 'citations': 85000, 'source': 'research_paper', 'processed_date': '2025-12-21T07:43:02.600762', 'char_count': 1006, 'word_count': 123}


### üéØ ML Engineering Note: Chunk Size Selection

**Chunk Size Trade-offs:**

| Size | Pros | Cons | Use Case |
|------|------|------|----------|
| **Small (256-512)** | Precise retrieval, lower cost | May lose context | Q&A, factoid extraction |
| **Medium (512-1024)** | Balanced context/precision | Good default | General RAG, document QA |
| **Large (1024-2048)** | Rich context | Diluted relevance, higher cost | Summarization, broad queries |

**Overlap Guidelines:**
- 10-20% of chunk size (typical)
- Higher overlap (20-30%) for dense, technical content
- Lower overlap (5-10%) for structured documents

### 3.2 Token-Based Chunking

In [8]:
# TokenTextSplitter: Precise token count control
token_splitter = TokenTextSplitter(
    chunk_size=512,      # Exact token limit
    chunk_overlap=128,   # 25% overlap
    separator=" ",
)

token_nodes = token_splitter.get_nodes_from_documents(sample_papers)

print(f"TokenTextSplitter Results:")
print(f"  Input documents: {len(sample_papers)}")
print(f"  Output nodes: {len(token_nodes)}")
print(f"  Avg chars per node: {sum(len(n.text) for n in token_nodes) / len(token_nodes):.0f}")

# Compare with sentence splitter
print(f"\nComparison:")
print(f"  SentenceSplitter: {len(sentence_nodes)} nodes")
print(f"  TokenTextSplitter: {len(token_nodes)} nodes")
print(f"  Difference: {abs(len(sentence_nodes) - len(token_nodes))} nodes")

TokenTextSplitter Results:
  Input documents: 3
  Output nodes: 3
  Avg chars per node: 936

Comparison:
  SentenceSplitter: 3 nodes
  TokenTextSplitter: 3 nodes
  Difference: 0 nodes


### 3.3 Semantic Chunking

In [9]:
# SemanticSplitterNodeParser: Chunk by meaning, not just size
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1,              # Sentences to group for comparison
    breakpoint_percentile_threshold=95,  # Sensitivity to semantic breaks
    embed_model=Settings.embed_model,
)

print("Creating semantic chunks (this will call embedding API)...")
semantic_nodes = semantic_splitter.get_nodes_from_documents(sample_papers)

print(f"\nSemanticSplitterNodeParser Results:")
print(f"  Input documents: {len(sample_papers)}")
print(f"  Output nodes: {len(semantic_nodes)}")
print(f"  Avg chars per node: {sum(len(n.text) for n in semantic_nodes) / len(semantic_nodes):.0f}")
print(f"  Min chars: {min(len(n.text) for n in semantic_nodes)}")
print(f"  Max chars: {max(len(n.text) for n in semantic_nodes)}")

Creating semantic chunks (this will call embedding API)...

SemanticSplitterNodeParser Results:
  Input documents: 3
  Output nodes: 6
  Avg chars per node: 477
  Min chars: 220
  Max chars: 740


### Semantic Chunking Advantages

**How it works:**
1. Embeds consecutive sentences
2. Calculates cosine similarity between embeddings
3. Splits where similarity drops (topic change)

**Pros:**
- ‚úÖ Preserves semantic coherence
- ‚úÖ Natural topic boundaries
- ‚úÖ Better for complex documents

**Cons:**
- ‚ùå Slower (requires embedding API calls)
- ‚ùå Variable chunk sizes
- ‚ùå Higher cost (more API calls)

**Best for**: Academic papers, technical docs, long-form content

### 3.4 Comparing Chunking Strategies

In [10]:
import pandas as pd

# Compare chunking strategies
strategies = [
    {"name": "Sentence", "nodes": sentence_nodes},
    {"name": "Token", "nodes": token_nodes},
    {"name": "Semantic", "nodes": semantic_nodes},
]

comparison_data = []
for strat in strategies:
    nodes = strat["nodes"]
    comparison_data.append({
        "Strategy": strat["name"],
        "Num Nodes": len(nodes),
        "Avg Chars": int(sum(len(n.text) for n in nodes) / len(nodes)),
        "Min Chars": min(len(n.text) for n in nodes),
        "Max Chars": max(len(n.text) for n in nodes),
        "Std Dev": int(pd.Series([len(n.text) for n in nodes]).std()),
    })

df = pd.DataFrame(comparison_data)
print("\nChunking Strategy Comparison:")
print(df.to_string(index=False))


Chunking Strategy Comparison:
Strategy  Num Nodes  Avg Chars  Min Chars  Max Chars  Std Dev
Sentence          3        936        877        988       55
   Token          3        936        877        988       55
Semantic          6        477        220        740      228


---

## 4. Metadata Management

### 4.1 Adding Custom Node-Level Metadata

In [11]:
# Enrich nodes with custom metadata
for i, node in enumerate(sentence_nodes):
    # Add node-specific metadata
    node.metadata["node_index"] = i
    node.metadata["chunk_strategy"] = "sentence"
    
    # Derive metadata from content
    text_lower = node.text.lower()
    node.metadata["has_abstract"] = "abstract" in text_lower
    node.metadata["has_introduction"] = "introduction" in text_lower
    node.metadata["mentions_transformer"] = "transformer" in text_lower

print("Enhanced node metadata example:")
print(f"Node 0 metadata: {sentence_nodes[0].metadata}")

Enhanced node metadata example:
Node 0 metadata: {'title': 'Attention Is All You Need', 'authors': 'Vaswani et al.', 'year': 2017, 'category': 'transformers', 'citations': 85000, 'source': 'research_paper', 'processed_date': '2025-12-21T07:43:02.600762', 'char_count': 1006, 'word_count': 123, 'node_index': 0, 'chunk_strategy': 'sentence', 'has_abstract': True, 'has_introduction': True, 'mentions_transformer': True}


### 4.2 LLM-Based Metadata Extraction

In [12]:
# Use LLM to extract metadata
from llama_index.core.extractors import SummaryExtractor, TitleExtractor

# Create extractors
title_extractor = TitleExtractor(
    llm=Settings.llm,
    nodes=5,  # Look at first 5 nodes for title
)

summary_extractor = SummaryExtractor(
    llm=Settings.llm,
    summaries=["self"],  # Summarize each node
)

print("Extracting metadata with LLM (this may take a moment)...")

# Apply to a subset of nodes (to save API calls)
sample_nodes_for_extraction = sentence_nodes[:2]

# Extract summaries
nodes_with_summaries = summary_extractor.process_nodes(sample_nodes_for_extraction)

print(f"\n‚úÖ Extracted summaries for {len(nodes_with_summaries)} nodes")
print(f"\nNode 0 with LLM-generated summary:")
print(f"  Original text (first 150 chars): {nodes_with_summaries[0].text[:150]}...")
if "section_summary" in nodes_with_summaries[0].metadata:
    print(f"  Summary: {nodes_with_summaries[0].metadata['section_summary']}")

Extracting metadata with LLM (this may take a moment)...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:03<00:00,  1.94s/it]


‚úÖ Extracted summaries for 2 nodes

Node 0 with LLM-generated summary:
  Original text (first 150 chars): Title: Attention Is All You Need
        Authors: Vaswani et al.
        Year: 2017

        Abstract: The dominant sequence transduction models are b...
  Summary: The section discusses the research paper titled "Attention Is All You Need," authored by Vaswani et al. in 2017. It introduces the Transformer architecture, which is a novel approach to sequence transduction that relies entirely on attention mechanisms, eliminating the need for recurrent or convolutional neural networks. The paper highlights the limitations of traditional models, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, in sequence modeling. The Transformer model is presented as a state-of-the-art solution that utilizes self-attention to process input and output representations effectively. The paper has garnered significant attention, with approximately 85,000 citations, in




### üéØ ML Engineering Note: Metadata Extraction Trade-offs

**LLM-based extraction:**
- ‚úÖ High quality, contextual metadata
- ‚úÖ Can extract complex information (topics, entities, sentiment)
- ‚ùå Expensive (LLM API calls per node)
- ‚ùå Slow (sequential processing)

**Rule-based extraction:**
- ‚úÖ Fast and cheap
- ‚úÖ Deterministic
- ‚ùå Limited to simple patterns
- ‚ùå Requires domain knowledge

**Best Practice**: Use rule-based for simple metadata (dates, counts), LLM for complex metadata (summaries, topics)

---

## 5. Node Relationships

### Understanding Node Relationships

In [13]:
# Inspect node relationships
print("Node Relationships:")
for i, node in enumerate(sentence_nodes[:3]):
    print(f"\nNode {i}:")
    print(f"  ID: {node.node_id}")
    print(f"  Relationships: {list(node.relationships.keys())}")
    
    # Check for source document
    if NodeRelationship.SOURCE in node.relationships:
        source_info = node.relationships[NodeRelationship.SOURCE]
        print(f"  Source Document ID: {source_info.node_id}")
    
    # Check for previous/next nodes
    if NodeRelationship.PREVIOUS in node.relationships:
        print(f"  Has PREVIOUS node")
    if NodeRelationship.NEXT in node.relationships:
        print(f"  Has NEXT node")

Node Relationships:

Node 0:
  ID: 27bf30bc-9394-47af-9e34-31978560b859
  Relationships: [<NodeRelationship.SOURCE: '1'>]
  Source Document ID: c8cf197a-472d-4445-9590-c1dd6f6c1eb8

Node 1:
  ID: e79ff1b8-839f-4d37-9485-25d59d9bbf74
  Relationships: [<NodeRelationship.SOURCE: '1'>]
  Source Document ID: b94c9437-8352-42c0-9ce0-d5a4da3d43e7

Node 2:
  ID: 7bb5a46e-710a-4232-aefa-0d3e47705904
  Relationships: [<NodeRelationship.SOURCE: '1'>]
  Source Document ID: 1d0dfd59-3643-458f-9c23-fe860e5f4e11


### Creating Custom Node Relationships

In [14]:
# Create custom parent-child relationships
# Example: Create a summary node that links to detail nodes

summary_node = TextNode(
    text="Summary: Research papers on transformers, BERT, and RAG",
    metadata={"type": "summary", "level": "0"},
)

# Link detail nodes as children
for node in sentence_nodes[:3]:
    node.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(
        node_id=summary_node.node_id,
    )
    node.metadata["level"] = "1"

print("Created hierarchical relationship:")
print(f"  Summary Node ID: {summary_node.node_id}")
print(f"  Child nodes: {len([n for n in sentence_nodes[:3] if NodeRelationship.PARENT in n.relationships])}")

Created hierarchical relationship:
  Summary Node ID: 9f2873a2-4a62-4dc3-ab42-a1a026243e67
  Child nodes: 3


---

## 6. Ingestion Pipeline

### Creating a Complete Ingestion Pipeline

In [15]:
# Build ingestion pipeline
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=200),
        Settings.embed_model,  # Generate embeddings
    ],
)

print("Running ingestion pipeline...")
nodes = pipeline.run(documents=sample_papers, show_progress=True)

print(f"\n‚úÖ Pipeline complete!")
print(f"  Processed {len(sample_papers)} documents")
print(f"  Generated {len(nodes)} nodes")
print(f"  Nodes have embeddings: {nodes[0].embedding is not None}")

Running ingestion pipeline...


Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/3 [00:00<?, ?it/s]


‚úÖ Pipeline complete!
  Processed 3 documents
  Generated 3 nodes
  Nodes have embeddings: True


### Pipeline Benefits

**IngestionPipeline** provides:
- ‚úÖ **Caching**: Avoid re-processing unchanged documents
- ‚úÖ **Batch processing**: Efficient for large document sets
- ‚úÖ **Composable**: Chain multiple transformations
- ‚úÖ **Async support**: Parallel processing
- ‚úÖ **Error handling**: Graceful failures

---

## 7. Building an Index with Optimized Chunks

### Using Our Processed Nodes

In [16]:
# Create index from our processed nodes
index = VectorStoreIndex(nodes=nodes)

query_engine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact"
)

print("‚úÖ Index created from processed nodes")
print(f"  Total nodes indexed: {len(nodes)}")

‚úÖ Index created from processed nodes
  Total nodes indexed: 3


### Querying with Rich Metadata

In [17]:
# Query about transformers
query = "What is the Transformer architecture?"
response = query_engine.query(query)

print(f"Query: {query}\n")
print("Response:")
print(response)
print("\n" + "="*80)

# Examine retrieved sources
print("\nRetrieved Sources:")
for i, source_node in enumerate(response.source_nodes, 1):
    print(f"\nSource {i}:")
    print(f"  Score: {source_node.score:.4f}")
    print(f"  Title: {source_node.metadata.get('title', 'N/A')}")
    print(f"  Year: {source_node.metadata.get('year', 'N/A')}")
    print(f"  Category: {source_node.metadata.get('category', 'N/A')}")
    print(f"  Text preview: {source_node.text[:150]}...")

Query: What is the Transformer architecture?

Response:
The Transformer architecture is a novel network design that relies entirely on attention mechanisms, eliminating the need for recurrence and convolutions. It is specifically developed for sequence transduction tasks and computes representations of input and output through self-attention, distinguishing itself from traditional models that utilize recurrent neural networks or convolutional layers.


Retrieved Sources:

Source 1:
  Score: 0.4582
  Title: Attention Is All You Need
  Year: 2017
  Category: transformers
  Text preview: Title: Attention Is All You Need
        Authors: Vaswani et al.
        Year: 2017

        Abstract: The dominant sequence transduction models are b...

Source 2:
  Score: 0.2470
  Title: BERT
  Year: 2019
  Category: language_models
  Text preview: Title: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
        Authors: Devlin et al.
        Year: 2019

        Abs...

S

In [18]:
# Query about RAG
query2 = "Explain retrieval-augmented generation"
response2 = query_engine.query(query2)

print(f"Query: {query2}\n")
print("Response:")
print(response2)
print("\n" + "="*80)

print("\nTop Source:")
top_source = response2.source_nodes[0]
print(f"  Title: {top_source.metadata.get('title')}")
print(f"  Authors: {top_source.metadata.get('authors')}")
print(f"  Citations: {top_source.metadata.get('citations')}")

Query: Explain retrieval-augmented generation

Response:
Retrieval-augmented generation (RAG) is a model that integrates both parametric and non-parametric memory to enhance the performance of natural language processing tasks. It combines the capabilities of large pre-trained language models, which store knowledge in their parameters, with a dense vector index of external information sources, such as Wikipedia. This approach allows the model to access up-to-date information and improves its ability to interpret and manipulate knowledge effectively. By leveraging both types of memory, RAG models aim to provide more accurate and contextually relevant responses in knowledge-intensive applications.


Top Source:
  Title: RAG
  Authors: Lewis et al.
  Citations: 3500


---

## 8. Chunking Best Practices

### Experiment: Impact of Chunk Size on Retrieval

In [19]:
# Test different chunk sizes
chunk_sizes = [256, 512, 1024, 2048]
test_query = "What are the benefits of attention mechanisms?"

results = []

for chunk_size in chunk_sizes:
    # Create splitter
    splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size * 0.2)  # 20% overlap
    )
    
    # Process and index
    temp_nodes = splitter.get_nodes_from_documents(sample_papers)
    temp_index = VectorStoreIndex.from_documents(
        sample_papers,
        transformations=[splitter],
        show_progress=False
    )
    
    # Query
    temp_engine = temp_index.as_query_engine(similarity_top_k=2)
    temp_response = temp_engine.query(test_query)
    
    results.append({
        "Chunk Size": chunk_size,
        "Num Nodes": len(temp_nodes),
        "Top Score": f"{temp_response.source_nodes[0].score:.4f}",
        "Response Len": len(str(temp_response)),
    })

df_results = pd.DataFrame(results)
print("\nChunk Size Impact on Retrieval:")
print(df_results.to_string(index=False))


Chunk Size Impact on Retrieval:
 Chunk Size  Num Nodes Top Score  Response Len
        256          3    0.4572           566
        512          3    0.4572           736
       1024          3    0.4571           660
       2048          3    0.4572           713


### üéØ Key Takeaways: Chunking Best Practices

1. **Start with 512-1024 tokens** for most applications
2. **Use 10-20% overlap** to preserve context across boundaries
3. **Choose SentenceSplitter** for general use (respects boundaries)
4. **Use SemanticSplitter** for complex documents (worth the cost)
5. **Add rich metadata** for better filtering and attribution
6. **Test chunk sizes** on your specific data and queries
7. **Monitor costs**: Smaller chunks = more nodes = more embeddings

---

## 9. Summary: What You Learned

### ‚úÖ Completed Learning Objectives

1. **Document Loading**: Used SimpleDirectoryReader and created custom documents
2. **Chunking Strategies**: Implemented sentence, token, and semantic chunking
3. **Metadata Management**: Added document and node-level metadata, used LLM extraction
4. **Node Relationships**: Created parent-child and sequential relationships
5. **Optimization**: Built ingestion pipeline, compared chunk sizes
6. **Best Practices**: Learned trade-offs and guidelines for production

### Key Concepts Mastered

- **SentenceSplitter**: Respects sentence boundaries, good default
- **TokenTextSplitter**: Precise token control
- **SemanticSplitterNodeParser**: Meaning-based chunking (slower, higher quality)
- **IngestionPipeline**: Composable, cacheable document processing
- **Metadata enrichment**: Rule-based and LLM-based extraction
- **Node relationships**: SOURCE, PREVIOUS, NEXT, PARENT, CHILD

### Next Steps

In **Notebook 3: Indexing & Simple Queries**, you'll learn:
- Integrating external vector stores (Qdrant, Chroma)
- Comparing embedding models (OpenAI vs HuggingFace)
- Index persistence and loading
- Advanced query engine configuration
- Response synthesis strategies
- VectorIndexAutoRetriever for smart filtering

---

## üéØ Practice Exercises

1. **Load Real PDFs**: Place PDFs in `data/research_papers/` and load with SimpleDirectoryReader
2. **Compare Chunking**: Create indexes with different chunk sizes (256, 512, 1024). Which works best for your data?
3. **Metadata Extraction**: Add custom metadata extractors for your domain (e.g., extract dates, authors, topics)
4. **Semantic Chunking**: Apply SemanticSplitterNodeParser to a long document. How does it split compared to SentenceSplitter?
5. **Pipeline**: Build a custom ingestion pipeline with multiple transformations

---

## Additional Resources

- **Node Parsers**: https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/
- **Metadata Extractors**: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/
- **Ingestion Pipeline**: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/