# Introduction
In this exercise, we will learn how to **enrich documents with metadata** before indexing them for RAG. So instead of only storing raw text, we‚Äôll add useful information such as titles, summaries or example questions/answers.

# Setup: Installing Required Libraries

Step 1: Before we begin, we need to install the necessary Python libraries. Run the cell below to install all dependencies for this notebook.

Step 2: Upload the file called why-language-models-hallucinate.pdf to Files.

In [None]:
# Install required libraries with working versions
# If you see dependency conflict warnings during installation, you can ignore them - they won't affect this notebook.
# Always restart your runtime after installation! (Runtime ‚Üí Restart runtime)
!pip install -q llama-index-core==0.14.6 llama-index-embeddings-openai==0.5.1 \
    llama-index-llms-openai==0.6.6 openai==1.109.1 \
    chromadb==1.2.2 llama-index-vector-stores-chroma==0.5.3 \
    llama-index-readers-file llama-parse

print("‚úÖ All libraries installed successfully!")
print("‚ö†Ô∏è  IMPORTANT: Please restart your kernel/runtime now before running the next cell!")

# 1. Loading the data

We are going to work with the PDF file "why-language-models-hallucinate.pdf" (a recent OpenAI research piece that explores the statistical reasons behind model hallucinations) and load it using `SimpleDirectoryReader`:


In [None]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('‚úÖ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('üí° To use Colab secrets: Go to üîë (left sidebar) ‚Üí Add new secret ‚Üí Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('‚ùå ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('‚ùå ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('‚úÖ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'ü§ñ Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'üß† Embedding Model: {OPENAI_EMBED_MODEL}')


In [None]:
from llama_index.core import SimpleDirectoryReader

documents = await SimpleDirectoryReader(input_files=["why-language-models-hallucinate.pdf"]).aload_data()

The file has 36 pages, so the data connector created 36 document objects:

In [None]:
print("Number of document objects:", len(documents))

# 2. Controlling Metadata Visibility

Before we enrich our documents, we first need to understand what information is already stored in them.



We already know that each document carries not just the text but also the metadata. Additionally, there are 2 parameters that control what parts of that metadata will be visible to the embedding model and to the LLM at query time:

`excluded_embed_metadata_keys`:
- tells LlamaIndex which metadata **should not be included when creating embeddings**
- embeddings are designed to capture the semantic meaning of content, and technical details such as file size or last modified date do not add any useful meaning.

`excluded_llm_metadata_keys`:
- tells LlamaIndex which metadata **should not be sent to the LLM when the document is retrieved at query time**
- the reason for controlling this is that some metadata, like we can add the author, can provide valuable context to the LLM when generating an answer, while other metadata would only distract the model and reduce the clarity of its response

These two parameters give us **control over what information flows into embeddings and into the LLM**.

In [None]:
documents[0].__dict__

## 2.1 Manually Constructed Documents

Most of the time we let LlamaIndex create documents automatically when we load files. But sometimes we want to **build a Document manually**. This is useful when:
1. Our data doesn‚Äôt come from a file (e.g., from a database or an API)
2. We want to attach custom metadata up front



The example below (adapted from the documentation) shows how we can construct a custom Document and explicitly control what metadata is included when building embeddings.

In this case, we give the document some metadata ("file_name", "category" and "author"). But notice that we tell LlamaIndex to **exclude "file_name"** from embeddings by setting `excluded_embed_metadata_keys`. This makes sense, because the actual file name is not semantically meaningful and would only add noise to the embedding space. The category ("finance") and author ("LlamaIndex"), however, may carry useful meaning for semantic search, so we leave them in.


In [None]:
# What the Embedding model will see

from llama_index.core import Document
from llama_index.core.schema import MetadataMode

document = Document(
    text="This is a short snippet of a super-customized document that will go to the embedding model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    excluded_embed_metadata_keys=["file_name"]
)

print("The Embedding model sees this: \n", document.get_content(metadata_mode=MetadataMode.EMBED))

Just like we control which metadata flows into the embedding model, we can also **decide which metadata the LLM will receive when it is asked to answer a query**. This is important because the LLM doesn‚Äôt only use the raw text of a chunk. It can also use metadata as extra context to generate a better answer.

In the example below, we create a custom Document with the same metadata fields. This time, however, we tell LlamaIndex to exclude the category from what the LLM sees. That means when the document is retrieved later, the model will still see the file name (so it knows the source) and the author (which may add credibility). In this case, the category is redundant - the text already makes it clear that the topic is finance, so it likely won‚Äôt affect the LLM‚Äôs response and only takes up prompt space.

In [None]:
# What the LLM model will see

from llama_index.core import Document
from llama_index.core.schema import MetadataMode

document = Document(
    text="This is a short snippet of a super-customized document that will go to the embedding model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    excluded_llm_metadata_keys=["category"],
)

print(
    "The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))

Some metadata is more useful for embeddings, some is more useful for the LLM, and some works for both. **Whether we exclude/add a certain information depends on our use case**: are we trying to keep embeddings clean, or give the LLM more context? In real-world projects, this is a design choice you make depending on how much metadata adds value versus noise.

By default, LlamaIndex already takes care of formatting metadata in a clean way, and in practice you usually don‚Äôt need to change it. However, you can **customize the formatting** if you want more readable prompts for the LLM or if you want the metadata formatted in a certain style to match your company‚Äôs pipelines or prompt style.

Optional parameters:
- `metadata_seperator` - Sets the character(s) between different pieces of metadata. The default is a newline (`"\n"`).
- `metadata_template` - Defines how each key-value pair is shown. Both `{key}` and `{value}` must be included.
- `text_template` - takes two variables: `metadata_str` and `content`

This doesn‚Äôt change what information is sent, only how it is displayed. For example, for the LLM a cleaner format can sometimes help it parse metadata more naturally.

In our exercise, we‚Äôll try a custom format just to see how this works.

In [None]:
# Formatting
document = Document(
    text="This is a short snippet of a super-customized document that will go to the model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    metadata_seperator=", ",
    metadata_template="{key}:{value}",
    text_template="Metadata:\n{metadata_str}\n------\nContent:\n{content}",
)

print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))

Now let‚Äôs return to the real documents created automatically when we loaded our PDF. Each of these documents already comes with some metadata attached.

- "file_path" can be useful (it tells us where the chunk came from)
- "page_label" usually does not add much value for embeddings (handy to keep for LLM if you want the reference)

In [None]:
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

Let's exclude page labels - we can loop through all documents, adjust their formatting template and tell LlamaIndex not to include "page_label" in the embeddings:

In [None]:
for doc in documents:
    # Defining the content/metadata template
    doc.text_template = "Metadata:\n{metadata_str}\n---\nContent:\n{content}"

    # Excluding page label from embedding
    if "page_label" not in doc.excluded_embed_metadata_keys:
        doc.excluded_embed_metadata_keys.append("page_label")

Let's check the transformation - page label should not be included in the metadata:

In [None]:
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

# 3. Building a RAG Pipeline with Metadata Enrichment

Now we‚Äôre getting to the interesting and fun part of the notebook. Up to this point, our documents only carried basic metadata like file names and page labels. That‚Äôs useful for organizing files, but it doesn‚Äôt really help a RAG system retrieve more accurate answers. So this is what we‚Äôre going to do: make our RAG pipeline smarter by **enriching each document chunk with additional context** such as short titles, summaries and example Q&As. To do so, we'll **use a language model  `gpt-5-nano` to generate this metadata**.

To really test whether this helps, we‚Äôll build **3 different versions of our nodes** (chunks of text):
- Baseline nodes (`nodes_0`) - only the basic metadata
- Title-enriched nodes (`nodes_1`) ‚Äì chunks labeled with short, descriptive titles.
- Fully enriched nodes (`nodes_2`) ‚Äì chunks augmented with titles, summaries and example Q&A pairs.

We‚Äôll follow three main steps in this experiment:
1. **Splitting the data**: we'll break the PDF into smaller, manageable chunks
2. **Creating three versions of nodes**
3. **Building and testing RAG indexes**: we‚Äôll run the same queries against each node set and compare the results to see how much metadata enrichment improves retrieval and answers

> NOTE: This setup is inspired by the official LlamaIndex metadata extraction [example](https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtraction_LLMSurvey/#automated-metadata-extraction-for-better-retrieval-synthesis).

**OpenAI's Language model for transformations**

We will use OpenAI‚Äôs `gpt-5-nano` model which is fast, affordable, and accurate enough for our metadata extraction.

In [None]:
from llama_index.llms.openai import OpenAI

# Language model
llm_transformations = OpenAI(
    model = OPENAI_MODEL,
    temperature = 0.0,
    max_tokens = 512
)

## 3.1 Splitting the data

First, we need to prepare our documents for transformation by splitting them into smaller chunks. Large documents cannot be processed effectively all at once. We'll use `SentenceSplitter` which splits the content into 1024 tokens and also adds an overlap of 128 tokens. The overlap ensures that if important information appears at the boundary of one chunk, it is also present in the next chunk, so nothing is lost. Parameter `separator` simply tells the splitter to break text along spaces (keeping words intact).

In [None]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(
    separator = " ",
    chunk_size = 1024,
    chunk_overlap = 128
)
text_splitter

## 3.2 Creating three versions of Nodes

We‚Äôll now create three parallel versions of our corpus so we can run a fair comparison later.

### 3.2.1 Creating baseline nodes (split only)
First, we create the baseline: chunks produced by the splitter with no metadata enrichment. This gives us a control group. Any improvement we see later can be attributed to the extra metadata, not to changes in chunking.

In [None]:
# Baseline nodes
baseline_nodes = text_splitter.get_nodes_from_documents(documents)

In [None]:
baseline_nodes[:3]

### 3.2.2 Enriched nodes (titles extraction)

Next, we'll add short, descriptive titles for each chunk of text. These will be labels for each chunk and often help the retriever match user intent to the right passage.

To do this, we use `TitleExtractor` which takes an LLM and generates a title for each node. We also set the parameter `nodes = 5` so that up to 5 chunks are processed in one request, making the process more efficient.

In [None]:
from llama_index.core.extractors import TitleExtractor

title_extractor = TitleExtractor(llm = llm_transformations, nodes = 5)
title_extractor

Next, we run this transformation using `IngestionPipeline`. The pipeline executes a sequence of transformations, in our case, splitting the text into chunks and then adding titles. We set `in_place=False` to make sure we don‚Äôt overwrite our baseline nodes. Instead, we produce a separate list (stored in "nodes_1") for A/B test.

In [None]:
from llama_index.core.ingestion import IngestionPipeline

pipeline_titles = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor
    ]
)

In [None]:
# Running the pipeline
nodes_1 = pipeline_titles.run(
    documents = documents,
    in_place = False,
    show_progress = True
)

### üìù EXERCISE 1: Explore Metadata Extraction


**Your task:**
1. Compare a baseline node (without metadata) to an enriched node (with title extraction)
2. Display the content of `baseline_nodes[5]` using `.get_content()`
3. Display the content of `nodes_1[5]` (with title metadata) using `.get_content(metadata_mode=MetadataMode.LLM)`
4. Observe: What additional information does the title provide? How might this help retrieval?


**Hint:** Use `MetadataMode.LLM` to see what the language model receives, including metadata.


In [None]:
# YOUR CODE HERE


### 3.2.3 Fully enriched nodes (titles + Q&A + summary extraction)

For the richest version of our nodes, we‚Äôll go beyond titles and also add example Q&A pairs and short summaries.

**Q&A pairs simulate how a real user might query the system and what kind of response a chunk could provide**. This makes the retriever‚Äôs job easier because each chunk carries hints about the kinds of questions it can answer. In practice, adding Q&A metadata often improves recall (finding the right chunk) and helps the system produce more useful answers. We‚Äôll use `QuestionsAnsweredExtractor` and set `questions = 3`, which asks the LLM to generate three realistic Q&A pairs per chunk.

In [None]:
from llama_index.core.extractors import QuestionsAnsweredExtractor

qa_extractor = QuestionsAnsweredExtractor(llm = llm_transformations, questions = 3)
qa_extractor

**Summaries capture the core ideas of each chunk in a compact form**. They provide another layer of metadata that‚Äôs especially helpful when users ask broader or high-level questions. We‚Äôll use `SummaryExtractor` for this task.

In [None]:
from llama_index.core.extractors import SummaryExtractor

summary_extractor = SummaryExtractor(llm = llm_transformations)
summary_extractor

Both transformations run inside the same pipeline, along with the SentenceSplitter and TitleExtractor, so each chunk ends up with a title, a short summary and 3 example Q&A pairs.

In [None]:
# titles + Q&A + summary
pipeline_rich = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor,
        qa_extractor,
        summary_extractor
    ]
)

In [None]:
# Running the pipeline
nodes_2 = pipeline_rich.run(
    documents = documents,
    in_place = False,
    show_progress = True
)

In [None]:
print(nodes_2[0].get_content(metadata_mode=MetadataMode.LLM))

**Splicing Baseline and Enriched Nodes**

To fairly test the effect of metadata enrichment, we don‚Äôt want to rebuild our dataset in three completely separate ways. That would make it difficult to know if differences in answers are due to enrichment or simply because the data was reprocessed differently. Instead, we keep most of the dataset identical and **replace only a small slice of nodes with enriched versions**.

This creates a controlled experiment:
- All three indexes contain the same core content.
- The only difference is that in "index1" and "index2", a chosen section of the document is enriched with new metadata (titles, or titles + Q&A + summaries).
- If the enriched versions produce better answers, we can be confident the improvement comes from the metadata itself, not from unrelated differences.

  
First let's check the number of nodes in the baseline split:

In [None]:
print(len(baseline_nodes))

When deciding which nodes to replace, we need to balance two things:
1. Keep enough baseline nodes so the indexes are mostly identical.
2. Pick a meaningful section of the paper (not just references, etc.).
   
In our case, the baseline split produced **39 nodes**. A good rule of thumb is to replace about 20‚Äì25% of the nodes. That‚Äôs large enough to see an effect, but small enough that the rest of the dataset remains constant. We chose the range 15‚Äì25, which corresponds to the middle of the paper.

We will create the helper function that replaces the baseline slice [15:25] with enriched nodes from "nodes_1" or "nodes_2". The rest of the baseline stays intact:

In [None]:
def splice(orig, replacement, start=15, end=25):
    # keep same length, swap slice [start:end] with enriched nodes
    return orig[:start] + replacement[start:end] + orig[end:]

# mostly baseline nodes, with titles added in positions 15‚Äì25
nodes_for_index_1 = splice(baseline_nodes, nodes_1, 15, 25)

# mostly baseline nodes, with titles added in positions 15‚Äì25
nodes_for_index_2 = splice(baseline_nodes, nodes_2, 15, 25)

**Creating Embeddings**

Now we‚Äôll embed and index each one with the same embedding model `"text-embedding-3-small"`. Creating a `VectorStoreIndex` from nodes automatically computes embeddings for those nodes.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

embed_model = OpenAIEmbedding(model=OPENAI_EMBED_MODEL)

# baseline only
index_0 = VectorStoreIndex(baseline_nodes, embed_model=embed_model, show_progress=True)

# baseline with the slice replaced by titles
index_1 = VectorStoreIndex(nodes_for_index_1, embed_model=embed_model, show_progress=True)

# baseline with the slice replaced by titles + Q&A + summary
index_2 = VectorStoreIndex(nodes_for_index_2, embed_model=embed_model, show_progress=True)

**Querying**

Next, we'll create three query engines with identical parameter `similarity_top_k=1`, so each returns the single most relevant node:

In [None]:
query = "What metrics are commonly used to evaluate text generation quality, and what are their limitations according to the paper?"

query_engine_0 = index_0.as_query_engine(similarity_top_k=1)
query_engine_1 = index_1.as_query_engine(similarity_top_k=1)
query_engine_2 = index_2.as_query_engine(similarity_top_k=1)

Each index is queried with the exact same question:

In [None]:
response_0 = query_engine_0.query(query)
response_1 = query_engine_1.query(query)
response_2 = query_engine_2.query(query)

**Important Note: Results May Vary**

  The effectiveness of metadata enrichment depends on several factors:
  - **Which nodes were enriched**: We only enriched nodes 15-25 (about
  25% of the document)
  - **Semantic similarity**: How well the query embedding matches chunk
   embeddings
  - **Data quality**: Whether reference pages and irrelevant sections
  were included in the index

  In this particular run, all three versions (baseline, titles, and
  titles + Q&A + summary) retrieved content from **page 20, which
  contains only references**. This demonstrates several important
  lessons about RAG systems:

  1. **RAG is non-deterministic**: Results can vary between runs due to
   embedding variability and chunking differences
  2. **Metadata enrichment isn't a silver bullet**: While it helps
  improve retrieval, it doesn't guarantee perfect results every time
  3. **Preprocessing matters critically**: We should have excluded
  reference pages, bibliographies, and appendices before indexing to
  prevent retrieving non-substantive content
  4. **Retrieval can fail**: Even with enrichment, the retriever can
  still grab the wrong chunks, especially when queries have high
  lexical overlap with irrelevant sections
  5. **LLMs hallucinate from poor context**: Notice how the model
  generates plausible-sounding answers (mentioning BLEU, ROUGE,
  perplexity) even though page 20 only contains citations, not actual
  discussion of these metrics. The model is drawing from its training
  knowledge rather than grounding its answer in the document.



In [None]:
print("\n[BASELINE]\n", response_0.response)
print("\n[TITLES]\n", response_1.response)
print("\n[TITLES + Q&A + SUMMARY]\n", response_2.response)

def show_sources(resp, k=1):
    for i, sn in enumerate(resp.source_nodes[:k], 1):
        md = sn.node.metadata or {}
        print(f"\nSource {i} | page={md.get('page_label')} | title={md.get('document_title')}")
        print(sn.node.get_content(metadata_mode=MetadataMode.NONE)[:400], "\n---------------")

print("\n SOURCES: BASELINE")
show_sources(response_0)

print("\n SOURCES: TITLES")
show_sources(response_1)

print("\n SOURCES: TITLES + Q&A + SUMMARY")
show_sources(response_2)

# 4. Persistent Storage

Once we‚Äôve decided which pages to keep and which metadata to enrich (e.g., titles only, pages with references removed), we can persist that final node set, for example, in ChromaDB. We will take the enriched nodes stored in "nodes_1", embed them with the same embedding model (`text-embedding-3-small`), and write those vectors into a Chroma collection we can reopen in future notebook's sessions.

In [None]:
from chromadb import PersistentClient
from llama_index.vector_stores.chroma import ChromaVectorStore

embed_model = OpenAIEmbedding(model = OPENAI_EMBED_MODEL)

In this code below, we connect our pipeline to ChromaDB. We start by opening (or creating) a Chroma database on disk, then define a collection called "LLM_titles_only_v1" where our vectors will be stored. We'll build a `VectorStoreIndex` from our enriched nodes using the embedding model and route them into Chroma.

In [None]:
from llama_index.core import StorageContext

CHROMA_PATH = "./chroma_database"
client = PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection("LLM_titles_only_v1")

# Routing vectors into Chroma via StorageContext
vector_store = ChromaVectorStore(chroma_collection = collection)
storage_context = StorageContext.from_defaults(vector_store = vector_store)

index = VectorStoreIndex(
    nodes_1,
    storage_context = storage_context,
    embed_model=embed_model,
    show_progress = True
)

When we come back in a new session, we just need to wrap the existing Chroma collection and set the index as query engine.

In [None]:
client = PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection("LLM_titles_only_v1")
vector_store = ChromaVectorStore(chroma_collection=collection)

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    embed_model=embed_model,  # query-time embeddings must match!
)

In [None]:
qe = index.as_query_engine(similarity_top_k=1)

In [None]:
query = "What are the conclusions about hallucinations of language models?"

In [None]:
response = qe.query(query)

In [None]:
print(response)

# 5. Using Metadata Filters in Queries

In the previous sections, we enriched our documents with metadata like titles, summaries, and Q&A pairs. But we haven't yet shown you how to actually **USE** that metadata to filter your queries. This is one of the most powerful features of metadata enrichment.

## Why Filter with Metadata?

Imagine you have thousands of documents from different sources, topics, or time periods. Sometimes you don't want to search through ALL of them‚Äîyou want to search only within:
- A specific document or set of documents
- A particular category or topic
- Content from a certain time period
- Documents by a specific author

**Metadata filtering lets you do exactly this.** It combines semantic search (finding similar content) with structured filtering (like SQL WHERE clauses).

## How It Works

LlamaIndex allows you to add filters to your queries using the `MetadataFilters` class. You can filter by:
- **Exact match**: `key == value`
- **In list**: `key IN [value1, value2, ...]`
- **Greater than / Less than**: `key > value`, `key < value`
- **Not equal**: `key != value`

Let's see this in action with our enriched nodes.

## 5.1 Example: Filter by File Name (Multi-Document Collections)

In real-world applications, you often have multiple documents indexed together. Metadata filtering becomes even more powerful here‚Äîyou can search across all documents OR narrow down to specific ones.

Let's demonstrate this concept. Even though we only have one PDF in this notebook, imagine you had indexed multiple research papers. You could filter by `file_name` to search within just one paper.

In [None]:
from llama_index.core.vector_stores import MetadataFilters, MetadataFilter, FilterOperator

# Filter to search only within a specific file
file_filter = MetadataFilters(
    filters=[
        MetadataFilter(
            key="file_name",
            value="why-language-models-hallucinate.pdf",
            operator=FilterOperator.EQ  # Exact match
        )
    ]
)

# Create query engine with file filter
file_filtered_qe = index.as_query_engine(
    similarity_top_k=3,
    filters=file_filter
)

response = file_filtered_qe.query("What are the main conclusions about hallucinations?")

print("ANSWER:")
print(response.response)
print("\n" + "="*80)
print("SOURCES (all from the filtered file):")
print("="*80)

for i, node in enumerate(response.source_nodes, 1):
    print(f"\nSource {i}: {node.metadata.get('file_name', 'N/A')} (Page {node.metadata.get('page_label', 'N/A')})")

**Real-world use case:**

Imagine you're building a research assistant that has indexed 100 academic papers. A user asks:
> "What does the 2024 OpenAI paper say about hallucinations?"

Without filtering, the system might retrieve chunks from ANY of the 100 papers. With metadata filtering:
```python
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="author", value="OpenAI", operator=FilterOperator.EQ),
        MetadataFilter(key="year", value="2024", operator=FilterOperator.EQ)
    ]
)
```

Now the search is constrained to ONLY the relevant paper, dramatically improving answer quality.

# 6. Response Synthesis Modes

So far, we've focused heavily on the **retrieval** side of RAG‚Äîhow to find the right chunks using semantic search, metadata enrichment, and filtering. But there's another critical component: **how the LLM synthesizes those retrieved chunks into a final answer**.

This is called **response synthesis**, and LlamaIndex offers several different modes for doing this. Each mode has different trade-offs in terms of answer quality, context handling, and API cost.

## The Problem: What Happens with Multiple Retrieved Chunks?

When you set `similarity_top_k=5`, the retriever returns 5 chunks. But how does the LLM use them?

**Three challenges:**
1. **Context length**: If chunks are long, they might exceed the LLM's context window
2. **Information synthesis**: Should chunks be combined, compared, or processed one-by-one?
3. **Relevance**: Not all retrieved chunks are equally useful‚Äîsome might be noise

Different response modes solve these challenges in different ways.

## 6.1 Response Mode: `compact` (Default)

This is the default mode in LlamaIndex.

**How it works:**
1. Retrieves multiple chunks (e.g., top 5)
2. **Concatenates them into a single context string**
3. Sends the concatenated context + query to the LLM in ONE request
4. LLM generates answer based on all chunks at once

**Characteristics:**
- ‚úÖ **Fast**: Only one LLM call
- ‚úÖ **Cheap**: Minimal API cost
- ‚úÖ **Good for short chunks**: Works well when all chunks fit in context
- ‚ùå **Context limit risk**: If chunks are too large, might exceed LLM's context window
- ‚ùå **No refinement**: LLM sees everything at once, can't iteratively improve

**When to use:**
- Default choice for most queries
- Short to medium-length chunks
- Fast prototyping

In [None]:
# Example: Compact mode (default)
query_engine_compact = index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact"  # This is actually the default
)

query = "What are the main causes of hallucinations in language models?"
response_compact = query_engine_compact.query(query)

print("="*80)
print("RESPONSE MODE: COMPACT")
print("="*80)
print(response_compact.response)
print("\n" + "="*80)
print(f"Number of chunks used: {len(response_compact.source_nodes)}")
print("="*80)

## 6.2 Response Mode: `refine`

This mode uses an **iterative refinement** approach.

**How it works:**
1. Retrieves multiple chunks (e.g., top 5)
2. Sends **chunk 1** to LLM ‚Üí Get initial answer
3. Sends **chunk 2 + previous answer** to LLM ‚Üí "Refine the answer based on new context"
4. Sends **chunk 3 + refined answer** to LLM ‚Üí "Refine again"
5. Continues until all chunks are processed
6. Returns the final refined answer

**Characteristics:**
- ‚úÖ **Better quality**: Iteratively improves the answer with each chunk
- ‚úÖ **Handles long contexts**: Processes chunks one at a time, avoids context limits
- ‚úÖ **More comprehensive**: Can incorporate information from many chunks sequentially
- ‚ùå **Slower**: Makes N LLM calls (where N = number of chunks)
- ‚ùå **More expensive**: Each refinement costs API tokens
- ‚ùå **Later chunks matter more**: Information from later chunks might overshadow earlier ones

**When to use:**
- Complex questions requiring information from multiple sources
- Long documents where chunks are large
- When answer quality is more important than speed/cost

In [None]:
# Example: Refine mode
query_engine_refine = index.as_query_engine(
    similarity_top_k=3,
    response_mode="refine"
)

response_refine = query_engine_refine.query(query)

print("="*80)
print("RESPONSE MODE: REFINE")
print("="*80)
print(response_refine.response)
print("\n" + "="*80)
print(f"Number of chunks used: {len(response_refine.source_nodes)}")
print("="*80)
print("\nüí° Note: This made 3 LLM calls (one per chunk) to iteratively refine the answer")

## 6.3 Response Mode: `tree_summarize`

This mode uses a **hierarchical summarization** approach.

**How it works:**
1. Retrieves multiple chunks (e.g., top 8)
2. Groups them into pairs or small batches
3. Summarizes each batch ‚Üí Creates intermediate summaries
4. Groups those summaries and summarizes again
5. Repeats until one final summary remains


```
Chunk1, Chunk2 ‚Üí Summary A
Chunk3, Chunk4 ‚Üí Summary B
Chunk5, Chunk6 ‚Üí Summary C
Chunk7, Chunk8 ‚Üí Summary D

Summary A, B ‚Üí Summary AB
Summary C, D ‚Üí Summary CD

Summary AB, CD ‚Üí Final Answer
```

**Characteristics:**
- ‚úÖ **Handles many chunks**: Can process dozens of chunks efficiently
- ‚úÖ **Balanced processing**: All chunks contribute equally (no recency bias)
- ‚úÖ **Good for summarization**: Excellent for "summarize this document" type queries
- ‚ùå **Multiple LLM calls**: log(N) calls where N = number of chunks
- ‚ùå **Loss of detail**: Hierarchical summarization can lose fine-grained details
- ‚ùå **Slower**: More calls than compact, fewer than refine

**When to use:**
- Large number of retrieved chunks (10+)
- Summarization tasks
- When you want balanced consideration of all chunks

In [None]:
# Example: Tree Summarize mode
query_engine_tree = index.as_query_engine(
    similarity_top_k=4,  # Use 4 chunks to show tree structure
    response_mode="tree_summarize"
)

response_tree = query_engine_tree.query(query)

print("="*80)
print("RESPONSE MODE: TREE_SUMMARIZE")
print("="*80)
print(response_tree.response)
print("\n" + "="*80)
print(f"Number of chunks used: {len(response_tree.source_nodes)}")
print("="*80)
print("\nüí° Note: This used hierarchical summarization (pairs of chunks ‚Üí intermediate summaries ‚Üí final answer)")

## 6.4 Comparing Response Modes Side-by-Side

Let's compare all three modes on the same query to see the differences:

In [None]:
# Compare all three modes
comparison_query = "Summarize the paper's main findings about why language models hallucinate"

modes = ["compact", "refine", "tree_summarize"]
responses = {}

print("="*80)
print(f"QUERY: {comparison_query}")
print("="*80)

for mode in modes:
    qe = index.as_query_engine(similarity_top_k=3, response_mode=mode)
    response = qe.query(comparison_query)
    responses[mode] = response

    print(f"\n{'='*80}")
    print(f"MODE: {mode.upper()}")
    print(f"{'='*80}")
    print(response.response)
    print(f"\nChunks used: {len(response.source_nodes)}")



## 6.5 Key Takeaways: Response Synthesis Modes

### **Quick Reference Table:**

| Mode | LLM Calls | Speed | Cost | Quality | Best For |
|------|-----------|-------|------|---------|----------|
| **compact** | 1 | Fast | Cheap | ‚≠ê‚≠ê Good | Default choice, short chunks |
| **refine** | N (per chunk) | Slow | Expensive | ‚≠ê‚≠ê‚≠ê Best | Complex queries, detail-oriented |
| **tree_summarize** | log(N) | Medium |  Medium | ‚≠ê‚≠ê Good | Many chunks, summarization |

### **Decision Guide:**

**Use `compact` when:**
- ‚úÖ You have 2-5 short chunks
- ‚úÖ Fast response time is important
- ‚úÖ You're prototyping or testing
- ‚úÖ Cost is a concern

**Use `refine` when:**
- ‚úÖ Answer quality is paramount
- ‚úÖ You need comprehensive answers from multiple sources
- ‚úÖ Chunks contain complementary information
- ‚úÖ You can afford the extra API calls

**Use `tree_summarize` when:**
- ‚úÖ You have many chunks (10+)
- ‚úÖ You're doing summarization
- ‚úÖ You want balanced treatment of all chunks
- ‚úÖ Moderate cost/quality trade-off is acceptable

### **Pro Tips:**

1. **Start with `compact`** - It's the default for a reason. Only switch if you have a specific need.

2. **Monitor token usage** - In production, track how many tokens each mode uses. `refine` can get expensive quickly!

3. **Test with your data** - The "best" mode depends on your specific documents and queries. Run experiments!

4. **Consider hybrid approaches** - You can use `compact` for simple queries and `refine` for complex ones.

5. **Remember: Good retrieval matters more** - A perfect synthesis mode can't fix bad retrieval. Focus on metadata, chunking, and filtering first!

---
