# Introduction
In this exercise, we will learn how to **enrich documents with metadata** before indexing them for RAG. So instead of only storing raw text, we’ll add useful information such as titles, summaries or example questions/answers.

# 1. Loading the data

We are going to work with the PDF file "why-language-models-hallucinate.pdf" (a recent OpenAI research piece that explores the statistical reasons behind model hallucinations) and load it using `SimpleDirectoryReader`:


In [None]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('✅ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('❌ ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('❌ ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('✅ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'🤖 Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'🧠 Embedding Model: {OPENAI_EMBED_MODEL}')


In [None]:
from llama_index.core import SimpleDirectoryReader

documents = await SimpleDirectoryReader(input_files=["./data/why-language-models-hallucinate.pdf"]).aload_data()

The file has 36 pages, so the data connector created 36 document objects:

In [None]:
print("Number of document objects:", len(documents))

# 2. Controlling Metadata Visibility

Before we enrich our documents, we first need to understand what information is already stored in them.



We already know that each document carries not just the text but also the metadata. Additionally, there are 2 parameters that control what parts of that metadata will be visible to the embedding model and to the LLM at query time:

`excluded_embed_metadata_keys`:
- tells LlamaIndex which metadata **should not be included when creating embeddings**
- embeddings are designed to capture the semantic meaning of content, and technical details such as file size or last modified date do not add any useful meaning.

`excluded_llm_metadata_keys`:
- tells LlamaIndex which metadata **should not be sent to the LLM when the document is retrieved at query time**
- the reason for controlling this is that some metadata, like we can add the author, can provide valuable context to the LLM when generating an answer, while other metadata would only distract the model and reduce the clarity of its response

These two parameters give us **control over what information flows into embeddings and into the LLM**.

In [None]:
documents[0].__dict__

## 2.1 Manually Constructed Documents

Most of the time we let LlamaIndex create documents automatically when we load files. But sometimes we want to **build a Document manually**. This is useful when:
1. Our data doesn’t come from a file (e.g., from a database or an API)
2. We want to attach custom metadata up front



The example below (adapted from the documentation) shows how we can construct a custom Document and explicitly control what metadata is included when building embeddings.

In this case, we give the document some metadata ("file_name", "category" and "author"). But notice that we tell LlamaIndex to **exclude "file_name"** from embeddings by setting `excluded_embed_metadata_keys`. This makes sense, because the actual file name is not semantically meaningful and would only add noise to the embedding space. The category ("finance") and author ("LlamaIndex"), however, may carry useful meaning for semantic search, so we leave them in.


In [None]:
# What the Embedding model will see

from llama_index.core import Document
from llama_index.core.schema import MetadataMode

document = Document(
    text="This is a short snippet of a super-customized document that will go to the embedding model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    excluded_embed_metadata_keys=["file_name"]
)

print("The Embedding model sees this: \n", document.get_content(metadata_mode=MetadataMode.EMBED))

Just like we control which metadata flows into the embedding model, we can also **decide which metadata the LLM will receive when it is asked to answer a query**. This is important because the LLM doesn’t only use the raw text of a chunk. It can also use metadata as extra context to generate a better answer.

In the example below, we create a custom Document with the same metadata fields. This time, however, we tell LlamaIndex to exclude the category from what the LLM sees. That means when the document is retrieved later, the model will still see the file name (so it knows the source) and the author (which may add credibility). In this case, the category is redundant - the text already makes it clear that the topic is finance, so it likely won’t affect the LLM’s response and only takes up prompt space.

In [None]:
# What the LLM model will see

from llama_index.core import Document
from llama_index.core.schema import MetadataMode

document = Document(
    text="This is a short snippet of a super-customized document that will go to the embedding model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    excluded_llm_metadata_keys=["category"],
)

print(
    "The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))

Some metadata is more useful for embeddings, some is more useful for the LLM, and some works for both. **Whether we exclude/add a certain information depends on our use case**: are we trying to keep embeddings clean, or give the LLM more context? In real-world projects, this is a design choice you make depending on how much metadata adds value versus noise.

By default, LlamaIndex already takes care of formatting metadata in a clean way, and in practice you usually don’t need to change it. However, you can **customize the formatting** if you want more readable prompts for the LLM or if you want the metadata formatted in a certain style to match your company’s pipelines or prompt style.

Optional parameters:
- `metadata_seperator` - Sets the character(s) between different pieces of metadata. The default is a newline (`"\n"`).
- `metadata_template` - Defines how each key-value pair is shown. Both `{key}` and `{value}` must be included.
- `text_template` - takes two variables: `metadata_str` and `content`

This doesn’t change what information is sent, only how it is displayed. For example, for the LLM a cleaner format can sometimes help it parse metadata more naturally.

In our exercise, we’ll try a custom format just to see how this works.

In [None]:
# Formatting
document = Document(
    text="This is a short snippet of a super-customized document that will go to the model",
    metadata={
        "file_name": "super_secret_document.txt",
        "category": "finance",
        "author": "LlamaIndex",
    },
    metadata_seperator=", ",
    metadata_template="{key}:{value}",
    text_template="Metadata:\n{metadata_str}\n------\nContent:\n{content}",
)

print("The LLM sees this: \n", document.get_content(metadata_mode=MetadataMode.LLM))

Now let’s return to the real documents created automatically when we loaded our PDF. Each of these documents already comes with some metadata attached.

- "file_path" can be useful (it tells us where the chunk came from)
- "page_label" usually does not add much value for embeddings (handy to keep for LLM if you want the reference)

In [None]:
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

Let's exclude page labels - we can loop through all documents, adjust their formatting template and tell LlamaIndex not to include "page_label" in the embeddings:

In [None]:
for doc in documents:
    # Defining the content/metadata template
    doc.text_template = "Metadata:\n{metadata_str}\n---\nContent:\n{content}"

    # Excluding page label from embedding
    if "page_label" not in doc.excluded_embed_metadata_keys:
        doc.excluded_embed_metadata_keys.append("page_label")

Let's check the transformation - page label should not be included in the metadata:

In [None]:
print(documents[0].get_content(metadata_mode=MetadataMode.EMBED))

# 3. Building a RAG Pipeline with Metadata Enrichment

Now we’re getting to the interesting and fun part of the notebook. Up to this point, our documents only carried basic metadata like file names and page labels. That’s useful for organizing files, but it doesn’t really help a RAG system retrieve more accurate answers. So this is what we’re going to do: make our RAG pipeline smarter by **enriching each document chunk with additional context** such as short titles, summaries and example Q&As. To do so, we'll **use a language model  `gpt-5-nano` to generate this metadata**.

To really test whether this helps, we’ll build **3 different versions of our nodes** (chunks of text):
- Baseline nodes (`nodes_0`) - only the basic metadata
- Title-enriched nodes (`nodes_1`) – chunks labeled with short, descriptive titles.
- Fully enriched nodes (`nodes_2`) – chunks augmented with titles, summaries and example Q&A pairs.

We’ll follow three main steps in this experiment:
1. **Splitting the data**: we'll break the PDF into smaller, manageable chunks
2. **Creating three versions of nodes**
3. **Building and testing RAG indexes**: we’ll run the same queries against each node set and compare the results to see how much metadata enrichment improves retrieval and answers

> NOTE: This setup is inspired by the official LlamaIndex metadata extraction [example](https://docs.llamaindex.ai/en/stable/examples/metadata_extraction/MetadataExtraction_LLMSurvey/#automated-metadata-extraction-for-better-retrieval-synthesis).

**OpenAI's Language model for transformations**

We will use OpenAI’s `gpt-5-nano` model which is fast, affordable, and accurate enough for our metadata extraction.

In [None]:
from llama_index.llms.openai import OpenAI

# Language model
llm_transformations = OpenAI(
    model = OPENAI_MODEL,
    temperature = 0.0,
    max_tokens = 512
)

## 3.1 Splitting the data

First, we need to prepare our documents for transformation by splitting them into smaller chunks. Large documents cannot be processed effectively all at once. We'll use `SentenceSplitter` which splits the content into 1024 tokens and also adds an overlap of 128 tokens. The overlap ensures that if important information appears at the boundary of one chunk, it is also present in the next chunk, so nothing is lost. Parameter `separator` simply tells the splitter to break text along spaces (keeping words intact).

In [None]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(
    separator = " ",
    chunk_size = 1024,
    chunk_overlap = 128
)
text_splitter

## 3.2 Creating three versions of Nodes

We’ll now create three parallel versions of our corpus so we can run a fair comparison later.

### 3.2.1 Creating baseline nodes (split only)
First, we create the baseline: chunks produced by the splitter with no metadata enrichment. This gives us a control group. Any improvement we see later can be attributed to the extra metadata, not to changes in chunking.

In [None]:
# Baseline nodes
baseline_nodes = text_splitter.get_nodes_from_documents(documents)

In [None]:
baseline_nodes[:3]

### 3.2.2 Enriched nodes (titles extraction)

Next, we'll add short, descriptive titles for each chunk of text. These will be labels for each chunk and often help the retriever match user intent to the right passage.

To do this, we use `TitleExtractor` which takes an LLM and generates a title for each node. We also set the parameter `nodes = 5` so that up to 5 chunks are processed in one request, making the process more efficient.

In [None]:
from llama_index.core.extractors import TitleExtractor

title_extractor = TitleExtractor(llm = llm_transformations, nodes = 5)
title_extractor

Next, we run this transformation using `IngestionPipeline`. The pipeline executes a sequence of transformations, in our case, splitting the text into chunks and then adding titles. We set `in_place=False` to make sure we don’t overwrite our baseline nodes. Instead, we produce a separate list (stored in "nodes_1") for A/B test.

In [None]:
from llama_index.core.ingestion import IngestionPipeline

pipeline_titles = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor
    ]
)

In [None]:
# Running the pipeline
nodes_1 = pipeline_titles.run(
    documents = documents,
    in_place = False,
    show_progress = True
)

### 📝 EXERCISE 1: Explore Metadata Extraction (7-10 minutes)

**What you'll practice:** Understanding how metadata extraction enriches document chunks.

**Your task:**
1. Compare a baseline node (without metadata) to an enriched node (with title extraction)
2. Display the content of `baseline_nodes[5]` using `.get_content()`
3. Display the content of `nodes_1[5]` (with title metadata) using `.get_content(metadata_mode=MetadataMode.LLM)`
4. Observe: What additional information does the title provide? How might this help retrieval?

**Key concepts:**
- **Baseline nodes**: Just the text chunk, no context
- **Title-enriched nodes**: Include a descriptive title that summarizes the chunk's topic
- **Why it matters**: Titles help the LLM understand context and improve answer quality

**Hint:** Use `MetadataMode.LLM` to see what the language model receives, including metadata.

**Expected outcome:** The enriched node will have a "document_title" field that provides context about the chunk's content.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# print("BASELINE NODE (no metadata enrichment):")
# print("=" * 60)
# print(baseline_nodes[5].get_content())
# 
# print("\n\nENRICHED NODE (with title extraction):")
# print("=" * 60)
# print(nodes_1[5].get_content(metadata_mode=MetadataMode.LLM))
# 
# print("\n\nOBSERVATION:")
# print("Notice how the enriched node has a 'document_title' that")
# print("provides context about what this chunk discusses.")

### 3.2.3 Fully enriched nodes (titles + Q&A + summary extraction)

For the richest version of our nodes, we’ll go beyond titles and also add example Q&A pairs and short summaries.

**Q&A pairs simulate how a real user might query the system and what kind of response a chunk could provide**. This makes the retriever’s job easier because each chunk carries hints about the kinds of questions it can answer. In practice, adding Q&A metadata often improves recall (finding the right chunk) and helps the system produce more useful answers. We’ll use `QuestionsAnsweredExtractor` and set `questions = 3`, which asks the LLM to generate three realistic Q&A pairs per chunk.

In [None]:
from llama_index.core.extractors import QuestionsAnsweredExtractor

qa_extractor = QuestionsAnsweredExtractor(llm = llm_transformations, questions = 3)
qa_extractor

**Summaries capture the core ideas of each chunk in a compact form**. They provide another layer of metadata that’s especially helpful when users ask broader or high-level questions. We’ll use `SummaryExtractor` for this task.

In [None]:
from llama_index.core.extractors import SummaryExtractor

summary_extractor = SummaryExtractor(llm = llm_transformations)
summary_extractor

Both transformations run inside the same pipeline, along with the SentenceSplitter and TitleExtractor, so each chunk ends up with a title, a short summary and 3 example Q&A pairs.

In [None]:
# titles + Q&A + summary
pipeline_rich = IngestionPipeline(
    transformations=[
        text_splitter,
        title_extractor,
        qa_extractor,
        summary_extractor
    ]
)

In [None]:
# Running the pipeline
nodes_2 = pipeline_rich.run(
    documents = documents,
    in_place = False,
    show_progress = True
)

In [None]:
print(nodes_2[0].get_content(metadata_mode=MetadataMode.LLM))

**Splicing Baseline and Enriched Nodes**

To fairly test the effect of metadata enrichment, we don’t want to rebuild our dataset in three completely separate ways. That would make it difficult to know if differences in answers are due to enrichment or simply because the data was reprocessed differently. Instead, we keep most of the dataset identical and **replace only a small slice of nodes with enriched versions**.

This creates a controlled experiment:
- All three indexes contain the same core content.
- The only difference is that in "index1" and "index2", a chosen section of the document is enriched with new metadata (titles, or titles + Q&A + summaries).
- If the enriched versions produce better answers, we can be confident the improvement comes from the metadata itself, not from unrelated differences.

  
First let's check the number of nodes in the baseline split:

In [None]:
print(len(baseline_nodes))

When deciding which nodes to replace, we need to balance two things:
1. Keep enough baseline nodes so the indexes are mostly identical.
2. Pick a meaningful section of the paper (not just references, etc.).
   
In our case, the baseline split produced **39 nodes**. A good rule of thumb is to replace about 20–25% of the nodes. That’s large enough to see an effect, but small enough that the rest of the dataset remains constant. We chose the range 15–25, which corresponds to the middle of the paper.

We will create the helper function that replaces the baseline slice [15:25] with enriched nodes from "nodes_1" or "nodes_2". The rest of the baseline stays intact:

In [None]:
def splice(orig, replacement, start=15, end=25):
    # keep same length, swap slice [start:end] with enriched nodes
    return orig[:start] + replacement[start:end] + orig[end:]

# mostly baseline nodes, with titles added in positions 15–25
nodes_for_index_1 = splice(baseline_nodes, nodes_1, 15, 25)

# mostly baseline nodes, with titles added in positions 15–25
nodes_for_index_2 = splice(baseline_nodes, nodes_2, 15, 25)

**Creating Embeddings**

Now we’ll embed and index each one with the same embedding model `"text-embedding-3-small"`. Creating a `VectorStoreIndex` from nodes automatically computes embeddings for those nodes.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

embed_model = OpenAIEmbedding(model=OPENAI_EMBED_MODEL)

# baseline only
index_0 = VectorStoreIndex(baseline_nodes, embed_model=embed_model, show_progress=True)

# baseline with the slice replaced by titles
index_1 = VectorStoreIndex(nodes_for_index_1, embed_model=embed_model, show_progress=True)

# baseline with the slice replaced by titles + Q&A + summary
index_2 = VectorStoreIndex(nodes_for_index_2, embed_model=embed_model, show_progress=True)

**Querying**

Next, we'll create three query engines with identical parameter `similarity_top_k=1`, so each returns the single most relevant node:

In [None]:
query = "What metrics are commonly used to evaluate text generation quality, and what are their limitations according to the paper?"

query_engine_0 = index_0.as_query_engine(similarity_top_k=1)
query_engine_1 = index_1.as_query_engine(similarity_top_k=1)
query_engine_2 = index_2.as_query_engine(similarity_top_k=1)

Each index is queried with the exact same question:

In [None]:
response_0 = query_engine_0.query(query)
response_1 = query_engine_1.query(query)
response_2 = query_engine_2.query(query)

Let's observe the results:

- **BASELINE**:
In the baseline setup, the system produced an answer mentioning BLEU, ROUGE, METEOR, and perplexity. However, the retrieved source was from page 20, which only contains references. This means the model did not actually ground its answer in the paper, but instead pulled in "classic NLP metrics" from its prior knowledge. The result looks plausible on the surface but is ultimately a hallucination because the metrics are not discussed in that section of the PDF.


- **TITLES**:
With titles added as metadata, the answer changed noticeably. This time, the system retrieved content from **page 34**, where the paper discusses evaluation benchmarks such as HELM, MMLU-Pro, and GPQA. The answer now highlighted evaluation methods like mathematical equivalence, instruction following, and user chat assessments, as well as the problem of benchmarks discouraging "I don’t know" responses. This aligns much more closely with the original text in Appendix F, making the response both faithful and relevant.


- **TITLES + Q&A + SUMMARY**:
This version performed similarly to the titles-only case. It also pointed to page 34 and generated an answer that covered mathematical equivalence, instruction following, penalties for abstention, and hallucination risks. While this was still grounded in the correct part of the paper and avoided the hallucination seen in the baseline, it did not provide a substantial improvement over the titles-only approach.




In [None]:
print("\n[BASELINE]\n", response_0.response)
print("\n[TITLES]\n", response_1.response)
print("\n[TITLES + Q&A + SUMMARY]\n", response_2.response)

def show_sources(resp, k=1):
    for i, sn in enumerate(resp.source_nodes[:k], 1):
        md = sn.node.metadata or {}
        print(f"\nSource {i} | page={md.get('page_label')} | title={md.get('document_title')}")
        print(sn.node.get_content(metadata_mode=MetadataMode.NONE)[:400], "\n---------------")

print("\n SOURCES: BASELINE")
show_sources(response_0)

print("\n SOURCES: TITLES")
show_sources(response_1)

print("\n SOURCES: TITLES + Q&A + SUMMARY")
show_sources(response_2)

In our experiment, the document included all pages, even those containing references. That meant **the retriever could sometimes pull from irrelevant sections**, as we saw with the baseline answer. Metadata enrichment helped correct this problem by steering retrieval toward more meaningful content.

In practice, however, we should **pre-process documents and exclude irrelevant sections such as references or bibliographies**. By removing these, we reduce noise in the retrieval stage and make it more likely that answers are grounded in the substantive parts of the text.

# 4. Persistent Storage

Once we’ve decided which pages to keep and which metadata to enrich (e.g., titles only, pages with references removed), we can persist that final node set, for example, in ChromaDB. We will take the enriched nodes stored in "nodes_1", embed them with the same embedding model (`text-embedding-3-small`), and write those vectors into a Chroma collection we can reopen in future notebook's sessions.

### 📝 EXERCISE 2: Experiment with Different Chunk Sizes (12-15 minutes)

**What you'll practice:** Understanding how chunk size affects retrieval and answer quality.

**Your task:**
1. Create a new text splitter with a smaller chunk size (e.g., 256 tokens instead of 512)
2. Generate new baseline nodes using this smaller chunk size
3. Create a new vector index from these smaller chunks
4. Query both the original index and your new index with the same question
5. Compare: Which chunk size produces better answers? Why?

**Key trade-offs to consider:**
- **Larger chunks (512+ tokens)**: More context per chunk, but may blend multiple topics
- **Smaller chunks (128-256 tokens)**: More focused, precise matching, but may lack context
- **Optimal size**: Depends on your documents and queries!

**Hint:** Follow the same pattern:
```python
new_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)
new_nodes = new_splitter.get_nodes_from_documents(documents)
new_index = VectorStoreIndex(new_nodes, embed_model=embed_model)
```

**Expected outcome:** You'll discover that chunk size significantly impacts answer quality. Smaller chunks may be more precise but might miss broader context.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# # Create new splitter with different chunk size
# small_splitter = SentenceSplitter(
#     chunk_size=256,
#     chunk_overlap=50
# )
# 
# # Generate new nodes
# small_nodes = small_splitter.get_nodes_from_documents(documents)
# print(f"Original nodes: {len(baseline_nodes)}")
# print(f"Smaller chunk nodes: {len(small_nodes)}")
# 
# # Create new index
# small_index = VectorStoreIndex(small_nodes, embed_model=embed_model)
# small_query_engine = small_index.as_query_engine(
#     similarity_top_k=2,
#     llm=llm_model
# )
# 
# # Query both
# test_query = "What are language models used for?"
# 
# print(f"\nQuery: {test_query}\n")
# print("ORIGINAL CHUNKS (512 tokens):")
# print(query_engine_0.query(test_query))
# 
# print("\n\nSMALLER CHUNKS (256 tokens):")
# print(small_query_engine.query(test_query))
# 
# print("\n\nANALYSIS:")
# print("Which version gave a more complete answer?")
# print("Which was more focused and precise?")

In [None]:
from chromadb import PersistentClient
from llama_index.vector_stores.chroma import ChromaVectorStore

embed_model = OpenAIEmbedding(model = OPENAI_EMBED_MODEL)

In this code below, we connect our pipeline to ChromaDB. We start by opening (or creating) a Chroma database on disk, then define a collection called "LLM_titles_only_v1" where our vectors will be stored. We'll build a `VectorStoreIndex` from our enriched nodes using the embedding model and route them into Chroma.

In [None]:
from llama_index.core import StorageContext

CHROMA_PATH = "./chroma_database"
client = PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection("LLM_titles_only_v1")

# Routing vectors into Chroma via StorageContext
vector_store = ChromaVectorStore(chroma_collection = collection)
storage_context = StorageContext.from_defaults(vector_store = vector_store)

index = VectorStoreIndex(
    nodes_1,
    storage_context = storage_context,
    embed_model=embed_model,
    show_progress = True
)

When we come back in a new session, we just need to wrap the existing Chroma collection and set the index as query engine.

In [None]:
client = PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection("LLM_titles_only_v1")
vector_store = ChromaVectorStore(chroma_collection=collection)

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    embed_model=embed_model,  # query-time embeddings must match!
)

In [None]:
qe = index.as_query_engine(similarity_top_k=1)

In [None]:
query = "What are the conclusions about hallucinations of language models?"

In [None]:
response = qe.query(query)

In [None]:
print(response)