# ðŸ““ The GenAI Revolution Cookbook

**Title:** Multi-Document Agent with LlamaIndex: The Ultimate Guide [2025]

**Description:** Build a production-ready multi-document agent with LlamaIndex, turning PDFs into retrieval and summarization tools using semantic selection for accurate answers.

**ðŸ“– Read the full article:** [Multi-Document Agent with LlamaIndex: The Ultimate Guide [2025]](https://blog.thegenairevolution.com/article/multi-document-agent-with-llamaindex-the-ultimate-guide-2025)

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You're Building

You'll create a multi\-document research assistant that can answer questions across multiple PDFs with precise citations. The agent leverages semantic vector search for targeted queries, hierarchical summarization for high\-level synthesis, and function calling to route queries to the appropriate tool. When you're done, you'll have a runnable notebook that handles cross\-document Q\&A, enforces consistent citations in \[file\_name p.page\_label\] format, and includes a minimal validation suite.

**Prerequisites:**

* Python 3\.10\+
* OpenAI API key
* 2 to 3 sample PDFs (research papers, reports, or technical documents)
* Expected cost: roughly $0\.10 to $0\.50 per summary\-heavy query depending on document size

## Why This Approach Works

**Per\-Document Tool Isolation**

Each PDF gets its own vector and summary tool. This prevents cross\-contamination, enables precise citations, and lets the agent reason about which document to query for any given question. It's a clean separation that makes debugging much easier too.

**Semantic Tool Retrieval**

An object index embeds tool descriptions and retrieves the top\-k relevant tools per query. This scales to dozens of documents without overwhelming the agent's context window. Actually, this is one of those things that sounds complicated but works beautifully in practice.

**Dual Retrieval Strategy**

Vector tools handle narrow, fact\-based queries like "What dataset did the authors use?" Summary tools handle broad synthesis questions like "Compare the main contributions across papers." The agent picks the right mode based on query semantics. Simple but effective.

**Citation Enforcement**

Every tool attaches file name and page metadata to results. The system prompt instructs the agent to cite sources after each claim, and you can post\-process responses to format citations programmatically. No more vague "according to the document" references.

## How It Works (High\-Level Overview)

1. **Load and chunk PDFs** â€“ Extract text, split into sentence\-aware chunks, normalize metadata for citations.

2. **Build per\-document tools** â€“ Create vector and summary tools for each PDF, wrap them with clear descriptions.

3. **Index tools semantically** â€“ Embed tool descriptions in an object index for dynamic retrieval.

4. **Assemble the agent** â€“ Use function calling with a strict system prompt to route queries and enforce citations.

5. **Validate and iterate** â€“ Run test queries, inspect tool selection, tune retrieval thresholds and temperature.

## Setup \& Installation

First, run this cell to install all required packages with pinned versions:

In [None]:
%pip -q install llama-index llama-index-llms-openai  llama-index-embeddings-openai pypdf nest_asyncio python-dotenv numpy pandas jedi>=0.16

Next up, configure your OpenAI API key. If you're running in Colab, add your key to Secrets (Settings, then Secrets, then OPENAI\_API\_KEY). Otherwise, just create a .env file with OPENAI\_API\_KEY\=your\_key.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

# Fail early if key is missing
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env or Colab Secrets"
print("API key loaded.")

Set up logging, suppress warnings, and enable async support for a clean notebook environment:

In [None]:
import logging
import warnings
import nest_asyncio

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
nest_asyncio.apply()

Configure the LLM and embedding model globally for all LlamaIndex operations:

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Use GPT-4o for reliable function calling; fallback to gpt-4o-mini if needed
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

Create a data directory and download sample PDFs programmatically so the notebook runs end\-to\-end:

In [None]:
import urllib.request

DATA_DIR = "data"
os.makedirs(DATA_DIR, exist_ok=True)

# Example: download public arXiv papers (replace with your own PDFs)
sample_urls = [
    ("https://arxiv.org/pdf/2005.11401.pdf", "paper1.pdf"),  # GPT-3 paper
    ("https://arxiv.org/pdf/2303.08774.pdf", "paper2.pdf"),  # GPT-4 paper
]

for url, fname in sample_urls:
    fpath = os.path.join(DATA_DIR, fname)
    if not os.path.exists(fpath):
        print(f"Downloading {fname}...")
        urllib.request.urlretrieve(url, fpath)

pdf_files = [f for f in os.listdir(DATA_DIR) if f.lower().endswith(".pdf")]
print(f"Found {len(pdf_files)} PDFs:", pdf_files)

## Step\-by\-Step Implementation

### Step 1: Load and Chunk PDFs

Load documents from the data directory. The PDF reader attaches page metadata automatically, which is exactly what we need:

In [None]:
from llama_index.core import SimpleDirectoryReader, Document

docs = SimpleDirectoryReader(DATA_DIR, recursive=False).load_data()
print(f"Loaded {len(docs)} documents")

Split documents into sentence\-aware chunks for semantic retrieval. Here's the thing about sentence\-aware splitting: it avoids fragmenting thoughts mid\-sentence, giving the vector index better semantic units. This directly improves retrieval quality, especially for dense technical writing like research papers or legal clauses. Actually, if you want more strategies to boost retrieval accuracy in RAG systems, check out our guide on [retrieval tricks to boost answer accuracy](/article/rag-application-7-retrieval-tricks-to-boost-answer-accuracy-2).

In [None]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=200)
nodes = splitter.get_nodes_from_documents(docs, show_progress=True)
print(f"Total chunks: {len(nodes)}")

Normalize metadata for accurate citations. You need to ensure every node has file\_name and page\_label:

In [None]:
for n in nodes:
    meta = n.metadata or {}
    if "file_name" not in meta:
        file_path = meta.get("file_path", meta.get("source", "unknown"))
        meta["file_name"] = os.path.basename(file_path) if isinstance(file_path, str) else "unknown"
    if "page_label" not in meta:
        meta["page_label"] = str(meta.get("page_number", "N/A"))
    n.metadata = meta

print("Sample chunk metadata:", nodes[0].metadata)
print("Sample chunk text:", nodes[0].text[:300], "...")

Group nodes by document for per\-document tool creation:

In [None]:
from collections import defaultdict

nodes_by_file = defaultdict(list)
for n in nodes:
    nodes_by_file[n.metadata["file_name"]].append(n)

print({k: len(v) for k, v in nodes_by_file.items()})

### Step 2: Build Per\-Document Vector Tools

Create a vector index for each document to enable precise passage retrieval:

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool

vector_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    v_index = VectorStoreIndex(doc_nodes, show_progress=True)
    v_engine = v_index.as_query_engine(similarity_top_k=5)
    v_tool = QueryEngineTool.from_defaults(
        name=f"vector_{fname.replace('.', '_')}",
        query_engine=v_engine,
        description=(
            f"Semantic vector search for {fname}. "
            "Use for targeted, specific questions that require exact passages and citations."
        )
    )
    vector_tools[fname] = v_tool

print(f"Vector tools created: {len(vector_tools)}")

Test a vector tool to verify retrieval quality. Always good to sanity check these things:

In [None]:
sample_file = next(iter(vector_tools.keys()))
resp = vector_tools[sample_file].query_engine.query("What problem does this paper address?")
print(resp)

### Step 3: Build Per\-Document Summary Tools

Create a summary index for each document to enable hierarchical summarization:

In [None]:
from llama_index.core import SummaryIndex

summary_tools = {}

for fname, doc_nodes in nodes_by_file.items():
    s_index = SummaryIndex(doc_nodes)
    s_engine = s_index.as_query_engine(
        response_mode="tree_summarize",
        use_async=True
    )
    s_tool = QueryEngineTool.from_defaults(
        name=f"summary_{fname.replace('.', '_')}",
        query_engine=s_engine,
        description=(
            f"Hierarchical summarization for {fname}. "
            "Use for overviews, key contributions, limitations, and document-wide synthesis."
        )
    )
    summary_tools[fname] = s_tool

print(f"Summary tools created: {len(summary_tools)}")

Test a summary tool to verify synthesis quality:

In [None]:
sample_file = next(iter(summary_tools.keys()))
resp = summary_tools[sample_file].query_engine.query("Provide a 5-bullet executive summary.")
print(resp)

### Step 4: Index Tools Semantically

Build an object index over all tools for semantic tool selection. This embeds tool descriptions and retrieves the top\-k relevant tools per query. It's actually pretty clever how this works:

In [None]:
from llama_index.core.objects import ObjectIndex

all_tools = list(vector_tools.values()) + list(summary_tools.values())

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
    show_progress=True
)

tool_retriever = obj_index.as_retriever(similarity_top_k=3)

Inspect which tools are retrieved for different queries to debug tool selection. This step is crucial for understanding what's happening under the hood:

In [None]:
import inspect

def inspect_tools(query: str):
    retrieved_results = tool_retriever.retrieve(query)
    print(f"Query: {query}")
    for i, res in enumerate(retrieved_results, 1):
        tool_obj = None
        # Check if res is a NodeWithScore object (standard behavior for ObjectRetriever)
        if hasattr(res, 'node') and hasattr(res.node, 'obj'):
            tool_obj = res.node.obj
        # Otherwise, assume res is the QueryEngineTool object directly
        else:
            tool_obj = res

        tool_name = getattr(getattr(tool_obj, 'metadata', None), 'name', None)

        if tool_name:
            name_parts = tool_name.split('_', 1)
            tool_type = name_parts[0] if len(name_parts) > 0 else "unknown"
            file_name = name_parts[1] if len(name_parts) > 1 else "unknown"
            print(f"#{i} -> {tool_type} | {file_name} | {tool_name}")
        else:
            print(f"#{i} -> Could not determine tool properties: tool object {tool_obj} has no valid 'name' attribute in its metadata or it's empty/None.")
        print("-" * 30)

# Test calls for inspect_tools
inspect_tools("Provide an executive summary across all documents.")
inspect_tools("Which sections discuss model architecture details?")

### Step 5: Assemble the Agent

Create the agent using function calling and a strict system prompt that enforces citation format. Now, while frameworks like LangChain and CrewAI are solid choices, LlamaIndex specializes in document workflows with first\-class support for indexing, retrieval, summarization, and agentic tool use that map cleanly to this problem. If you're interested in foundational agent patterns, you might want to check out our step\-by\-step tutorial on [building an LLM agent from scratch with GPT\-4 ReAct](/article/how-to-build-an-llm-agent-from-scratch-with-gpt-4-react-5).

In [None]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core import Settings

SYSTEM_PROMPT = """You are a multi-document research assistant.
- Use only the provided tools.
- Prefer vector tools for specific, narrow questions.
- Prefer summary tools for high-level synthesis.
- Always cite sources as [file_name p.page_label] after each relevant sentence.
- If you cannot find relevant evidence, say so explicitly."""

# If you previously used a tool retriever, see Option B below.
agent = FunctionAgent(
    tools=all_tools,          # same list you used before
    llm=Settings.llm,         # your configured LLM
    system_prompt=SYSTEM_PROMPT,
    verbose=True,
)

## Run and Validate

Run a cross\-document query and verify the agent synthesizes answers with citations:

In [None]:
import asyncio
response = asyncio.run(
    agent.run("Compare the main challenges and proposed collaboration mechanisms across the papers.")
)
print(str(response))

Run a suite of test queries to validate agent routing, retrieval, and summarization. This is where you really see if everything's working together:

In [None]:
import asyncio

tests = [
    "List the datasets used by each paper and compare evaluation metrics.",
    "Provide a high-level summary of the main contributions across documents.",
    "According to the authors, what are the primary limitations?"
]

async def main():
    for q in tests:
        print("\nQ:", q)
        resp = await agent.run(q)
        print("A:", str(resp))

asyncio.run(main())

## Conclusion

You've built a multi\-document research assistant that routes queries to the right tool, retrieves precise passages, and enforces consistent citations. The key decisions here include per\-document tool isolation for clean attribution, semantic tool retrieval for scalability, and dual retrieval modes. Vector for specifics, summary for synthesis.

Next steps to harden this for production:

1. **Persist indices** â€“ Save vector and summary indices to disk or a vector database like pgvector or Pinecone to avoid re\-embedding on every run. Trust me, this saves a lot of time and money.

2. **Add retries and rate limits** â€“ Wrap LLM calls with exponential backoff and timeout handling for robustness. Things will fail occasionally, better to handle it gracefully.

3. **Implement structured logging** â€“ Use LlamaIndex callbacks or a logging framework to trace tool calls, latency, and token usage. You'll thank yourself later when debugging.

4. **Cache answers** â€“ Use an in\-memory LRU cache or a persistent store like Redis for repeated queries. For a deep dive into implementing semantic caching with Redis Vector to optimize LLM costs, see [how to implement semantic cache with Redis Vector](/article/semantic-cache-llm-how-to-implement-with-redis-vector-to-cut-costs-6).

5. **Post\-process citations** â€“ Extract source\_nodes from responses and format citations programmatically to ensure consistency beyond prompt\-based enforcement. Prompt engineering only gets you so far.

And that's it. You now have a working multi\-document research assistant that actually knows where its information comes from. Pretty useful for any serious document analysis work.