# Lesson 4: Building a Multi-Document Agent

In Lesson 3, we built an agent that can reason over a single document and answer complex questions over it while maintaining memory. In Lesson 4, we will learn how to extend that agent to handle multiple documents in increasing degrees of complexity. We will start with a 3-document use case, and then we will expand to an 11 document use case.

## Setup

In [46]:
from helper import get_openai_api_key
OPENAI_API_KEY = get_openai_api_key()

In [47]:
import nest_asyncio
nest_asyncio.apply()

## 1. Setup an agent over 3 papers

The first task is to set up our function calling agent over 3 papers. We do this by combining the vector summary tools for each document into a list and passing it to the agent, so that the agent actually has 6 tools in total. So we will download 3 papers from Eichler 2024, and convert each paper into a tool.

**Note**: The pdf files are included with this lesson. To access these papers, go to the `File` menu and select`Open...`.

In [48]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=hSyW5go0v8",
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "selfrag.pdf",
]

In Lesson 3, we have a helper function called `get_doc_tools`, which automatically builds a vector index tool and a summary index tool over a given paper. So the vector tool performs vector search, and the summary tool performs summarization over the entire document. 

For each paper, we get back both the vector tool and summary tool, and we put it into this overall dictionary, mapping each paper name to the vector tool and summary tool. Next we simply get these tools in a flat list.

In [49]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: selfrag.pdf


In [50]:
initial_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

We will define GPT 3.5 turbo from OpenAI as our LLM of choice. If we quickly take a look at the number of tools that are going to be passed to the agent, we will see that the number is 6. That's because we have 3 papers, and we have 2 tools for each paper: a vector tool and a summary tool.

The next step is to construct our overall agent worker. And this agent work includes the 6 tools as well as the LLM that we parse. And now we are able to ask questions across these 3 documents or within a single document. For now, let's quickly ask a question about LongLoRA: `Tell me about the evaluation dataset used in LongLoRA, and then tell me about the evaluation results.` We get back the answer that one of the eval datasets used is the PG19 test split. And that we are able to look at the eval results for our LongLoRA models.

In [51]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

In [52]:
len(initial_tools)

6

In [53]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    initial_tools, 
    llm=llm, 
    verbose=True
)
agent = AgentRunner(agent_worker)

In [54]:
response = agent.query(
    "Tell me about the evaluation dataset used in LongLoRA, "
    "and then tell me about the evaluation results"
)

Added user message to memory: Tell me about the evaluation dataset used in LongLoRA, and then tell me about the evaluation results
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation dataset"}
=== Function Output ===
PG19 test split
=== Calling Function ===
Calling function: vector_tool_longlora with args: {"query": "evaluation results"}
=== Function Output ===
The evaluation results include reporting perplexity for models and baselines on proof-pile and PG19 datasets, showing the effectiveness of the fine-tuning method with longer context sizes. The perplexity decreases as the context size increases, indicating improved performance. Additionally, experiments on retrieval in long contexts were conducted, comparing the model with other open LLMs on a topic retrieval task, showcasing promising results on extremely large settings.
=== LLM Response ===
The evaluation dataset used in LongLoRA includes the PG19 test split. 

As for the evaluation 

The next question we can ask is `Give me a summary of both Self-RAG and LongLoRA`. So this allow us to do summarization across 2 papers. First we call the summary tool for selfrag with the input `Self-RAG`, and we will get back the output describing what the paper is about. The agent then calls a LongLoRA summary tool with the input `LongLoRA`, and then we get back an overall summary of LongLoRA. The final LLM response is that we are able to get back both a summary of selfrag and LongLoRA.

In [55]:
response = agent.query("Give me a summary of both Self-RAG and LongLoRA")
print(str(response))

Added user message to memory: Give me a summary of both Self-RAG and LongLoRA
=== Calling Function ===
Calling function: summary_tool_selfrag with args: {"input": "Self-RAG"}
=== Function Output ===
Self-RAG is a framework that enhances the quality and factuality of a large language model by incorporating retrieval and self-reflection mechanisms. It allows the language model to adaptively retrieve passages on-demand, generate text, and reflect on both the retrieved passages and its own generations using special tokens called reflection tokens. This approach enables the language model to control its behavior during inference, tailoring it to diverse task requirements and has shown significant performance improvements over existing models in various tasks such as open-domain QA, reasoning, fact verification, and long-form generation.
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "LongLoRA"}
=== Function Output ===
LongLoRA is an efficient method fo

## 2. Setup an agent over 11 papers

If we want to try out some queries on our own, we can try out any combinations of these 2 or even 3 papers, and ask for both summaries as well as specific information within the papers, to see whether or not the agent is able to reason about the summary and vector tools for each document. 

Let's expand into a more advanced use case, using 11 research papers from Eichler 2024.

### Download 11 ICLR papers

In [56]:
urls = [
    "https://openreview.net/pdf?id=VtmBAGCN7o",
    "https://openreview.net/pdf?id=6PmJoRfdaK",
    "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    "https://openreview.net/pdf?id=9WD9KwssyT",
    "https://openreview.net/pdf?id=yV6fD7LYkF",
    "https://openreview.net/pdf?id=hnrB5YHoYu",
    "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    "https://openreview.net/pdf?id=TpD2aG1h0D"
]

papers = [
    "metagpt.pdf",
    "longlora.pdf",
    "loftq.pdf",
    "swebench.pdf",
    "selfrag.pdf",
    "zipformer.pdf",
    "values.pdf",
    "finetune_fair_diffusion.pdf",
    "knowledge_card.pdf",
    "metra.pdf",
    "vr_mcl.pdf"
]

We will build a dictionary mapping each paper to its vector and summary tool. This section can also take a little bit of time, since we need to process, index and embed 11 documents. 

To download these papers, below is the needed code:


    #for url, paper in zip(urls, papers):
         #!wget "{url}" -O "{paper}"
    
    
**Note**: The pdf files are included with this lesson. To access these papers, go to the `File` menu and select`Open...`.

In [57]:
from utils import get_doc_tools
from pathlib import Path

paper_to_tools_dict = {}
for paper in papers:
    print(f"Getting tools for paper: {paper}")
    vector_tool, summary_tool = get_doc_tools(paper, Path(paper).stem)
    paper_to_tools_dict[paper] = [vector_tool, summary_tool]

Getting tools for paper: metagpt.pdf
Getting tools for paper: longlora.pdf
Getting tools for paper: loftq.pdf
Getting tools for paper: swebench.pdf
Getting tools for paper: selfrag.pdf
Getting tools for paper: zipformer.pdf
Getting tools for paper: values.pdf
Getting tools for paper: finetune_fair_diffusion.pdf
Getting tools for paper: knowledge_card.pdf
Getting tools for paper: metra.pdf
Getting tools for paper: vr_mcl.pdf


### Extend the Agent with Tool Retrieval

Now let's collapse these tools into a flat list. This is the point at which we need a slightly more advanced agent and tool architecture. The issue is that if we try to index all 11 papers, which now includes 20 tools, or if we try to index 100 papers or more, even though LLM context windows are getting longer, stuffing too many tool selections into the LLM prompt leads to the following issues:

    1. The tools may not all fit in the prompt, especially if our number of documents are big and we are modeling each document as a separate tool or a set of tools. Costs and latency will spike because we are increasing the number of tokens in our prompt. 

    2. The LLM can actually get confused. The LLM may fail to pick the right tool when the number of choices is too large. 

A solution is that when the user asks a query, we actually perform Retrieval Augmentation, but not on the level of text, but actually the level of tools. We first retrieve a small set of relevant tools, and then feed the relevant tools to the agent reasoning prompt instead of all the tools. This retrieval process is similar to the retrieval process used in RAG. At its simplest, it can just be top-K vector search. But we can also add all the advanced retrieval techniques we want, to filter out the relevant set of results. Our agents let us plug in a tool retriever that allows us to accomplish this.

So let's get this done. First we want to index the tools. LlamaIdex already has extensive indexing capabilities over general text documents. Since these tools are actually Python objects, we need to convert and serialize these objects to a string representation and back. This is solved through the object index abstraction in LlamaIndex. 

So we will define an object index and retriever over these tools. We import `VectorStoreIndex`, which is our standard interface for indexing text. Then we wrap `VectorStoreIndex` with `ObjectIndex`. And to construct an object index, we directly plug in these Python tools as input into the index. 

We can retrieve from an object index through an object retriever. This will call the underlying retriever from the index, and return the output directly as objects. In this case, it will be tools.

In [58]:
all_tools = [t for paper in papers for t in paper_to_tools_dict[paper]]

In [59]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)

In [60]:
obj_retriever = obj_index.as_retriever(similarity_top_k=3)

Now that we have defined the object retriever, let's walk through a very simple example. Let's ask `Tell me about the eval dataset used in MetaGPT and SWE-Bench`. We look at the first tool in this list. We see that we actually directly retrieved a set of tools, and that the first tool is the summary tool for MetaGPT (`summary_tool_metagpt`). 

If we look at the second tool, we see that this is a summary tool for an unrelated paper to MetaGPT and Swebench, so the quality of retrieval is dependent on our embedding model. However, we see that the last tool that's rtrieved is indeed the summary tool for Swebench (`summary_tool_swebench`).

In [61]:
tools = obj_retriever.retrieve(
    "Tell me about the eval dataset used in MetaGPT and SWE-Bench"
)

In [62]:
tools[0].metadata

ToolMetadata(description='Useful for summarization questions related to metagpt', name='summary_tool_metagpt', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [63]:
tools[1].metadata

ToolMetadata(description='Useful for summarization questions related to metra', name='summary_tool_metra', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

In [64]:
tools[2].metadata

ToolMetadata(description='Useful for summarization questions related to swebench', name='summary_tool_swebench', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)

Now we are ready to set up our function calling agent. We note that the setup is pretty similar to the setup in Lesson 3. However, just as an additional feature, we can actually add a system prompt to the agent if we want. This is optional, we don't need to specify this, but we can if we want an additional guidance to prompt the agent to output things in a certain way, or if we want it to take into account certain factors when it reasons over those tools. So this is an example of that.

In [65]:
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_retriever,
    llm=llm, 
    system_prompt=""" \
You are an agent designed to answer queries over a set of given papers.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    verbose=True
)
agent = AgentRunner(agent_worker)

Now let's try asking some comparion queries. We ask `Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench`. We see that it calls both the summary tool for MetaGPT and the summary tool for Swebench. It is able to get back results for both. And then it generates a final response.

In [66]:
response = agent.query(
    "Tell me about the evaluation dataset used "
    "in MetaGPT and compare it against SWE-Bench"
)
print(str(response))

Added user message to memory: Tell me about the evaluation dataset used in MetaGPT and compare it against SWE-Bench
=== Calling Function ===
Calling function: summary_tool_metagpt with args: {"input": "evaluation dataset used in MetaGPT"}
=== Function Output ===
The evaluation dataset used in MetaGPT includes three benchmarks: HumanEval, MBPP, and SoftwareDev. HumanEval consists of 164 handwritten programming tasks, MBPP comprises 427 Python tasks, and SoftwareDev is a collection of 70 representative software development tasks covering various scopes like mini-games, image processing algorithms, and data visualization.
=== Calling Function ===
Calling function: summary_tool_swebench with args: {"input": "evaluation dataset used in SWE-Bench"}
=== Function Output ===
The evaluation dataset used in SWE-Bench consists of task instances collected from real GitHub repositories, including popular Python repositories. It includes issues, pull requests, task instructions, retrieved files, docu

As a final example, let's compare and contrast the 2 Lora papers, LongLoRA and LoftQ, and analyze the approach in each paper first. We see that the agent is executing this query, and the first step it takes is this input task and actually retrieves the set of input tools that help it fulfill this task. So through the object retriever, the expectation is that it actually retrieves LongLoRA and LoftQ query tools in order to help it fulfill its response. 

If we take a look at the intermediate outputs of the agent, we see that it is able to have access to relevant tools from LongLoRA and LoftQ. We see that it first calls `summary_tool_longlora` with the arguments `Approach in LongLoRA`, and we are able to get back a summary of the approach. Similarly, we are able to get back the approach in LoftQ by calling `summary_tool_loftq`. 

The final LLM response is able to compare these 2 approaches by comparing the responses from these 2 tools, and combining them to synthesize an answer that satisfies the user query.

In [67]:
response = agent.query(
    "Compare and contrast the LoRA papers (LongLoRA, LoftQ). "
    "Analyze the approach in each paper first. "
)

Added user message to memory: Compare and contrast the LoRA papers (LongLoRA, LoftQ). Analyze the approach in each paper first. 
=== Calling Function ===
Calling function: summary_tool_longlora with args: {"input": "Approach in LongLoRA"}
=== Function Output ===
The approach in LongLoRA involves efficiently extending the context length of large language models (LLMs) to significantly larger sizes while saving on computational costs during fine-tuning. It focuses on maintaining the quality of the original attention architecture during inference and emphasizes efficient adaptation of LLMs to longer context lengths by incorporating trainable normalization and embedding layers. Additionally, LongLoRA utilizes methods like S2-Attn, Flash-Attention2, and DeepSpeed during fine-tuning to achieve promising results on extremely large context settings and handle long documents effectively.
=== Calling Function ===
Calling function: summary_tool_loftq with args: {"input": "Approach in LoftQ"}
=== 

So that concludes our lesson. Now we should be equipped with the right tools to build agents over a single document and also over multiple documents. This would enable us to build more general, complex context-augmented research assistance that can answer complex questions.