# Recursive Retriever + Document Agents

This guide shows how to combine recursive retrieval and "document agents" for advanced decision making over heterogeneous documents.

There are two motivating factors that lead to solutions for better retrieval:

* Decoupling retrieval embeddings from chunk-based synthesis. Oftentimes fetching documents by their summaries will return more relevant context to queries rather than raw chunks. This is something that recursive retrieval directly allows.
* Within a document, users may need to dynamically perform tasks beyond fact-based question-answering. We introduce the concept of "document agents" - agents that have access to both vector search and summary tools for a given document.

In [1]:
from dotenv import load_dotenv
load_dotenv()
import os

In [2]:
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI

In [4]:
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]

In [5]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [6]:
# Load all wiki documents
city_docs = {}
for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

# Define LLM + Service Context + Callback Manager



In [14]:
from llama_index.core import Settings

Settings.llm = OpenAI(temperature=0, model="gpt-4o-mini")

# Build Document Agent for each Document
In this section we define "document agents" for each document.

First we define both a vector index (for semantic search) and summary index (for summarization) for each document. The two query engines are then converted into tools that are passed to an OpenAI function calling agent.

This document agent can dynamically choose to perform semantic search or summarization within a given document.

We create a separate document agent for each city.

In [15]:
from llama_index.agent.openai import OpenAIAgent

# Build agents dictionary
agents = {}

for wiki_title in wiki_titles:
    # build vector index
    vector_index = VectorStoreIndex.from_documents(
        city_docs[wiki_title],
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        city_docs[wiki_title],
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine()
    list_query_engine = summary_index.as_query_engine()

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    f"Useful for retrieving specific context from {wiki_title}"
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for summarization questions related to"
                    f" {wiki_title}"
                ),
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-4o-mini")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
    )

    agents[wiki_title] = agent

# Build Composable Retriever over these Agents
Now we define a set of summary nodes, where each node links to the corresponding Wikipedia city article. We then define a composable retriever + query engine on top of these Nodes to route queries down to a given node, which will in turn route it to the relevant document agent.

In [16]:
# define top-level nodes
objects = []
for wiki_title in wiki_titles:
    # define index node that links to these agents
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        " this index if you need to lookup specific facts about"
        f" {wiki_title}.\nDo not use this index if you want to analyze"
        " multiple cities."
    )
    node = IndexNode(
        text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title]
    )
    objects.append(node)

In [17]:
# define top-level retriever
vector_index = VectorStoreIndex(
    objects=objects,
)
query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)

# Running Example Queries


In [18]:
# should use Boston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Boston")

[1;3;38;2;11;159;203mRetrieval entering Boston: OpenAIAgent
[0m[1;3;38;2;237;90;200mRetrieving from object OpenAIAgent with query Tell me about the sports teams in Boston
[0mAdded user message to memory: Tell me about the sports teams in Boston
=== Calling Function ===
Calling function: summary_tool with args: {"input":"sports teams in Boston"}
Got output: Boston has teams in the four major North American men's professional sports leagues, as well as Major League Soccer. The city's professional sports teams have won a total of 40 championships across these leagues. Notable teams include:

- **Boston Red Sox**: A founding member of the American League, they play at Fenway Park, the oldest sports arena in active use in the U.S.
- **Boston Celtics**: A founding member of the NBA, they have won eighteen championships, the most of any NBA team, and play at TD Garden.
- **Boston Bruins**: The first American member of the NHL and an Original Six franchise, they also play at TD Garden.
- *

In [19]:
print(response)

Boston is home to several prominent sports teams across major professional leagues, as well as a vibrant college athletics scene. The key professional teams include:

- **Boston Red Sox**: A founding member of the American League, they play at Fenway Park, the oldest sports arena still in use in the U.S., and have a passionate fan base.

- **Boston Celtics**: As a founding member of the NBA, they are one of the most successful teams in basketball history, with eighteen championships, the most in the league. Their home games are held at TD Garden.

- **Boston Bruins**: The first American member of the NHL and part of the Original Six franchises, they also play at TD Garden and have a storied history in ice hockey.

- **New England Patriots**: Originally founded as the Boston Patriots, they now play in Foxborough and have won multiple Super Bowls, known for their strong performances in the NFL.

In addition to these professional teams, Boston boasts a robust college athletics scene with 

In [20]:
# should use Houston agent -> vector tool
response = query_engine.query("Tell me about the sports teams in Houston")

[1;3;38;2;11;159;203mRetrieval entering Houston: OpenAIAgent
[0m[1;3;38;2;237;90;200mRetrieving from object OpenAIAgent with query Tell me about the sports teams in Houston
[0mAdded user message to memory: Tell me about the sports teams in Houston
=== Calling Function ===
Calling function: summary_tool with args: {"input":"sports teams in Houston"}
Got output: Houston has professional sports teams in every major league except the National Hockey League. The teams include:

1. **Houston Astros** - Major League Baseball (MLB) team, which has won the World Series in 2017 and 2022.
2. **Houston Rockets** - National Basketball Association (NBA) team, with two championships won in 1994 and 1995.
3. **Houston Texans** - National Football League (NFL) team, established in 2002.
4. **Houston Dynamo** - Major League Soccer (MLS) team, which has won two MLS Cup titles in 2006 and 2007.
5. **Houston Dash** - National Women's Soccer League (NWSL) team, which won their first title in 2020.
6. **

In [21]:
# should use Seattle agent -> summary tool
response = query_engine.query(
    "Give me a summary on all the positive aspects of Chicago"
)

[1;3;38;2;11;159;203mRetrieval entering Chicago: OpenAIAgent
[0m[1;3;38;2;237;90;200mRetrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago
[0mAdded user message to memory: Give me a summary on all the positive aspects of Chicago
=== Calling Function ===
Calling function: summary_tool with args: {"input":"positive aspects of Chicago"}
Got output: Chicago boasts a rich cultural heritage, with significant contributions to the arts, music, and theater. It is known for its vibrant music scene, particularly in genres like jazz, blues, and house music. The city is home to renowned institutions such as the Chicago Symphony Orchestra and the Art Institute of Chicago, which enhance its cultural landscape.

The city's architecture is another highlight, featuring a stunning skyline that includes iconic buildings like the Willis Tower and the John Hancock Center. Chicago is recognized for its innovative urban planning and was the birthplace of th

In [22]:
print(response)

Chicago is a vibrant city known for its rich cultural heritage, particularly in the arts and music, with a notable emphasis on jazz, blues, and house music. It is home to prestigious institutions like the Chicago Symphony Orchestra and the Art Institute of Chicago, which enhance its cultural landscape. The city's stunning architecture features iconic skyscrapers such as the Willis Tower and the John Hancock Center, showcasing innovative urban planning.

Economically, Chicago boasts a diverse and robust economy, serving as a major hub for finance, commerce, and industry, with a varied labor market. It also offers excellent educational opportunities through renowned universities like the University of Chicago and Northwestern University.

The city is enriched by extensive parks and green spaces, providing residents with recreational opportunities along the waterfront of Lake Michigan. Chicago's culinary scene is celebrated for its unique regional specialties, including deep-dish pizza an