In [1]:
from dotenv import load_dotenv
load_dotenv()
import os

# Recursive Retriever + Query Engine Demo
In this demo, we walk through a use case of showcasing our "RecursiveRetriever" module over hierarchical data.

The concept of recursive retrieval is that we not only explore the directly most relevant nodes, but also explore node relationships to additional retrievers/query engines and execute them. For instance, a node may represent a concise summary of a structured table, and link to a SQL/Pandas query engine over that structured table. Then if the node is retrieved, we want to also query the underlying query engine for the answer.

This can be especially useful for documents with hierarchical relationships. In this example, we walk through a Wikipedia article about billionaires (in PDF form), which contains both text and a variety of embedded structured tables. We first create a Pandas query engine over each table, but also represent each table by an IndexNode (stores a link to the query engine); this Node is stored along with other Nodes in a vector store.

During query-time, if an IndexNode is fetched, then the underlying query engine/retriever will be queried.

In [2]:
# %pip install llama-index-embeddings-openai
# %pip install llama-index-readers-file pymupdf
# %pip install llama-index-llms-openai
# %pip install llama-index-experimental

In [3]:
import camelot

# https://en.wikipedia.org/wiki/The_World%27s_Billionaires
from llama_index.core import VectorStoreIndex
from llama_index.experimental.query_engine import PandasQueryEngine
from llama_index.core.schema import IndexNode
from llama_index.llms.openai import OpenAI

from llama_index.readers.file import PyMuPDFReader
from typing import List

In [4]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load in Document (and Tables)
We use our PyMuPDFReader to read in the main text of the document.

We also use camelot to extract some structured tables from the document

In [5]:
file_path = "billionaires_page.pdf"

In [6]:
# initialize PDF reader
reader = PyMuPDFReader()

In [7]:
docs = reader.load(file_path)

In [8]:
# use camelot to parse tables

def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = (
            table_df.rename(columns=table_df.iloc[0])
            .drop(table_df.index[0])
            .reset_index(drop=True)
        )
        table_dfs.append(table_df)
    return table_dfs


In [13]:
table_dfs = get_tables(file_path, pages=[3, 25])

2024-10-31T10:55:45 - INFO - Processing page-3


INFO:camelot:Processing page-3
Processing page-3


2024-10-31T10:55:47 - INFO - Processing page-25


INFO:camelot:Processing page-25
Processing page-25


In [10]:
# shows list of top billionaires in 2023
table_dfs[0]

Unnamed: 0,No.,Name,Net worth\n(USD),Age,Nationality,Primary source(s) of wealth
0,1,Bernard Arnault &\nfamily,$211 billion,74,France,LVMH
1,2,Elon Musk,$180 billion,51,United\nStates,"Tesla, SpaceX, X Corp."
2,3,Jeff Bezos,$114 billion,59,United\nStates,Amazon
3,4,Larry Ellison,$107 billion,78,United\nStates,Oracle Corporation
4,5,Warren Buffett,$106 billion,92,United\nStates,Berkshire Hathaway
5,6,Bill Gates,$104 billion,67,United\nStates,Microsoft
6,7,Michael Bloomberg,$94.5 billion,81,United\nStates,Bloomberg L.P.
7,8,Carlos Slim & family,$93 billion,83,Mexico,"Telmex, América Móvil, Grupo\nCarso"
8,9,Mukesh Ambani,$83.4 billion,65,India,Reliance Industries
9,10,Steve Ballmer,$80.7 billion,67,United\nStates,Microsoft


In [11]:
# shows list of top billionaires
table_dfs[1]

Unnamed: 0,Year,Number of billionaires,Group's combined net worth
0,2023[2],2640.0,$12.2 trillion
1,2022[6],2668.0,$12.7 trillion
2,2021[11],2755.0,$13.1 trillion
3,2020,2095.0,$8.0 trillion
4,2019,2153.0,$8.7 trillion
5,2018,2208.0,$9.1 trillion
6,2017,2043.0,$7.7 trillion
7,2016,1810.0,$6.5 trillion
8,2015[18],1826.0,$7.1 trillion
9,2014[67],1645.0,$6.4 trillion


# Create Pandas Query Engines
We create a pandas query engine over each structured table.

These can be executed on their own to answer queries about each table.

**⚠️WARNING**: This tool provides the LLM access to the ```eval``` function. Arbitrary code execution is possible on the machine running this tool. While some level of filtering is done on code, this tool is not recommended to be used in a production setting without heavy sandboxing or virtual machines.

# define query engines over these tables


In [44]:
llm = OpenAI(model="gpt-4o")

df_query_engines = [
    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs
]

In [23]:
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
$180 billion


In [24]:
response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
793


# Build Vector Index
Build vector index over the chunked document as well as over the additional IndexNode objects linked to the tables.

In [25]:
from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)

In [46]:
# define index nodes
summaries = [
    (
        "This node provides information about the world's richest billionaires"
        " in 2023"
    ),
    (
        "This node provides information on the number of billionaires and"
        " their combined net worth from 2000 to 2023."
    ),
]

df_nodes = [
    IndexNode(text=summary, index_id=f"pandas{idx}")
    for idx, summary in enumerate(summaries)
]

df_id_query_engine_mapping = {
    f"pandas{idx}": df_query_engine
    for idx, df_query_engine in enumerate(df_query_engines)
}

# construct top-level vector index + query engine


In [47]:
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


# Use ```RecursiveRetriever``` in our ```RetrieverQueryEngine```


We define a ```RecursiveRetriever``` object to recursively retrieve/query nodes. We then put this in our ```RetrieverQueryEngine``` along with a ```ResponseSynthesizer``` to synthesize a response.

We pass in mappings from id to retriever and id to query engine. We then pass in a root id representing the retriever we query first.


In [48]:
# baseline vector index (that doesn't include the extra df nodes).
# used to benchmark
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [49]:
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

In [50]:
response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)

[1;3;34mRetrieving with query id None: What's the net worth of the second richest billionaire in 2023?
[0mINFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[1;3;38;5;200mRetrieving text node: 7/1/23, 11:31 PM
The World's Billionaires - Wikipedia
https://en.wikipedia.org/wiki/The_World%27s_Billionaires
2/33
stock are priced to market on a date roughly a month before publication. Privately held companies are
priced by the prevailing price-to-sales or price-to-earnings ratios. Known debt is subtracted from
assets to get a final estimate of an individual's estimated worth in United States dollars. Since stock
prices fluctuate rapidly, an individual's true wealth and ranking at the time of publication may vary
from their situation when the list was compiled.[7]
When a living individual has dispersed his or her wealth to immediate family members it is included
under a single listin

In [51]:
response.source_nodes[0].node.get_content()

'7/1/23, 11:31 PM\nThe World\'s Billionaires - Wikipedia\nhttps://en.wikipedia.org/wiki/The_World%27s_Billionaires\n2/33\nstock are priced to market on a date roughly a month before publication. Privately held companies are\npriced by the prevailing price-to-sales or price-to-earnings ratios. Known debt is subtracted from\nassets to get a final estimate of an individual\'s estimated worth in United States dollars. Since stock\nprices fluctuate rapidly, an individual\'s true wealth and ranking at the time of publication may vary\nfrom their situation when the list was compiled.[7]\nWhen a living individual has dispersed his or her wealth to immediate family members it is included\nunder a single listing (as a single "family fortune") provided that individual (the grantor) is still living.\nHowever, if a deceased billionaire\'s fortune has been dispersed, it will not appear as a single listing,\nand each recipient will only appear if his or her own total net worth is over a $Billion (his

In [52]:
str(response)

'The net worth of the second richest billionaire in 2023 is not specified in the provided information.'

In [53]:
response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [54]:
print(response.source_nodes[0].node.get_content())

7/1/23, 11:31 PM
The World's Billionaires - Wikipedia
https://en.wikipedia.org/wiki/The_World%27s_Billionaires
12/33
No.
Name
Net worth
(USD)
Age
Nationality
Source(s) of wealth
1 
Carlos Slim
$74.0 billion 
71
 Mexico
América Móvil, Grupo Carso
2 
Bill Gates
$56.0 billion 
55
 United
States
Microsoft
3 
Warren Buffett
$50.0 billion 
80
 United
States
Berkshire Hathaway
4 
Bernard Arnault
$41.0 billion 
62
 France
LVMH Moët Hennessy • Louis
Vuitton
5 
Larry Ellison
$39.5 billion 
66
 United
States
Oracle Corporation
6 
Lakshmi Mittal
$31.1 billion 
60
 India
Arcelor Mittal
7 
Amancio Ortega
$31.0 billion 
74
 Spain
Inditex Group
8 
Eike Batista
$30.0 billion 
53
 Brazil
EBX Group
9 
Mukesh Ambani
$27.0 billion 
54
 India
Reliance Industries
10 
Christy Walton &
family
$26.5 billion 
62
 United
States
Walmart
Slim narrowly eclipsed Gates to top the billionaire list for the first time. Slim saw his estimated worth
surge $18.5 billion to $53.5 billion as shares of America Movil rose 35 pe

In [55]:
print(str(response))

In 2009, there were a total of 1,011 billionaires.


In [56]:
response.source_nodes[0].node.get_content()

"7/1/23, 11:31 PM\nThe World's Billionaires - Wikipedia\nhttps://en.wikipedia.org/wiki/The_World%27s_Billionaires\n12/33\nNo.\nName\nNet worth\n(USD)\nAge\nNationality\nSource(s) of wealth\n1 \nCarlos Slim\n$74.0\xa0billion\xa0\n71\n\xa0Mexico\nAmérica Móvil, Grupo Carso\n2 \nBill Gates\n$56.0\xa0billion\xa0\n55\n\xa0United\nStates\nMicrosoft\n3 \nWarren Buffett\n$50.0\xa0billion\xa0\n80\n\xa0United\nStates\nBerkshire Hathaway\n4 \nBernard Arnault\n$41.0\xa0billion\xa0\n62\n\xa0France\nLVMH Moët Hennessy • Louis\nVuitton\n5 \nLarry Ellison\n$39.5\xa0billion\xa0\n66\n\xa0United\nStates\nOracle Corporation\n6 \nLakshmi Mittal\n$31.1\xa0billion\xa0\n60\n\xa0India\nArcelor Mittal\n7 \nAmancio Ortega\n$31.0\xa0billion\xa0\n74\n\xa0Spain\nInditex Group\n8 \nEike Batista\n$30.0\xa0billion\xa0\n53\n\xa0Brazil\nEBX Group\n9 \nMukesh Ambani\n$27.0\xa0billion \n54\n\xa0India\nReliance Industries\n10 \nChristy Walton &\nfamily\n$26.5\xa0billion\xa0\n62\n\xa0United\nStates\nWalmart\nSlim narrowly e

In [57]:
response = query_engine.query(
    "Which billionaires are excluded from this list?"
)

[1;3;34mRetrieving with query id None: Which billionaires are excluded from this list?
[0mINFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[1;3;38;5;200mRetrieving text node: Retrieved 13 March 2019.
15. Kroll, Luisa (6 March 2018). "Forbes Billionaires 2018: Meet The Richest People On The Planet"
(https://www.forbes.com/sites/luisakroll/2018/03/06/forbes-billionaires-2018-meet-the-richest-peopl
e-on-the-planet/). Forbes. Archived (https://web.archive.org/web/20180308165924/https://www.for
bes.com/sites/luisakroll/2018/03/06/forbes-billionaires-2018-meet-the-richest-people-on-the-plane
t/) from the original on 8 March 2018. Retrieved 6 March 2018.
16. Dolan, Kerry A. "Why No Saudi Arabians Made The Forbes Billionaires List This Year" (https://ww
w.forbes.com/sites/kerryadolan/2018/03/06/no-saudi-arabian-billionaires-forbes-list-2018-alwaleed
-alamoudi/). Forbes. Archived (ht

In [43]:
print(str(response))

No Saudi Arabians made the Forbes Billionaires List in the year referenced.
