# Recursive Retriever + Query Engine Demo 

In this demo, we walk through a use case of showcasing our "RecursiveRetriever" module over hierarchical data.

The concept of recursive retrieval is that we not only explore the directly most relevant nodes, but also explore
node relationships to additional retrievers/query engines and execute them. For instance, a node may represent a concise summary of a structured table,
and link to a SQL/Pandas query engine over that structured table. Then if the node is retrieved, we want to also query the underlying query engine for the answer.

This can be especially useful for documents with hierarchical relationships. In this example, we walk through a Wikipedia article about billionaires (in PDF form), which contains both text and a variety of embedded structured tables. We first create a Pandas query engine over each table, but also represent each table by an `IndexNode` (stores a link to the query engine); this Node is stored along with other Nodes in a vector store. 

During query-time, if an `IndexNode` is fetched, then the underlying query engine/retriever will be queried. 

**Notes about Setup**

We use `camelot` to extract text-based tables from PDFs.

In [1]:
import camelot
from llama_index import Document, ListIndex
# https://en.wikipedia.org/wiki/The_World%27s_Billionaires
from llama_index import VectorStoreIndex, ServiceContext, LLMPredictor
from llama_index.query_engine import PandasQueryEngine, RecursiveRetrieverQueryEngine, RetrieverQueryEngine
from llama_index.retrievers import RecursiveRetriever
from llama_index.schema import IndexNode

from langchain.chat_models import ChatOpenAI
from llama_hub.file.pymu_pdf.base import PyMuPDFReader
from pathlib import Path
from typing import List

  from .autonotebook import tqdm as notebook_tqdm


## Load in Document (and Tables)

We use our `PyMuPDFReader` to read in the main text of the document.

We also use `camelot` to extract some structured tables from the document

In [2]:
file_path = "billionaires_page.pdf"

In [3]:
# initialize PDF reader
reader = PyMuPDFReader()

In [4]:
docs = reader.load(Path(file_path))

In [5]:
# use camelot to parse tables
def get_tables(path: str, pages: List[int]):
    table_dfs = []
    for page in pages:
        table_list = camelot.read_pdf(path, pages=str(page))
        table_df = table_list[0].df
        table_df = table_df.rename(columns=table_df.iloc[0]).drop(table_df.index[0]).reset_index(drop=True)
        table_dfs.append(table_df)
    return table_dfs

In [6]:
table_dfs = get_tables(file_path, pages=[3, 25])

In [7]:
# shows list of top billionaires in 2023
table_dfs[0]

Unnamed: 0,No.,Name,Net worth\n(USD),Age,Nationality,Primary source(s) of wealth
0,1,Bernard Arnault &\nfamily,$211 billion,74,France,LVMH
1,2,Elon Musk,$180 billion,51,United\nStates,"Tesla, SpaceX, X Corp."
2,3,Jeff Bezos,$114 billion,59,United\nStates,Amazon
3,4,Larry Ellison,$107 billion,78,United\nStates,Oracle Corporation
4,5,Warren Buffett,$106 billion,92,United\nStates,Berkshire Hathaway
5,6,Bill Gates,$104 billion,67,United\nStates,Microsoft
6,7,Michael Bloomberg,$94.5 billion,81,United\nStates,Bloomberg L.P.
7,8,Carlos Slim & family,$93 billion,83,Mexico,"Telmex, América Móvil, Grupo\nCarso"
8,9,Mukesh Ambani,$83.4 billion,65,India,Reliance Industries
9,10,Steve Ballmer,$80.7 billion,67,United\nStates,Microsoft


In [8]:
# shows list of top billionaires
table_dfs[1]

Unnamed: 0,Year,Number of billionaires,Group's combined net worth
0,2023[2],2640.0,$12.2 trillion
1,2022[6],2668.0,$12.7 trillion
2,2021[11],2755.0,$13.1 trillion
3,2020,2095.0,$8.0 trillion
4,2019,2153.0,$8.7 trillion
5,2018,2208.0,$9.1 trillion
6,2017,2043.0,$7.7 trillion
7,2016,1810.0,$6.5 trillion
8,2015[18],1826.0,$7.1 trillion
9,2014[67],1645.0,$6.4 trillion


## Create Pandas Query Engines

We create a pandas query engine over each structured table.

These can be executed on their own to answer queries about each table.

In [9]:
# define query engines over these tables
df_query_engines = [PandasQueryEngine(table_df) for table_df in table_dfs]

In [10]:
df_query_engines[0].query("What's the net worth of the second richest billionaire in 2023?")

df.iloc[1]['Net worth\n(USD)']


Response(response='$180\xa0billion', source_nodes=[], metadata={'pandas_instruction_str': "\ndf.iloc[1]['Net worth\\n(USD)']"})

## Build Vector Index

Build vector index over the chunked document as well as over the additional `IndexNode` objects linked to the tables.

In [11]:
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4", streaming=True))
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
)

In [12]:
doc_nodes = service_context.node_parser.get_nodes_from_documents(docs)

In [13]:
# define index nodes
summaries = [
    "This node provides information on the number of billionaires and their combined net worth from 2000 to 2023.",
    "This node provides information about the world's richest billionaires in 2023"
]

df_nodes = [IndexNode(text=summary, index_id=f"pandas{idx}") for idx, summary in enumerate(summaries)]

df_id_query_engine_mapping = {f"pandas{idx}": df_query_engine for idx, df_query_engine in enumerate(df_query_engines)}

In [14]:
# construct top-level vector index + query engine
vector_index = VectorStoreIndex(doc_nodes + df_nodes)
vector_retriever = vector_index.as_retriever()

## Build RecursiveRetrieverQueryEngine

Our `RecursiveRetrieverQueryEngine` is a light layer around a `RecursiveRetriever` object to retrieve nodes, and a `ResponseSynthesizer` to synthesize a response.

We pass in mappings from id to retriever and id to query engine. We then pass in a root id representing the retriever we query first.

In [15]:
# baseline vector index (that doesn't include the extra df nodes).
# used to benchmark 
vector_index0 = VectorStoreIndex(doc_nodes)
vector_query_engine0 = vector_index0.as_query_engine()

In [16]:
from llama_index.query_engine import RecursiveRetrieverQueryEngine

query_engine = RecursiveRetrieverQueryEngine.from_args(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True
)

In [17]:
response = query_engine.query("What's the net worth of the second richest billionaire in 2023?")

df.loc[df['Year'] == '2023[2]', "Group's combined net worth"].iloc[1]


Traceback (most recent call last):
  File "/Users/jerryliu/Programming/gpt_index/llama_index/query_engine/pandas_query_engine.py", line 59, in default_output_processor
    raise e
  File "/Users/jerryliu/Programming/gpt_index/llama_index/query_engine/pandas_query_engine.py", line 57, in default_output_processor
    return str(eval(module_end_str, {}, local_vars))
  File "<string>", line 1, in <module>
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
    self._validate_integer(key, axis)
  File "/Users/jerryliu/Programming/gpt_index/.venv/lib/python3.10/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positio

df.iloc[1]['Net worth\n(USD)']


In [19]:
response.source_nodes[1].node.get_content()

"Query: What's the net worth of the second richest billionaire in 2023?\nResponse: $180\xa0billion"

In [20]:
str(response)

'\n$180 billion'

In [21]:
response = query_engine.query("How many billionaires were there in 2010?")

len(df[df['Age'] == 2010])


In [22]:
response = vector_query_engine0.query("What's the net worth of the second richest billionaire in 2023?")

In [23]:
print(response.source_nodes[1].node.get_content())

7/1/23, 11:31 PM
The World's Billionaires - Wikipedia
https://en.wikipedia.org/wiki/The_World%27s_Billionaires
3/33
No.
Name
Net worth
(USD)
Age
Nationality
Primary source(s) of wealth
1 
Bernard Arnault &
family
$211 billion 
74
 France
LVMH
2 
Elon Musk
$180 billion 
51
 United
States
Tesla, SpaceX, X Corp.
3 
Jeff Bezos
$114 billion 
59
 United
States
Amazon
4 
Larry Ellison
$107 billion 
78
 United
States
Oracle Corporation
5 
Warren Buffett
$106 billion 
92
 United
States
Berkshire Hathaway
6 
Bill Gates
$104 billion 
67
 United
States
Microsoft
7 
Michael Bloomberg
$94.5 billion 
81
 United
States
Bloomberg L.P.
8 
Carlos Slim & family
$93 billion 
83
 Mexico
Telmex, América Móvil, Grupo
Carso
9 
Mukesh Ambani
$83.4 billion 
65
 India
Reliance Industries
10 
Steve Ballmer
$80.7 billion 
67
 United
States
Microsoft
In the 36th annual Forbes list of the world's billionaires, the list included 2,668 billionaires with a
total net wealth of $12.7 trillion, down 97 members from 2021.[6

In [24]:
print(str(response))


The net worth of the second richest billionaire in 2023 is $211 billion.


In [25]:
response.source_nodes[1].node.get_content()

"7/1/23, 11:31 PM\nThe World's Billionaires - Wikipedia\nhttps://en.wikipedia.org/wiki/The_World%27s_Billionaires\n3/33\nNo.\nName\nNet worth\n(USD)\nAge\nNationality\nPrimary source(s) of wealth\n1 \nBernard Arnault &\nfamily\n$211\xa0billion\xa0\n74\n\xa0France\nLVMH\n2 \nElon Musk\n$180\xa0billion\xa0\n51\n\xa0United\nStates\nTesla, SpaceX, X Corp.\n3 \nJeff Bezos\n$114\xa0billion\xa0\n59\n\xa0United\nStates\nAmazon\n4 \nLarry Ellison\n$107\xa0billion\xa0\n78\n\xa0United\nStates\nOracle Corporation\n5 \nWarren Buffett\n$106\xa0billion\xa0\n92\n\xa0United\nStates\nBerkshire Hathaway\n6 \nBill Gates\n$104\xa0billion\xa0\n67\n\xa0United\nStates\nMicrosoft\n7 \nMichael Bloomberg\n$94.5\xa0billion\xa0\n81\n\xa0United\nStates\nBloomberg L.P.\n8 \nCarlos Slim & family\n$93\xa0billion\xa0\n83\n\xa0Mexico\nTelmex, América Móvil, Grupo\nCarso\n9 \nMukesh Ambani\n$83.4\xa0billion \n65\n\xa0India\nReliance Industries\n10 \nSteve Ballmer\n$80.7\xa0billion\xa0\n67\n\xa0United\nStates\nMicrosoft\n