# Comparing Methods for Structured Retrieval (Auto-Retrieval vs. Recursive Retrieval)

In a naive RAG system, the set of input documents are then chunked, embedded, and dumped to a vector database collection. Retrieval would just fetch the top-k documents by embedding similarity.

This can fail if the set of documents is large - it can be hard to disambiguate raw chunks, and you're not guaranteed to filter for the set of documents that contain relevant context.

In this guide we explore **structured retrieval** - more advanced query algorithms that take advantage of structure within your documents for higher-precision retrieval. We compare the following two methods:

- **Metadata Filters + Auto-Retrieval**: Tag each document with the right set of metadata. During query-time, use auto-retrieval to infer metadata filters along with passing through the query string for semantic search.
- **Store Document Hierarchies (summaries -> raw chunks) + Recursive Retrieval**: Embed document summaries and map that to the set of raw chunks for each document. During query-time, do recursive retrieval to first fetch summaries before fetching documents.

In [1]:
import logging
import sys
from llama_index import (
    SimpleDirectoryReader, 
    ListIndex, 
    ServiceContext
)

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [2]:
wiki_titles = ["Michael Jordan", "Elon Musk", "Rihanna"]
wiki_metadatas = {
    "Michael Jordan": {
        "category": "Sports",
        "country": "United States",
    },
    "Elon Musk": {
        "category": "Business",
        "country": "United States",
    },
    "Rihanna": {
        "category": "Music",
        "country": "Barbados",
    }
}

In [3]:
from pathlib import Path

import requests

for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)

    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

In [4]:
# Load all wiki documents
docs_dict = {}
for wiki_title in wiki_titles:
    doc = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()[0]
    
    doc.metadata.update(wiki_metadatas[wiki_title])
    docs_dict[wiki_title] = doc

In [5]:
service_context = ServiceContext.from_defaults(chunk_size=1000)

## Metadata Filters + Auto-Retrieval

In [7]:
## Setup Weaviate
import weaviate

# cloud
resource_owner_config = weaviate.AuthClientPassword(
    username="username",
    password="password",
)
client = weaviate.Client(
    "https://llamaindex-test-ul4sgpxc.weaviate.network",
    auth_client_secret=resource_owner_config,
)

  self.adapters[prefix] = adapter


In [8]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores import WeaviateVectorStore
from IPython.display import Markdown, display

In [9]:
from llama_index.storage.storage_context import StorageContext

# If you want to load the index later, be sure to give it a name!
vector_store = WeaviateVectorStore(weaviate_client=client, index_name="LlamaIndex")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# NOTE: you may also choose to define a index_name manually.
# index_name = "test_prefix"
# vector_store = WeaviateVectorStore(weaviate_client=client, index_name=index_name)

In [10]:
index = VectorStoreIndex([], storage_context=storage_context)

# add documents to index
for wiki_title in wiki_titles:
    index.insert(docs_dict[wiki_title])

In [11]:
from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="brief biography of celebrities",
    metadata_info=[
        MetadataInfo(
            name="category",
            type="str",
            description="Category of the celebrity, one of [Sports, Entertainment, Business, Music]",
        ),
        MetadataInfo(
            name="country",
            type="str",
            description="Country of the celebrity, one of [United States, Barbados, Portugal]",
        ),
    ],
)
retriever = VectorIndexAutoRetriever(index, vector_store_info=vector_store_info)

In [12]:
retriever.retrieve("Tell me about a celebrity from the United States")

INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: 
Using query str: 
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'country': 'United States'}
Using filters: {'country': 'United States'}
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2
Using top_k: 2
INFO:openai:error_code=None error_message="'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference." error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
error_code=None error_message="'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference." error_param=None error_type=invalid_request_error message='OpenAI API error received' stream_error=False
INFO:openai:error_code=None error_message="'$.input' is invalid. Please check the API reference: https://platform.opena

RetryError: RetryError[<Future at 0x2b7c7c490 state=finished raised InvalidRequestError>]

## Build Recursive Retriever over Document Summaries

In [29]:
# define top-level nodes
nodes = []
for wiki_title in wiki_titles:
    # use LLM-generated summary
#     list_index = ListIndex.from_documents([docs_dict[wiki_title]], service_context=service_context)
    
#     summarizer = list_index.as_query_engine()
    # response = summarizer.query(f"Give me a summary of {wiki_title}")
    
    # from llama_index.response_synthesizers import TreeSummarize
    # tree_summarize = TreeSummarize()
    # tree_summarize.get_response(f"Give me a summary of {wiki_title}", [docs_dict[wiki_title].get_content()])
    
    wiki_summary = response.response
    print(f'**Summary for {wiki_title}: {wiki_summary}')
    node = IndexNode(text=wiki_summary, index_id=wiki_title)
    nodes.append(node)

INFO:openai:error_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False
error_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False
error_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False
error_code=context_length_exceeded error_message="This model's maximum context len

InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4146 tokens. Please reduce the length of the messages.

In [None]:
# define top-level retriever
vector_index = VectorStoreIndex(nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

In [None]:
# define recursive retriever
from llama_index.retrievers import RecursiveRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

In [None]:
# note: can pass `agents` dict as `query_engine_dict` since every agent can be used as a query engine
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=agents,
    verbose=True,
)

In [None]:
retriever.retrieve("Tell me about a celebrity from the United States")