# Auto-Retrieval with LlamaCloud

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/advanced_rag/auto_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Auto-retrieval** is an advanced RAG technique that uses an LLM to dynamically infer the metadata filter parameters along with the semantic query before initiating vector database retrieval, in comparison to naive RAG which directly sends the user query to the vector db retrieval interface (e.g. dense vector search). It can both be thought of as a form of query expansion/rewriting if you come from the retrieval world, as well as a specific form of function calling.

![](auto_retrieval_img.png)

LlamaCloud helps you easily define chunk and document-level retrieval interfaces on top of any documents. In this guide we show you how to build an auto-retrieval pipeline on top of LlamaCloud retrievers over a research document corpus.

## Setup LlamaCloud 

Install core packages and download relevant files. Upload these documents to LlamaCloud, and then define a chunk and document-level retriever interface over these documents.

For more information on chunk-level and document-level retrieval, check out our interface [here](https://github.com/run-llama/llamacloud-demo/blob/main/examples/10k_apple_tesla/demo_file_retrieval.ipynb).

In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-parse

In [2]:
# NOTE: uncomment more papers if you want to do research over a larger subset of docs

urls = [
    # "https://openreview.net/pdf?id=VtmBAGCN7o",
    # "https://openreview.net/pdf?id=6PmJoRfdaK",
    # "https://openreview.net/pdf?id=LzPWWPAdY4",
    "https://openreview.net/pdf?id=VTF8yNQM66",
    "https://openreview.net/pdf?id=hSyW5go0v8",
    # "https://openreview.net/pdf?id=9WD9KwssyT",
    # "https://openreview.net/pdf?id=yV6fD7LYkF",
    # "https://openreview.net/pdf?id=hnrB5YHoYu",
    # "https://openreview.net/pdf?id=WbWtOYIzIK",
    "https://openreview.net/pdf?id=c5pwL0Soay",
    # "https://openreview.net/pdf?id=TpD2aG1h0D",
]

papers = [
    # "metagpt.pdf",
    # "longlora.pdf",
    # "loftq.pdf",
    "swebench.pdf",
    "selfrag.pdf",
    # "zipformer.pdf",
    # "values.pdf",
    # "finetune_fair_diffusion.pdf",
    # "knowledge_card.pdf",
    "metra.pdf",
    # "vr_mcl.pdf",
]

data_dir = "iclr_docs"

In [None]:
!mkdir "{data_dir}"
for url, paper in zip(urls, papers):
    !wget "{url}" -O "{data_dir}/{paper}"

#### Load Documents into LlamaCloud

Create a new index in LlamaCloud and drag and drop these downloaded PDFs into the data source.

For best results, in the Transformation Configuration click on the "Manual" tab, and set page-level segmentation configuration and "None" for additional chunking.

#### Setup LlamaCloud Index

In [3]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

index = LlamaCloudIndex(
  name="research_papers_page",
  project_name="llamacloud_demo",
  api_key=os.environ["LLAMA_CLOUD_API_KEY"]
)

## Setup Auto-Retrieval

Now we setup an **auto-retrieval** function over our LlamaCloud retrievers. At a high-level our auto-retrieval function uses a function-calling LLM to infer the metadata filters for a user query - this leads to more precise and relevant retrieval results beyond just using a raw semantic query.

This section shows you how to build it from scratch, also includes some advanced few-shot example selection to increase reliability.
1. Define a custom prompt to generate metadata filters
2. Given a user query, first do chunk-level retrieval to dynamically retrieve the metadata of the retrieved chunks.
3. Inject the metadata as few-shot examples in the auto-retrieval prompt. The goal is to show the LLM what existing, relevant examples of metadata values already look like, so that the LLM can infer correct metadata filters.

A lot of the code below is lifted from our **VectorIndexAutoRetriever** module, which provides an out of the box way to do auto-retrieval against a vector index.

In [4]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")

In [26]:
from llama_index.core.prompts import ChatPromptTemplate
from llama_index.core.vector_stores.types import VectorStoreInfo, VectorStoreQuerySpec, MetadataInfo, MetadataFilters
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import Response

import json

SYS_PROMPT = """\
Your goal is to structure the user's query to match the request schema provided below.
You MUST call the tool in order to generate the query spec.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the \
following schema:

{schema_str}

The query string should contain only text that is expected to match the contents of \
documents. Any conditions in the filter should not be mentioned in the query as well.

Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be \
applied return [] for the filter value.\

If the user's query explicitly mentions number of documents to retrieve, set top_k to \
that number, otherwise do not set top_k.

The schema of the metadata filters in the vector db table is listed below, along with some example metadata dictionaries from relevant rows.
The user will send the input query string.

Data Source:
```json
{info_str}
```

Example metadata from relevant chunks:
{example_rows}

"""

example_rows_retriever = index.as_retriever(
    retrieval_mode="chunks",
    rerank_top_n=4
)

def get_example_rows_fn(**kwargs):
    """Retrieve relevant few-shot examples."""
    query_str = kwargs["query_str"]
    nodes = example_rows_retriever.retrieve(query_str)
    # get the metadata, join them
    metadata_list = [n.metadata for n in nodes]

    return "\n".join([json.dumps(m) for m in metadata_list])
        
    

# TODO: define function mapping for `example_rows`.
chat_prompt_tmpl = ChatPromptTemplate.from_messages(
    [
        ("system", SYS_PROMPT),
        ("user", "{query_str}"),
    ],
    function_mappings={
        "example_rows": get_example_rows_fn
    }
)


## NOTE: this is a dataclass that contains information about the metadata
vector_store_info = VectorStoreInfo(
    content_info="contains content from various research papers",
    metadata_info=[
        MetadataInfo(
            name="file_name",
            type="str",
            description="Name of the source paper",
        ),
    ],
)

def auto_retriever_rag(query: str, retrieve_doc: bool = False, files_top_k: int = 1, rerank_top_n: int = 5) -> Response:
    """Synthesizes an answer to your question by feeding in an entire relevant document as context."""
    print(f"> User query string: {query}")
    # Use structured predict to infer the metadata filters and query string.
    query_spec = llm.structured_predict(
        VectorStoreQuerySpec,
        chat_prompt_tmpl,
        info_str=vector_store_info.model_dump_json(indent=4),
        schema_str=json.dumps(VectorStoreQuerySpec.model_json_schema()),
        query_str=query
    )
    # build retriever and query engine
    filters = MetadataFilters(filters=query_spec.filters) if len(query_spec.filters) > 0 else None
    print(f"> Inferred query string: {query_spec.query}")
    if filters:
        print(f"> Inferred filters: {filters.json()}")

    # define retriever based on whether chunk or document-level is specified
    if retrieve_doc:
        retriever = index.as_retriever(
            retrieval_mode="files_via_content",
            # retrieval_mode="files_via_metadata",
            files_top_k=files_top_k,
            filters=filters
        )
    else:
        retriever = index.as_retriever(
            retrieval_mode="chunks",
            rerank_top_n=rerank_top_n,
            filters=filters
        )
    
    query_engine = RetrieverQueryEngine.from_args(
        retriever, 
        llm=llm,
        response_mode="tree_summarize"
    )
    # run query
    return query_engine.query(query_spec.query)


### Try out Auto-Retrieval

Let's try running our auto-retriever on some sample queries. We try out both the chunk-level and document-level retrieval

In [27]:
from functools import partial

auto_doc_rag = partial(auto_retriever_rag, retrieve_doc=True)
auto_chunk_rag = partial(auto_retriever_rag, retrieve_doc=False)

In [10]:
response = auto_chunk_rag("ELI5 the objective function in Metra")
print(str(response))

> User query string: ELI5 the objective function in Metra
> Inferred query string: objective function in Metra
> Inferred filters: {"filters":[{"key":"file_name","value":"metra.pdf","operator":"=="}],"condition":"and"}
The objective function in METRA involves maximizing the expected inner product of the difference in state representations and a skill vector, subject to a Lipschitz constraint under the temporal distance metric. This is expressed as maximizing the expected value of \((\phi(s') - \phi(s))^\top z\), where \(\phi\) is a representation function, \(s\) and \(s'\) are states, and \(z\) is a skill vector. The constraint ensures that the difference in representations is bounded by the temporal distance between states, promoting the discovery of diverse behaviors that cover the latent space effectively.


In [19]:
response = auto_chunk_rag("How was SWE-Bench constructed? Tell me all the stages that went into it.")
print(str(response))

> User query string: How was SWE-Bench constructed? Tell me all the stages that went into it.
> Inferred query string: SWE-Bench construction stages
> Inferred filters: {"filters":[{"key":"file_name","value":"swebench.pdf","operator":"=="}],"condition":"and"}
The construction of SWE-bench involves a three-stage pipeline:

1. **Repo Selection and Data Scraping**: This stage involves collecting pull requests (PRs) from 12 popular open-source Python repositories on GitHub, resulting in approximately 90,000 PRs. The focus is on popular repositories due to their better maintenance, clear contributor guidelines, and comprehensive test coverage.

2. **Attribute-based Filtering**: In this stage, candidate tasks are created by selecting merged PRs that resolve a GitHub issue and make changes to the test files of the repository. This indicates that the user likely contributed tests to verify the resolution of the issue.

3. **Execution-based Filtering**: For each candidate task, the PR’s test co

In [20]:
response = auto_doc_rag("Give me a summary of the SWE-bench paper") 
print(str(response))

> User query string: Give me a summary of the SWE-bench paper
> Inferred query string: summary of the SWE-bench paper
> Inferred filters: {"filters":[{"key":"file_name","value":"swebench.pdf","operator":"=="}],"condition":"and"}
The SWE-bench paper introduces a new benchmark designed to evaluate the capabilities of language models (LMs) in real-world software engineering tasks. This benchmark, called SWE-bench, consists of 2,294 software engineering problems derived from GitHub issues and corresponding pull requests across 12 popular Python repositories. The task for the language models is to generate a pull request that resolves a given issue and passes the associated tests. SWE-bench challenges models to handle complex reasoning, long contexts, and cross-file code editing, which are typical in real-world software development but not commonly addressed in existing benchmarks.

The paper highlights that current state-of-the-art models, including proprietary ones like Claude 2, struggle

In [28]:
response = auto_doc_rag("Give me a summary of the Self-RAG paper") 
print(str(response))

> User query string: Give me a summary of the Self-RAG paper
> Inferred query string: summary of the Self-RAG paper
> Inferred filters: {"filters":[{"key":"file_name","value":"selfrag.pdf","operator":"=="}],"condition":"and"}
The Self-RAG paper introduces a framework called Self-Reflective Retrieval-Augmented Generation (SELF-RAG) designed to enhance the quality and factuality of large language models (LLMs). SELF-RAG improves upon traditional Retrieval-Augmented Generation (RAG) by incorporating a self-reflection mechanism that allows the model to retrieve relevant information on-demand and critique its own outputs. This is achieved through the use of reflection tokens, which guide the model in deciding when to retrieve information and how to evaluate the relevance and support of the retrieved passages. The framework enables the model to adapt its behavior to different tasks, improving factual accuracy and citation precision. Experiments demonstrate that SELF-RAG outperforms state-of-

## Next Steps

Now that you've learned the basics of auto-retrieval, you can choose to build a standalone RAG pipeline powered by this, or choose to plug this in as part of a broader agentic system. For instance, you can plug in both chunk and doc-level auto-retriever pipelines as tools for an agent to interact with. 