# File-level and Chunk-Level Retrieval with LlamaCloud and Workflows

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/advanced_rag/file_retrieve_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will show you how to perform file-level and chunk-level retrieval with LlamaCloud using a custom router query engine and a custom agent built with [Workflows](https://docs.llamaindex.ai/en/latest/module_guides/workflow/).

![](file_retrieve_workflow_img.png)

File-level retrieval is useful for handling user questions that require the entire document context to properly answer the question. Since only doing file-level retrieval can be slow + expensive, we also show you how to build an agent that can dynamically decide whether to do file-level or chunk-level retrieval! 

## Setup

Install LlamaIndex, apply nest_asyncio, and set up your OpenAI API key.

In [None]:
%pip install llama-index llama-index-indices-managed-llama-cloud

In [1]:
import nest_asyncio
nest_asyncio.apply()

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "<Your OpenAI API Key>"

## [Optional] Setup Observability

We setup an integration with LlamaTrace (integration with Arize).

If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below.

In [None]:
!pip install -U llama-index-callbacks-arize-phoenix

In [None]:
# setup Arize Phoenix for logging/observability
import llama_index.core
import os

PHOENIX_API_KEY = "<phoenix_api_key>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)

## Load Documents into LlamaCloud

The first order of business is to download the 5 Apple and Tesla 10Ks and upload them into LlamaCloud.

You can easily do this by creating a pipeline and uploading docs via the "Files" mode.

After this is done, proceed to the next section.

In [None]:
!mkdir -p data
# download Apple 
!wget "https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf" -O data/apple_2023.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2022/q4/_10-K-2022-(As-Filed).pdf" -O data/apple_2022.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O data/apple_2021.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2020/ar/_10-K-2020-(As-Filed).pdf" -O data/apple_2020.pdf
!wget "https://www.dropbox.com/scl/fi/i6vk884ggtq382mu3whfz/apple_2019_10k.pdf?rlkey=eudxh3muxh7kop43ov4bgaj5i&dl=1" -O data/apple_2019.pdf

# download Tesla
!wget "https://ir.tesla.com/_flysystem/s3/sec/000162828024002390/tsla-20231231-gen.pdf" -O data/tesla_2023.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000095017023001409/tsla-20221231-gen.pdf" -O data/tesla_2022.pdf
!wget "https://www.dropbox.com/scl/fi/ptk83fmye7lqr7pz9r6dm/tesla_2021_10k.pdf?rlkey=24kxixeajbw9nru1sd6tg3bye&dl=1" -O data/tesla_2021.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459021004599/tsla-10k_20201231-gen.pdf" -O data/tesla_2020.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459020004475/tsla-10k_20191231-gen_0.pdf" -O data/tesla_2019.pdf

## Helper Classes

We define the `Answer` model, which is a model that stores whether to pick chunk-level retrieval or document-level retrieval, along with a reason for that choice. We will let the LLM choose given a query string, and we will ask the LLM to produce a JSON output that can be parsed by a Pydantic model.

We will define the `RouterOutputParser` helper class, which parses the output from the LLM into a list of `Answer` models, which is then put into the `Answers` model that contains a list of `Answer`s.

In [2]:
import json

from llama_index.core.bridge.pydantic import BaseModel
from typing import List
from llama_index.core.types import BaseOutputParser
from llama_index.core import PromptTemplate

# tells LLM to select choices given a list
ROUTER_PROMPT = PromptTemplate(
    "Some choices are given below. It is provided in a numbered list (1 to"
    " {num_choices}), where each item in the list corresponds to a"
    " summary.\n---------------------\n{context_list}\n---------------------\nUsing"
    " only the choices above and not prior knowledge, return the top choices"
    " (no more than {max_outputs}, but only select what is needed) that are"
    " most relevant to the question: '{query_str}'\n"
)

# tells LLM to format list of choices in a certain way
FORMAT_STR = """The output should be formatted as a JSON instance that conforms to 
the JSON schema below. 

Here is the output schema:
{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "choice": {
        "type": "integer"
      },
      "reason": {
        "type": "string"
      }
    },
    "required": [
      "choice",
      "reason"
    ],
    "additionalProperties": false
  }
}
"""

class Answer(BaseModel):
    """Answer model."""

    choice: int
    reason: str


class Answers(BaseModel):
    """List of answers model."""

    answers: List[Answer]

class RouterOutputParser(BaseOutputParser):
    """Custom output parser."""

    def _escape_curly_braces(self, input_string: str):
        """Escape the brackets in the format string so contents are not treated as variables."""

        return input_string.replace("{", "{{").replace("}", "}}")

    def _marshal_output_to_json(self, output: str):
        """Find JSON string within response."""

        output = output.strip()
        left = output.find("[")
        right = output.find("]")
        output = output[left : right + 1]
        return output

    def parse(self, output: str) -> Answers:
        """Parse string"""

        json_output = self._marshal_output_to_json(output)
        json_dicts = json.loads(json_output)
        answers = [Answer.parse_obj(json_dict) for json_dict in json_dicts]
        return Answers(answers=answers)
    
    def format(self, query: str) -> str:
        return query + "\n\n" + self._escape_curly_braces(FORMAT_STR)

## Router Query Workflow

In the code snippet below, we define a router query workflow. This workflow requires 2 events: a `ChooseQueryEngineEvent`, which chooses the document-level or chunk-retrieval query engine, and `SynthesizeAnswersEvent`, which contains the results from the query engines and synthesizes a final response.

The workflow consists of the following steps:
1. Choosing the query engine(s) by passing the prompt and output parser defined above into an LLM. Both query engines can be chosen if the LLM thinks both query engines (defined in `choose_query_engine()`).
2. Queries the engines chosen by the LLM in the previous step (defined in `query_each_engine`).
3. Synthesizes a final response given the results from the queries above (defined in `synthesize_response()`).

In [6]:
from typing import List, Optional, Any

from llama_index.core.query_engine import (
    BaseQueryEngine,
    RetrieverQueryEngine,
)
from llama_index.core import PromptTemplate
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import LLM
from llama_index.core.response_synthesizers import TreeSummarize
from llama_index.core.workflow import (
    Workflow,
    Event,
    StartEvent,
    StopEvent,
    step,
)

class ChooseQueryEngineEvent(Event):
    """Query engine event."""

    answers: Answers
    query_str: str

class SynthesizeAnswersEvent(Event):
    """Synthesize answers event."""

    responses: List[Any]
    query_str: str


class RouterQueryWorkflow(Workflow):
    """Router query workflow."""

    def __init__(
        self,
        query_engines: List[BaseQueryEngine],
        choice_descriptions: List[str],
        router_prompt: PromptTemplate,
        timeout: Optional[float] = 10.0,
        disable_validation: bool = False,
        verbose: bool = False,
        llm: Optional[LLM] = None,
        summarizer: Optional[TreeSummarize] = None,
    ):
        """Constructor"""

        super().__init__(timeout=timeout, disable_validation=disable_validation, verbose=verbose)

        self.query_engines: List[BaseQueryEngine] = query_engines
        self.choice_descriptions: List[str] = choice_descriptions
        self.router_prompt: PromptTemplate = router_prompt
        self.llm: LLM = llm or OpenAI(temperature=0, model="gpt-4o")
        self.summarizer: TreeSummarize = summarizer or TreeSummarize()

    def _get_choice_str(self, choices):
        """String of choices to feed into LLM."""

        choices_str = "\n\n".join([f"{idx+1}. {c}" for idx, c in enumerate(choices)])
        return choices_str
    
    async def _query(self, query_str: str, choice_idx: int):
        """Query using query engine"""

        query_engine = self.query_engines[choice_idx]
        return await query_engine.aquery(query_str)

    
    @step()
    async def choose_query_engine(self, ev: StartEvent) -> ChooseQueryEngineEvent:
        """Choose query engine."""

        # get query str
        query_str = ev.get("query_str")
        if query_str is None:
            raise ValueError("'query_str' is required.")
        
        # partially format prompt with number of choices and max outputs
        router_prompt1 = self.router_prompt.partial_format(
            num_choices=len(self.choice_descriptions),
            max_outputs=len(self.choice_descriptions),
        )
        
        
        # get choices selected by LLM
        choices_str = self._get_choice_str(self.choice_descriptions)
        output = llm.structured_predict(
            Answers,
            router_prompt1,
            context_list=choices_str, 
            query_str=query_str
        )

        if self._verbose:
            print(f"Selected choice(s):")
            for answer in output.answers:
                print(f"Choice: {answer.choice}, Reason: {answer.reason}")
        
        return ChooseQueryEngineEvent(answers=output, query_str=query_str)
            
    @step()
    async def query_each_engine(self, ev: ChooseQueryEngineEvent) -> SynthesizeAnswersEvent:
        """Query each engine."""

        query_str = ev.query_str
        answers = ev.answers

        # query using corresponding query engine given in Answers list
        responses = []

        for answer in answers.answers:
            choice_idx = answer.choice - 1
            response = await self._query(query_str, choice_idx)
            responses.append(response)
        
        return SynthesizeAnswersEvent(responses=responses, query_str=query_str)
    
    @step()
    async def synthesize_response(self, ev: SynthesizeAnswersEvent) -> StopEvent:
        """Synthesizes response."""

        responses = ev.responses
        query_str = ev.query_str

        # get result of responses
        if len(responses) == 1:
            return StopEvent(result=responses[0])
        else:
            response_strs = [str(r) for r in responses]
            result_response = self.summarizer.get_response(query_str, response_strs)
            return StopEvent(result=result_response)


## Define LlamaCloud Retriever over documents

We'll define an instance of `LLamaCloudIndex`, which will allow us to access the indexed docs stored on LlamaCloud. We define two separate retrievers for this index: a file-level retriever and a chunk-level retriever. We create two query engines from these retrievers.

After this, we give a description for what each retriever does to allow the LLM to know which one to pick. We'll define our router workflow based on the two query engines and descriptions.

In [7]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
    name="<index_name>", 
    project_name="<project_name>",
    organization_id="<organization_id>",
    # api_key="<Your API Key>"
)

llm = OpenAI("gpt-4o")

doc_retriever = index.as_retriever(retrieval_mode="files_via_content", files_top_k=1)
query_engine_doc = RetrieverQueryEngine.from_args(
    doc_retriever, llm=llm, response_mode="tree_summarize"
)

chunk_retriever = index.as_retriever(retrieval_mode="chunks", rerank_top_n=10)
query_engine_chunk = RetrieverQueryEngine.from_args(
    chunk_retriever, llm=llm, response_mode="tree_summarize"
)

DOC_METADATA_EXTRA_STR = """\
Each document represents a complete 10K report for a given year (e.g. Apple in 2019).
Here's an example of relevant documents:
1. apple_2019.pdf
2. tesla_2020.pdf
"""

TOOL_DOC_DESC = f"""\
Synthesizes an answer to your question by feeding in an entire relevant document as context. Best used for higher-level summarization options.
Do NOT use if answer can be found in a specific chunk of a given document. Use the chunk_query_engine instead for that purpose.

Below we give details on the format of each document:
{DOC_METADATA_EXTRA_STR}
"""

TOOL_CHUNK_DESC = f"""\
Synthesizes an answer to your question by feeding in a relevant chunk as context. Best used for questions that are more pointed in nature.
Do NOT use if the question asks seems to require a general summary of any given document. Use the doc_query_engine instead for that purpose.

Below we give details on the format of each document:
{DOC_METADATA_EXTRA_STR}
"""

router_query_workflow = RouterQueryWorkflow(
    query_engines=[query_engine_doc, query_engine_chunk],
    choice_descriptions=[TOOL_DOC_DESC, TOOL_CHUNK_DESC],
    verbose=True,
    llm=llm,
    router_prompt=ROUTER_PROMPT,
    timeout=60
)

After defining our router query workflow, we'll create a query engine wrapper around this workflow, and we'll define a query engine tool around this wrapper to pass into an agent.

## Creating an Agent Around the Query Engine

We'll create a workflow that acts as an agent around the router query engine. In this workflow, we need four events:
1. `GatherToolsEvent`: Gets all tools that need to be called (which is determined by the LLM).
2. `ToolCallEvent`: An individual tool call. Multiple of these events will be triggered at the same time.
3. `ToolCallEventResult`: Gets result from a tool call.
4. `GatherEvent`: Returned from dispatcher that triggers the `ToolCallEvent`.

This workflow consists of the following steps:
1. `chat()`: Appends the message to the chat history. This chat history is fed into the LLM, along with the given tools, and the LLM determines which tools to call. This returns a `GatherToolsEvent`.
2. `dispatch_calls()`: Triggers a `ToolCallEvent` for each tool call given in the `GatherToolsEvent` using `send_event()`. Returns a `GatherEvent` with the number of tool calls.
3. `call_tool()`: Calls an individual tool. This step will run multiple times if there is more than one tool call. This step calls the tool and appends the result as a chat message to the chat history. It returns a `ToolCallEventResult` with the result of the tool call.
4. `gather()`: Gathers the results from all tool calls using `collect_events()`. Waits for all tool calls to finish, then feeds chat history (following all tool calls) into the LLM. Returns the response from the LLM.

In [8]:
from typing import Dict, List

from llama_index.core.tools import BaseTool
from llama_index.core.llms import ChatMessage
from llama_index.core.llms.llm import ToolSelection
from llama_index.core.workflow import Context, Workflow
from llama_index.core.base.response.schema import Response
from llama_index.core.tools import FunctionTool


class InputEvent(Event):
    """Input event."""

class GatherToolsEvent(Event):
    """Gather Tools Event"""

    tool_calls: Any

class ToolCallEvent(Event):
    """Tool Call event"""

    tool_call: ToolSelection

class ToolCallEventResult(Event):
    """Tool call event result."""

    msg: ChatMessage


class RouterOutputAgentWorkflow(Workflow):
    """Custom router output agent workflow."""

    def __init__(self,
        rag_workflow: Workflow,
        timeout: Optional[float] = 10.0,
        disable_validation: bool = False,
        verbose: bool = False,
        llm: Optional[LLM] = None,
        chat_history: Optional[List[ChatMessage]] = None,
    ):
        """Constructor."""

        super().__init__(timeout=timeout, disable_validation=disable_validation, verbose=verbose)

        self.rag_workflow = rag_workflow

        def query_workflow(query_str: str) -> Response:
            """Queries 10k reports for a given year."""
            return self.rag_workflow.run(query_str=query_str)
        
        self.rag_workflow_tool = FunctionTool.from_defaults(query_workflow)
        
        self.llm: LLM = llm or OpenAI(temperature=0, model="gpt-4o")
        self.chat_history: List[ChatMessage] = chat_history or []
    

    def reset(self) -> None:
        """Resets Chat History"""

        self.chat_history = []

    @step()
    async def prepare_chat(self, ev: StartEvent) -> InputEvent:
        message = ev.get("message")
        if message is None:
            raise ValueError("'message' field is required.")
        
        # add msg to chat history
        chat_history = self.chat_history
        chat_history.append(ChatMessage(role="user", content=message))
        return InputEvent()

    @step()
    async def chat(self, ev: InputEvent) -> GatherToolsEvent | StopEvent:
        """Appends msg to chat history, then gets tool calls."""

        # Put msg into LLM with tools included
        chat_res = await self.llm.achat_with_tools(
            [self.rag_workflow_tool],
            chat_history=self.chat_history,
            verbose=self._verbose,
            allow_parallel_tool_calls=True
        )
        tool_calls = self.llm.get_tool_calls_from_response(chat_res, error_on_no_tool_call=False)
        
        ai_message = chat_res.message
        self.chat_history.append(ai_message)
        if self._verbose:
            print(f"Chat message: {ai_message.content}")

        # no tool calls, return chat message.
        if not tool_calls:
            return StopEvent(result=ai_message.content)

        return GatherToolsEvent(tool_calls=tool_calls)

    @step(pass_context=True)
    async def dispatch_calls(self, ctx: Context, ev: GatherToolsEvent) -> ToolCallEvent:
        """Dispatches calls."""

        tool_calls = ev.tool_calls
        await ctx.set("num_tool_calls", len(tool_calls))

        # trigger tool call events
        for tool_call in tool_calls:
            ctx.send_event(ToolCallEvent(tool_call=tool_call))
        
        return None
    
    @step()
    async def call_tool(self, ev: ToolCallEvent) -> ToolCallEventResult:
        """Calls tool."""

        tool_call = ev.tool_call

        # get tool ID and function call
        id_ = tool_call.tool_id

        if self._verbose:
            print(f"Calling function {tool_call.tool_name} with msg {tool_call.tool_kwargs}")

        # directly run workflow, don't call tools
        output = await self.rag_workflow.run(**tool_call.tool_kwargs)
        msg = ChatMessage(
            name=tool_call.tool_name,
            content=str(output),
            role="tool",
            additional_kwargs={
                "tool_call_id": id_,
                "name": tool_call.tool_name
            }
        )

        return ToolCallEventResult(msg=msg)
    
    @step(pass_context=True)
    async def gather(self, ctx: Context, ev: ToolCallEventResult) -> StopEvent | None:
        """Gathers tool calls."""
        # wait for all tool call events to finish.
        tool_events = ctx.collect_events(ev, [ToolCallEventResult] * await ctx.get("num_tool_calls"))
        if not tool_events:
            return None
        
        for tool_event in tool_events:
            # append tool call chat messages to history
            self.chat_history.append(tool_event.msg)
        
        # # after all tool calls finish, pass input event back, restart agent loop
        return InputEvent()
    

Creates an instance of the agent.

In [9]:
agent = RouterOutputAgentWorkflow(router_query_workflow, verbose=True, timeout=60)

#### Visualize Workflow

In [10]:
from llama_index.utils.workflow import draw_all_possible_flows

draw_all_possible_flows(RouterOutputAgentWorkflow)

<class 'NoneType'>
<class '__main__.ToolCallEventResult'>
<class '__main__.GatherToolsEvent'>
<class 'llama_index.core.workflow.events.StopEvent'>
<class '__main__.ToolCallEvent'>
<class 'llama_index.core.workflow.events.StopEvent'>
<class '__main__.InputEvent'>
workflow_all_flows.html


## Example Queries

In [11]:
from IPython.display import display, Markdown

response = await agent.run(message="Tell me the revenue for Apple and Tesla in 2021.")
display(Markdown(response))

Running step prepare_chat
Step prepare_chat produced event InputEvent
Running step chat
Chat message: None
Step chat produced event GatherToolsEvent
Running step dispatch_calls
Step dispatch_calls produced no event
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Apple revenue 2021'}
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Tesla revenue 2021'}
Running step choose_query_engine
Selected choice(s):
Choice: 2, Reason: The question 'Apple revenue 2021' is pointed in nature, asking for specific information about Apple's revenue in 2021. Therefore, using a relevant chunk as context is more appropriate than synthesizing an answer from an entire document.
Step choose_query_engine produced event ChooseQueryEngineEvent
Running step query_each_engine
Running step choose_query_engine
Selected choice(s):
Choice: 2, Reason: The question 'Tesla revenue 2021' is specific and pointed in nature, asking for a particular piece of informat

In 2021, Apple's revenue was $365.8 billion, while Tesla's revenue was $53.82 billion.

In [12]:
response = await agent.run(message="Tell me the tailwinds for Apple and Tesla in 2021.")
display(Markdown(response))

Running step prepare_chat
Step prepare_chat produced event InputEvent
Running step chat
Chat message: None
Step chat produced event GatherToolsEvent
Running step dispatch_calls
Step dispatch_calls produced no event
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Apple tailwinds 2021'}
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Tesla tailwinds 2021'}
Running step choose_query_engine
Selected choice(s):
Choice: 1, Reason: The question 'Apple tailwinds 2021' seems to require a general summary of the entire document for the year 2021. Since the question is about tailwinds, which could be a broad topic covering various aspects of Apple's business, using the entire document as context would be more appropriate.
Step choose_query_engine produced event ChooseQueryEngineEvent
Running step query_each_engine
Running step choose_query_engine
Selected choice(s):
Choice: 1, Reason: The question 'Tesla tailwinds 2021' seems to require

In 2021, Apple experienced several tailwinds that contributed to its growth:

- Significant increase in net sales across all product and service categories, driven by strong demand for new iPhone models, Mac computers, iPads, and wearables.
- Substantial growth in the services segment, particularly in advertising, the App Store, and cloud services.
- Favorable foreign currency movements and improved financial leverage.
- Expansion of its share repurchase program and increased dividends, reflecting strong financial health and confidence in future growth.

For Tesla, the tailwinds in 2021 included:

- Significant increase in vehicle production and deliveries, supported by ramping up production at Gigafactory Shanghai and the Fremont Factory.
- Spike in demand in the automotive industry and ongoing electrification of the sector.
- Increasing environmental awareness driving demand for electric vehicles.
- Efforts to reduce costs and increase affordability through localized procurement and manufacturing, particularly in China.
- Advancements in vehicle performance and functionality, including Full Self-Driving (FSD) capabilities.
- Growth in the energy generation and storage segment, with increased deployments of Megapack, Powerwall, and Solar Roof products.

These factors contributed to Tesla's strong financial performance, with total revenues increasing by 71% compared to the previous year.

In [13]:
response = await agent.run(message="How was apple doing generally in 2019?")
display(Markdown(response))

Running step prepare_chat
Step prepare_chat produced event InputEvent
Running step chat
Chat message: None
Step chat produced event GatherToolsEvent
Running step dispatch_calls
Step dispatch_calls produced no event
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Apple performance 2019'}
Running step choose_query_engine
Selected choice(s):
Choice: 1, Reason: The question 'Apple performance 2019' requires a general summary of the entire document, which aligns with the purpose of choice 1. It is best used for higher-level summarization options and synthesizes an answer by feeding in an entire relevant document as context.
Step choose_query_engine produced event ChooseQueryEngineEvent
Running step query_each_engine
Step query_each_engine produced event SynthesizeAnswersEvent
Running step synthesize_response
Step synthesize_response produced event StopEvent
Step call_tool produced event ToolCallEventResult
Running step gather
Step gather produced event InputEv

In 2019, Apple's overall performance was mixed. The company experienced a 2% decrease in total net sales compared to 2018, primarily due to lower iPhone sales. However, there was growth in other areas:

- Increased sales in Wearables, Home and Accessories, and Services across all geographic segments.
- Growth in Mac and iPad sales.
- Significant growth in Services revenue.

Apple also focused on shareholder returns, repurchasing $67.1 billion of its common stock and paying $14.1 billion in dividends. Geographically, the Americas segment saw an increase in net sales, while Europe, Greater China, and Japan experienced declines. Overall, 2019 was marked by a shift in product sales dynamics and continued investment in shareholder returns.

## [Advanced] Setup Auto-Retrieval for Files

We make our file-level retrieval more sophisticated by allowing the LLM to infer a set of metadata filters, based on some relevant example documents. This allows document-level retrieval to be more precise, since it allows the LLM to narrow down search results via metadata filters and not just top-k.

We do some advanced things to make this happen
- Define a custom prompt to generate metadata filters
- Dynamically include few-shot examples of metadata as context to infer the set of metadata filters. These initial few-shot examples of metadata are obtained through vector search.

We prompt the LLM to generate a prompt, a list of filters, and optionally a top-k value. We will define another workflow that is subclassed from the `RouterQueryWorkflow`. In this workflow, we will replace the `_query()` method defined in `RouterQueryWorkflow`.

In this `_query()` method, we will check if the choice is the document-level retrieval. If it is, then we'll create a new query engine with certain LLM-generated filters applied. We'll return the response from this query engine.

A lot of the code below is lifted from our **VectorIndexAutoRetriever** module, which provides an out of the box way to do auto-retrieval against a vector index.

Since we are adding some customizations like adding few-shot examples, we re-use prompt pieces and implement auto-retrieval from scratch. 

In [23]:
from llama_index.core.prompts import ChatPromptTemplate
from llama_index.core.vector_stores.types import (
    VectorStoreInfo,
    VectorStoreQuerySpec,
    MetadataInfo,
    MetadataFilters,
)

SYS_PROMPT = """\
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the \
following schema:

{schema_str}

The query string should contain only text that is expected to match the contents of \
documents. Any conditions in the filter should not be mentioned in the query as well.

Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters take into account the descriptions of attributes.
Make sure that filters are only used as needed. If there are no filters that should be \
applied return [] for the filter value.\

If the user's query explicitly mentions number of documents to retrieve, set top_k to \
that number, otherwise do not set top_k.

The schema of the metadata filters in the vector db table is listed below, along with some example metadata dictionaries from relevant rows.
The user will send the input query string.

Data Source:
```json
{info_str}
```

Example metadata from relevant chunks:
{example_rows}
"""


class AutoRetrievalRouterQueryWorkflow(RouterQueryWorkflow):
    """Router query engine with auto retrieval."""

    async def _get_auto_retriever_query_engine(
        self, query: str, verbose: bool = False
    ) -> RetrieverQueryEngine:
        """Gets auto doc retriever query engine"""

        # retriever that retrieves example rows
        example_rows_retriever = index.as_retriever(
            retrieval_mode="chunks", rerank_top_n=4
        )

        def get_example_rows_fn(**kwargs):
            """Retrieve relevant few-shot examples of metadata."""

            query_str = kwargs["query_str"]
            nodes = example_rows_retriever.retrieve(query_str)
            # get the metadata, join them
            metadata_list = [n.metadata for n in nodes]

            return "\n".join([json.dumps(m) for m in metadata_list])

        # define chat prompt template to feed into LLM
        chat_prompt_tmpl = ChatPromptTemplate.from_messages(
            [
                ("system", SYS_PROMPT),
                ("user", "{query_str}"),
            ],
            function_mappings={"example_rows": get_example_rows_fn},
        )

        # information about vector store - used to generate json schema in prompt template
        vector_store_info = VectorStoreInfo(
            content_info="document chunks around Apple and Tesla 10K documents",
            metadata_info=[
                MetadataInfo(
                    name="file_name",
                    type="str",
                    description="Name of the source file",
                ),
            ],
        )

        query_spec: VectorStoreQuerySpec = await llm.astructured_predict(
            VectorStoreQuerySpec,
            chat_prompt_tmpl,
            info_str=vector_store_info.model_dump_json(),
            schema_str=json.dumps(VectorStoreQuerySpec.model_json_schema()),
            query_str=query,
        )

        # build retriever and query engine
        filters = (
            MetadataFilters(filters=query_spec.filters)
            if len(query_spec.filters) > 0
            else None
        )
        if verbose:
            print(f"> Using query str: {query_spec.query}")

        if filters and verbose:
            print(f"> Using filters{filters.json()}")

        retriever = index.as_retriever(
            retrieval_mode="files_via_content", files_top_k=1, filters=filters
        )

        query_engine = RetrieverQueryEngine.from_args(
            retriever, llm=self.llm, response_mode="tree_summarize"
        )

        return query_engine

    async def _query(self, query_str: str, choice_idx: int):
        """Query with auto retriever"""

        if choice_idx == 0:
            query_engine = await self._get_auto_retriever_query_engine(
                query_str, self._verbose
            )
        else:
            query_engine = self.query_engines[choice_idx]
        return await query_engine.aquery(query_str)


Create the auto retrieval query workflow, then wrap it around a RouterQueryEngine, then create a tool around that engine.

In [27]:
# auto retrieval query engine
auto_retrieval_query_workflow = AutoRetrievalRouterQueryWorkflow(
    query_engines=[query_engine_doc, query_engine_chunk],
    choice_descriptions=[TOOL_DOC_DESC, TOOL_CHUNK_DESC],
    verbose=True,
    llm=llm,
    router_prompt=ROUTER_PROMPT,
    timeout=120
)

Create an agent using auto retrieval.

In [28]:
# agent
agent_router_output = RouterOutputAgentWorkflow(auto_retrieval_query_workflow, verbose=True, timeout=120)

## Example Queries

In [29]:
response = await agent_router_output.run(message="How was Tesla doing generally in 2021 and 2022?")
display(Markdown(response))

Running step prepare_chat
Step prepare_chat produced event InputEvent
Running step chat
Chat message: None
Step chat produced event GatherToolsEvent
Running step dispatch_calls
Step dispatch_calls produced no event
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Tesla 2021 performance'}
Running step call_tool
Calling function query_workflow with msg {'query_str': 'Tesla 2022 performance'}
Running step choose_query_engine
Selected choice(s):
Choice: 1, Reason: The question 'Tesla 2021 performance' seems to require a general summary of Tesla's performance for the year 2021. Since the question is asking for a higher-level summarization of Tesla's performance, it is best to use the option that synthesizes an answer by feeding in an entire relevant document as context. This aligns with the description of choice 1, which is suitable for higher-level summarization options.
Step choose_query_engine produced event ChooseQueryEngineEvent
Running step query_each_eng

In 2021, Tesla had a strong year with significant growth and achievements:

- **Production and Deliveries**: Tesla produced 930,422 vehicles and delivered 936,222 vehicles.
- **Financial Performance**: The company recognized total revenues of $53.82 billion, a 71% increase compared to the previous year, and reported a net income of $5.52 billion, a favorable change of $4.80 billion from the prior year.
- **Cash and Investments**: Tesla ended the year with $17.58 billion in cash and cash equivalents.
- **Energy Products**: They deployed 3.99 GWh of energy storage products and 345 megawatts of solar energy systems.
- **Focus Areas**: Tesla focused on increasing vehicle production and capacity, improving battery technologies, enhancing Full Self-Driving (FSD) capabilities, and expanding global infrastructure.

In 2022, Tesla continued its growth trajectory despite challenges:

- **Production and Deliveries**: Tesla produced 1,369,611 consumer vehicles and delivered 1,313,851 vehicles, overcoming supply chain and logistics challenges.
- **Financial Performance**: Total revenues reached $81.46 billion, marking an increase of $27.64 billion from the previous year, and the net income was $12.56 billion, a favorable change of $7.04 billion compared to the prior year.
- **Cash and Investments**: The company ended the year with $22.19 billion in cash and cash equivalents and investments, an increase of $4.48 billion from the end of 2021.
- **Energy Products**: They deployed 6.5 GWh of energy storage products and 348 megawatts of solar energy systems.
- **Focus Areas**: Tesla focused on increasing production and delivery capabilities, improving battery technologies, enhancing Full Self-Driving capabilities, increasing vehicle affordability and efficiency, bringing new products to market, and expanding global infrastructure.

Overall, Tesla showed strong performance and growth in both years, with significant increases in production, deliveries, and financial metrics.