# Contextual Retrieval for Multimodal RAG

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/multimodal_contextual_retrieval_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this cookbook we show you how to build a multimodal RAG pipeline with **contextual retrieval**.

Contextual retrieval was initially introduced in this Anthropic [blog post](https://www.anthropic.com/news/contextual-retrieval). The high-level intuition is that every chunk is given a concise summary of where that chunk fits in with respect to the overall summary of the document. This allows insertion of high-level concepts/keywords that enable this chunk to be better retrieved for different types of queries.

These LLM calls are expensive. Contextual retrieval depends on **prompt caching** in order to be efficient.

In this notebook, we use Claude 3.5-Haiku to generate contextual summaries. We cache the document as text tokens, but generate contextual summaries by feeding in the parsed text chunk. 

We feed both the text and image chunks into the final multimodal RAG pipeline to generate the response.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-20-2025   | 0.6.61  | Maintained |

![mm_rag_diagram](./multimodal_contextual_retrieval_rag_img.png)

## Setup

In [None]:
%pip install llama-cloud-services "llama-index>=0.13.0<0.14.0" llama-index-embeddings-voyageai llama-index-llms-anthropic

### (Optional) Setup Observability

We setup an integration with LlamaTrace (integration with Arize).

If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below.

In [None]:
!pip install -U llama-index-callbacks-arize-phoenix

In [None]:
# setup Arize Phoenix for logging/observability
import llama_index.core
import os

PHOENIX_API_KEY = "<PHOENIX_API_KEY>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)

### Load Data

Here we load the [ICONIQ 2024 State of AI Report](https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQ%20Analytics%20%2B%20Insights%20-%20State%20of%20AI%20Sep24.pdf).

In [None]:
!mkdir data
!mkdir data_images_iconiq
!wget "https://cdn.prod.website-files.com/65e1d7fb19a3e64b5c36fb38/66eb856e019e59758ef73759_ICONIQ%20Analytics%20%2B%20Insights%20-%20State%20of%20AI%20Sep24.pdf" -O data/iconiq_report.pdf

### Model Setup

Setup models that will be used for downstream orchestration.

In [None]:
import os

# replace with your Anthropic API key
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
# replace with your VoyageAI key
os.environ["VOYAGE_API_KEY"] = "pa-..."
# replace with your LlamaCloud API key
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

In [None]:
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.core import Settings


llm = Anthropic(model="claude-4-sonnet-20250514")
embed_model = VoyageEmbedding(model_name="voyage-3.5")

Settings.llm = llm
Settings.embed_model = embed_model

  from .autonotebook import tqdm as notebook_tqdm


## Use LlamaParse to Parse Text and Images

In this example, use LlamaParse to parse both the text and images from the document.

We parse out the text with LlamaParse premium.

**NOTE**: The report has 40 pages, and at ~5c per page, this will cost you $2. Just a heads up!

In [None]:
from llama_cloud_services import LlamaParse


parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

In [None]:
results = await parser.aparse("data/iconiq_report.pdf")

Started parsing the file under job_id 1384d483-16c8-4b20-a3ff-6863eafecbc1


In [None]:
print(results.pages[10].md)


# A Decision-Making Framework

When making decisions around GenAI investments, we believe it will be important to assess organization readiness, put in place a framework and processes for use case evaluation, and proactively mitigate risks

----

### Accelerate Value  
Find synergies between organizational readiness, use cases, and risk mitigation when making GenAI investment decisions

----

### Use Case Identification & Evaluation  
When determining use cases for GenAI, we believe stakeholders will need to assess business value, the fluency vs. accuracy of solutions, and the level of risk associated. Given the risks involved with using GenAI to build new products, many organizations are first starting with use cases for internal productivity.

It is also important to implement feedback loops and a system for measuring ROI to evaluate use cases.

----

### Organizational Readiness  
For enterprises adopting GenAI solutions for the first time, we believe it will be important to ensure

We can download the page screenshots directly, and we can use them as context later.

In [None]:
image_nodes = await results.aget_image_nodes(
    include_object_images=False,
    include_screenshot_images=True,
    image_download_dir="./iconiq_images",
)

In [None]:
text_nodes = results.get_markdown_nodes(split_by_page=True)

## Build Multimodal Index

In this section we build the multimodal index over the parsed deck. 

We do this by creating **text** nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

In [None]:
for text_node, image_node in zip(text_nodes, image_nodes):
    text_node.metadata["image_path"] = image_node.image_path

In [None]:
print(text_nodes[0].get_content(metadata_mode="all"))

page_number: 1
file_name: data/iconiq_report.pdf
image_path: iconiq_images/page_1.jpg


# The State of AI

September 2024

Navigating the present and promise of Generative AI

ICONIQ | Growth


#### Add Contextual Summaries

In this section we implement the key step in contextual retrieval - attaching metadata to each chunk that situates it within the overall document context.

We take advantage of prompt caching by feeding in the static document as prefix tokens, and only swap out the "header" tokens.

In [None]:
from copy import deepcopy
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock, CachePoint
import time


whole_doc_text = """\
Here is the entire document.
<document>
{WHOLE_DOCUMENT}
</document>"""

chunk_text = """\
Here is the chunk we want to situate within the whole document
<chunk>
{CHUNK_CONTENT}
</chunk>"""

suffix_text = """Please give a short succinct context to situate this chunk within the overall document for \
the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""


def create_contextual_nodes(nodes, llm):
    """Function to create contextual nodes for a list of nodes"""
    nodes_modified = []

    # get overall doc_text string
    doc_text = "\n".join([n.get_content(metadata_mode="all") for n in nodes])

    for idx, node in enumerate(nodes):
        start_time = time.time()
        new_node = deepcopy(node)

        messages = [
            ChatMessage(
                role="user",
                blocks=[
                    TextBlock(text=whole_doc_text.format(WHOLE_DOCUMENT=doc_text)),
                    CachePoint(cache_control={"type": "ephemeral"}),
                    TextBlock(
                        text=chunk_text.format(
                            CHUNK_CONTENT=node.get_content(metadata_mode="all")
                        )
                    ),
                    TextBlock(
                        text="And here is the page screenshot for the corresponding chunk:"
                    ),
                    ImageBlock(path=node.metadata["image_path"]),
                    TextBlock(text=suffix_text),
                ],
            ),
        ]

        new_response = llm.chat(messages)
        new_node.metadata["context"] = str(new_response)

        nodes_modified.append(new_node)
        print(f"Completed node {idx}, {time.time() - start_time}")

    return nodes_modified

In [None]:
context_llm = Anthropic(model="claude-3-5-haiku-latest")

new_text_nodes = create_contextual_nodes(text_nodes, context_llm)

Completed node 0, 5.0501158237457275
Completed node 1, 4.125281095504761
Completed node 2, 3.700598955154419
Completed node 3, 4.249290943145752
Completed node 4, 4.552713871002197
Completed node 5, 3.700002908706665
Completed node 6, 4.9324049949646
Completed node 7, 6.246585845947266
Completed node 8, 5.678989887237549
Completed node 9, 4.55932092666626
Completed node 10, 4.865902662277222
Completed node 11, 4.376728057861328
Completed node 12, 3.823659896850586
Completed node 13, 4.069238185882568
Completed node 14, 3.7528319358825684
Completed node 15, 3.789531946182251
Completed node 16, 4.54377818107605
Completed node 17, 3.3560800552368164
Completed node 18, 4.519093990325928
Completed node 19, 5.594789028167725
Completed node 20, 3.7624330520629883
Completed node 21, 3.778661012649536
Completed node 22, 3.895768880844116
Completed node 23, 3.6451258659362793
Completed node 24, 9.422847032546997
Completed node 25, 3.954685926437378
Completed node 26, 3.4985830783843994
Completed

#### Build Index

Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(nodes=new_text_nodes)

#### Build Baseline Index

Build a baseline index with the text nodes without summarized context.

In [None]:
base_index = VectorStoreIndex(text_nodes)

## Build Multimodal Query Engine

We now use LlamaIndex abstractions to build a **custom query engine**. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import MetadataMode
from llama_index.core.base.response.schema import Response


qa_prompt_block_text = """\
Below we give parsed text from slides in two different formats, as well as the image.

---------------------
{context_str}
---------------------
"""

image_prefix_block = TextBlock(text="And here are the corresponding images per page\n")

image_suffix = """\
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """


class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes and respond using an LLM + retrieved text/images.

    """

    retriever: BaseRetriever
    llm: Anthropic

    def __init__(self, **kwargs) -> None:
        """Initialize."""
        super().__init__(**kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_blocks = [
            ImageBlock(path=n.metadata["image_path"])
            for n in nodes
            if n.metadata.get("image_path")
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )

        formatted_msg = ChatMessage(
            role="user",
            blocks=[
                TextBlock(text=qa_prompt_block_text.format(context_str=context_str)),
                image_prefix_block,
                *image_blocks,
                TextBlock(text=image_suffix.format(query_str=query_str)),
            ],
        )

        # synthesize an answer from formatted text and images
        llm_response = self.llm.chat([formatted_msg])

        return Response(
            response=str(llm_response.message.content),
            source_nodes=nodes,
        )

In [None]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=3),
    llm=Anthropic(model="claude-4-sonnet-20250514"),
)

base_query_engine = base_index.as_query_engine(similarity_top_k=3)

## Try out Queries

Let's try out some questions against the slide deck in this multimodal RAG pipeline.

In [None]:
response = query_engine.query(
    "which departments/teams use genAI the most and how are they using it?"
)
print(str(response))

In [None]:
base_response = base_query_engine.query(
    "which departments/teams use genAI the most and how are they using it?"
)
print(str(base_response))

In this next question, the same sources are retrieved with and without contextual retrieval, and the answer is correct for both approaches. This is thanks for LlamaParse Premium's ability to comprehend graphs.

In [None]:
query = "what are relevant insights from the 'deep dive on infrastructure' section in terms of model preferences, cost, deployment environments?"

response = query_engine.query(query)
print(str(response))

In [None]:
base_response = base_query_engine.query(query)
print(str(base_response))