# Building a Natively Multimodal RAG Pipeline (over a Slide Deck)

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/multimodal_rag_slide_deck.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this cookbook we show you how to build a multimodal RAG pipeline over a slide deck, with text, tables, images, diagrams, and complex layouts.

A gap of text-based RAG is that they struggle with purely text-based representations of complex documents. For instance, if a page contains a lot of images and diagrams, a text parser would need to rely on raw OCR to extract out text. You can also use a multimodal model (e.g. gpt-4o and up) to do text extraction, but this is inherently a lossy conversion.

Instead a **native multimodal pipeline** stores both a text and image representation of a document chunk. They are indexed via embeddings (text or image), and during synthesis both text and image are directly fed to the multimodal model for synthesis.

This can have the following advantages:
- **Robustness**: This solution is more robust than a pure text or even a pure image-based approach. In a pure text RAG approach, the parsing piece can be lossy. In a pure image-based approach, multimodal OCR is not perfect and may lose out against text parsing for text-heavy documents.
- **Cost Optimization**: You may choose to dynamically include text-only, or text + image depending on the content of the page.

Status:
| Last Executed | Version | State      |
|---------------|---------|------------|
| Aug-20-2025   | 0.6.61  | Maintained |

![mm_rag_diagram](./multimodal_rag_slide_deck_img.png)

## Setup

In [None]:
%pip install llama-cloud-services "llama-index>=0.13.0<0.14.0"

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-"
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

### (Optional) Setup Observability

We setup an integration with LlamaTrace (integration with Arize).

If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below.

In [None]:
!pip install -U llama-index-callbacks-arize-phoenix

In [None]:
# setup Arize Phoenix for logging/observability
import llama_index.core
import os

PHOENIX_API_KEY = "<PHOENIX_API_KEY>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)

### Load Data

Here we load the [Conoco Phillips 2023 investor meeting slide deck](https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf).

In [None]:
!mkdir data
!mkdir data_images
!wget "https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf" -O data/conocophillips.pdf

### Model Setup

Setup models that will be used for downstream orchestration.

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI(model="gpt-5-mini")

Settings.embed_model = embed_model
Settings.llm = llm

## Use LlamaParse to Parse Text and Images

In this example, use LlamaParse to parse both the text and images from the document.

In [None]:
from llama_cloud_services import LlamaParse


parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

In [None]:
results = await parser.aparse("data/conocophillips.pdf")

Started parsing the file under job_id 2cf07879-5bdb-4dca-9a07-001b2a07727e
.

In [None]:
print(results.pages[10].md)


# Commitment to Disciplined Reinvestment Rate

<table>
<thead>
<tr>
  <th>Industry Growth Focus</th>
  <th>ConocoPhillips Strategy Reset</th>
  <th>Disciplined Reinvestment Rate is the Foundation for Superior Returns <br> <b>on and of</b> Capital, while Driving Durable CFO Growth</th>
</tr>
</thead>
<tbody>
<tr>
  <td style="text-align:center;">&gt;100%<br>Reinvestment Rate</td>
  <td style="text-align:center;">&lt;60%<br>Reinvestment Rate</td>
  <td style="text-align:center; font-weight:bold; color:#0055ff;">
    ~50%<br>10-Year Reinvestment Rate<br><br>
    ~6%<br>CFO CAGR 2024-2032<br><br>
    at $60/BBL WTI<br>Mid-Cycle Planning Price
  </td>
</tr>
<tr>
  <td>
    <div style="height:150px; width:50px; background-color:#b0b0b0; margin: 0 auto; position:relative;">
      <div style="position:absolute; bottom:0; width:100%; height:105%; background-color:#b0b0b0;"></div>
      <div style="position:absolute; bottom:0; width:100%; text-align:center; color:#fff; font-weight:bold;">~$75/B

We can download the page screenshots directly, and we can use them as context later.

In [None]:
image_nodes = await results.aget_image_nodes(
    include_object_images=False,
    include_screenshot_images=True,
    image_download_dir="./slide_images",
)

In [None]:
text_nodes = results.get_markdown_nodes(split_by_page=True)

## Build Multimodal Index

In this section we build the multimodal index over the parsed deck. 

We do this by creating **text** nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

#### Get Text Nodes

In [None]:
for text_node, image_node in zip(text_nodes, image_nodes):
    text_node.metadata["image_path"] = image_node.image_path

In [None]:
print(text_nodes[0].get_content(metadata_mode="all"))

page_number: 1
file_name: data/conocophillips.pdf
image_path: slide_images/page_1.jpg


# ConocoPhillips

## 2023 Analyst & Investor Meeting


#### Build Index

Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(nodes=text_nodes)

## Build Multimodal Query Engine

We now use LlamaIndex abstractions to build a **custom query engine**. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import MetadataMode
from llama_index.core.base.response.schema import Response
from llama_index.core.llms import TextBlock, ImageBlock, ChatMessage


qa_prompt_block_text = """\
Below we give parsed text from slides in two different formats, as well as the image.

---------------------
{context_str}
---------------------
"""

image_prefix_block = TextBlock(text="And here are the corresponding images per page\n")

image_suffix = """\
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """


class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes and respond using an LLM + retrieved text/images.

    """

    retriever: BaseRetriever
    llm: OpenAI

    def __init__(self, **kwargs) -> None:
        """Initialize."""
        super().__init__(**kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_blocks = [
            ImageBlock(path=n.metadata["image_path"])
            for n in nodes
            if n.metadata.get("image_path")
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )

        formatted_msg = ChatMessage(
            role="user",
            blocks=[
                TextBlock(text=qa_prompt_block_text.format(context_str=context_str)),
                image_prefix_block,
                *image_blocks,
                TextBlock(text=image_suffix.format(query_str=query_str)),
            ],
        )

        # synthesize an answer from formatted text and images
        llm_response = self.llm.chat([formatted_msg])

        return Response(
            response=str(llm_response.message.content),
            source_nodes=nodes,
        )

In [None]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=3), llm=llm
)

### Define Baseline

In addition, we define a "baseline" where we rely only on text-based indexing. Here we define an index using only the nodes that are parsed in text-mode from LlamaParse. 

In [None]:
base_index = VectorStoreIndex(nodes=text_nodes)
base_query_engine = base_index.as_query_engine(llm=llm, similarity_top_k=3)

## Build a Multimodal Agent

Build an agent around the multimodal query engine. This gives you agent capabilities like query planning/decomposition and memory around a central QA interface.

In [None]:
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionAgent


vector_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
agent = FunctionAgent(
    tools=[vector_tool],
    llm=llm,
)

from llama_index.core.workflow import Context

# Context to store chat history for the session
ctx = Context(agent)

In [None]:
# define a similar agent for the baseline
base_vector_tool = QueryEngineTool.from_defaults(
    query_engine=base_query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
base_agent = FunctionAgent(
    tools=[base_vector_tool],
    llm=llm,
)

base_ctx = Context(base_agent)

## Try out Queries

Let's try out queries against these documents and compare against each other.

In [None]:
query = (
    "Tell me about the diverse geographies where Conoco Phillips has a production base"
)

response = await agent.run(query, ctx=ctx)
base_response = await base_agent.run(query, ctx=base_ctx)

In [None]:
print(str(response))

In [None]:
print(str(base_response))