# Building a Natively Multimodal RAG Pipeline (over a Slide Deck)

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this cookbook we show you how to build a multimodal RAG pipeline over a slide deck, with text, tables, images, diagrams, and complex layouts.

A gap of text-based RAG is that they struggle with purely text-based representations of complex documents. For instance, if a page contains a lot of images and diagrams, a text parser would need to rely on raw OCR to extract out text. You can also use a multimodal model (e.g. gpt-4o and up) to do text extraction, but this is inherently a lossy conversion.

Instead a **native multimodal pipeline** stores both a text and image representation of a document chunk. They are indexed via embeddings (text or image), and during synthesis both text and image are directly fed to the multimodal model for synthesis.

This can have the following advantages:
- **Robustness**: This solution is more robust than a pure text or even a pure image-based approach. In a pure text RAG approach, the parsing piece can be lossy. In a pure image-based approach, multimodal OCR is not perfect and may lose out against text parsing for text-heavy documents.
- **Cost Optimization**: You may choose to dynamically include text-only, or text + image depending on the content of the page.

![mm_rag_diagram](./multimodal_rag_slide_deck_img.png)

## Setup

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Setup Observability

We setup an integration with LlamaTrace (integration with Arize).

If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below.

In [None]:
!pip install -U llama-index-callbacks-arize-phoenix

In [None]:
# setup Arize Phoenix for logging/observability
import llama_index.core
import os

PHOENIX_API_KEY = "<PHOENIX_API_KEY>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)

### Load Data

Here we load the [Conoco Phillips 2023 investor meeting slide deck](https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf).

In [None]:
!mkdir data
!mkdir data_images
!wget "https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf" -O data/conocophillips.pdf

### Model Setup

Setup models that will be used for downstream orchestration.

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI(model="gpt-4o")

Settings.embed_model = embed_model
Settings.llm = llm

## Use LlamaParse to Parse Text and Images

In this example, use LlamaParse to parse both the text and images from the document.

We parse out the text in two ways: 
- in regular `text` mode using our default text layout algorithm
- in `markdown` mode using GPT-4o (`gpt4o_mode=True`). This also allows us to capture page screenshots

In [None]:
from llama_parse import LlamaParse


parser_text = LlamaParse(result_type="text")
parser_gpt4o = LlamaParse(result_type="markdown", gpt4o_mode=True)

In [None]:
print(f"Parsing text...")
docs_text = parser_text.load_data("data/conocophillips.pdf")
print(f"Parsing PDF file...")
md_json_objs = parser_gpt4o.get_json_result("data/conocophillips.pdf")
md_json_list = md_json_objs[0]["pages"]

In [None]:
print(docs_text[0].get_content())

In [None]:
print(md_json_list[10]["md"])

In [None]:
print(md_json_list[1].keys())

dict_keys(['page', 'text', 'md', 'images', 'items'])


In [None]:
image_dicts = parser_gpt4o.get_images(md_json_objs, download_path="data_images")

## Build Multimodal Index

In this section we build the multimodal index over the parsed deck. 

We do this by creating **text** nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

#### Get Text Nodes

In [None]:
from llama_index.core.schema import TextNode
from typing import Optional

In [None]:
# get pages loaded through llamaparse
import re


def get_page_number(file_name):
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files

In [None]:
from copy import deepcopy
from pathlib import Path


# attach image metadata to the text nodes
def get_text_nodes(docs, image_dir=None, json_dicts=None):
    """Split docs into nodes, by separator."""
    nodes = []

    image_files = _get_sorted_image_files(image_dir) if image_dir is not None else None
    md_texts = [d["md"] for d in json_dicts] if json_dicts is not None else None

    doc_chunks = docs[0].text.split("---")
    for idx, doc_chunk in enumerate(doc_chunks):
        chunk_metadata = {"page_num": idx + 1}
        if image_files is not None:
            image_file = image_files[idx]
            chunk_metadata["image_path"] = str(image_file)
        if md_texts is not None:
            chunk_metadata["parsed_text_markdown"] = md_texts[idx]
        chunk_metadata["parsed_text"] = doc_chunk
        node = TextNode(
            text="",
            metadata=chunk_metadata,
        )
        nodes.append(node)

    return nodes

In [None]:
# this will split into pages
text_nodes = get_text_nodes(docs_text, image_dir="data_images", json_dicts=md_json_list)

In [None]:
print(text_nodes[10].get_content(metadata_mode="all"))

page_num: 11
image_path: data_images/d9137e19-3974-4b5d-998f-dac0cf29dd9d-page-10.jpg
parsed_text_markdown: # Commitment to Disciplined Reinvestment Rate

| Year       | Reinvestment Rate | WTI Average Price | Reinvestment Rate at $60/BBL WTI | Reinvestment Rate at $80/BBL WTI |
|------------|-------------------|-------------------|----------------------------------|----------------------------------|
| 2012-2016  | >100%             | ~$75/BBL          |                                  |                                  |
| 2017-2022  | <60%              | ~$63/BBL          |                                  |                                  |
| 2023E      |                   |                   |                                  | at $80/BBL WTI                   |
| 2024-2028  |                   |                   | at $60/BBL WTI                   | at $80/BBL WTI                   |
| 2029-2032  |                   |                   | at $60/BBL WTI                   | at $8

#### Build Index

Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)

In [None]:
import os
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)

if not os.path.exists("storage_nodes"):
    index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    # save index to disk
    index.set_index_id("vector_index")
    index.storage_context.persist("./storage_nodes")
else:
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="storage_nodes")
    # load index
    index = load_index_from_storage(storage_context, index_id="vector_index")

retriever = index.as_retriever()

## Build Multimodal Query Engine

We now use LlamaIndex abstractions to build a **custom query engine**. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine, SimpleMultiModalQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from llama_index.core.prompts import PromptTemplate
from llama_index.core.base.response.schema import Response
from typing import Optional


gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

QA_PROMPT_TMPL = """\
Below we give parsed text from slides in two different formats, as well as the image.

We parse the text in both 'markdown' mode as well as 'raw text' mode. Markdown mode attempts \
to convert relevant diagrams into tables, whereas raw text tries to maintain the rough spatial \
layout of the text.

Use the image information first and foremost. ONLY use the text/markdown information 
if you can't understand the image.

---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)


class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes.
    Also takes in a prompt template and multimodal model.

    """

    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: OpenAIMultiModal

    def __init__(self, qa_prompt: Optional[PromptTemplate] = None, **kwargs) -> None:
        """Initialize."""
        super().__init__(qa_prompt=qa_prompt or QA_PROMPT, **kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=n.metadata["image_path"]))
            for n in nodes
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )
        fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)

        # synthesize an answer from formatted text and images
        llm_response = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=[image_node.node for image_node in image_nodes],
        )
        return Response(
            response=str(llm_response),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )

        return response

In [None]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=9), multi_modal_llm=gpt_4o
)

### Define Baseline

In addition, we define a "baseline" where we rely only on text-based indexing. Here we define an index using only the nodes that are parsed in text-mode from LlamaParse. 

**NOTE**: We don't currently include the markdown-parsed text because that was parsed with GPT-4o, so already uses a multimodal model during the text extraction phase.

It is of course a valid experiment to compare RAG where multimodal extraction only happens during indexing, vs. the current multimodal RAG implementation where images are fed during synthesis to the LLM. 

In [None]:
def get_nodes(docs):
    """Split docs into nodes, by separator."""
    nodes = []
    for doc in docs:
        doc_chunks = doc.text.split("\n---\n")
        for doc_chunk in doc_chunks:
            node = TextNode(
                text=doc_chunk,
                metadata=deepcopy(doc.metadata),
            )
            nodes.append(node)

    return nodes

In [None]:
base_nodes = get_nodes(docs_text)

In [None]:
print(base_nodes[13].get_content(metadata_mode="all"))

Our Differentiated Portfolio: Deep; Durable and Diverse
                              20 BBOE of Resource                                           Diverse Production Base
                            Under $40/BBL Cost of Supply                              10-Year Plan Cumulative Production (BBOE)
      S50                   S32/BBL                                                Lower 48                           Alaska
                    Average Cost of Supply
  3$40                                                                                                                        GKA        GWA
                                                                                                                      GPA     WNS
      $30                                                                                                             EMENA
  3                                                                                                                              Norway
 

In [None]:
base_index = VectorStoreIndex(base_nodes, embed_model=embed_model)
base_query_engine = base_index.as_query_engine(llm=llm, similarity_top_k=9)

## Build a Multimodal Agent

Build an agent around the multimodal query engine. This gives you agent capabilities like query planning/decomposition and memory around a central QA interface.

In [None]:
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionCallingAgentWorker


vector_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
agent = FunctionCallingAgentWorker.from_tools(
    [vector_tool], llm=llm, verbose=True
).as_agent()

In [None]:
# define a similar agent for the baseline
base_vector_tool = QueryEngineTool.from_defaults(
    query_engine=base_query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
base_agent = FunctionCallingAgentWorker.from_tools(
    [base_vector_tool], llm=llm, verbose=True
).as_agent()

## Try out Queries

Let's try out queries against these documents and compare against each other.

In [None]:
# response = agent.query("Tell me about the different regions and subregions where Conoco Phillips has a production base.")
response = agent.query(
    "How does the Conoco Phillips capex/EUR in the delaware basin compare against other competitors?"
)
print(str(response))

Added user message to memory: How does the Conoco Phillips capex/EUR in the delaware basin compare against other competitors?
=== Calling Function ===
Calling function: vector_tool with args: {"input": "Conoco Phillips capex/EUR in the Delaware Basin"}
=== Function Output ===
The ConocoPhillips capex/EUR in the Delaware Basin is $10/BOE.

I obtained this information from the image provided. The image clearly shows a bar chart under the section "Delaware Basin Well Capex/EUR ($/BOE)" where ConocoPhillips is listed with a capex/EUR of $10/BOE. This information is consistent with the parsed markdown text, which also lists ConocoPhillips' capex/EUR as $10/BOE in the Delaware Basin. There are no discrepancies between the image and the parsed markdown text in this case.
=== Calling Function ===
Calling function: vector_tool with args: {"input": "competitors capex/EUR in the Delaware Basin"}
=== Function Output ===
The competitors' Capex/EUR in the Delaware Basin can be found in the image on 

In [None]:
print(response.source_nodes[0].get_content(metadata_mode="all"))

page_num: 38
image_path: data_images/d9137e19-3974-4b5d-998f-dac0cf29dd9d-page-37.jpg
parsed_text_markdown: # Delaware: Vast Inventory with Proven Track Record of Performance

## Prolific Acreage Spanning Over ~659,000 Net Acres¹

![Map of Delaware Basin](image)

### Total 10-Year Operated Permian Inventory

- Delaware Basin: 65%
- Midland Basin: 35%

### High Single-Digit Production Growth

## 12-Month Cumulative Production³ (BOE/FT)

| Months | 2019 | 2020 | 2021 | 2022 |
|--------|------|------|------|------|
| 1      | 0    | 0    | 0    | 0    |
| 2      | 5    | 6    | 7    | 8    |
| 3      | 10   | 12   | 14   | 16   |
| 4      | 15   | 18   | 21   | 24   |
| 5      | 20   | 24   | 28   | 32   |
| 6      | 25   | 30   | 35   | 40   |
| 7      | 30   | 36   | 42   | 48   |
| 8      | 35   | 42   | 49   | 56   |
| 9      | 40   | 48   | 56   | 64   |
| 10     | 45   | 54   | 63   | 72   |
| 11     | 50   | 60   | 70   | 80   |
| 12     | 55   | 66   | 77   | 88   |

~30% Improved

In [None]:
# base_response = base_agent.query("Tell me about the different regions and subregions where Conoco Phillips has a production base.")
base_response = base_agent.query(
    "How does the Conoco Phillips capex/EUR in the delaware basin compare against other competitors?"
)
print(str(base_response))

Added user message to memory: How does the Conoco Phillips capex/EUR in the delaware basin compare against other competitors?
=== Calling Function ===
Calling function: vector_tool with args: {"input": "Conoco Phillips capex/EUR in the Delaware Basin"}
=== Function Output ===
ConocoPhillips' capex/EUR in the Delaware Basin is approximately $20/BOE.
=== Calling Function ===
Calling function: vector_tool with args: {"input": "competitors capex/EUR in the Delaware Basin"}
=== Function Output ===
The average single well capex/EUR for competitors in the Delaware Basin is between $10 and $25 per BOE.
=== LLM Response ===
ConocoPhillips' capex/EUR in the Delaware Basin is approximately $20 per BOE. In comparison, the average capex/EUR for competitors in the Delaware Basin ranges between $10 and $25 per BOE. This places ConocoPhillips' capex/EUR towards the higher end of the competitive range.
ConocoPhillips' capex/EUR in the Delaware Basin is approximately $20 per BOE. In comparison, the aver

In [None]:
print(base_response.source_nodes[0].get_content(metadata_mode="llm"))

Deep, Durable and Diverse Portfolio with Significant Growth Runway
    1,2002022 Lower 48 Unconventional Production' (MBOED                                                  S50     ~S32/BBL
     000 ConocoPhillips                                                                                                   Cost of SupplyAverage
       00                                                                                                 S40
      500                                                                                            3
      400                                                                                            1    S30
     200
                                                                                                      5
   15,000ConocoPhillipsNet Remaining Well Inventory?                                                  1   S20
   12,000                                                                                                 S10
      000
