# Building a Natively Multimodal RAG Pipeline (over a Slide Deck)

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_ppt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this cookbook we show you how to build a multimodal RAG pipeline over a slide deck, with text, tables, images, diagrams, and complex layouts.

A gap of text-based RAG is that they struggle with purely text-based representations of complex documents. For instance, if a page contains a lot of images and diagrams, a text parser would need to rely on raw OCR to extract out text. You can also use a multimodal model (e.g. gpt-4o and up) to do text extraction, but this is inherently a lossy conversion.

Instead a **native multimodal pipeline** stores both a text and image representation of a document chunk. They are indexed via embeddings (text or image), and during synthesis both text and image are directly fed to the multimodal model for synthesis.

This can have the following advantages:
- **Robustness**: This solution is more robust than a pure text or even a pure image-based approach. In a pure text RAG approach, the parsing piece can be lossy. In a pure image-based approach, multimodal OCR is not perfect and may lose out against text parsing for text-heavy documents.
- **Cost Optimization**: You may choose to dynamically include text-only, or text + image depending on the content of the page.

## Setup

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Setup Observability

We setup an integration with LlamaTrace (integration with Arize).

If you haven't already done so, make sure to create an account here: https://llamatrace.com/login. Then create an API key and put it in the `PHOENIX_API_KEY` variable below.

In [None]:
!pip install -U llama-index-callbacks-arize-phoenix

In [None]:
# setup Arize Phoenix for logging/observability
import llama_index.core
import os

PHOENIX_API_KEY = "<PHOENIX_API_KEY>"
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
llama_index.core.set_global_handler(
    "arize_phoenix", endpoint="https://llamatrace.com/v1/traces"
)

### Load Data

Here we load the [Conoco Phillips 2023 investor meeting slide deck](https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf).

In [None]:
!mkdir data
!mkdir data_images
!wget "https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf" -O data/conocophillips.pdf

--2024-07-10 22:07:20--  https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf
Resolving static.conocophillips.com (static.conocophillips.com)... 2600:9000:25ef:e200:13:a3a2:1e40:93a1, 2600:9000:25ef:2400:13:a3a2:1e40:93a1, 2600:9000:25ef:6400:13:a3a2:1e40:93a1, ...
Connecting to static.conocophillips.com (static.conocophillips.com)|2600:9000:25ef:e200:13:a3a2:1e40:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41895745 (40M) [application/pdf]
Saving to: ‘data/conocophillips.pdf’


2024-07-10 22:07:21 (32.8 MB/s) - ‘data/conocophillips.pdf’ saved [41895745/41895745]



### Model Setup

Setup models that will be used for downstream orchestration.

In [None]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI(model="gpt-4o")

Settings.embed_model = embed_model
Settings.llm = llm

## Use LlamaParse to Parse Text and Images

In this example, use LlamaParse to parse both the text and images from the document.

We parse out the text in two ways: 
- in regular `text` mode using our default text layout algorithm
- in `markdown` mode using GPT-4o (`gpt4o_mode=True`). This also allows us to capture page screenshots

In [None]:
from llama_parse import LlamaParse


parser_text = LlamaParse(result_type="text")
parser_gpt4o = LlamaParse(result_type="markdown", gpt4o_mode=True)

In [None]:
# docs_text = parser_text.load_data("data/conocophillips.pdf")
json_objs = parser_gpt4o.get_json_result("data/conocophilips.pdf")
json_list = json_objs[0]["pages"]

Started parsing the file under job_id cac11eca-71c5-46bd-add9-ca5f4d378899


In [None]:
print(docs_text[0].get_content())

In [None]:
print(json_list[10]["md"])

In [None]:
print(json_list[1]["md"])

In [None]:
image_dicts = parser_gpt4o.get_images(json_objs, download_path="data_images")

## Build Multimodal Index

In this section we build the multimodal index over the parsed deck. 

We do this by creating **text** nodes from the document that contain metadata referencing the original image path.

In this example we're indexing the text node for retrieval. The text node has a reference to both the parsed text as well as the image screenshot.

#### Get Text Nodes

In [None]:
from llama_index.core.schema import TextNode
from typing import Optional

In [None]:
# get pages loaded through llamaparse
import re


def get_page_number(file_name):
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files

In [None]:
from copy import deepcopy
from pathlib import Path


# attach image metadata to the text nodes
def get_text_nodes(docs, image_dir=None, json_dicts=None):
    """Split docs into nodes, by separator."""
    nodes = []

    image_files = _get_sorted_image_files(image_dir) if image_dir is not None else None
    md_texts = [d["md"] for d in json_dicts] if json_dicts is not None else None

    doc_chunks = docs[0].text.split("---")
    for idx, doc_chunk in enumerate(doc_chunks):
        chunk_metadata = {"page_num": idx + 1}
        if image_files is not None:
            image_file = image_files[idx]
            chunk_metadata["image_path"] = str(image_file)
        if md_texts is not None:
            chunk_metadata["parsed_text_markdown"] = md_texts[idx]
        #             chunk_metadata["parsed_text_markdown"] = f"""
        # Here is the parsed markdown. Parsing may be incorrect:
        # {md_texts[idx]}
        # -----
        # """
        chunk_metadata["parsed_text"] = doc_chunk
        node = TextNode(
            # text=doc_chunk,
            # text=chunk_metadata["text_markdown"],
            text="",
            metadata=chunk_metadata,
        )
        nodes.append(node)

    return nodes

In [None]:
# this will split into pages
text_nodes = get_text_nodes(docs_text, image_dir="data_images", json_dicts=json_list)

In [None]:
print(text_nodes[10].get_content(metadata_mode="llm"))

#### Build Index

Once the text nodes are ready, we feed into our vector store index abstraction, which will index these nodes into a simple in-memory vector store (of course, you should definitely check out our 40+ vector store integrations!)

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(text_nodes, embed_model=embed_model)

## Build Multimodal Query Engine

We now use LlamaIndex abstractions to build a **custom query engine**. In contrast to a standard RAG query engine that will retrieve the text node and only put that into the prompt (response synthesis module), this custom query engine will also load the image document, and put both the text and image document into the response synthesis module.

In [None]:
from llama_index.core.query_engine import CustomQueryEngine, SimpleMultiModalQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageNode, QueryBundle, NodeWithScore
from llama_index.core.prompts import PromptTemplate


gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)


QA_PROMPT_TMPL = """\
Below we give parsed text from slides in two different formats, as well as the image.

We parse the text in both 'markdown' mode as well as 'raw text' mode. Markdown mode attempts \
to convert relevant diagrams into tables, whereas raw text tries to maintain the rough spatial \
layout of the text.

Use the markdown context, but keep in mind the parsed data may be incorrect. Try to correlate
data from text mode and the image to markdown. If you notice inconsistencies, then rely on your understanding
of text mode and the image to give the response instead.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query. Explain whether you got the answer
from the parsed markdown or raw text or image, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

qa_prompt = PromptTemplate(QA_PROMPT_TMPL)


class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine."""

    retriever: BaseRetriever
    multi_modal_llm: OpenAIMultiModal

    def custom_query(self, query_str: str):
        nodes = self.retriever.retrieve(query_str)

        # create ImageNode items from text nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=n.metadata["image_path"]))
            for n in nodes
        ]

        # TODO: we just use the `synthesize` mode from `SimpleMultiModalQueryEngine`
        # which synthesizes information from a query string and a set of text and image chunks.
        query_engine = SimpleMultiModalQueryEngine(
            self.retriever,  ## unused
            multi_modal_llm=self.multi_modal_llm,
            text_qa_template=qa_prompt,
        )
        response = query_engine.synthesize(
            QueryBundle(query_str=query_str),
            nodes + image_nodes
            # image_nodes ## TMP
        )

        return response

In [None]:
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=9), multi_modal_llm=gpt_4o
)

### Define Baseline

In addition, we define a "baseline" where we rely only on text-based indexing. Here we can directly do `index.as_query_engine` to get back the "default" query engine over the index.

In [None]:
base_query_engine = index.as_query_engine(llm=llm, similarity_top_k=9)

## Build a Multimodal Agent

Build an agent around the multimodal query engine. This gives you agent capabilities like query planning/decomposition and memory around a central QA interface.

In [None]:
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionCallingAgentWorker


vector_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
agent = FunctionCallingAgentWorker.from_tools(
    [vector_tool], llm=llm, verbose=True
).as_agent()

In [None]:
# define a similar agent for the baseline
base_vector_tool = QueryEngineTool.from_defaults(
    query_engine=base_query_engine,
    name="vector_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)
base_agent = FunctionCallingAgentWorker.from_tools(
    [base_vector_tool], llm=llm, verbose=True
).as_agent()

## Try out Queries

Let's try out queries against these documents and compare against each other.

In [None]:
response = agent.query("<INSERT QUESTION HERE>")
print(str(response))