# Advanced RAG with LlamaParse

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a complete walkthrough for using LlamaParse with advanced indexing/retrieval techniques in LlamaIndex over the Apple 10K Filing. 

This allows us to ask sophisticated questions that aren't possible with "naive" parsing/indexing techniques with existing models.

In [None]:
%pip install llama-index llama-cloud-services

In [None]:
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O apple_2021_10k.pdf

Some OpenAI and LlamaParse details

In [None]:
import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-proj-..."

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
llm = OpenAI(model="gpt-4o-mini")

Settings.llm = llm
Settings.embed_model = embed_model

## Using brand new `LlamaParse` PDF reader for PDF Parsing

We also compare three different retrieval/query engine strategies:
1. Baseline using default parsing from `SimpleDirectoryReader`
2. Using raw markdown text as nodes for building index and apply simple query engine for generating the results;
3. Using markdown + page screenshots to help retrieve the proper nodes.

In [None]:
from llama_cloud_services import LlamaParse

result = await LlamaParse(take_screenshot=True).aparse("./apple_2021_10k.pdf")

markdown_nodes = await result.aget_markdown_nodes(split_by_page=True)
screenshot_image_nodes = await result.aget_image_nodes(
    include_screenshot_images=True,
    include_object_images=False,
    image_download_dir="./images",
)

Started parsing the file under job_id e403a457-1721-4093-82bf-4a316d2d637a


In [None]:
from llama_index.core import SimpleDirectoryReader

baseline_documents = SimpleDirectoryReader(
    input_files=["apple_2021_10k.pdf"]
).load_data()

## Setup Baseline Index

For comparison, we setup a naive RAG pipeline with default parsing and standard chunking, indexing, retrieval.

In [None]:
from llama_index.core import VectorStoreIndex

baseline_index = VectorStoreIndex.from_documents(baseline_documents)
baseline_query_engine = baseline_index.as_query_engine(similarity_top_k=3)

## Setup our LlamaParse Indexes

Using both the markdown and screenshot images, we can build two different indexes.

1. An index over just the markdown documents
2. A custom index that uses the markdown + screenshot images to help with response quality.

In [None]:
from llama_index.core import VectorStoreIndex

markdown_index = VectorStoreIndex(nodes=markdown_nodes)
markdown_query_engine = markdown_index.as_query_engine(similarity_top_k=3)

In [None]:
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# could also use other API-based multimodal models like voyageai or jinaai
# Note: this may take quite a while if running on CPU!
image_embed_model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    embed_batch_size=2,
    trust_remote_code=True,
    cache_folder="./hf_cache_2",
    device="cpu",  # set to "cuda" if you have a GPU or remove to auto-detect
)

multi_modal_index = MultiModalVectorStoreIndex(
    nodes=[*markdown_nodes, *screenshot_image_nodes],
    embed_model=Settings.embed_model,
    image_embed_model=image_embed_model,
    show_progress=True,
)

Below, we will create a custom query engine that does a few things
1. Retrieves both image nodes and text nodes
2. Combines them into two lists -- one where images and texts come from the same page, and one where we have texts alone
3. Use a Jinja-based `RichPromptTemplate` to format the retrieved content automatically into a list of multimodal chat messages
4. Send our messages to the LLM and return a result


In [None]:
from llama_index.core.async_utils import asyncio_run
from llama_index.core.llms import LLM
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.prompts import RichPromptTemplate
from llama_index.core.response import Response
from llama_index.core.schema import NodeWithScore
from llama_index.core import Settings

TEXT_IMAGE_PROMPT_TEMPLATE = RichPromptTemplate(
    """
<context>
Here is some retrieved content from a knowledge base:
{% for image_path, text in images_and_texts %}
<page>
<text>{{ text }}</text>
<image>{{ image_path | image }}</image>
</page>
{% endfor %}
{% for text in texts %}
<page>
<text>{{ text }}</text>
</page>
{% endfor %}
</context>

Using the context, answer the following question:
<query>{{ query_str }}</query>
"""
)


class SimpleMultiModalQueryEngine(CustomQueryEngine):
    def __init__(
        self,
        index: MultiModalVectorStoreIndex,
        image_top_k: int = 4,
        text_top_k: int = 4,
        llm: LLM | None = None,
        **kwargs
    ):
        super().__init__(**kwargs)
        self._retriever = index.as_retriever(
            similarity_top_k=text_top_k, image_similarity_top_k=image_top_k
        )
        self._llm = llm or Settings.llm

    def _match_images_and_texts(
        self, text_results: list[NodeWithScore], image_results: list[NodeWithScore]
    ) -> tuple[list[NodeWithScore], list[NodeWithScore]]:
        # combine results, prioritize images and texts
        # if both an image and matching text was retrieved, that is a strong indicator
        images_and_texts = []
        text_keys = {
            (x.metadata["page_number"], x.metadata["file_name"]): x
            for x in text_results
        }
        for image_result in image_results:
            key = (
                image_result.metadata["page_number"],
                image_result.metadata["file_name"],
            )
            # add matching text to results if available
            if key in text_keys:
                text_result = text_keys[key]
                images_and_texts.append(
                    (image_result.node.image_path, text_result.node.text)
                )

                # remove from list
                text_keys.pop(key)

        # get the remaining texts as a fallback
        texts = [result.node.text for result in text_keys.values()]

        return images_and_texts, texts

    def custom_query(self, query_str: str) -> Response:
        # wrap the async method to avoid code duplication
        # asyncio_run is a slightly safer asyncio.run() call
        return asyncio_run(self.acustom_query(query_str))

    async def acustom_query(self, query_str: str) -> Response:
        text_results = await self._retriever.atext_retrieve(query_str)
        image_results = await self._retriever.atext_to_image_retrieve(query_str)

        images_and_texts, texts = self._match_images_and_texts(
            text_results, image_results
        )
        messages = TEXT_IMAGE_PROMPT_TEMPLATE.format_messages(
            images_and_texts=images_and_texts, texts=texts, query_str=str(query_str)
        )

        response = await self._llm.achat(messages)

        return Response(
            response.message.content, source_nodes=[*text_results, *image_results]
        )

In [None]:
multimodal_query_engine = SimpleMultiModalQueryEngine(
    index=multi_modal_index,
    image_top_k=3,
    text_top_k=3,
)

## Try out the Query Engines and Compare!

Now with our three query engines assembled, we can compare each approach with a rough "vibes-based" evaluation.

In [None]:
query = "What were the total fair value of marketable securities in 2020"

response_1 = await baseline_query_engine.aquery(query)
print("\n***********Baseline Query Engine***********")
print(response_1)

response_2 = await markdown_query_engine.aquery(query)
print("\n***********Markdown Query Engine***********")
print(response_2)

response_3 = await multimodal_query_engine.aquery(query)
print("\n***********MultiModal Query Engine***********")
print(response_3)


***********Baseline Query Engine***********
The total fair value of marketable securities in 2020 was $190,516 million.

***********Markdown Query Engine***********
The total fair value of marketable securities in 2020 was $191,830 million.

***********MultiModal Query Engine***********
The total fair value of marketable securities in 2020 was $191,830 million.


As we can see, the multimodal and markdown query engines are able to retrieve the correct content, while the default query engine struggles to find the correct total value.

We can also inspect the source nodes, and see the pages that were retrieved. Here is the correct page for the total fair value of marketable securities in 2020:

In [None]:
response_3.source_nodes[4].node.image_path

'images/page_41.jpg'

Lets try a few more queries to see how the query engines perform.

In [None]:
query = "What were the effective interest rates of all debt issuances in 2021"

response_1 = await baseline_query_engine.aquery(query)
print("\n***********Baseline Query Engine***********")
print(response_1)

response_2 = await markdown_query_engine.aquery(query)
print("\n***********Markdown Query Engine***********")
print(response_2)

response_3 = await multimodal_query_engine.aquery(query)
print("\n***********MultiModal Query Engine***********")
print(response_3)


***********Baseline Query Engine***********
The effective interest rates for the debt issuances in 2021 were as follows:

- Floating-rate notes: 0.48% – 0.63%
- Fixed-rate notes: 0.03% – 4.78% for maturities from 2022 to 2060
- Fixed-rate notes issued in the second quarter: 0.75% – 2.81% for maturities from 2026 to 2061
- Fixed-rate notes issued in the fourth quarter: 1.43% – 2.86% for maturities from 2028 to 2061

***********Markdown Query Engine***********
The effective interest rates for the debt issuances in 2021 were as follows:

- Floating-rate notes: 0.48% – 0.63%
- Fixed-rate notes: 0.03% – 4.78% for the 0.000% – 4.650% notes, 0.75% – 2.81% for the 0.700% – 2.800% notes, and 1.43% – 2.86% for the 1.400% – 2.850% notes.

***********MultiModal Query Engine***********
The effective interest rates of all debt issuances in 2021 were as follows:

1. **Floating-rate notes**: 0.48% – 0.63%
2. **Fixed-rate 0.000% – 4.650% notes**: 0.03% – 4.78%
3. **Fixed-rate 0.700% – 2.800% notes**: 

In [None]:
query = "federal deferred tax in 2019-2021"

response_1 = await baseline_query_engine.aquery(query)
print("\n***********Baseline Query Engine***********")
print(response_1)

response_2 = await markdown_query_engine.aquery(query)
print("\n***********Markdown Query Engine***********")
print(response_2)

response_3 = await multimodal_query_engine.aquery(query)
print("\n***********MultiModal Query Engine***********")
print(response_3)


***********Baseline Query Engine***********
The federal deferred tax amounts for the years 2019 to 2021 are as follows (in millions):

- **2019**: $(2,939)
- **2020**: $(3,619)
- **2021**: $(7,176)

These figures represent the deferred tax expense for each respective year.

***********Markdown Query Engine***********
As of September 25, 2021, the total deferred tax assets and liabilities for the years 2021 and 2020 are as follows:

**Deferred Tax Assets:**
- 2021: $25,176 million
- 2020: $19,336 million

**Deferred Tax Liabilities:**
- 2021: $7,200 million
- 2020: $10,138 million

**Net Deferred Tax Assets:**
- 2021: $13,073 million
- 2020: $8,157 million

The information for 2019 is not provided in the context.

***********MultiModal Query Engine***********
The federal deferred tax assets and liabilities for the years 2019 to 2021 are as follows:

### Deferred Tax Assets (in millions):
- **2021**: $25,176
- **2020**: $19,336
- **2019**: Not specified in the provided content.

### Def

In [None]:
query = "current state taxes per year in 2019-2021 (include +/-)"

response_1 = await baseline_query_engine.aquery(query)
print("\n***********Baseline Query Engine***********")
print(response_1)

response_2 = await markdown_query_engine.aquery(query)
print("\n***********Markdown Query Engine***********")
print(response_2)

response_3 = await multimodal_query_engine.aquery(query)
print("\n***********MultiModal Query Engine***********")
print(response_3)


***********Baseline Query Engine***********
The current state taxes for the years 2019 to 2021 are as follows (in millions):

- 2021: $1,620
- 2020: $455
- 2019: $475

This indicates an increase of $1,165 million from 2020 to 2021, a decrease of $20 million from 2018 to 2019, and an increase of $80 million from 2019 to 2020.

***********Markdown Query Engine***********
The current state taxes for the years 2019 to 2021 are as follows (in millions):

- **2021**: $1,620
- **2020**: $455
- **2019**: $475

The changes in current state taxes from year to year are:

- From 2019 to 2020: Decrease of $20 million
- From 2020 to 2021: Increase of $1,165 million

***********MultiModal Query Engine***********
The current state taxes for the years 2019 to 2021 are as follows (in millions):

- **2021**: $1,620
- **2020**: $455
- **2019**: $475

So, the changes are:
- From 2019 to 2020: Decrease of $20 million
- From 2020 to 2021: Increase of $1,165 million
