# Building a Multimodal RAG Pipeline over an Auto Insurance Claim

This cookbook shows how to use LlamaParse to parse an auto insurance claim document that contains complex tabular data using OpenAI's multimodal GPT-4o model.

This example demonstrates how LlamaParse can be used on insurance documents, which often contains complex tabular data. We parse these tabluar PDF files into markdown-formatted tables, which can be indexed and queried over with a `VectorStoreIndex`. This can help insurance companies accelerate the process of gathering information about a particular accident from insurance claim documents.

## Install and Setup

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [None]:
%pip install llama-index

In [None]:
!wget https://github.com/user-attachments/files/16435705/claim.pdf -O claim.pdf

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<Your OpenAI API Key>"
os.environ["LLAMA_CLOUD_API_KEY"] = "<Your LlamaCloud API Key>"

## Code Implementation

Set up LlamaParse. We want to parse the PDF file into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="This is an auto insurance claim document.",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
)

md_json_objs = parser.get_json_result(
    "claim.pdf"
)  # extract markdown data for insurance claim document
md_json_list = md_json_objs[0]["pages"]  # extract list of pages for insurance claim doc
parser.get_images(
    md_json_objs, download_path="data_images"
)  # extract images from PDFs and save them to ./data_images/

Create helper functions to create a list of `TextNode`s to feed into the `VectorStoreIndex`.

In [None]:
import re
from pathlib import Path
import typing as t
from llama_index.core.schema import TextNode, ImageNode


def get_page_number(file_name):
    """Gets page number of images using regex on file names"""
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files


def get_text_nodes(json_dicts, image_dir) -> t.List[TextNode]:
    """Creates nodes from json + images"""

    nodes = []

    docs = [doc["md"] for doc in json_dicts]  # extract text
    image_files = _get_sorted_image_files(image_dir)  # extract images

    for idx, doc in enumerate(docs):
        # adds both a text node and the corresponding image node (jpg of the page) for each page
        node = TextNode(
            text=doc,
            metadata={"image_path": str(image_files[idx]), "page_num": idx + 1},
        )
        image_node = ImageNode(
            image_path=str(image_files[idx]),
            metadata={"page_num": idx + 1, "text_node_id": node.id_},
        )
        nodes.extend([node, image_node])

    return nodes


text_nodes = get_text_nodes(md_json_list, "data_images")

Index the documents.

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_nodes"):
    index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_insurance")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_insurance")
    index = load_index_from_storage(ctx)

query_engine = index.as_query_engine()

Example queries are shown below.

In [None]:
from IPython.display import display, Markdown

response = query_engine.query("Who filed the insurance claim?")
display(Markdown(str(response)))

Michael De Santa filed the insurance claim.

In [None]:
response = query_engine.query("Where did the accident happen?")
display(Markdown(str(response)))

The accident happened at the intersection of Eclipse Boulevard and Marlow Drive in Los Angeles, CA.

In [None]:
response = query_engine.query("How was the red sedan damaged?")
display(Markdown(str(response)))

The red sedan sustained significant damage in the car accident. The initial collision with the blue pickup truck in the intersection caused substantial front-end damage, including crumpled components like the hood, fenders, and bumper. The force of this impact then caused the sedan to spin and strike a nearby traffic pole, further damaging the side and rear of the vehicle. The combination of these two collisions resulted in severe structural damage, likely compromising the frame, suspension, and other critical systems, rendering the vehicle inoperable.

In [None]:
response = query_engine.query("Who was in the blue pickup?")
display(Markdown(str(response)))

The blue pickup had three occupants: Walter White, Skylar White, and Walter White Jr.

In [None]:
response = query_engine.query("Who owns the blue pickup?")
display(Markdown(str(response)))

The blue pickup is owned by Walter White.

In [None]:
response = query_engine.query("Who are some witnesses and how can we contact them?")
display(Markdown(str(response)))

One of the witnesses is Franklin Clinton. He can be contacted at 3671 Whispy Mound Drive, Los Angeles, CA 90068, with the phone number 3285550156.

In [None]:
chat_engine = index.as_chat_engine()
response = chat_engine.chat(
    "Given the context, name a party that is liable for the damages and provide reasoning."
)
display(Markdown(str(response)))

Michael De Santa is likely liable for the damages. The reasoning is based on the description of the accident, which indicates that Michael De Santa, driving a red sedan, collided with a blue pickup truck at an intersection. The severity of the impact and the subsequent collision with a traffic pole suggest that Michael De Santa may have been at fault, leading to significant damage to both vehicles.