# Building a Multimodal RAG Pipeline over an Auto Insurance Claim

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/insurance_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows how to use LlamaParse and OpenAI's multimodal GPT-4o model to parse auto insurance claim documents that contain complex tabular data. In this example, we will use an auto insurance claim template form, which contains complex tabular inputs regarding information about the location of the accident, accident description, information about vehicles of both parties, and injury information. The template is shown below.

![Auto Insurance Template](https://github.com/user-attachments/assets/aadbaa5b-16d2-490f-be35-f8ee06571633)

This example demonstrates how LlamaParse can be used on insurance documents, which often contains complex tabular data. We parse these tabluar PDF files into markdown-formatted tables, which can be indexed and queried over with a `VectorStoreIndex`. This can help insurance companies accelerate the process of gathering information about car accidents from insurance claim documents.

## Install and Setup

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [None]:
%pip install llama-index

In [None]:
!wget https://github.com/user-attachments/files/16536240/claims.zip -O claims.zip
!unzip -o claims.zip
!rm claims.zip

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<Your OpenAI API Key>"
os.environ["LLAMA_CLOUD_API_KEY"] = "<Your Llamacloud API Key>"

## Code Implementation

Set up LlamaParse. We want to parse the PDF files into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="This is an auto insurance claim document.",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    show_progress=True,
)

CLAIMS_DIR = "claims"


def get_claims_files(claims_dir=CLAIMS_DIR) -> list[str]:
    files = []
    for f in os.listdir(claims_dir):
        fname = os.path.join(claims_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_claims_files()  # get all files from the claims/ directory
md_json_objs = parser.get_json_result(
    files
)  # extract markdown data for insurance claim document
parser.get_images(
    md_json_objs, download_path="data_images"
)  # extract images from PDFs and save them to ./data_images/

In [None]:
# extract list of pages for insurance claim doc
md_json_list = []
for obj in md_json_objs:
    md_json_list.extend(obj["pages"])

Create helper functions to create a list of `TextNode`s from the markdown tables to feed into the `VectorStoreIndex`.

In [None]:
import re
from pathlib import Path
import typing as t
from llama_index.core.schema import TextNode, ImageNode


def get_page_number(file_name):
    """Gets page number of images using regex on file names"""
    match = re.search(r"-page-(\d+)\.jpg$", str(file_name))
    if match:
        return int(match.group(1))
    return 0


def _get_sorted_image_files(image_dir):
    """Get image files sorted by page."""
    raw_files = [f for f in list(Path(image_dir).iterdir()) if f.is_file()]
    sorted_files = sorted(raw_files, key=get_page_number)
    return sorted_files


def get_text_nodes(json_dicts, image_dir) -> t.List[TextNode]:
    """Creates nodes from json + images"""

    nodes = []

    docs = [doc["md"] for doc in json_dicts]  # extract text
    image_files = _get_sorted_image_files(image_dir)  # extract images

    for idx, doc in enumerate(docs):
        # adds both a text node and the corresponding image node (jpg of the page) for each page
        node = TextNode(
            text=doc,
            metadata={"image_path": str(image_files[idx]), "page_num": idx + 1},
        )
        image_node = ImageNode(
            image_path=str(image_files[idx]),
            metadata={"page_num": idx + 1, "text_node_id": node.id_},
        )
        nodes.extend([node, image_node])

    return nodes


text_nodes = get_text_nodes(md_json_list, "data_images")

Index the documents.

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_insurance"):
    index = VectorStoreIndex(text_nodes, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_insurance")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_insurance")
    index = load_index_from_storage(ctx)

query_engine = index.as_query_engine()

Example queries are shown below.

In [None]:
from IPython.display import display, Markdown

response = query_engine.query(
    "Who filed the insurance claim for the accident that happened on Sunset Blvd?"
)
display(Markdown(str(response)))

Michael Johnson filed the insurance claim for the accident that happened on Sunset Blvd.

In [None]:
response = query_engine.query("How did Ms. Patel's accident happen?")
display(Markdown(str(response)))

Ms. Patel's accident occurred on March 10, 2023, at approximately 9:15 AM in the Boise Towne Square Mall parking lot. She was heading west at a parking space and, after checking her mirrors and blind spots, did not see any approaching vehicles. However, Michael Chen, the driver of another vehicle, was driving too fast through the parking lot and failed to stop in time, resulting in a collision with Ms. Patel's vehicle. This caused significant damage to the rear bumper and trunk of her car.

In [None]:
response = query_engine.query("How was Mr. Johnson's red sedan damaged?")
display(Markdown(str(response)))

Mr. Johnson's red sedan, a 2020 Honda Accord, was damaged on the front passenger side, including a dented fender and a broken headlight. The estimated repair cost is $3,500.

In [None]:
response = query_engine.query("How was Mr. Doe's Honda Accord damaged?")
display(Markdown(str(response)))

Mr. Doe's Honda Accord sustained damage to the front bumper, hood, fenders, head/tail lights, windshield, and doors.

In [None]:
response = query_engine.query(
    "Who are some witnesses for the Ms. Patel's accident and how can we contact them?"
)
display(Markdown(str(response)))

The witness for Ms. Patel's accident is Sophia Rodriguez. She can be contacted at 5554567890.

In [None]:
response = query_engine.query(
    "Did Ms. Johnson sustain any injuries? If so, what were those injuries?"
)
display(Markdown(str(response)))

Yes, Ms. Johnson sustained injuries. She experienced minor injuries, including a bruised knee and some whiplash.

In [None]:
chat_engine = index.as_chat_engine()
response = chat_engine.chat(
    "Given the accident that happened on Lombard Street, name a party that is liable for the damages and explain why."
)
display(Markdown(str(response)))

Mark Johnson is liable for the damages from the accident on Lombard Street. He was driving a delivery van that collided with the rear of Emily Rodriguez's vehicle. In rear-end collisions, the driver who hits the vehicle in front is typically at fault because they are expected to maintain a safe distance and be able to stop in time to avoid a collision.