# Building a RAG Pipeline over Legal Documents

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/legal_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example shows how LlamaParse and LlamaIndex can be used to parse various types of legal documents, which may contain complex tabular data. The advantage of this is being able to quickly retrieve a specific answer to a legal question with comprehensive context — knowledge of precedents, statutes, and cases presented in the given documents. A user can quickly find the answer to or find out more details about a specific legal question without having to read through the often long documents by using LLMs.

In this example, we will be using legal documents from the archive of the Library of Congress ([link to dataset](https://www.loc.gov/item/2020445568/)). These documents vary by format, with some containing pure text and others containing headings, sections, and large tables. This shows how LlamaParse can parse a wide variety of documents and still retrieve accurate results.

The documents in this example include:
- [APA Program Report](https://www.irs.gov/pub/irs-apa/a_2003-19.pdf)
- [2004 Report on the CRA performance of Barre Savings Bank in Barre, MA](https://github.com/user-attachments/files/16536412/barre_savings_bank_evaluation.pdf)
- [2016 Energy Supply/Demand Forecast](https://github.com/user-attachments/files/16536415/energy_supply_demand.pdf)
- [Transcript of Senate Committee Hearing about Foreign Markets](https://github.com/user-attachments/files/16536422/foreign_markets.pdf)
- [A Motion To Stay for an Indiana Court Case](https://github.com/user-attachments/files/16536427/motion_to_stay.pdf)
- [Article About an OC Representative's Bill to Introduce Offshore Drilling to CA](https://github.com/user-attachments/files/16536437/oc_bill_offshore_drilling.pdf)
- [Charter of the Subcommittee on Ocean Science and Technology](https://github.com/user-attachments/files/16536445/ost_subcommittee_charter.pdf)
- [US Immigration Case](https://github.com/user-attachments/files/16536446/us_immigration_case.pdf)

## Setup and Installation

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [None]:
%pip install llama-index llama-parse

In [None]:
!wget https://github.com/user-attachments/files/16447759/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<Your OpenAI API Key>"
os.environ["LLAMA_CLOUD_API_KEY"] = "<Your LlamaCloud API Key>"

## Code Implementation

Set up LlamaParse. We want to parse the PDF files into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="Provided are a series of US legal documents.",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    show_progress=True,
)

DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

Load data from parser into documents containing parsed Markdown text from the legal document PDFs.

In [None]:
documents = parser.load_data(
    files,
    extra_info={"name": "US legal documents provided by the Library of Congress."},
)

Parsing files: 100%|██████████| 8/8 [01:25<00:00, 10.67s/it]


Setup LlamaIndex. Set the default LLM to GPT-4o (a multi-modal model), and create an index from the documents, and persist these documents to disk. If these documents have already been persisted, then load index from the persisted docs.

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_legal"):
    index = VectorStoreIndex(documents, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_legal")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_legal")
    index = load_index_from_storage(ctx)

query_engine = index.as_query_engine()

## Example Queries

In [None]:
from IPython.display import display, Markdown

response = query_engine.query(
    "Where did the majority of Barre Savings Bank's loans go?"
)
display(Markdown(str(response)))

The majority of Barre Savings Bank's loans went to residential real estate, specifically 1-4 family mortgages, which accounted for 78.7 percent of the total loans.

In [None]:
response = query_engine.query(
    "Why does Mr. Kubarych believe foreign markets are so important?"
)
display(Markdown(str(response)))

Mr. Kubarych believes foreign markets are important because they are attractive to foreign investors for the same reasons they are attractive to Americans. The economic data is strong, and the high tech boom has created a positive perception that overshadows longer-term vulnerabilities. Additionally, foreign investors have high expectations for the U.S. to maintain a firm monetary policy in response to inflation and to act as a superpower rather than pursuing narrow nationalist economic policies.

In [None]:
response = query_engine.query(
    "Who is against the proposal of offshore drilling in CA and why?"
)
display(Markdown(str(response)))

House Speaker Nancy Pelosi and the Democratic majority are against the proposal of offshore drilling in California. Pelosi stated that offshore drilling is "off the table," and Democrats have been consistently unwilling to bend environmental rules. They argue that oil companies are not using the 68 million acres of federal lands already leased to them, either because it takes a long time or they lack the necessary equipment.

In [None]:
response = query_engine.query(
    "What is the purpose of the Ocean Science and Technology Subcommittee?"
)
display(Markdown(str(response)))

The purpose of the Ocean Science and Technology Subcommittee (SOST) is to advise and assist the Committee on Environment, Natural Resources, and Sustainability on national issues of ocean science and technology. The SOST aims to contribute to the goals for Federal ocean science and technology by developing coordinated interagency strategies. It also retains the functions of the previously-chartered Joint Subcommittee on Ocean Science and Technology and serves as the Ocean Science and Technology Interagency Policy Committee for the National Ocean Council.

In [None]:
response = query_engine.query("Why is the immigration appeal dismissed?")
display(Markdown(str(response)))

The immigration appeal is dismissed because the petitioner is not a U.S. citizen, and therefore, is not eligible to file a Petition for Alien Fiancé(e) (Form I-129F) on behalf of the beneficiary. The relevant law provides nonimmigrant classification only to aliens who are the fiancé(e)s of U.S. citizens.

In [None]:
response = query_engine.query("What is an advance pricing agreement?")
display(Markdown(str(response)))

An advance pricing agreement (APA) is a binding contract between a taxpayer and the IRS that establishes an approved transfer pricing method (TPM) for specific transactions. This agreement aims to prevent disputes over transfer pricing by ensuring that the taxpayer's tax returns for the covered years are consistent with the agreed TPM. APAs can be unilateral, involving only the taxpayer and the IRS, or bilateral/multilateral, involving agreements with one or more foreign tax authorities to avoid double taxation.