# Building a RAG Pipeline over Legal Documents

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/parse/multimodal/legal_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example shows how LlamaParse and LlamaIndex can be used to parse various types of legal documents, which may contain complex tabular data. The advantage of this is being able to quickly retrieve a specific answer to a legal question with comprehensive context — knowledge of precedents, statutes, and cases presented in the given documents. A user can quickly find the answer to or find out more details about a specific legal question without having to read through the often long documents by using LLMs.

In this example, we will be using legal documents from the archive of the Library of Congress ([link to dataset](https://www.loc.gov/item/2020445568/)). These documents vary by format, with some containing pure text and others containing headings, sections, and large tables. This shows how LlamaParse can parse a wide variety of documents and still retrieve accurate results.

The documents in this example include:
- [APA Program Report](https://www.irs.gov/pub/irs-apa/a_2003-19.pdf)
- [2004 Report on the CRA performance of Barre Savings Bank in Barre, MA](https://github.com/user-attachments/files/16536412/barre_savings_bank_evaluation.pdf)
- [2016 Energy Supply/Demand Forecast](https://github.com/user-attachments/files/16536415/energy_supply_demand.pdf)
- [Transcript of Senate Committee Hearing about Foreign Markets](https://github.com/user-attachments/files/16536422/foreign_markets.pdf)
- [A Motion To Stay for an Indiana Court Case](https://github.com/user-attachments/files/16536427/motion_to_stay.pdf)
- [Article About an OC Representative's Bill to Introduce Offshore Drilling to CA](https://github.com/user-attachments/files/16536437/oc_bill_offshore_drilling.pdf)
- [Charter of the Subcommittee on Ocean Science and Technology](https://github.com/user-attachments/files/16536445/ost_subcommittee_charter.pdf)
- [US Immigration Case](https://github.com/user-attachments/files/16536446/us_immigration_case.pdf)

## Setup and Installation

Install LlamaIndex, download the data, and set your API keys.

In [None]:
%pip install "llama-index>=0.13.0<0.14.0" llama-cloud-services

In [None]:
!wget https://github.com/user-attachments/files/16447759/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

Set up your OpenAI and LlamaCloud keys.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

## Code Implementation

Set up LlamaParse. We want to parse the PDF files into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs.

In [None]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    parse_mode="parse_page_with_agent",
    model="openai-gpt-4-1-mini",
    high_res_ocr=True,
    adaptive_long_table=True,
    outlined_table_extraction=True,
    output_tables_as_HTML=True,
)

DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

Load data from parser into documents containing parsed Markdown text from the legal document PDFs.

In [None]:
results = await parser.aparse(files)

documents = []
for result in results:
    documents.extend(result.get_markdown_documents(split_by_page=True))

for document in documents:
    document.metadata[
        "context"
    ] = "US legal documents provided by the Library of Congress."

Getting job results:   0%|          | 0/8 [00:00<?, ?it/s]

Started parsing the file under job_id dad7b215-360c-46a6-857e-983249441395
Started parsing the file under job_id bcfb24fb-0b30-4bd5-a87d-2a81b2d4298a
Started parsing the file under job_id 50417384-e3fa-44fa-9f58-8344c129cedf
Started parsing the file under job_id 49b0620f-e9fa-4736-801f-aadd6d6e21dd


Getting job results:  12%|█▎        | 1/8 [00:23<02:43, 23.42s/it]

Started parsing the file under job_id 729ceca5-2940-406d-b29a-0252dbf11e15


Getting job results:  38%|███▊      | 3/8 [00:41<00:56, 11.20s/it]

Started parsing the file under job_id 0733a9c5-d4a6-4242-9bd2-f61e931424dd
Started parsing the file under job_id a948a2f8-521a-412a-9cbd-4574814a8d2c
.

Getting job results:  50%|█████     | 4/8 [00:44<00:32,  8.19s/it]

Started parsing the file under job_id d9929a63-4f84-4567-abd9-bc352eee1db0


Getting job results:  75%|███████▌  | 6/8 [01:07<00:19,  9.70s/it]

....

Getting job results:  88%|████████▊ | 7/8 [02:47<00:39, 39.42s/it]

.

Getting job results: 100%|██████████| 8/8 [03:32<00:00, 26.61s/it]


Setup LlamaIndex for querying the data using RAG

In [None]:
from llama_index.core import (
    VectorStoreIndex,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-5-mini")

Settings.llm = llm
Settings.embed_model = embed_model

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

## Example Queries

In [None]:
from IPython.display import display, Markdown

response = query_engine.query(
    "Where did the majority of Barre Savings Bank's loans go?"
)
display(Markdown(str(response)))

The majority went to residential real estate lending—primarily 1–4 family mortgages (about 78.7% of loans, with home equity lines adding another 8.7%, for a total of 87.4%).

In [None]:
response = query_engine.query(
    "Why does Mr. Kubarych believe foreign markets are so important?"
)
display(Markdown(str(response)))

He says foreign markets (especially U.S. markets) are attractive because:
- The underlying economic data are strong.
- The high‑tech boom creates a “halo” that attracts attention and investment.
- There is broad, nearly bipartisan political/economic stability.
- Foreign investors expect sensible foreign‑policy behavior and a firm monetary policy response to any rise in inflation.
- Large foreign institutions (investment funds, insurers, banks) therefore see the markets as a safe, desirable place to put money.

In [None]:
response = query_engine.query(
    "Who is against the proposal of offshore drilling in CA and why?"
)
display(Markdown(str(response)))

House Democrats — including Speaker Nancy Pelosi — and other Democratic lawmakers oppose drilling off the California coast. They say it should be "off the table" for environmental reasons, point out that there are already millions of acres of federal lands leased to oil companies that aren’t being developed, and note oil companies have told Pelosi those leases aren’t being used because development takes a long time or the companies lack the equipment. No Democrats signed on to the proposed bill.

In [None]:
response = query_engine.query(
    "What is the purpose of the Ocean Science and Technology Subcommittee?"
)
display(Markdown(str(response)))

To advise and assist the Committee on Environment, Natural Resources, and Sustainability on national ocean science and technology issues and to advance federal ocean S&T goals by developing coordinated interagency strategies. It also serves as the National Ocean Council’s Ocean Science and Technology Interagency Policy Committee and retains the mandated functions of the prior joint subcommittee. Key roles include fostering national ocean S&T priorities; facilitating interagency coordination of research, technology, infrastructure, education, and observation/mapping programs; expanding fundamental knowledge of the ocean and its links to the Earth system and society; advancing modeling and forecasting; advising on science and technology for ecosystem-based management and stewardship; supporting use of ocean S&T in coastal and marine policy; and recommending scientific and technical assessments.

In [None]:
response = query_engine.query("Why is the immigration appeal dismissed?")
display(Markdown(str(response)))

The appeal was dismissed because the petitioner is not a U.S. citizen, and the K‑1 fiancé(e) classification (Form I‑129F) is available only for fiancés of U.S. citizens. The denial is without prejudice, so the petitioner may file a new I‑129F if he becomes a U.S. citizen.

In [None]:
response = query_engine.query("What is an advance pricing agreement?")
display(Markdown(str(response)))

An advance pricing agreement (APA) is a binding contract between a taxpayer and the IRS that establishes an approved transfer pricing method (TPM) for specified related‑party (covered) transactions and tax years. If the taxpayer files its returns consistent with the agreed TPM, the IRS agrees not to seek an adjustment under IRC § 482 for those transactions. An APA can be unilateral (between the taxpayer and the IRS) or bilateral/multilateral (also agreeing with one or more foreign competent authorities), and is intended to resolve transfer‑pricing disputes in advance and, where bilateral, to reduce the risk of double taxation.