# Building a RAG Pipeline over Legal Documents

<a href="https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/legal_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This example shows how LlamaParse and LlamaIndex can be used to parse various types of legal documents, which may contain complex tabular data. The advantage of this is being able to quickly retrieve a specific answer to a legal question with comprehensive context — knowledge of precedents, statutes, and cases presented in the given documents. A user can quickly find the answer to or find out more details about a specific legal question without having to read through the often long documents by using LLMs.

In this example, we will be using legal documents from the archive of the Library of Congress ([link to dataset](https://www.loc.gov/item/2020445568/)). These documents vary by format, with some containing pure text and others containing headings, sections, and large tables. This shows how LlamaParse can parse a wide variety of documents and still retrieve accurate results.

The documents in this example include:
- [APA Program Report](https://www.irs.gov/pub/irs-apa/a_2003-19.pdf)
- [2004 Report on the CRA performance of Barre Savings Bank in Barre, MA](https://github.com/user-attachments/files/16536412/barre_savings_bank_evaluation.pdf)
- [2016 Energy Supply/Demand Forecast](https://github.com/user-attachments/files/16536415/energy_supply_demand.pdf)
- [Transcript of Senate Committee Hearing about Foreign Markets](https://github.com/user-attachments/files/16536422/foreign_markets.pdf)
- [A Motion To Stay for an Indiana Court Case](https://github.com/user-attachments/files/16536427/motion_to_stay.pdf)
- [Article About an OC Representative's Bill to Introduce Offshore Drilling to CA](https://github.com/user-attachments/files/16536437/oc_bill_offshore_drilling.pdf)
- [Charter of the Subcommittee on Ocean Science and Technology](https://github.com/user-attachments/files/16536445/ost_subcommittee_charter.pdf)
- [US Immigration Case](https://github.com/user-attachments/files/16536446/us_immigration_case.pdf)

## Setup and Installation

Install LlamaIndex, download the data, and apply `nest_asyncio`.

In [1]:
pip install llama-parse


Note: you may need to restart the kernel to use updated packages.









In [2]:
%pip install llama-index llama-parse

Note: you may need to restart the kernel to use updated packages.




In [3]:
!pip install wget






In [4]:
# Install requests if needed
!pip install requests

import requests
import zipfile
import os

# Download the file
url = "https://github.com/user-attachments/files/16447759/data.zip"
response = requests.get(url)
with open("data.zip", "wb") as file:
    file.write(response.content)

# Unzip the file
with zipfile.ZipFile("data.zip", 'r') as zip_ref:
    zip_ref.extractall()  # Extract to current directory

# Remove the zip file
os.remove("data.zip")






In [5]:
import nest_asyncio

nest_asyncio.apply()

Set up your OpenAI and LlamaCloud keys.

In [6]:
!pip install python-dotenv





Loading the dotenv library

In [7]:
from dotenv import load_dotenv
import os

# Load the .env file
load_dotenv("secrets.env")

# Now you can access the secrets
#openai_api_key = os.getenv("OPENAI_API_KEY")
llama_cloud_api_key = os.getenv("LLAMA_CLOUD_API_KEY")

In [8]:
print("LLAMA_CLOUD_API_KEY", llama_cloud_api_key)  # Verify it prints the actual key


LLAMA_CLOUD_API_KEY llx-zXIDu0BX7EEEaPrgR2uLRNXuIgVWoIUw5SoFtp1cbS9PEYh3


Configuring Ollama for embedding 

In [9]:
pip install llama-index.llms.ollama









In [10]:
pip install llama_index.embeddings.huggingface


Note: you may need to restart the kernel to use updated packages.




In [11]:
pip install llama_index.core

Note: you may need to restart the kernel to use updated packages.




## Code Implementation

Set up LlamaParse. We want to parse the PDF files into markdown, translating the tabular data into markdown tables. To ensure accuracy, we will use the GPT-4o multimodal model to parse the PDFs.

In [12]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="Provided are a series of US legal documents.",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    show_progress=True,
)

DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

Load data from parser into documents containing parsed Markdown text from the legal document PDFs.

In [13]:
documents = parser.load_data(
    files,
    extra_info={"name": "US legal documents provided by the Library of Congress."},
)

Parsing files: 100%|█████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.50s/it]


Setup LlamaIndex. Set the default LLM to GPT-4o (a multi-modal model), and create an index from the documents, and persist these documents to disk. If these documents have already been persisted, then load index from the persisted docs.

In [26]:
pip install llama-index-embeddings-ollama llama-index-llms-ollama

Collecting llama-index-embeddings-ollamaNote: you may need to restart the kernel to use updated packages.





  Downloading llama_index_embeddings_ollama-0.3.0-py3-none-any.whl (2.6 kB)
Installing collected packages: llama-index-embeddings-ollama
Successfully installed llama-index-embeddings-ollama-0.3.0





In [28]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
# from llama_index.embeddings.openai import OpenAIEmbedding
# from llama_index.llms.openai import OpenAI

# embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# llm = OpenAI("gpt-4o")

# Settings.llm = llm
# Settings.embed_model = embed_model



from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding


# models
Settings.embed_model = OllamaEmbedding(model_name="all-minilm")
Settings.llm = Ollama(model="gemma2:2b", temperature=0, request_timeout=360.0, options={'num_ctx':4096})

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("Why was this post written? What event was it at?")
print(response)



# from llama_index.llms.ollama import Ollama
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# embed_model = HuggingFaceEmbedding(
#     model_name="BAAI/bge-small-en-v1.5"
# )

# llm = Ollama(model="llama2", request_timeout=120.0)

# Settings.llm = llm
# Settings.embed_model = embed_model

# if not os.path.exists("storage_legal"):
#     index = VectorStoreIndex(documents, embed_model=embed_model)
#     index.storage_context.persist(persist_dir="./storage_legal")
# else:
#     ctx = StorageContext.from_defaults(persist_dir="./storage_legal")
#     index = load_index_from_storage(ctx)

# query_engine = index.as_query_engine()

This post is a certification document from the Commonwealth of Massachusetts regarding an examination of the Barre Savings Bank. It details that the report on the bank's compliance with consumer and fair lending rules, as well as the Community Reinvestment Act (CRA), has been read to or by the undersigned. 

The event this post likely pertains to is a formal review of the Barre Savings Bank conducted for compliance with relevant regulations.  



## Example Queries

In [20]:
from IPython.display import display, Markdown

response = query_engine.query(
    "Are there documents here?"
)
display(Markdown(str(response)))

Empty Response

In [29]:
response = query_engine.query(
    "Why does Mr. Kubarych believe foreign markets are so important?"
)
display(Markdown(str(response)))

Mr. Kubarych believes foreign markets are attractive because they offer a high potential for growth and return on investment. He attributes this to several factors, including the strong economic data of the US, the commitment to fiscal discipline, an independent central bank pursuing monetary policy effectively, and a bipartisan commitment to open markets and an enterprise-driven economy.  


In [None]:
response = query_engine.query(
    "Who is against the proposal of offshore drilling in CA and why?"
)
display(Markdown(str(response)))

In [None]:
response = query_engine.query(
    "What is the purpose of the Ocean Science and Technology Subcommittee?"
)
display(Markdown(str(response)))

In [None]:
response = query_engine.query("Why is the immigration appeal dismissed?")
display(Markdown(str(response)))

In [None]:
response = query_engine.query("What is an advance pricing agreement?")
display(Markdown(str(response)))