# **Extracting Information from Legal Documents Using RAG**

## **Objective**

The main objective of this assignment is to process and analyse a collection text files containing legal agreements (e.g., NDAs) to prepare them for implementing a **Retrieval-Augmented Generation (RAG)** system. This involves:

* Understand the Cleaned Data : Gain a comprehensive understanding of the structure, content, and context of the cleaned dataset.
* Perform Exploratory Analysis : Conduct bivariate and multivariate analyses to uncover relationships and trends within the cleaned data.
* Create Visualisations : Develop meaningful visualisations to support the analysis and make findings interpretable.
* Derive Insights and Conclusions : Extract valuable insights from the cleaned data and provide clear, actionable conclusions.
* Document the Process : Provide a detailed description of the data, its attributes, and the steps taken during the analysis for reproducibility and clarity.

The ultimate goal is to transform the raw text data into a clean, structured, and analysable format that can be effectively used to build and train a RAG system for tasks like information retrieval, question-answering, and knowledge extraction related to legal agreements.

### **Business Value**  


The project aims to leverage RAG to enhance legal document processing for businesses, law firms, and regulatory bodies. The key business objectives include:

* Faster Legal Research: <br> Reduce the time lawyers and compliance officers spend searching for relevant case laws, precedents, statutes, or contract clauses.
* Improved Contract Analysis: <br> Automatically extract key terms, obligations, and risks from lengthy contracts.
* Regulatory Compliance Monitoring: <br> Help businesses stay updated with legal and regulatory changes by retrieving relevant legal updates.
* Enhanced Decision-Making: <br> Provide accurate and context-aware legal insights to assist in risk assessment and legal strategy.


**Use Cases**
* Legal Chatbots
* Contract Review Automation
* Tracking Regulatory Changes and Compliance Monitoring
* Case Law Analysis of past judgments
* Due Diligence & Risk Assessment

## **1. Data Loading, Preparation and Analysis** <font color=red> [20 marks] </font><br>

### **1.1 Data Understanding**

The dataset contains legal documents and contracts collected from various sources. The documents are present as text files (`.txt`) in the *corpus* folder.

There are four types of documents in the *courpus* folder, divided into four subfolders.
- `contractnli`: contains various non-disclosure and confidentiality agreements
- `cuad`: contains contracts with annotated legal clauses
- `maud`: contains various merger/acquisition contracts and agreements
- `privacy_qa`: a question-answering dataset containing privacy policies

The dataset also contains evaluation files in JSON format in the *benchmark* folder. The files contain the questions and their answers, along with sources. For each of the above four folders, there is a `json` file: `contractnli.json`, `cuad.json`, `maud.json` `privacy_qa.json`. The file structure is as follows:

```
{
    "tests": [
        {
            "query": <question1>,
            "snippets": [{
                    "file_path": <source_file1>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 1>
                },
                {
                    "file_path": <source_file2>,
                    "span": [ begin_position, end_position ],
                    "answer": <relevant answer to the question 2>
                }, ....
            ]
        },
        {
            "query": <question2>,
            "snippets": [{<answer context for que 2>}]
        },
        ... <more queries>
    ]
}
```

### **1.2 Load and Preprocess the data** <font color=red> [5 marks] </font><br>

#### Loading libraries

In [1]:
## The following libraries might be useful
!pip install -q langchain-openai
!pip install -U -q langchain-community
!pip install -U -q langchain-chroma
!pip install -U -q datasets
!pip install -U -q ragas
!pip install -U -q rouge_score

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.5/64.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.5/438.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone

In [2]:
!pip install qdrant-client
!pip install -qU langchain-qdrant


Collecting qdrant-client
  Downloading qdrant_client-1.14.2-py3-none-any.whl.metadata (10 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading qdrant_client-1.14.2-py3-none-any.whl (327 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.7/327.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, qdrant-client
Successfully installed portalocker-2.10.1 qdrant-client-1.14.2


In [3]:
# Import essential libraries
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

import os
import openai
from langchain_openai import ChatOpenAI, OpenAI
from langchain_community.document_loaders import DirectoryLoader, TextLoader
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from langchain_openai import OpenAIEmbeddings

# from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

from langchain.schema import Document
from chromadb.config import Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from transformers import AutoTokenizer

import re
import logging
import random
import numpy as np

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

from langchain.schema import Document
from collections import Counter

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# import libraries
import glob
import json

In [5]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams


#### **1.2.1** <font color=red> [3 marks] </font>
Load all `.txt` files from the folders.

You can utilise document loaders from the options provided by the LangChain community.

Optionally, you can also read the files manually, while ensuring proper handling of encoding issues (e.g., utf-8, latin1). In such case, also store the file content along with metadata (e.g., file name, directory path) for traceability.

In [7]:
# Load the files as documents
# Define root folder
root ="/content/rag_legal"

# Use limited document due to limit in token
documents_to_use = 1000

# Use limited number of docs for test
docs_for_test = 100

docloader = DirectoryLoader(
        path = root #+ "/corpus"
        ,
        glob = "**/*.txt",
        loader_cls = lambda path: TextLoader(path, encoding="utf-8")
)

documents = docloader.load()

documents = documents[:documents_to_use]

print(f"{len(documents)} documents loaded from the system.")




258 documents loaded from the system.


#### **1.2.2** <font color=red> [2 marks] </font>
Preprocess the text data to remove noise and prepare it for analysis.

Remove special characters, extra whitespace, and irrelevant content such as email and telephone contact info.
Normalise text (e.g., convert to lowercase, remove stop words).
Handle missing or corrupted data by logging errors and skipping problematic files.

In [8]:
# Clean and preprocess the data by removing whitespaces, phone, special characters, email , stopwords and chnage words to lowercase
stop_words = set(stopwords.words('english'))

def textpreprocessing(txt):
    txt = txt.lower()

    # remove email and phone number
    txt = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w{2,4}\b', '', txt)  # Emails
    txt = re.sub(r'\+?\d[\d\s().-]{7,}', '', txt)  # Phone numbers
    txt = re.sub(r'^\s*\d+(?:\.\d+)*\.\s+', '', txt, flags=re.MULTILINE) # removing bullet headers and numbers
    # Remove special characters
    txt = re.sub(r'[^a-z0-9\s]', ' ', txt)

    # Remove addtinal whitespace
    txt = re.sub(r'\s+', ' ', txt).strip()

    # Remove stopwords
    txt = ' '.join([word for word in txt.split() if word not in stop_words])

    return txt


In [9]:
#processing the text and handling missing/corrupted data and skipping and recording the skipped files
clean_documents = []

for i, doc in enumerate(documents):
    try:
        #ensuring the doc has content and its in string form
        if not doc.page_content or not isinstance(doc.page_content, str):
            raise ValueError("Missing or invalid content")

        # Preprocessing the document content using txtpreprocessing function defined above
        clean_txt = textpreprocessing(doc.page_content)

        # Creating new doc with cleaned text
        clean_doc = Document(page_content=clean_txt, metadata=doc.metadata)
        clean_documents.append(clean_doc)

    except Exception as e:
        # loggint the error and skipping
        print(f"skipping document {i} ({doc.metadata.get('source', 'unknown')}): {e}")

### **1.3 Exploratory Data Analysis** <font color=red> [10 marks] </font><br>

#### **1.3.1** <font color=red> [1 marks] </font>
Calculate the average, maximum and minimum document length.

In [11]:
# Calculate the average, maximum and minimum document length.
doc_lengths = [len(doc.page_content.split()) for doc in clean_documents]
# Ensure there are valid documents
if doc_lengths:
    average_length = sum(doc_lengths) / len(doc_lengths)
    max_length = max(doc_lengths)
    min_length = min(doc_lengths)

    print(f"average document length: {average_length:.2f} words")
    print(f"maximum document length: {max_length} words")
    print(f"minimum document length: {min_length} words")


average document length: 16355.71 words
maximum document length: 86518 words
minimum document length: 187 words


#### **1.3.2** <font color=red> [4 marks] </font>
Analyse the frequency of occurrence of words and find the most and least occurring words.

Find the 20 most common and least common words in the text. Ignore stop words such as articles and prepositions.

In [12]:
# Find frequency of occurence of words
text_all = " ".join([doc.page_content for doc in clean_documents if isinstance(doc.page_content, str)])

#tokenizing the words in the text corpus
token = nltk.word_tokenize(text_all)

# Ignore stop words, alphabetic words and words with length more than 2 as smaller words may not make any sense in this context
filteredwords = [word for word in token if word.isalpha() and word not in stop_words and len(word)>2]

# counting words
word_count = Counter(filteredwords)



In [13]:
# finding 20 most common and least common words from the list
most_common_word = word_count.most_common(20)
least_common_word = word_count.most_common()[-20:]

#printing words
print("20 Most Common Words:-----------")
for word, freq in most_common_word:
    print(f"{word}: {freq}")

print("20 Least Common Words:-----------")
for word, freq in least_common_word:
    print(f"{word}: {freq}")

20 Most Common Words:-----------
company: 144593
agreement: 64359
section: 63266
shall: 62579
parent: 60372
merger: 33550
subsidiaries: 32761
date: 30091
material: 30003
time: 26991
stock: 25348
applicable: 24410
party: 23228
respect: 22842
including: 20129
shares: 19356
prior: 18178
effect: 17706
effective: 17546
business: 17018
20 Least Common Words:-----------
strictions: 1
striction: 1
permissi: 1
appro: 1
priate: 1
cluding: 1
cer: 1
tify: 1
pur: 1
quested: 1
fac: 1
simile: 1
ceived: 1
modifi: 1
cation: 1
representa: 1
tives: 1
ity: 1
instru: 1
hwy: 1


#### **1.3.3** <font color=red> [4 marks] </font>
Analyse the similarity of different documents to each other based on TF-IDF vectors.

Transform some documents to TF-IDF vectors and calculate their similarity matrix using a suitable distance function. If contracts contain duplicate or highly similar clauses, similarity calculation can help detect them.

Identify for the first 10 documents and then for 10 random documents. What do you observe?

In [20]:
# Transform the page contents of documents
vectorizer =  TfidfVectorizer(stop_words='english')

text = [doc.page_content for doc in clean_documents]

tfidf_matrix = vectorizer.fit_transform(text)

# Due to limited token limited number of rows to be used
rows_to_use = 10
# Use 10 if doc length is 10, else use doc length
if len(clean_documents) >= 10:
    rows_to_use = 10
elif len(clean_documents) < 10:
    rows_to_use = len(clean_documents)
else:
    rows_to_use = 10

# Cosine similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)
similarity_matrix

array([[1.        , 0.27016771, 0.19591596, ..., 0.1135406 , 0.09949609,
        0.11954423],
       [0.27016771, 1.        , 0.25325768, ..., 0.12606569, 0.11531278,
        0.14443233],
       [0.19591596, 0.25325768, 1.        , ..., 0.13055333, 0.12499495,
        0.15479189],
       ...,
       [0.1135406 , 0.12606569, 0.13055333, ..., 1.        , 0.53636702,
        0.5770068 ],
       [0.09949609, 0.11531278, 0.12499495, ..., 0.53636702, 1.        ,
        0.72951188],
       [0.11954423, 0.14443233, 0.15479189, ..., 0.5770068 , 0.72951188,
        1.        ]])

In [21]:
# create a list of 10 random integers
similarity_matrix_first_10_rows = np.arange(rows_to_use)
similarity_matrix_first_10 = similarity_matrix[np.ix_(similarity_matrix_first_10_rows, similarity_matrix_first_10_rows)]
similarity_matrix_first_10


array([[1.        , 0.27016771, 0.19591596, 0.18015   , 0.17652579,
        0.28404434, 0.3186445 , 0.0152269 , 0.02672378, 0.01066024],
       [0.27016771, 1.        , 0.25325768, 0.33072607, 0.25849221,
        0.40711709, 0.45026651, 0.02162569, 0.03022305, 0.0130836 ],
       [0.19591596, 0.25325768, 1.        , 0.2074717 , 0.18222677,
        0.30419919, 0.30319579, 0.01666827, 0.02801776, 0.01481307],
       [0.18015   , 0.33072607, 0.2074717 , 1.        , 0.20952028,
        0.3221088 , 0.35774938, 0.01295162, 0.02134733, 0.00777658],
       [0.17652579, 0.25849221, 0.18222677, 0.20952028, 1.        ,
        0.34138188, 0.24594313, 0.01444668, 0.0214458 , 0.01590115],
       [0.28404434, 0.40711709, 0.30419919, 0.3221088 , 0.34138188,
        1.        , 0.4051166 , 0.02609695, 0.04647384, 0.02175846],
       [0.3186445 , 0.45026651, 0.30319579, 0.35774938, 0.24594313,
        0.4051166 , 1.        , 0.01933755, 0.03289038, 0.01318865],
       [0.0152269 , 0.02162569, 0.0166682

In [22]:
# Compute similarity scores for 10 random documents

random.seed(42)
random_index= random.sample(range(len(documents)), rows_to_use)[:rows_to_use]

similarity_matrix_random_10 = similarity_matrix[np.ix_(random_index, random_index)]
similarity_matrix_random_10

array([[1.        , 0.27197419, 0.51187666, 0.53870919, 0.46461686,
        0.1104136 , 0.52549552, 0.47131064, 0.07826347, 0.04745435],
       [0.27197419, 1.        , 0.11453695, 0.1021807 , 0.09076221,
        0.03157716, 0.09846828, 0.08160185, 0.0417437 , 0.0329441 ],
       [0.51187666, 0.11453695, 1.        , 0.93377923, 0.90652466,
        0.1650362 , 0.92748962, 0.8755106 , 0.11922077, 0.08322543],
       [0.53870919, 0.1021807 , 0.93377923, 1.        , 0.89651092,
        0.16080674, 0.93193786, 0.88314318, 0.10952139, 0.07752906],
       [0.46461686, 0.09076221, 0.90652466, 0.89651092, 1.        ,
        0.16117297, 0.86724993, 0.94072549, 0.11301663, 0.06915391],
       [0.1104136 , 0.03157716, 0.1650362 , 0.16080674, 0.16117297,
        1.        , 0.13300868, 0.13252525, 0.04106448, 0.02340527],
       [0.52549552, 0.09846828, 0.92748962, 0.93193786, 0.86724993,
        0.13300868, 1.        , 0.89592516, 0.10423037, 0.06612767],
       [0.47131064, 0.08160185, 0.8755106

### **1.4 Document Creation and Chunking** <font color=red> [5 marks] </font><br>

#### **1.4.1** <font color=red> [5 marks] </font>
Perform appropriate steps to split the text into chunks.

In [17]:
# Process files and generate chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # Size of each chunk in characters
    chunk_overlap=100    # Optional overlap for context preservation
)

# Split the documents
chunked_docs = text_splitter.split_documents(clean_documents)

# View results for some records
for chunk in chunked_docs[:2]:
    print(f"Content: {chunk.page_content}, Metadata: {chunk.metadata}")


Content: collect information includingpersonal informationandnon identifying information interact us site example access use site register subscribe create account groupon open respond e mails refer friends family others groupon contact customer service use customer support tools provide information enroll participate inother programsprovided behalf together business partners visit page online displays ads content purchase products services site connect link site via social networking sites post comments toonline communities provide information ourvendors privacy statement apply collection information way listed privacy statement apply collection information way listed think benefit personalized experience know like however limit information provide groupon limit communications groupon sends commercial e mails may choose receive commercial e mails us following instructions contained commercial e mails send logging account adjusting e mail preferences please note even unsubscribe commer

## **2. Vector Database and RAG Chain Creation** <font color=red> [15 marks] </font><br>

### **2.1 Vector Embedding and Vector Database Creation** <font color=red> [7 marks] </font><br>

#### **2.1.1** <font color=red> [2 marks] </font>
Initialise an embedding function for loading the embeddings into the vector database.

Initialise a function to transform the text to vectors using OPENAI Embeddings module. You can also use this function to transform during vector DB creation itself.

In [25]:
# Initialise an embedding function
model = "text-embedding-3-small"
openai.api_key = 'sk-proj-'
# Update the OpenAI API key by updating the environment variable
os.environ["OPENAI_API_KEY"] = openai.api_key

embedding_model = OpenAIEmbeddings(api_key=openai.api_key,model="text-embedding-3-small")


#### **2.1.2** <font color=red> [5 marks] </font>
Load the embeddings to a vector database.

Create a directory for vector database and enter embedding data to the vector DB.

In [26]:
# Add Chunks to vector DB

# Define storage directory
persist_dir = "faiss_doc_store"
os.makedirs(persist_dir, exist_ok=True)

# Store the split document in ChromaDB
fvectorstore = FAISS.from_documents(chunked_docs, embedding_model)

# Persist to disk
fvectorstore.save_local(folder_path=persist_dir)


In [27]:
!pip install faiss-cpu




In [28]:
fvectorstore = FAISS.load_local(folder_path=persist_dir, embeddings=embedding_model, allow_dangerous_deserialization=True)


In [29]:
# Define storage directory
persist_dir = "qdrant_db"
os.makedirs(persist_dir, exist_ok=True)

# Qdrant configuration (local)
qclient = QdrantClient(path=persist_dir)
collection_name = "legal_collection"

# Create collection (if not exists)
qclient.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),  # 1536 for OpenAI
)

# Load to Qdrant via LangChain wrapper
qvectorstore = QdrantVectorStore(
    client=qclient,
    collection_name=collection_name,
    embedding=embedding_model,
)

In [30]:
# Import uuid library
from uuid import uuid4

# Create document uuids
uuids = [str(uuid4()) for _ in range(len(chunked_docs))]

# Add to vector store
qvectorstore.add_documents(documents=chunked_docs, ids=uuids)

['84323d61-c5a9-4ce5-b49c-1dad7073fbd2',
 '8c01ca18-f0df-41b0-a9b2-92ff15a599a0',
 '7fe344cb-6d65-4bf7-b3ac-454b3aa08f4d',
 '937b48fd-5f85-4718-8d18-16ce063fd9c9',
 '8acceb0d-5a7e-44f0-8b2f-4ff9c21fe8d8',
 '28239849-0eee-4b0d-871d-2089e067faf1',
 '8c4f7501-247a-4c74-9cf2-568ffd097a44',
 '751940f1-b553-4785-a017-9469d1bb3cac',
 '689ae5a2-4c03-4d30-9299-9db835df33d4',
 'd94411e2-e111-41a5-8353-b5bfa7d221cb',
 '505f02f3-a341-486f-9e18-034f6cb267aa',
 'f1aa5e31-6f50-492d-bafe-64dde8374277',
 '8b9be102-7d74-4e1c-bb8b-00b30aa5251a',
 '0affd34a-4f80-45f3-bd80-8bcb6f629503',
 '22bc19c3-269e-4e34-8267-cde3eb1d5ef2',
 '51f4b0f8-ceb6-4d5a-841d-81c22b659ccd',
 'edbb882c-6ed0-4377-8832-b70f4cfa07a5',
 '998ad794-2a1e-48c5-b584-74ccd6f8ca9a',
 '750e4754-adf3-4186-8d48-be7a30533537',
 '9dd74ce2-c17e-4f51-accd-4d009a04bfc5',
 '0f148e7b-bce3-4103-b281-f0f14e2f841c',
 '7c42e6e2-ce2c-46f9-a1a6-161b49ff36fe',
 '8a1448f8-e30f-4b6e-a7ad-3a2a41bdd5ff',
 '03c6ea04-4911-4109-ae9c-3bcff1a76c57',
 '3b70205b-22ad-

### **2.2 Create RAG Chain** <font color=red> [8 marks] </font><br>

#### **2.2.1** <font color=red> [5 marks] </font>
Create a RAG chain.

In [31]:
# Create a RAG chain

# import RAG chain
from langchain.chains import RetrievalQA

# Initialize LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create RetrievalQA Chain (RAG) for qdrant
qrag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=qvectorstore.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True  # optional: for source tracking
)

# Create RetrievalQA Chain (RAG) for FAISS
frag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=qvectorstore.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True  # optional: for source tracking
)

#### **2.2.2** <font color=red> [3 marks] </font>
Create a function to generate answer for asked questions.

Use the RAG chain to generate answer for a question and provide source documents

In [32]:
# Create a function for question answering

def search_faiss(question):
    # Ask a question from FAISS
    results = fvectorstore.similarity_search(question, k=1
    )

    str_result = ""

    for res in results:
        str_result = str_result + f" {res.page_content} [{res.metadata}] \n"

    return str_result

In [33]:
def search_qdrant(question):
    # Ask a question from qdrant
    results = qvectorstore.similarity_search(question, k=1
    )

    str_result = ""

    for res in results:
        str_result = str_result + f" {res.page_content} [{res.metadata}] \n"

    return str_result

In [34]:
# Lang Chain Expression language chain with prompt template libraries import
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema.output_parser import StrOutputParser

# Function to provide answer based on custom LCEL chain
def qna_template(question):
    # Get retriever
    retriever = qvectorstore.as_retriever()
    # template to be passed to the prompt
    template = """Answer the question based only on the following context:
    {context}

    Question: {question}
    """

    prompt = PromptTemplate.from_template(template)
    model = OpenAI()

    # Context and RunnablePassthrough input to the chain
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | model
        | StrOutputParser()
    )

    response = chain.invoke(question)
    return response

In [35]:
# Example question and  answer
question ="Consider the Non-Disclosure Agreement between CopAcc and ToP Mentors; Does the document indicate that the Agreement does not grant the Receiving Party any rights to the Confidential Information?"

# search from qdrant
print(search_faiss(question))
print("---------------------")
# search from FAISS
print(search_qdrant(question))
print("---------------------")

# Both provide same result, so will use qdrant going forward

 without prior written consent participants mentor shall require employees access confidential information commit non disclosure agreement protects confidential information least degree agreement mentor shall take reasonable measures protect secrecy avoid unauthorised disclosure use confidential information measures shall include highest degree care mentor utilises protect mentor confidential information similar nature less reasonable care notwithstanding mentor right assess rate ideas participants mentor shall use confidential information third parties purposes shall file intellectual property right protection confidential information parts mentor shall notify organiser immediately writing misuse misappropriation confidential information may come mentor attention mentor agrees segregate confidential information relating agreement confidential information others avoid commingling 6 discontinuation use return materials organiser first request mentor shall discontinue use confidential [{

In [36]:
print(qna_template(question))



No, the document indicates that the Agreement does grant the Receiving Party rights to the Confidential Information, but with certain restrictions and requirements.


## **3. RAG Evaluation** <font color=red> [10 marks] </font><br>

### **3.1 Evaluation and Inference** <font color=red> [10 marks] </font><br>

#### **3.1.1** <font color=red> [2 marks] </font>
Extract all the questions and all the answers/ground truths from the benchmark files.

Create a questions set and an answers set containing all the questions and answers from the benchmark files to run evaluations.

In [43]:
# Create a question set by taking all the questions from the benchmark data
# Also create a ground truth/answer set

# Recursively find all JSON files in subfolders
benchmarks = glob.glob(root + "/benchmarks/*.json", recursive=True)
benchmarks


['/content/rag_legal/benchmarks/contractnli.json',
 '/content/rag_legal/benchmarks/cuad.json',
 '/content/rag_legal/benchmarks/maud.json',
 '/content/rag_legal/benchmarks/privacy_qa.json']

In [44]:
# Import pandas
import pandas as pd

In [45]:
# Function to evaluate the RAG pipeline


# Prepare test data
benchmark = []
i= 1
for file_path in benchmarks:
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

        # Access the "tests" list
        test_cases = data.get("tests", [])

        for entry in test_cases:
            question = entry.get("query", "").strip()
            answer = ""
            snippets = entry.get("snippets", [])
            if snippets:
                answer = snippets[0].get("answer", "").strip()
            benchmark.append({"question": question, "answer": answer})

# Create DataFrame
df_benchmark = pd.DataFrame(benchmark)

df_benchmark.head()


Unnamed: 0,question,answer
0,Consider the Non-Disclosure Agreement between ...,"Any and all proprietary rights, including but ..."
1,Consider the Non-Disclosure Agreement between ...,“Confidential Information” means any Idea disc...
2,Consider the Non-Disclosure Agreement between ...,Notwithstanding the termination of this Agreem...
3,Consider the Non-Disclosure Agreement between ...,"At Organiser’s first request, Mentor shall:"
4,Consider the Non-Disclosure Agreement between ...,Mentor shall not disclose any Confidential Inf...


#### **3.1.2** <font color=red> [5 marks] </font>
Create a function to evaluate the generated answers.

Evaluate the responses on *Rouge*, *Ragas* and *Bleu* scores.

In [46]:
# Import for BLEU score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
nltk.download('punkt')

# import for rogue score
from rouge_score import rouge_scorer

# import for ragas score
from ragas.metrics import answer_relevancy, faithfulness, context_precision
from ragas.evaluation import evaluate
from datasets import Dataset

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [47]:
# Function to evaluate using BLEU score
def evaluate_bleu_rogue():
    smoothie = SmoothingFunction().method4
    bleu_scores = []
    rogue_scores = []

    rogue_scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

    df_benchmark['generated_answer'] = None
    df_benchmark['search_qdrant'] = None


    for _, row in df_benchmark.iterrows():
        ref = row['answer']
        pred = qna_template(row['question'])
        # update the generated answer
        df_benchmark.loc[row.name, 'generated_answer'] = pred
        df_benchmark.loc[row.name, 'contexts'] = search_qdrant(row['question'])

        ref_tokens = nltk.word_tokenize(ref.lower())
        pred_tokens = nltk.word_tokenize(pred.lower())

        score = sentence_bleu([ref_tokens], pred_tokens, smoothing_function=smoothie)
        bleu_scores.append(score)

        rogue_scores_all = rogue_scorer.score(ref, pred)
        rogue_scores.append(rogue_scores_all['rougeL'].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = sum(rogue_scores) / len(rogue_scores)

    return {"avg_bleu": avg_bleu, "avg_rouge": avg_rouge }

In [48]:
# Use limited rows due to token constraint
df_benchmark = df_benchmark[:docs_for_test]

# Evaluate function
print("Average BLEU and Rogue scores:")
print(evaluate_bleu_rogue())

Average BLEU and Rogue scores:
{'avg_bleu': 0.016263421539661413, 'avg_rouge': 0.12327049719682956}


In [52]:
# show dataset
df_benchmark.head(20)

Unnamed: 0,question,answer,generated_answer,search_qdrant,contexts
0,Consider the Non-Disclosure Agreement between ...,"Any and all proprietary rights, including but ...","\n\nNo, the document does state that the Recei...",,without prior written consent participants me...
1,Consider the Non-Disclosure Agreement between ...,“Confidential Information” means any Idea disc...,"\nNo, the document does not state that Confide...",,without prior written consent participants me...
2,Consider the Non-Disclosure Agreement between ...,Notwithstanding the termination of this Agreem...,"\nYes, the document mentions that some obligat...",,agreement force effect without liability part...
3,Consider the Non-Disclosure Agreement between ...,"At Organiser’s first request, Mentor shall:","\nNo, the document does not permit the Receivi...",,without prior written consent participants me...
4,Consider the Non-Disclosure Agreement between ...,Mentor shall not disclose any Confidential Inf...,"\n\nNo, the document does not allow the Receiv...",,without prior written consent participants me...
5,Consider the Non-Disclosure Agreement between ...,"If Mentor is required by mandatory, non-appeal...","\n\nYes, the document requires the Receiving P...",,confidentiality restrictions written agreemen...
6,Consider the Non-Disclosure Agreement between ...,Confidential Information does not include:,"\nNo, the document does not allow the Receivin...",,without prior written consent participants me...
7,Consider the Non-Disclosure Agreement between ...,Mentor shall not disclose any Confidential Inf...,\nYes,,without prior written consent participants me...
8,Consider the Non-Disclosure Agreement between ...,Mentor shall not use any Confidential Informat...,"\nYes, the document does restrict the use of C...",,without prior written consent participants me...
9,Consider DBT's Mutual Non-Disclosure Agreement...,5. No Further Rights All Confidential Informat...,"\n\nNo, the document does not indicate that th...",,mutual non disclosure agreement non disclosur...


#### **3.1.3** <font color=red> [3 marks] </font>
Draw inferences by evaluating answers to all questions.

In [50]:
# Prepare data for Ragas evaluation
def prepare_ragas_dataset():
    df_ragas = pd.DataFrame({
        "question": df_benchmark["question"],
        "reference": df_benchmark["answer"],
        "response": df_benchmark["generated_answer"],
        "retrieved_contexts": df_benchmark["contexts"].apply(lambda x: [x] if isinstance(x, str) else x),
    })
    return Dataset.from_pandas(df_ragas)

# Convert your DataFrame for ragas
ragas_dataset = prepare_ragas_dataset()
ragas_dataset

Dataset({
    features: ['question', 'reference', 'response', 'retrieved_contexts'],
    num_rows: 100
})

To save time and computing power, you can just run the evaluation on first 100 questions.

In [51]:
# Evaluate the RAG pipeline
# Run ragas evaluation
results = evaluate(ragas_dataset)

print("RAGAS Evaluation Results:")
print(results)

Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

RAGAS Evaluation Results:
{'answer_relevancy': 0.8314, 'context_precision': 0.5900, 'faithfulness': 0.3103, 'context_recall': 0.6550}


## **4. Conclusion** <font color=red> [5 marks] </font><br>

### **4.1 Conclusions and insights** <font color=red> [5 marks] </font><br>

#### **4.1.1** <font color=red> [5 marks] </font>
Conclude with the results here. Include the insights gained about the data, model pipeline, the RAG process and the results obtained.

This report summarizes the implementation and execution of a Legal QnA system designed to provide responses based on legal documents. The system uses OpenAI’s GPT-3.5-turbo model and Langchain framework with vector databases to enable RAG.

258 documents loaded. As per TF-IDF analysis common legal terms were used in documents and there are similarities .The corpus documents were merged and chunked to save. Sentence transformers and FAISS were used for vector search. RAG QA pipeline was created using a transformer-based QA model.
Based on Evaluation for benchmark questions it showed accuracy based on similarity in the content with ground truth.

 **Steps Followed**

Data Preprocessing

Data Exploration and Analysis

Document Chunking

Embedding and Vector Store Population

Retrieval-Augmented Generation (RAG)

Reference Answer Loading and Evaluation


**Insights**:

*   Many questions required specific contextual knowledge, showing RAG's strength over raw QA.
*  Chunk size and number of retrieved documents had a noticeable impact on accuracy.
*  This pipeline can be extended by using advanced QA models (e.g., GPT or larger BERT variants) and richer legal documents


**Results and Observations**

*  Both FAISS and QDrant vector stores demonstrated similar efficacy in vector search and retrieval
*   Due to token constraints imposed by OpenAI’s API, only a fraction of the total document corpus could be processed.
*   The system provides relevant answers when the underlying documents contain right information.
*   The custom RAG chain successfully combined vector-based retrieval with generative question answering.
*  Increase the number of documents ingested or increase number of test cases increase accuracy of result   




