Project Description: Legal Document Analysis and Summarization Tool using RAG

In [None]:
%pip install nltk
%pip install scikit-learn
%pip install beautifulsoup4
%pip install lxml

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from bs4 import BeautifulSoup
import contractions

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

stemmer = PorterStemmer()


def separate_numbers_and_text(text):
    separated_text = re.sub(r"(\d+)([a-zA-Z]+)|([a-zA-Z]+)(\d+)", r"\1 \2 \3 \4", text)
    return separated_text


def preprocess_text(text):
    text = BeautifulSoup(text, "html.parser").text
    text = text.lower()

    text = text.translate(str.maketrans("", "", string.punctuation))
    text = " ".join(word for word in text.split() if word not in stop_words)
    text = separate_numbers_and_text(text)
    text = " ".join(text.split())
    text = re.sub(r"\s+", " ", text)  # removing extra white space
    text = contractions.fix(
        text
    )  # fixing contractions (ex: can't gets converted to cannot)
    text = re.sub(r"http\S+|www\.\S+|https\S+", "<URL>", text)  # removing URLs
    text = re.sub(r"\S+@\S+", "<EMAIL>", text)  # removing email IDs
    text = re.sub(
        r"\b(not)\s+(\w+)\b", r"\1_\2", text
    )  # handling negations (ex: "not good" becomes "not_good")

    text = " ".join(stemmer.stem(word) for word in text.split())  # stemming

    return text


def read_and_preprocess_data(folder_path):
    judgement_path = os.path.join(folder_path, "judgement")
    summary_path = os.path.join(folder_path, "summary")

    judgement_files = sorted(os.listdir(judgement_path))
    summary_files = sorted(os.listdir(summary_path))

    judgements = []
    summaries = []

    for judgement_file, summary_file in zip(judgement_files, summary_files):
        with open(
            os.path.join(judgement_path, judgement_file), "r", encoding="utf-8"
        ) as jf:
            judgement_text = jf.read()

            judgement_text = preprocess_text(judgement_text)
            judgements.append(judgement_text)

        with open(
            os.path.join(summary_path, summary_file), "r", encoding="utf-8"
        ) as sf:
            summary_text = sf.read()

            summary_text = preprocess_text(summary_text)
            summaries.append(summary_text)

    data = pd.DataFrame({"Judgement": judgements, "Summary": summaries})
    return data


# Paths to train and test folders
train_folder_path = r"D:\Rohan\ML\Datasets\legal_dataset\IN-Abs\train-data-small"
test_folder_path = r"D:\Rohan\ML\Datasets\legal_dataset\IN-Abs\test-data"

# Reading and preprocessing train and test data
train_data = read_and_preprocess_data(train_folder_path)
test_data = read_and_preprocess_data(test_folder_path)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
train_data["Judgement"][0]

'appeal lxvi 1949 appeal high court judicatur bombay refer section 66 indian incom tax act 1022 km munshi n p nathvani appel lant mc setalvad attorney gener india h j umrigar respond 1950 may 26 judgment court deliv mehr chand mahajan j appeal judgment high court judicatur bombay incom tax matter rais question whether munici pal properti tax urban immov properti tax payabl relev bombay act allow deduct section 9 1 iv indian incom tax act assesse compani invest compani deriv incom properti citi bombay assess year 1940 41 net incom assesse head properti comput incom tax offic sum rs 621764 deduct gross rent certain payment compani paid relev year rs 122675 municip properti tax rs 32760 urban properti tax deduct two sum claim provis section 9 act first item deduct sum rs 48572 allow ground item repres tenant burden paid assesse otherwis claim disal low appeal assesse appel sistant commission incom tax appel tribu nal unsuccess tribun howev agre refer two question law high court judicatur 

In [4]:
train_data.head()

Unnamed: 0,Judgement,Summary
0,appeal lxvi 1949 appeal high court judicatur b...,charg creat respect municip properti tax secti...
1,civil appeal no 94 1949 107 834 appeal judgmen...,agreement leas leas indian declar includ must ...
2,xxix 1950 applic articl 32 constitut india wri...,section 7 1 c east punjab public safeti act 19...
3,xxxvii 1950 applic articl 32 constitut india w...,section 4 sub section 1 c east punjab public s...
4,xvi 1950 appli cation articl 32 constitut writ...,held full court overrul preliminari object con...


In [5]:
test_data.head()

Unnamed: 0,Judgement,Summary
0,appeal 101 1959 appeal special leav judgment o...,appel displac person west pakistan grant quasi...
1,appeal 52 1957 appeal judgment decre date apri...,appel respond owner adjoin collieri suit prese...
2,appeal no 45 46 1959 appeal special leav judgm...,respond firm claim exempt sale tax articl 286 ...
3,ion crimin appeal 89 1961 appeal special leav ...,appel tri murder fact establish quarrel appel ...
4,civil appeal 50 1961 appeal special leav award...,employ appel cross cutter saw mill ask show be...


In [6]:
# downloading the RAG model and initialising
from transformers import BartForConditionalGeneration, BartTokenizer

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")



In [7]:
train_inputs = [
    tokenizer(
        judgement, return_tensors="pt", padding=True, truncation=True
    ).input_ids.squeeze(0)
    for judgement in train_data["Judgement"]
]
test_inputs = [
    tokenizer(
        judgement, return_tensors="pt", padding=True, truncation=True
    ).input_ids.squeeze(0)
    for judgement in test_data["Judgement"]
]

train_labels = [
    tokenizer(
        summary, return_tensors="pt", padding=True, truncation=True
    ).input_ids.squeeze(0)
    for summary in train_data["Summary"]
]
test_labels = [
    tokenizer(
        summary, return_tensors="pt", padding=True, truncation=True
    ).input_ids.squeeze(0)
    for summary in test_data["Summary"]
]

In [8]:
from torch.utils.data import Dataset, DataLoader


class TextDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.labels[idx]


from torch.nn.utils.rnn import pad_sequence


def collate_fn(batch):
    inputs, labels = zip(*batch)

    # Pad inputs and labels to the same length
    inputs_padded = pad_sequence(
        inputs, batch_first=True, padding_value=tokenizer.pad_token_id
    )
    labels_padded = pad_sequence(
        labels, batch_first=True, padding_value=tokenizer.pad_token_id
    )

    # Ensure tensors are of type torch.long
    return inputs_padded.long(), labels_padded.long()


# Prepare Datasets
train_dataset = TextDataset(train_inputs, train_labels)
test_dataset = TextDataset(test_inputs, test_labels)

train_loader = DataLoader(
    train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn
)
test_loader = DataLoader(
    test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn
)

In [9]:
import torch.optim as optim
import torch

# Initialize optimizer
optimizer = optim.AdamW(model.parameters(), lr=5e-5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(3):
    model.train()
    for batch in train_loader:
        inputs, labels = batch
        inputs = inputs.to(device)
        labels = labels.to(device)

        outputs = model(input_ids=inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch + 1} completed")

  attn_output = torch.nn.functional.scaled_dot_product_attention(


OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 14.77 GiB is allocated by PyTorch, and 84.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
"""
1) Dataset is downloaded from https://github.com/Law-AI/summarization from the link: https://zenodo.org/records/7152317#.Yz6mJ9JByC0
IN-Abs data is used. I used a subset of train-data of 10 documents (.txt files) for training and 100 documents (.txt files) for testing
I took a small subset for training since sufficient RAM wasn't available on my local device and Google Colab as well.

2) Imported judgement and summary files from train-data and test-data and preprocessed it

3) facebook/bart-large model is downloaded to work as the Generative component of this project
from Hugging Face

4) Input and output tensors are prepared for training and testing a sequence-to-sequence model
by tokenizing text data.

5) TextDataset class for handling tokenized input and output pairs, and uses a collate_fn function to
pad sequences to uniform length within batches. It then creates DataLoader instances for training
and testing, which handle batching and shuffling of the data. 

6) Model is trained for 3 epochs using GPU. However, training was not possible since sufficient RAM wasn't
available on my local device and Google Colab as well.




"""