# AI-Powered Legal Document Analyzer

## Project Overview
This project aims to develop an AI-powered tool that automates legal document analysis using Natural Language Processing (NLP).
The system will extract key clauses, summarize contracts, and perform Named Entity Recognition (NER) on legal documents.

### **Objectives**
- Summarize contracts using NLP models.
- Extract key clauses (e.g., obligations, indemnity, governing law).
- Implement Named Entity Recognition (NER) to detect important legal terms.


#### Step 1: Downloading and Loading the CUAD Dataset

### Why CUAD (Contract Understanding Atticus Dataset) ?

•	✅ Best choice for contract clause extraction<br>
•	✅ Labeled with 41 key contract clauses (e.g., indemnity, liabilities, governing law)<br>
•	✅ Already used in NLP research for legal document processing<br>


In [1]:
# Install necessary libraries
!pip install datasets jsonlines transformers torch pandas accelerate matplotlib evaluate seqeval



datasets → Load CUAD dataset from Hugging Face.<br>
jsonlines → Handle JSON-based datasets.<br>
transformers → Use NLP models (BERT, GPT).<br>
torch → Deep learning for NLP models.<br>
pandas → Data handling.<br>
matplotlib → Data visualization.<br>

#### Load and explore the data

In [2]:
import os
print(os.getcwd())

/content


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import json

data_path = '/content/drive/MyDrive/CUAD_v1.json'
with open (data_path, "r", encoding = "utf-8") as file:
    data = json.load(file)

print(f"Number of Contracts : {len(data['data'])}")

print(json.dumps(data["data"][0], indent=2))

Number of Contracts : 510
{
  "title": "LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT",
  "paragraphs": [
    {
      "qas": [
        {
          "answers": [
            {
              "text": "DISTRIBUTOR AGREEMENT",
              "answer_start": 44
            }
          ],
          "id": "LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Document Name",
          "question": "Highlight the parts (if any) of this contract related to \"Document Name\" that should be reviewed by a lawyer. Details: The name of the contract",
          "is_impossible": false
        },
        {
          "answers": [
            {
              "text": "Distributor",
              "answer_start": 244
            },
            {
              "text": "Electric City Corp.",
              "answer_start": 148
            },
            {
              "text": "Electric City of Illinois L.L.C.",
              "answer_start": 49574
            },
            {
              "text": "Company",
 

#### Checking if the data is loaded correctly

In [5]:
print(data.keys())

# Check the first contract's title
print("First contract title:", data["data"][0]["title"])

# Check the first contract's text - shows first 500 chars
print("First contract text sample:", data["data"][0]["paragraphs"][0]["context"][:500])

dict_keys(['version', 'data'])
First contract title: LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT
First contract text sample: EXHIBIT 10.6

                              DISTRIBUTOR AGREEMENT

         THIS  DISTRIBUTOR  AGREEMENT (the  "Agreement")  is made by and between Electric City Corp.,  a Delaware  corporation  ("Company")  and Electric City of Illinois LLC ("Distributor") this 7th day of September, 1999.

                                    RECITALS

         A. The  Company's  Business.  The Company is  presently  engaged in the business  of selling an energy  efficiency  device,  which is  referred to as an 


#### Step 2: Convert CUAD Dataset to a Pandas DataFrame

In [6]:
import pandas as pd

contracts = []
for contract in data["data"]:
    title = contract["title"]
    for paragraph in contract["paragraphs"]:
        context = paragraph["context"]
        for qa in paragraph["qas"]:
            question = qa["question"]
            answers = [ans["text"] for ans in qa["answers"]] if qa["answers"] else ["No answers"]

            contracts.append({
                "title" : title,
                "context" : context,
                "question" : question,
                "answers" : answers
            })

df = pd.DataFrame(contracts)

df.head()

Unnamed: 0,title,context,question,answers
0,LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGRE...,EXHIBIT 10.6\n\n ...,Highlight the parts (if any) of this contract ...,[DISTRIBUTOR AGREEMENT]
1,LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGRE...,EXHIBIT 10.6\n\n ...,Highlight the parts (if any) of this contract ...,"[Distributor, Electric City Corp., Electric Ci..."
2,LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGRE...,EXHIBIT 10.6\n\n ...,Highlight the parts (if any) of this contract ...,"[7th day of September, 1999.]"
3,LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGRE...,EXHIBIT 10.6\n\n ...,Highlight the parts (if any) of this contract ...,[The term of this Agreement shall be ten (10...
4,LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGRE...,EXHIBIT 10.6\n\n ...,Highlight the parts (if any) of this contract ...,[The term of this Agreement shall be ten (10...


In [7]:
df.info()
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20910 entries, 0 to 20909
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20910 non-null  object
 1   context   20910 non-null  object
 2   question  20910 non-null  object
 3   answers   20910 non-null  object
dtypes: object(4)
memory usage: 653.6+ KB


Unnamed: 0,title,context,question,answers
9309,"CANOPETROLEUM,INC_12_13_2007-EX-10.1-Sponsorsh...",QuickLinks -- Click here to rapidly navigate t...,Highlight the parts (if any) of this contract ...,"[5th day of December, 2007]"
7541,"HEMISPHERX - Sales, Marketing, Distribution, a...","Exhibit 10.1\n\n\n\nSales, Marketing, Distribu...",Highlight the parts (if any) of this contract ...,[No answers]
11556,"CURAEGISTECHNOLOGIES,INC_05_26_2010-EX-1-CORPO...",CORPORATE SPONSORSHIP AGREEMENT\n\nThis agreem...,Highlight the parts (if any) of this contract ...,"[Upon termination of this Agreement, NEITHER P..."
20497,INGEVITYCORP_05_16_2016-EX-10.5-INTELLECTUAL P...,Exhibit 10.5 INTELLECTUAL PROPERTY AGREEMENT...,Highlight the parts (if any) of this contract ...,[No answers]
5767,InmodeLtd_20190729_F-1A_EX-10.9_11743243_EX-10...,Exhibit 10.9 TURN - KEY MANUFACTURING AGREEMEN...,Highlight the parts (if any) of this contract ...,[No answers]


#### Step 3 Data Preprocessing

##### Remove Duplicates

In [8]:
# Check for duplicate rows based on 'context' and 'question'
print(f"Duplicate entries before removal: {df.duplicated(subset=['context', 'question']).sum()}")

# Remove duplicate rows
df = df.drop_duplicates(subset=['context', 'question'], keep='first')

# Verify duplicate removal
print(f"Duplicate entries after removal: {df.duplicated(subset=['context', 'question']).sum()}")


Duplicate entries before removal: 41
Duplicate entries after removal: 0


##### Clean contract text
Legal contracts may contain excess whitespace, special characters, and inconsistent formatting.

- clean the text by:
    - Removing extra spaces.
    - Converting text to lowercase (optional, depending on model choice).
    - Removing special characters (except legal terms like section symbols).
    - Standardizing newlines.

In [9]:
import re

def clean_text(text):
    text = text.strip()  # Remove leading/trailing spaces
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'[^A-Za-z0-9.,;:()$%&@\-\'"\s]', '', text)  # Remove unwanted special characters
    return text

# Apply text cleaning to 'context' and 'answers'
df["context"] = df["context"].apply(clean_text)
df["answers"] = df["answers"].apply(lambda ans_list: [clean_text(ans) for ans in ans_list])

# Display a sample cleaned contract text
df.sample(3)


Unnamed: 0,title,context,question,answers
7270,PhasebioPharmaceuticalsInc_20200330_10-K_EX-10...,Exhibit 10.21 Certain information has been exc...,Highlight the parts (if any) of this contract ...,"[Notwithstanding the foregoing, nothing herein..."
3407,BIOCEPTINC_08_19_2013-EX-10-COLLABORATION AGRE...,Exhibit 10.13 COLLABORATION AGREEMENT THIS COL...,Highlight the parts (if any) of this contract ...,[The term of this Agreement will commence on t...
3925,SECURIANFUNDSTRUST_05_01_2012-EX-99.28.H.9-NET...,Exhibit 28(h)(9) RESTATED NET INVESTMENT INCOM...,Highlight the parts (if any) of this contract ...,[No answers]


##### Handle missing values
Some contract entries might have missing text, questions, or answers.

- Identify missing values.
    - Fill missing answers with "No answer provided".
    - Drop entries that are completely empty.

In [10]:
# Check for missing values in the dataset
print("Missing values before handling:")
print(df.isnull().sum())

# Fill missing answers with a default value
df["answers"] = df["answers"].apply(lambda ans_list: ans_list if ans_list else ["No answer provided"])

# Drop rows where 'context' or 'question' is completely missing
df = df.dropna(subset=["context", "question"])

# Verify that there are no more missing values
print("Missing values after handling:")
print(df.isnull().sum())

Missing values before handling:
title       0
context     0
question    0
answers     0
dtype: int64
Missing values after handling:
title       0
context     0
question    0
answers     0
dtype: int64


#### Step 4: Normalize Text Length (Trimming & Padding)
Since legal documents can be too long or too short:<br>
    - Trim excessively long contract text to a reasonable length (e.g., 1024 characters).<br>
    - Pad shorter contract text (if necessary) for uniformity

In [11]:
# Define maximum text length
MAX_LENGTH = 1024

def normalize_length(text):
    return text[:MAX_LENGTH]  # Trim text if too long

# Apply length normalization
df["context"] = df["context"].apply(normalize_length)

# Verify the text length normalization
df["text_length"] = df["context"].apply(len)  # Add a column to check lengths
df["text_length"].describe()

Unnamed: 0,text_length
count,20869.0
mean,1022.70334
std,19.565594
min,639.0
25%,1024.0
50%,1024.0
75%,1024.0
max,1024.0


#### Step 5: Tokenize Contract Text for NLP Models

In [12]:
from transformers import AutoTokenizer

# Choose a pretrained model tokenizer (BERT-based for legal documents)
TOKENIZER_MODEL = "nlpaueb/legal-bert-base-uncased"  # Pretrained Legal-BERT model
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_MODEL)

# Tokenize contract text
df["tokenized_context"] = df["context"].apply(lambda text: tokenizer.encode(text, truncation=True, padding="max_length", max_length=512))

# Display tokenized data
df[["context", "tokenized_context"]].sample(3)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,context,tokenized_context
20803,Exhibit 26 JOINT FILING AGREEMENT Pursuant to ...,"[101, 675, 556, 857, 995, 232, 294, 211, 212, ..."
625,ENDORSEMENT Contract Number: ENDORSEMENT Effec...,"[101, 3927, 438, 394, 119, 3927, 440, 254, 119..."
15939,"AMENDMENT NO. 2 Dated as of March 27, 2006 Ref...","[101, 543, 230, 117, 199, 777, 221, 210, 624, ..."


### Why Are We Using BERT Instead of GPT or Other NLP Models?

For legal document analysis, we are using a BERT-based model (Legal-BERT) instead of GPT or other NLP models :

1️⃣ **BERT is bidirectional** (reads both left and right context), while GPT reads only in one direction.<br>
2️⃣ **BERT is optimized for Named Entity Recognition (NER) and contract summarization**, whereas GPT is better suited for text generation.<br>
3️⃣ **Legal-BERT is specifically trained on legal texts**, making it better at handling contract clauses, obligations, and liabilities.<br>
4️⃣ **GPT is prone to hallucinations** (generating incorrect legal information), while BERT excels at extracting **factual** legal clauses.<br>

Therefore, we use `nlpaueb/legal-bert-base-uncased` (Legal-BERT) for this project.

### Prepare Data for Named Entity Recognition (NER)

This step processes contract text data to prepare it for Named Entity Recognition (NER) using Legal-BERT. It tokenizes the text, assigns entity labels (B-ENTITY, I-ENTITY, O), and ensures proper alignment. The sequences are truncated to 512 tokens, padded if necessary, and converted into a format compatible with Hugging Face's Dataset library. This dataset will be used to fine-tune a Legal-BERT model for extracting key entities from legal documents.

In [13]:
from datasets import Dataset

# Define max token length for the model
MAX_TOKEN_LENGTH = 512

ner_data = []

for _, row in df.iterrows():
    tokens = tokenizer.tokenize(row["context"])
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Initialize labels as 0 (no entity)
    labels = [0] * len(tokens)

    for answer in row["answers"]:
        answer_tokens = tokenizer.tokenize(answer)
        answer_token_ids = tokenizer.convert_tokens_to_ids(answer_tokens)

        # Find the start index of the answer tokens in the context
        for i in range(len(tokens) - len(answer_tokens) + 1):
            if token_ids[i:i+len(answer_tokens)] == answer_token_ids:
                labels[i] = 1  # Start of entity
                for j in range(1, len(answer_tokens)):
                    labels[i+j] = 2  # Inside the entity
                break

    # Ensure truncation & padding
    tokens = tokens[:MAX_TOKEN_LENGTH - 2]  # Account for [CLS] and [SEP]
    token_ids = token_ids[:MAX_TOKEN_LENGTH - 2]
    labels = labels[:MAX_TOKEN_LENGTH - 2]

    # Add special tokens ([CLS] and [SEP])
    tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
    token_ids = [tokenizer.cls_token_id] + token_ids + [tokenizer.sep_token_id]
    labels = [0] + labels + [0]  # No entity for special tokens

    # Pad sequences if necessary
    padding_length = MAX_TOKEN_LENGTH - len(token_ids)

    token_ids += [tokenizer.pad_token_id] * padding_length
    labels += [0] * padding_length  # Padding tokens should have label 0

    ner_data.append({"tokens": tokens, "input_ids": token_ids, "labels": labels})

# Convert to Hugging Face Dataset
ner_dataset = Dataset.from_list(ner_data)

# Display a sample from the dataset
print(ner_dataset[0])


Token indices sequence length is longer than the specified maximum sequence length for this model (535 > 512). Running this sequence through the model will result in indexing errors


{'tokens': ['[CLS]', 'exhibit', '10', '.', '6', 'distributor', 'agreement', 'this', 'distributor', 'agreement', '(', 'the', '"', 'agreement', '"', ')', 'is', 'made', 'by', 'and', 'between', 'electric', 'city', 'corp', '.', ',', 'a', 'delaware', 'corporation', '(', '"', 'company', '"', ')', 'and', 'electric', 'city', 'of', 'illinois', 'llc', '(', '"', 'distributor', '"', ')', 'this', '7', '##th', 'day', 'of', 'september', ',', '1999', '.', 'recitals', 'a', '.', 'the', 'company', "'", 's', 'business', '.', 'the', 'company', 'is', 'present', '##ly', 'engaged', 'in', 'the', 'business', 'of', 'selling', 'an', 'energy', 'efficiency', 'device', ',', 'which', 'is', 'referred', 'to', 'as', 'an', '"', 'energy', 'saver', '"', 'which', 'may', 'be', 'improved', 'or', 'otherwise', 'changed', 'from', 'its', 'present', 'composition', '(', 'the', '"', 'products', '"', ')', '.', 'the', 'company', 'may', 'engage', 'in', 'the', 'business', 'of', 'selling', 'other', 'products', 'or', 'other', 'devices', 'o

### Training a Named Entity Recognition (NER) Model (Fine-tune Legal-BERT)

This section will fine-tune Legal-BERT for Named Entity Recognition (NER). The model is loaded with a token classification head, mapping entity labels (B-ENTITY, I-ENTITY, O). We define training parameters, including batch size, learning rate, and number of epochs. The dataset is split into 90% training / 10% evaluation, and a Hugging Face Trainer is used for fine-tuning. The training process optimizes the model to recognize legal entities in contract text, improving its performance for downstream legal NLP tasks.

In [15]:
!pip install datasets --upgrade



In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import random
from datasets import Dataset
import numpy as np

# Define label mappings for NER
label2id = {"O": 0, "B-ENTITY": 1, "I-ENTITY": 2}
id2label = {v: k for k, v in label2id.items()}

# Load pre-trained Legal-BERT model for Token Classification
model_name = "nlpaueb/legal-bert-base-uncased"
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_dataset = ner_dataset

split = train_dataset.train_test_split(test_size=0.2)
train_dataset = split["train"]
eval_dataset = split["test"]

train_sampled = train_dataset.shuffle(seed=42).select([i for i in range(int(0.1 * len(train_dataset)))])

# Define training arguments
training_args = TrainingArguments(
    output_dir="./ner_legalbert",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=2,
    learning_rate=5e-5,
    warmup_steps=500,
    gradient_accumulation_steps=2,
    bf16=True,
    report_to="none",
)

# Data collator to handle padding and batching
def data_collator(features):
    input_ids = [f["input_ids"] for f in features]
    labels = [f["labels"] for f in features]

    # Padding
    padding_length = max(len(ids) for ids in input_ids)

    # Apply padding
    batch = {
        'input_ids': torch.tensor(input_ids, dtype=torch.long),
        'labels': torch.tensor(labels, dtype=torch.long)
    }

    # Pad sequences
    batch['input_ids'] = torch.nn.functional.pad(batch['input_ids'], (0, padding_length - batch['input_ids'].size(1)), value=tokenizer.pad_token_id)
    batch['labels'] = torch.nn.functional.pad(batch['labels'], (0, padding_length - batch['labels'].size(1)), value=-100)

    return batch

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_sampled,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model('./fine_tuned_legal_bert')


Some weights of BertForTokenClassification were not initialized from the model checkpoint at nlpaueb/legal-bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


### Evaluating NER Model Performance

In [None]:
# Function to compute metrics- precision, recall, F1
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    preds = np.argmax(predictions, axis=2)
    true_labels = [[id2label[label] for label in label_row] for label_row in labels]
    pred_labels = [[id2label[p] for p, l in zip(pred_row, label_row) if l != -100] for pred_row, label_row in zip(preds, labels)]

    results = metric.compute(predictions=pred_labels, references=true_labels)

    precision = results["overall_precision"]
    recall = results["overall_recall"]
    f1 = results["overall_f1"]
    accuracy = results["overall_accuracy"]

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "accuracy": accuracy
    }

# Evaluate the model
eval_results = trainer.evaluate()
metrics = compute_metrics((eval_results.predictions, eval_results.label_ids))

# Print the evaluation metrics
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1-score: {metrics['f1']:.4f}")
print(f"Accuracy: {metrics['accuracy']:.4f}")