# **AI - POWERED LEGAL MISLEADING CLAUSE DETECTION MODEL IN TERMS & CONDITIONS**

## **Problem Statement**


- Terms and Conditions (T&C) agreements are notoriously long, complicated, and full of dense legal jargon—making them difficult for the average user to understand. Within this complexity, companies may insert biased, unfair, or misleading clauses that limit consumer rights, shift liability, or disproportionately favor the service provider.

- Users frequently accept these agreements without reading or fully comprehending them, leading to unintended consequences. Detecting such problematic clauses currently requires manual legal expertise, which is time-consuming, costly, and inaccessible to most users.

- While existing resources are limited, the Contract Understanding Atticus Dataset (CUAD) curated by The Atticus Project offers a promising solution. CUAD comprises over 13,000 expert annotations across 510 commercial legal contracts, covering 41 types of key clauses commonly identified in corporate legal reviews—ranging from governing law and expiration dates to exclusivity and most-favored-nation provisions.

- This project aims to build an AI-powered Legal Assistant that leverages NLP and machine learning models trained on CUAD to:

    - Detect potentially biased, unfair, or misleading clauses in T&C documents.

    - Highlight these clauses for review.

- By building upon the rigorously annotated CUAD dataset, we capitalize on a high-quality benchmark designed explicitly for legal contract review, enabling more accurate detection and interpretation of critical clauses. This approach can significantly reduce the time, cost, and expertise needed—empowering individuals and small organizations with improved access to legal insight and protection.

In [None]:
import zipfile
import os

# Zip File
zip_file = "CUAD_v1.zip"

with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall("CUAD_dataset")  # folder to extract into

In [None]:
# List all files and folders extracted
import os
os.listdir("CUAD_dataset")

['CUAD_v1']

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
! pip install pypdf

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0


In [None]:
import os
import pandas as pd
import torch
import faiss
from transformers import AutoTokenizer, AutoModel, pipeline
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader

## **Load CSV**

In [None]:
# ---------------- STEP 1: Paths ----------------
CSV_PATH = "/content/CUAD_dataset/CUAD_v1/master_clauses.csv"
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

In [None]:
# ---------------- STEP 2: Load Dataset ----------------
# Load master clauses from CSV
df = pd.read_csv(CSV_PATH)
# 4. Show first rows & column names
df.columns

Index(['Filename', 'Document Name', 'Document Name-Answer', 'Parties',
       'Parties-Answer', 'Agreement Date', 'Agreement Date-Answer',
       'Effective Date', 'Effective Date-Answer', 'Expiration Date',
       'Expiration Date-Answer', 'Renewal Term', 'Renewal Term-Answer',
       'Notice Period To Terminate Renewal',
       'Notice Period To Terminate Renewal- Answer', 'Governing Law',
       'Governing Law-Answer', 'Most Favored Nation',
       'Most Favored Nation-Answer', 'Competitive Restriction Exception',
       'Competitive Restriction Exception-Answer', 'Non-Compete',
       'Non-Compete-Answer', 'Exclusivity', 'Exclusivity-Answer',
       'No-Solicit Of Customers', 'No-Solicit Of Customers-Answer',
       'No-Solicit Of Employees', 'No-Solicit Of Employees-Answer',
       'Non-Disparagement', 'Non-Disparagement-Answer',
       'Termination For Convenience', 'Termination For Convenience-Answer',
       'Rofr/Rofo/Rofn', 'Rofr/Rofo/Rofn-Answer', 'Change Of Control',
      

In [None]:
df.head()

Unnamed: 0,Filename,Document Name,Document Name-Answer,Parties,Parties-Answer,Agreement Date,Agreement Date-Answer,Effective Date,Effective Date-Answer,Expiration Date,...,Liquidated Damages,Liquidated Damages-Answer,Warranty Duration,Warranty Duration-Answer,Insurance,Insurance-Answer,Covenant Not To Sue,Covenant Not To Sue-Answer,Third Party Beneficiary,Third Party Beneficiary-Answer
0,CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...,['MARKETING AFFILIATE AGREEMENT'],MARKETING AFFILIATE AGREEMENT,"['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...","Birch First Global Investments Inc. (""Company""...","['8th day of May 2014', 'May 8, 2014']",5/8/14,['This agreement shall begin upon the date of ...,,['This agreement shall begin upon the date of ...,...,[],No,"[""COMPANY'S SOLE AND EXCLUSIVE LIABILITY FOR T...",Yes,[],No,[],No,[],No
1,EuromediaHoldingsCorp_20070215_10SB12G_EX-10.B...,['VIDEO-ON-DEMAND CONTENT LICENSE AGREEMENT'],VIDEO-ON-DEMAND CONTENT LICENSE AGREEMENT,"['EuroMedia Holdings Corp.', 'Rogers', 'Rogers...","Rogers Cable Communications Inc. (""Rogers""); E...","['July 11 , 2006']",7/11/06,"['July 11 , 2006']",7/11/06,"['The term of this Agreement (the ""Initial Ter...",...,[],No,[],No,[],No,[],No,[],No
2,FulucaiProductionsLtd_20131223_10-Q_EX-10.9_83...,['CONTENT DISTRIBUTION AND LICENSE AGREEMENT'],CONTENT DISTRIBUTION AND LICENSE AGREEMENT,"['Producer', 'Fulucai Productions Ltd.', 'Conv...","CONVERGTV, INC. (“ConvergTV”); Fulucai Product...","['November 15, 2012']",11/15/12,"['November 15, 2012']",11/15/12,[],...,[],No,[],No,[],No,[],No,[],No
3,GopageCorp_20140221_10-K_EX-10.1_8432966_EX-10...,['WEBSITE CONTENT LICENSE AGREEMENT'],WEBSITE CONTENT LICENSE AGREEMENT,"['PSiTech Corporation', 'Licensor', 'Licensee'...","PSiTech Corporation (""Licensor""); Empirical Ve...","['Feb 10, 2014']",2/10/14,"['Feb 10, 2014']",2/10/14,['The initial term of this Agreement commences...,...,[],No,[],No,[],No,[],No,[],No
4,IdeanomicsInc_20160330_10-K_EX-10.26_9512211_E...,['CONTENT LICENSE AGREEMENT'],CONTENT LICENSE AGREEMENT,"['YOU ON DEMAND HOLDINGS, INC.', 'Licensor', '...",Beijing Sun Seven Stars Culture Development Li...,"['December 21, 2015']",12/21/15,"['December 21, 2015']",12/21/15,"['The Term of this Agreement (the ""Term"") shal...",...,[],No,[],No,[],No,[],No,[],No


In [None]:
# Identify clause text columns (exclude ones ending in "-Answer")
clause_columns = [col for col in df.columns if not col.endswith("-Answer") and col not in ["Filename", "Document Name"]]

# Melt into long format: one clause per row
melted_df = df.melt(value_vars=clause_columns, var_name="label", value_name="clause_text")

# Drop NaN or empty clauses
melted_df = melted_df.dropna(subset=["clause_text"])
melted_df = melted_df[melted_df["clause_text"].str.strip() != ""]

print(melted_df.head())
print(f"Total clauses: {len(melted_df)}")


     label                                        clause_text
0  Parties  ['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...
1  Parties  ['EuroMedia Holdings Corp.', 'Rogers', 'Rogers...
2  Parties  ['Producer', 'Fulucai Productions Ltd.', 'Conv...
3  Parties  ['PSiTech Corporation', 'Licensor', 'Licensee'...
4  Parties  ['YOU ON DEMAND HOLDINGS, INC.', 'Licensor', '...
Total clauses: 20501


## **Text Extraction**

In [None]:
pdf_folder = "/content/CUAD_dataset/CUAD_v1/full_contract_pdf"
txt_folder = "/content/CUAD_dataset/CUAD_v1/full_contract_txt"
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Collect PDF texts
pdf_data = []
for filename in os.listdir(pdf_folder):
    if filename.lower().endswith(".pdf"):
        full_path = os.path.join(pdf_folder, filename)
        text = extract_text_from_pdf(full_path)
        pdf_data.append({"label": "Unknown", "clause_text": text})

# Collect TXT texts
txt_data = []
for filename in os.listdir(txt_folder):
    if filename.lower().endswith(".txt"):
        full_path = os.path.join(txt_folder, filename)
        with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
            text = f.read()
        txt_data.append({"label": "Unknown", "clause_text": text})

# Combine with CSV clauses
pdf_df = pd.DataFrame(pdf_data)
txt_df = pd.DataFrame(txt_data)
full_df = pd.concat([melted_df, pdf_df, txt_df], ignore_index=True)

print(full_df.head())
print(f"Total records: {len(full_df)}")


     label                                        clause_text
0  Parties  ['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...
1  Parties  ['EuroMedia Holdings Corp.', 'Rogers', 'Rogers...
2  Parties  ['Producer', 'Fulucai Productions Ltd.', 'Conv...
3  Parties  ['PSiTech Corporation', 'Licensor', 'Licensee'...
4  Parties  ['YOU ON DEMAND HOLDINGS, INC.', 'Licensor', '...
Total records: 21009


## **Chunking**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter (better for legal clauses)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # adjust based on legal clause length
    chunk_overlap=100,  # keep context overlap
    length_function=len
)

chunks = []
for _, row in full_df.iterrows():
    for chunk in text_splitter.split_text(row["clause_text"]):
        chunks.append({"label": row["label"], "text": chunk})

chunks_df = pd.DataFrame(chunks)
print(f"Total chunks created: {len(chunks_df)}")
print(chunks_df.head())

Total chunks created: 103247
     label                                               text
0  Parties  ['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...
1  Parties  ['EuroMedia Holdings Corp.', 'Rogers', 'Rogers...
2  Parties  ['Producer', 'Fulucai Productions Ltd.', 'Conv...
3  Parties  ['PSiTech Corporation', 'Licensor', 'Licensee'...
4  Parties  ['YOU ON DEMAND HOLDINGS, INC.', 'Licensor', '...


## **Embedding**

In [None]:
# ---------- Embedding ----------
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model (no API key needed)
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode all chunks
embeddings = embed_model.encode(chunks_df["text"].tolist(), convert_to_numpy=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## **FAISS Vector Store**

In [None]:
# ---------- FAISS Vector Store ----------
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"FAISS index size: {index.ntotal}")

FAISS index size: 103247


## **Defining Function For Retrieval Similar Chunks**

In [None]:
# ---------- Retrieval Function ----------
def retrieve_similar_chunks(query, top_k=5):
    query_emb = embed_model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, top_k)
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            "score": float(distances[0][i]),
            "text": chunks_df.iloc[idx]["text"],
            "label": chunks_df.iloc[idx]["label"]
        })
    return results

### **Example**

In [None]:
# ---------- Example Retrieval ----------
query = "termination clause without prior notice"
retrieved = retrieve_similar_chunks(query)

print("\nTop Matching Chunks:")
for r in retrieved:
    print(f"[{r['label']}] {r['text'][:200]}...\nScore: {r['score']}\n")


Top Matching Chunks:
[Unknown] notice thereof to the other, terminate this Agreement as of a date specified in such notice of termination, which...
Score: 0.5261452794075012

[Unknown] 12.3 Other Early Termination.

12.3.1 Either Party shall have the right to terminate this Agreement before the end of the Term for its convenience upon [***] written notice to the other Party (and any...
Score: 0.5616702437400818

[Unknown] is otherwise entitled to receive hereunder in the period from the date of such termination notice until the [ * ]....
Score: 0.6304959058761597

[Unknown] 16. TERMINATION

16.1 Termination events: without prejudice to any other rights under this Agreement and/or at Law, either Party shall be entitled to terminate all or part of this Agreement by Notice ...
Score: 0.637556791305542

[Termination For Convenience] ["Either party hereto may terminate this Agreement after the Initial Period upon at least six (6) months' prior written notice to the other party thereof.", "

## **Prepare Data For Generative Training**

In [None]:
# Prepare data for generative training
train_data = []
for _, row in chunks_df.iterrows():
    if row["label"] != "Unknown":  # only labeled data for supervised learning
        prompt = f"Explain why the following clause might be misleading:\n\n{row['text']}"
        response = f"This clause is categorized as '{row['label']}' and may be misleading because ..."
        train_data.append({"prompt": prompt, "response": response})

train_df = pd.DataFrame(train_data)
print(train_df.head())


                                              prompt  \
0  Explain why the following clause might be misl...   
1  Explain why the following clause might be misl...   
2  Explain why the following clause might be misl...   
3  Explain why the following clause might be misl...   
4  Explain why the following clause might be misl...   

                                            response  
0  This clause is categorized as 'Parties' and ma...  
1  This clause is categorized as 'Parties' and ma...  
2  This clause is categorized as 'Parties' and ma...  
3  This clause is categorized as 'Parties' and ma...  
4  This clause is categorized as 'Parties' and ma...  


## **Training With a Generative AI Model**

In [None]:
!pip install transformers datasets accelerate



In [None]:
# Disable wandb
import os
os.environ["WANDB_MODE"] = "disabled"

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
import pandas as pd

# Example dataset
train_df = pd.DataFrame({
    "clause": [
        "This agreement limits user rights.",
        "The company may terminate without notice.",
        "This clause ensures both parties are protected.",
        "The user can cancel at any time."
    ],
    "label": [1, 1, 0, 0]   # 1 = misleading, 0 = not misleading
})

# HuggingFace Dataset
dataset = Dataset.from_pandas(train_df)

# Load tokenizer & model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

### **Tokenize Data**

In [None]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenization
def tokenize(batch):
    return tokenizer(
        batch["clause"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

tokenized_data = dataset.map(tokenize, batched=True)

# Rename labels properly
tokenized_data = tokenized_data.rename_column("label", "labels")
tokenized_data.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Train/Test split
train_test = tokenized_data.train_test_split(test_size=0.2)
train_dataset = train_test["train"]
test_dataset = train_test["test"]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

### **Split dataset into train aand test**

### **Train the Model**

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

trainer.train()


  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=3, training_loss=0.6181640625, metrics={'train_runtime': 75.8322, 'train_samples_per_second': 0.119, 'train_steps_per_second': 0.04, 'total_flos': 1183999749120.0, 'train_loss': 0.6181640625, 'epoch': 3.0})

## **Save Model**

In [None]:
# Save model and tokenizer
trainer.save_model()  # Saves to ./gen_ai_model
tokenizer.save_pretrained("./gen_ai_model")


('./gen_ai_model/tokenizer_config.json',
 './gen_ai_model/special_tokens_map.json',
 './gen_ai_model/vocab.txt',
 './gen_ai_model/added_tokens.json',
 './gen_ai_model/tokenizer.json')

## **Download into my local system**

In [None]:
# Option 1: Download to local machine
!zip -r gen_ai_model.zip ./gen_ai_model
from google.colab import files
files.download("gen_ai_model.zip")

  adding: gen_ai_model/ (stored 0%)
  adding: gen_ai_model/tokenizer_config.json (deflated 75%)
  adding: gen_ai_model/special_tokens_map.json (deflated 42%)
  adding: gen_ai_model/vocab.txt (deflated 53%)
  adding: gen_ai_model/tokenizer.json (deflated 71%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>