# **AI - POWERED LEGAL MISLEADING CLAUSE DETECTION MODEL IN TERMS & CONDITIONS**

## **Problem Statement**


- Terms and Conditions (T&C) agreements are notoriously long, complicated, and full of dense legal jargon—making them difficult for the average user to understand. Within this complexity, companies may insert biased, unfair, or misleading clauses that limit consumer rights, shift liability, or disproportionately favor the service provider.

- Users frequently accept these agreements without reading or fully comprehending them, leading to unintended consequences. Detecting such problematic clauses currently requires manual legal expertise, which is time-consuming, costly, and inaccessible to most users.

- While existing resources are limited, the Contract Understanding Atticus Dataset (CUAD) curated by The Atticus Project offers a promising solution. CUAD comprises over 13,000 expert annotations across 510 commercial legal contracts, covering 41 types of key clauses commonly identified in corporate legal reviews—ranging from governing law and expiration dates to exclusivity and most-favored-nation provisions.

- This project aims to build an AI-powered Legal Assistant that leverages NLP and machine learning models trained on CUAD to:

    - Detect potentially biased, unfair, or misleading clauses in T&C documents.

    - Highlight these clauses for review.

- By building upon the rigorously annotated CUAD dataset, we capitalize on a high-quality benchmark designed explicitly for legal contract review, enabling more accurate detection and interpretation of critical clauses. This approach can significantly reduce the time, cost, and expertise needed—empowering individuals and small organizations with improved access to legal insight and protection.

In [None]:
import zipfile
import os

# Zip File
zip_file = "CUAD_v1.zip"

with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall("CUAD_dataset")  # folder to extract into

In [None]:
# List all files and folders extracted
import os
os.listdir("CUAD_dataset")

In [None]:
!pip install faiss-cpu

In [None]:
! pip install pypdf

In [None]:
import os
import pandas as pd
import torch
import faiss
from transformers import AutoTokenizer, AutoModel, pipeline
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader

## **Load CSV**

In [None]:
# ---------------- STEP 1: Paths ----------------
CSV_PATH = "/content/CUAD_dataset/CUAD_v1/master_clauses.csv"
EMB_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

In [None]:
# ---------------- STEP 2: Load Dataset ----------------
# Load master clauses from CSV
df = pd.read_csv(CSV_PATH)
# 4. Show first rows & column names
df.columns

In [None]:
df.head()

In [None]:
# Identify clause text columns (exclude ones ending in "-Answer")
clause_columns = [col for col in df.columns if not col.endswith("-Answer") and col not in ["Filename", "Document Name"]]

# Melt into long format: one clause per row
melted_df = df.melt(value_vars=clause_columns, var_name="label", value_name="clause_text")

# Drop NaN or empty clauses
melted_df = melted_df.dropna(subset=["clause_text"])
melted_df = melted_df[melted_df["clause_text"].str.strip() != ""]

print(melted_df.head())
print(f"Total clauses: {len(melted_df)}")


## **Text Extraction**

In [None]:
pdf_folder = "/content/CUAD_dataset/CUAD_v1/full_contract_pdf"
txt_folder = "/content/CUAD_dataset/CUAD_v1/full_contract_txt"
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

# Collect PDF texts
pdf_data = []
for filename in os.listdir(pdf_folder):
    if filename.lower().endswith(".pdf"):
        full_path = os.path.join(pdf_folder, filename)
        text = extract_text_from_pdf(full_path)
        pdf_data.append({"label": "Unknown", "clause_text": text})

# Collect TXT texts
txt_data = []
for filename in os.listdir(txt_folder):
    if filename.lower().endswith(".txt"):
        full_path = os.path.join(txt_folder, filename)
        with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
            text = f.read()
        txt_data.append({"label": "Unknown", "clause_text": text})

# Combine with CSV clauses
pdf_df = pd.DataFrame(pdf_data)
txt_df = pd.DataFrame(txt_data)
full_df = pd.concat([melted_df, pdf_df, txt_df], ignore_index=True)

print(full_df.head())
print(f"Total records: {len(full_df)}")


## **Chunking**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter (better for legal clauses)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # adjust based on legal clause length
    chunk_overlap=100,  # keep context overlap
    length_function=len
)

chunks = []
for _, row in full_df.iterrows():
    for chunk in text_splitter.split_text(row["clause_text"]):
        chunks.append({"label": row["label"], "text": chunk})

chunks_df = pd.DataFrame(chunks)
print(f"Total chunks created: {len(chunks_df)}")
print(chunks_df.head())

## **Embedding**

In [None]:
# ---------- Embedding ----------
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model (no API key needed)
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode all chunks
embeddings = embed_model.encode(chunks_df["text"].tolist(), convert_to_numpy=True)

## **FAISS Vector Store**

In [None]:
# ---------- FAISS Vector Store ----------
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"FAISS index size: {index.ntotal}")

## **Defining Function For Retrieval Similar Chunks**

In [None]:
# ---------- Retrieval Function ----------
def retrieve_similar_chunks(query, top_k=5):
    query_emb = embed_model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_emb, top_k)
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            "score": float(distances[0][i]),
            "text": chunks_df.iloc[idx]["text"],
            "label": chunks_df.iloc[idx]["label"]
        })
    return results

### **Example**

In [None]:
# ---------- Example Retrieval ----------
query = "termination clause without prior notice"
retrieved = retrieve_similar_chunks(query)

print("\nTop Matching Chunks:")
for r in retrieved:
    print(f"[{r['label']}] {r['text'][:200]}...\nScore: {r['score']}\n")

## **Prepare Data For Generative Training**

In [None]:
# Prepare data for generative training
train_data = []
for _, row in chunks_df.iterrows():
    if row["label"] != "Unknown":  # only labeled data for supervised learning
        prompt = f"Explain why the following clause might be misleading:\n\n{row['text']}"
        response = f"This clause is categorized as '{row['label']}' and may be misleading because ..."
        train_data.append({"prompt": prompt, "response": response})

train_df = pd.DataFrame(train_data)
print(train_df.head())


## **Training With a Generative AI Model**

In [None]:
!pip install transformers datasets accelerate

In [None]:
# Disable wandb
import os
os.environ["WANDB_MODE"] = "disabled"

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
import pandas as pd

# Example dataset
train_df = pd.DataFrame({
    "clause": [
        "This agreement limits user rights.",
        "The company may terminate without notice.",
        "This clause ensures both parties are protected.",
        "The user can cancel at any time."
    ],
    "label": [1, 1, 0, 0]   # 1 = misleading, 0 = not misleading
})

# HuggingFace Dataset
dataset = Dataset.from_pandas(train_df)

# Load tokenizer & model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

### **Tokenize Data**

In [None]:
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenization
def tokenize(batch):
    return tokenizer(
        batch["clause"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

tokenized_data = dataset.map(tokenize, batched=True)

# Rename labels properly
tokenized_data = tokenized_data.rename_column("label", "labels")
tokenized_data.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Train/Test split
train_test = tokenized_data.train_test_split(test_size=0.2)
train_dataset = train_test["train"]
test_dataset = train_test["test"]

### **Split dataset into train aand test**

### **Train the Model**

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

trainer.train()


## **Save Model**

In [None]:
# Save model and tokenizer
trainer.save_model()  # Saves to ./gen_ai_model
tokenizer.save_pretrained("./gen_ai_model")


## **Download into my local system**

In [None]:
# Option 1: Download to local machine
!zip -r gen_ai_model.zip ./gen_ai_model
from google.colab import files
files.download("gen_ai_model.zip")