<a href="https://colab.research.google.com/github/saurabhguptars-cmd/compliance-ai-project/blob/main/Step1Step2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# =========================
# Legal Document Processor — Sample Clauses (Safe)
# =========================

# Install dependencies
import sys
!{sys.executable} -m pip install -q transformers sentence-transformers torch

# Imports
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import torch, os, pandas as pd, numpy as np

# Create folders
os.makedirs('data/processed', exist_ok=True)

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

# =========================
# Step 1: Sample legal clauses
# =========================
texts = [
    "All customer data must be stored within the European Union.",
    "Employees must comply with the company's cybersecurity policy.",
    "Third-party vendors must sign a data protection agreement.",
    "All financial transactions must be logged and auditable.",
    "Access to sensitive data must be restricted to authorized personnel."
]

print("Extracted", len(texts), "clauses")
print("Sample clause:\n", texts[0])

# =========================
# Step 2: Generate embeddings
# =========================
embedder = SentenceTransformer("all-MiniLM-L6-v2")
corpus_embeddings = embedder.encode(texts, convert_to_tensor=True, show_progress_bar=True)

# Save embeddings & clauses
pd.DataFrame({"clause": texts}).to_csv("data/processed/clauses_sample.csv", index=False)
np.save("data/processed/corpus_embeddings.npy", corpus_embeddings.cpu().numpy())
print("Embeddings saved!")

# =========================
# Step 3: Semantic search
# =========================
query = "data residency EU storage location"
query_embedding = embedder.encode(query, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)

print("Top matching clauses:\n")
for h in hits[0]:
    print("Score:", h["score"])
    print("Clause:\n", texts[h["corpus_id"]])
    print("\n---\n")

# =========================
# Step 4: Summarization & Simplification
# =========================
device = 0 if torch.cuda.is_available() else -1

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=device)
simplifier = pipeline("text2text-generation", model="google/flan-t5-small", device=device)

sample_text = texts[hits[0][0]['corpus_id']]

summary = summarizer(sample_text, max_length=120, min_length=30, do_sample=False)[0]['summary_text']
prompt = "Simplify the following legal clause into 3-4 clear action items for a system analyst:\n\n" + sample_text
items = simplifier(prompt, max_length=200)[0]['generated_text']

print("Summary:\n", summary)
print("\nAction Items:\n", items)

# =========================
# Step 5: Save outputs
# =========================
with open("data/processed/sample_clause.txt", "w", encoding="utf-8") as f:
    f.write(sample_text)
with open("data/processed/sample_summary.txt", "w", encoding="utf-8") as f:
    f.write(summary)
with open("data/processed/sample_items.txt", "w", encoding="utf-8") as f:
    f.write(items)

print("Saved outputs in data/processed/")


Torch version: 2.8.0+cu126
CUDA available: True
Extracted 5 clauses
Sample clause:
 All customer data must be stored within the European Union.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings saved!
Top matching clauses:

Score: 0.5762563347816467
Clause:
 All customer data must be stored within the European Union.

---

Score: 0.2944127321243286
Clause:
 Access to sensitive data must be restricted to authorized personnel.

---

Score: 0.2440289556980133
Clause:
 Third-party vendors must sign a data protection agreement.

---



config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Your max_length is set to 120, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summary:
  All customer data must be stored within the European Union . Customers must also have their data stored in the EU . All customers' data must also be stored in Europe .

Action Items:
 A system analyst must store all customer data within the European Union.
Saved outputs in data/processed/


In [3]:
# =========================
# Step 2 - Compliance Q&A Engine
# =========================

import torch, numpy as np, pandas as pd, json
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

# -------------------------------
# Load embeddings and clauses from Step 1
# -------------------------------
clauses = pd.read_csv("data/processed/clauses_sample.csv")["clause"].tolist()
embeddings = np.load("data/processed/corpus_embeddings.npy")
embeddings = torch.tensor(embeddings)

print(f"Loaded {len(clauses)} clauses and embeddings.")

# Reload the same embedder used in Step 1
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# -------------------------------
# Semantic search function
# -------------------------------
def search(query, top_k=3):
    query_emb = embedder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_emb, embeddings, top_k=top_k)[0]
    results = []
    for h in hits:
        results.append({
            "score": float(h["score"]),
            "clause": clauses[h["corpus_id"]]
        })
    return results

# -------------------------------
# Summarizer and Simplifier
# -------------------------------
device = 0 if torch.cuda.is_available() else -1
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=device)
simplifier = pipeline("text2text-generation", model="google/flan-t5-small", device=device)

def explain_clause(text):
    summary = summarizer(text, max_length=120, min_length=30, do_sample=False)[0]['summary_text']
    prompt = "Simplify the following legal clause into 3-4 clear action items:\n\n" + text
    items = simplifier(prompt, max_length=200)[0]['generated_text']
    return summary, items

# -------------------------------
# Example Query
# -------------------------------
query = "Where must customer data be stored?"
results = search(query, top_k=3)

print(f"\nQuery: {query}\n")
for r in results:
    print(f"Score: {r['score']:.3f}")
    print(f"Clause: {r['clause']}\n")
    summary, items = explain_clause(r['clause'])
    print("Summary:", summary)
    print("Action Items:", items)
    print("\n---\n")


Loaded 5 clauses and embeddings.


Device set to use cuda:0
Device set to use cuda:0
Your max_length is set to 120, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)



Query: Where must customer data be stored?

Score: 0.654
Clause: All customer data must be stored within the European Union.



Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Your max_length is set to 120, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


Summary:  All customer data must be stored within the European Union . Customers must also have their data stored in the EU . All customers' data must also be stored in Europe .
Action Items: 3-4 clear action items: All customer data must be stored within the European Union.

---

Score: 0.364
Clause: Third-party vendors must sign a data protection agreement.



Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Your max_length is set to 120, but your input_length is only 13. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


Summary:  Third-party vendors must sign a data protection agreement . Data protection agreement must be signed by third party vendors . Third party vendors must also sign an agreement with the government .
Action Items: 3-4 clear action items: Third-party vendors must sign a data protection agreement.

---

Score: 0.340
Clause: Access to sensitive data must be restricted to authorized personnel.



Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Summary:  Access to sensitive data must be restricted to authorized personnel . Use of sensitive data is restricted to limited access to sensitive information . Use this article to help students understand the complexities of the data mining system .
Action Items: Section 3-4 of the Code of Criminal Procedure: Access to sensitive data must be restricted to authorized personnel.

---

