# Constitution of India - Summarization & Risk-classification Notebook

What this notebook does (step-by-step)
1. Loads your Constitution JSON (you said you already have it) and flattens the nested structure into rows (Articles / Clauses).
2. Preprocesses text and creates pseudo-summaries (for quick prototyping). For best results you should supply human-written summaries.
3. Generates a heuristic 'risk' label (0 = low, 1 = high) using keywords. Replace with manual labels if available.
4. Prepares Hugging Face `datasets` objects and tokenizers.
5. Fine-tunes:
   - A T5 model for text simplification/summarization (seq2seq).
   - A DistilBERT classifier for risk prediction (sequence classification).
6. Provides inference utilities that accept PDF input (text or scanned) and run summarization + risk classification.
7. Final cell triggers training (optional - uncomment to run).

WARNING: This notebook contains prototype code. Training transformer models requires GPU and time. Adjust hyperparameters for your machine.


In [21]:
# Install required packages (run once)
# In a notebook environment use %pip so it installs in the notebook kernel.
# You may need to separately install the Tesseract binary for OCR (for scanned PDFs):
# On Ubuntu: sudo apt-get install tesseract-ocr
# On Windows: install from https://github.com/tesseract-ocr/tesseract

%pip install --upgrade pip
%pip install transformers datasets evaluate sentencepiece accelerate torch torchvision torchaudio --quiet
%pip install datasets[torch] pdfplumber pytesseract pdf2image scikit-learn --quiet


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.


In [2]:
# Standard imports and reproducibility
import os
import json
import re
import math
import random
from pathlib import Path
from typing import List, Dict, Any

import numpy as np
import pandas as pd

# Seed everything for reproducibility
def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    except Exception:
        pass

seed_everything(42)

# Display basic environment info
import torch
print('torch', torch.__version__, 'cuda:', torch.cuda.is_available())

torch 2.8.0+cpu cuda: False


In [2]:
# Load and inspect your constitution JSON
# Put your JSON file at path below.
CONSTITUTION_JSON = 'COI.json'

def load_constitution(json_path: str) -> List[Dict[str, Any]]:
    # Load the JSON file and return the parsed object.
    # This function attempts to be tolerant to different nested list shapes used in your example.
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

# If you already have the file in the notebook workspace uncomment and run:
if os.path.exists(CONSTITUTION_JSON):
    raw = load_constitution(CONSTITUTION_JSON)
    print('Top-level type:', type(raw))
    # print a small preview
    import json as _json
    preview = _json.dumps(raw[:2], indent=2, ensure_ascii=False) if isinstance(raw, list) else str(raw)[:1000]
    print('Preview (truncated):\n', preview)
else:
    print(f'Please place your constitution JSON at: {CONSTITUTION_JSON} and re-run this cell.')

Top-level type: <class 'list'>
Preview (truncated):
 [
  [
    {
      "ArtNo": "0",
      "Name": "PREAMBLE",
      "ArtDesc": "WE, THE PEOPLE OF INDIA, having solemnly resolved to constitute India into a SOVEREIGN SOCIALIST SECULAR DEMOCRATIC REPUBLIC and to secure to all its citizens:\nJUSTICE, social, economic and political;\nLIBERTY of thought, expression, belief, faith and worship;\nEQUALITY of status and of opportunity;\nand to promote among them all\nFRATERNITY assuring the dignity of the individual and the unity and integrity of the Nation;\nIN OUR CONSTITUENT ASSEMBLY this twenty-sixth day of November, 1949, do HEREBY ADOPT, ENACT AND GIVE TO OURSELVES THIS CONSTITUTION."
    },
    {
      "ArtNo": "1",
      "Name": "Name and territory of the Union.",
      "Clauses": [
        {
          "ClauseNo": "1",
          "ClauseDesc": "India, that is Bharat, shall be a Union of States."
        },
        {
          "ClauseNo": "2",
          "ClauseDesc": "The States and the t

In [3]:
# Flatten nested JSON structure into a table (DataFrame)
def flatten_constitution(data: Any) -> pd.DataFrame:
    # Navigate nested lists/dicts and produce rows with fields:
    # - ArtNo, Name, ArtDesc, ClauseNo, ClauseDesc (when present)
    rows = []
    def handle_item(item):
        # item is expected to be a dict representing an article or clause block
        if not isinstance(item, dict):
            return
        art_no = item.get('ArtNo') or item.get('ArticleNo') or item.get('article_no') or ''
        name = item.get('Name') or item.get('Title') or ''
        art_desc = item.get('ArtDesc') or item.get('ArticleDesc') or ''
        clauses = item.get('Clauses') or item.get('clauses') or item.get('Clause') or []
        # if clauses exist, yield each clause as a row
        if isinstance(clauses, list) and clauses:
            for c in clauses:
                if isinstance(c, dict):
                    rows.append({
                        'ArtNo': art_no,
                        'Name': name,
                        'ArtDesc': art_desc,
                        'ClauseNo': c.get('ClauseNo',''),
                        'ClauseDesc': c.get('ClauseDesc', '') or c.get('Text','')
                    })
                else:
                    rows.append({
                        'ArtNo': art_no, 'Name': name, 'ArtDesc': art_desc,
                        'ClauseNo': '', 'ClauseDesc': str(c)
                    })
        else:
            # no clauses: treat the article description as a single row
            rows.append({
                'ArtNo': art_no,
                'Name': name,
                'ArtDesc': art_desc,
                'ClauseNo': '',
                'ClauseDesc': ''
            })

    # data might be nested lists
    def walk(x):
        if isinstance(x, list):
            for e in x:
                walk(e)
        elif isinstance(x, dict):
            handle_item(x)
            # walk nested lists or dict values
            for v in x.values():
                if isinstance(v, (list, dict)):
                    walk(v)
        # else ignore primitives

    walk(data)
    df = pd.DataFrame(rows)
    return df

# Example usage (run after placing JSON)
if os.path.exists(CONSTITUTION_JSON):
    df = flatten_constitution(raw)
    print('Rows:', len(df))
    display(df.head(20))
else:
    print('No JSON found yet; place the file and re-run.')

Rows: 190


Unnamed: 0,ArtNo,Name,ArtDesc,ClauseNo,ClauseDesc
0,0,PREAMBLE,"WE, THE PEOPLE OF INDIA, having solemnly resol...",,
1,1,Name and territory of the Union.,,1.0,"India, that is Bharat, shall be a Union of Sta..."
2,1,Name and territory of the Union.,,2.0,The States and the territories thereof shall b...
3,1,Name and territory of the Union.,,3.0,The territory of India shall comprise
4,,,,,
5,,,,,
6,,,,,
7,,,,,
8,,,,,
9,,,,,


In [4]:
# Preprocessing: create `text`, `summary`, `risk_label`
def first_sentences(text: str, n_sentences: int = 1) -> str:
    # Very lightweight sentence splitter for English-like text.
    if not text:
        return ''
    txt = re.sub(r'\s+', ' ', text.strip())
    sents = re.split(r'(?<=[.!?])\s+', txt)
    sents = [s.strip() for s in sents if s.strip()]
    return ' '.join(sents[:n_sentences])

RISK_KEYWORDS_HIGH = [
    'penalty', 'penal', 'offence', 'offense', 'liable', 'fine', 'punishable', 'imprison', 'arrest',
    'terminate', 'disqualify', 'prohibit', 'forfeit', 'warrant', 'sanction'
]

def heuristic_risk_label(text: str) -> int:
    # Simple keyword-based risk heuristic.
    if not text:
        return 0
    t = text.lower()
    score = sum(1 for kw in RISK_KEYWORDS_HIGH if kw in t)
    return 1 if score > 0 else 0

def prepare_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    # Create 'text', 'summary', 'risk_label' columns used for training.
    def make_text(row):
        parts = []
        if row.get('Name'): parts.append(str(row['Name']))
        if row.get('ArtNo'): parts.append(f"Article {row['ArtNo']}")
        if row.get('ArtDesc'): parts.append(str(row['ArtDesc']))
        if row.get('ClauseDesc'): parts.append(str(row['ClauseDesc']))
        return '\n'.join(parts).strip()

    df2 = df.copy()
    df2['text'] = df2.apply(make_text, axis=1)
    df2['summary'] = df2.apply(lambda r: first_sentences(r['ClauseDesc'] or r['ArtDesc'], n_sentences=2), axis=1)
    df2['risk_label'] = df2['text'].apply(heuristic_risk_label)
    df2 = df2[df2['text'].astype(bool)].reset_index(drop=True)
    return df2

if 'df' in globals():
    df_prepped = prepare_dataframe(df)
    print('Prepared rows:', len(df_prepped))
    display(df_prepped.head(10))
else:
    print('Run the previous cells to load and flatten the JSON first.')

Prepared rows: 91


Unnamed: 0,ArtNo,Name,ArtDesc,ClauseNo,ClauseDesc,text,summary,risk_label
0,0,PREAMBLE,"WE, THE PEOPLE OF INDIA, having solemnly resol...",,,"PREAMBLE\nArticle 0\nWE, THE PEOPLE OF INDIA, ...","WE, THE PEOPLE OF INDIA, having solemnly resol...",0
1,1,Name and territory of the Union.,,1.0,"India, that is Bharat, shall be a Union of Sta...",Name and territory of the Union.\nArticle 1\nI...,"India, that is Bharat, shall be a Union of Sta...",0
2,1,Name and territory of the Union.,,2.0,The States and the territories thereof shall b...,Name and territory of the Union.\nArticle 1\nT...,The States and the territories thereof shall b...,0
3,1,Name and territory of the Union.,,3.0,The territory of India shall comprise,Name and territory of the Union.\nArticle 1\nT...,The territory of India shall comprise,0
4,2,Admission or establishment of new States.,"Parliament may by law admit into the Union, or...",,,Admission or establishment of new States.\nArt...,"Parliament may by law admit into the Union, or...",0
5,2A,Sikkim to be associated with the Union,Omitted by the Constitution,,,Sikkim to be associated with the Union\nArticl...,Omitted by the Constitution,0
6,3,Formation of new States and alteration of area...,Parliament may by law—\n(a) form a new State b...,,,Formation of new States and alteration of area...,Parliament may by law— (a) form a new State by...,0
7,4,Laws made under articles 2 and 3 to provide fo...,,1.0,Any law referred to in article 2 or article 3 ...,Laws made under articles 2 and 3 to provide fo...,Any law referred to in article 2 or article 3 ...,0
8,4,Laws made under articles 2 and 3 to provide fo...,,2.0,No such law as aforesaid shall be deemed to be...,Laws made under articles 2 and 3 to provide fo...,No such law as aforesaid shall be deemed to be...,0
9,5,Citizenship at the commencement of the Constit...,"At the commencement of this Constitution, ever...",,,Citizenship at the commencement of the Constit...,"At the commencement of this Constitution, ever...",0


In [11]:
# Convert pandas DataFrame to Hugging Face Dataset and train-test split
from datasets import Dataset, DatasetDict

def to_hf_datasets(df: pd.DataFrame, test_size: float = 0.1, seed: int = 42) -> DatasetDict:
    ds = Dataset.from_pandas(df[['text','summary','risk_label']].reset_index(drop=True))
    ds = ds.shuffle(seed=seed)
    n = len(ds)
    test_n = max(1, int(n * test_size))
    train_ds = ds.select(range(n - test_n))
    test_ds = ds.select(range(n - test_n, n))
    return DatasetDict({'train': train_ds, 'test': test_ds})

if 'df_prepped' in globals():
    datasets = to_hf_datasets(df_prepped, test_size=0.1)
    print(datasets)
    print('Example train row:')
    display(datasets['train'][0])
else:
    print('Prepare the DataFrame first.')

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'risk_label'],
        num_rows: 82
    })
    test: Dataset({
        features: ['text', 'summary', 'risk_label'],
        num_rows: 9
    })
})
Example train row:


{'text': 'Protection against arrest and detention in certain cases.\nArticle 22\nNothing in clause (5) shall require the authority making any such order as is referred to in that clause to disclose facts which such authority considers to be against the public interest to disclose.',
 'summary': 'Nothing in clause (5) shall require the authority making any such order as is referred to in that clause to disclose facts which such authority considers to be against the public interest to disclose.',
 'risk_label': 1}

In [5]:
# Summarization setup (T5)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import evaluate

SUMMARIZER_MODEL = 't5-small'  # change to t5-base or a larger model as needed
max_input_length = 512
max_target_length = 128

def prepare_summarization_tokenizer_and_model(model_name=SUMMARIZER_MODEL):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    return tokenizer, model

def tokenize_for_summarization(batch, tokenizer):
    inputs = tokenizer(batch['text'], truncation=True, padding='max_length', max_length=max_input_length)
    targets = tokenizer(batch['summary'], truncation=True, padding='max_length', max_length=max_target_length)
    batch['input_ids'] = inputs['input_ids']
    batch['attention_mask'] = inputs['attention_mask']
    batch['labels'] = targets['input_ids']
    return batch

rouge = evaluate.load('rouge')

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Summarization compute_metrics helper
def postprocess_text(preds, labels):
    preds = [p.strip() for p in preds]
    labels = [l.strip() for l in labels]
    return preds, labels

def compute_rouge(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    # decode (will set summarize_tokenizer before training)
    tokenizer = summarize_tokenizer
    preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    preds, labels = postprocess_text(preds, labels)
    result = rouge.compute(predictions=preds, references=labels, use_stemmer=True)
    return {'rouge1': result['rouge1'], 'rouge2': result['rouge2'], 'rougeL': result['rougeL']}

In [7]:
# Classification setup (DistilBERT)
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
CLASSIFIER_MODEL = 'distilbert-base-uncased'

def prepare_classification_tokenizer_and_model(model_name=CLASSIFIER_MODEL, num_labels=2):
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    return tokenizer, model

def tokenize_for_classification(batch, tokenizer, max_length=256):
    toks = tokenizer(batch['text'], truncation=True, padding='max_length', max_length=max_length)
    batch['input_ids'] = toks['input_ids']
    batch['attention_mask'] = toks['attention_mask']
    batch['labels'] = batch['risk_label']
    return batch

In [8]:
# Save & load utilities for both models
def save_model_and_tokenizer(model, tokenizer, out_dir: str):
    os.makedirs(out_dir, exist_ok=True)
    tokenizer.save_pretrained(out_dir)
    model.save_pretrained(out_dir)

def load_seq2seq_model(out_dir: str):
    tokenizer = AutoTokenizer.from_pretrained(out_dir)
    model = AutoModelForSeq2SeqLM.from_pretrained(out_dir)
    return tokenizer, model

def load_classifier(out_dir: str):
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    tokenizer = AutoTokenizer.from_pretrained(out_dir)
    model = AutoModelForSequenceClassification.from_pretrained(out_dir)
    return tokenizer, model

In [9]:
# PDF ingestion utilities: text extraction with fallbacks for scanned pages
def extract_text_from_pdf(pdf_path: str) -> str:
    # Try multiple methods:
    # 1) pdfplumber to extract embedded text
    # 2) PyPDF2 as fallback
    # 3) OCR via pytesseract+pdf2image if text seems missing (scanned)
    text_parts = []
    # 1) pdfplumber
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                txt = page.extract_text()
                if txt:
                    text_parts.append(txt)
    except Exception as e:
        print('pdfplumber failed:', e)

    joined = '\n'.join(text_parts).strip()
    if len(joined) > 50:
        return joined

    # 2) PyPDF2
    try:
        import PyPDF2
        with open(pdf_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            for p in reader.pages:
                try:
                    txt = p.extract_text()
                except Exception:
                    txt = ''
                if txt:
                    text_parts.append(txt)
        joined = '\n'.join(text_parts).strip()
        if len(joined) > 50:
            return joined
    except Exception as e:
        print('PyPDF2 failed:', e)

    # 3) OCR fallback (requires tesseract binary + pdf2image)
    try:
        from pdf2image import convert_from_path
        import pytesseract
        images = convert_from_path(pdf_path, dpi=200)
        ocr_texts = []
        for img in images:
            txt = pytesseract.image_to_string(img)
            if txt:
                ocr_texts.append(txt)
        joined = '\n'.join(ocr_texts).strip()
        return joined
    except Exception as e:
        print('OCR fallback failed or not installed:', e)
        return joined

In [10]:
# Inference: summarization + risk classification on a piece of text
from transformers import pipeline

def build_inference_pipelines(summarizer_dir: str, classifier_dir: str, device: int = -1):
    # Load models from directories and create HF pipelines.
    # device: -1 for CPU, or GPU device id (0,1,...) if CUDA available.
    sum_tokenizer, sum_model = load_seq2seq_model(summarizer_dir)
    summarizer = pipeline('summarization', model=sum_model, tokenizer=sum_tokenizer, device=device)

    cls_tokenizer, cls_model = load_classifier(classifier_dir)
    classifier = pipeline('text-classification', model=cls_model, tokenizer=cls_tokenizer, device=device, return_all_scores=False)

    return summarizer, classifier

def infer_text(text: str, summarizer, classifier, max_length=150):
    # Return a short summary and a risk prediction for `text`.
    summary_res = summarizer(text, truncation=True, max_length=max_length, min_length=30)
    summary = summary_res[0]['summary_text'] if isinstance(summary_res, list) and summary_res else str(summary_res)

    cls_res = classifier(text)
    label = cls_res[0].get('label')
    score = cls_res[0].get('score')
    return {'summary': summary, 'risk_label': label, 'risk_score': float(score)}

In [13]:
def infer_text(text, summarizer, classifier, max_length=512):
    # Summarize first (summarizers often support longer inputs)
    summary_res = summarizer(text, truncation=True, max_length=max_length, min_length=30)
    summary = summary_res[0]['summary_text'] if isinstance(summary_res, list) and summary_res else str(summary_res)

    # 🔧 Truncate for classifier
    truncated_text = text[:2000]  # or use tokenizer.encode(..., truncation=True)
    cls_res = classifier(truncated_text, truncation=True, max_length=512)
    label = cls_res[0].get('label')
    score = cls_res[0].get('score')

    return {'summary': summary, 'risk_label': label, 'risk_score': score}


In [14]:
from datasets import Dataset, DatasetDict
import json
from sklearn.model_selection import train_test_split

# ===== LOAD CONSTITUTION JSON =====
json_path = "COI.json"  # Change to your file path

with open(json_path, "r", encoding="utf-8") as f:
    raw_data = json.load(f)

# Flatten JSON (assuming structure [[{ArtNo, Name, ArtDesc, Clauses}, ...]])
articles = []
for group in raw_data:
    for item in group:
        art_no = item.get("ArtNo", "")
        name = item.get("Name", "")
        desc = item.get("ArtDesc", "")
        text = f"Article {art_no}: {name}\n{desc}"

        # ---- Simplified synthetic risk scoring (can later be replaced with a human-labeled one)
        if any(word in desc.lower() for word in ["emergency", "suspend", "power", "detention"]):
            risk = 1  # high-risk (arbitrary)
        else:
            risk = 0  # low-risk

        # Construct entry
        articles.append({
            "text": text,
            "summary": desc[:200],  # crude short summary to fine-tune summarizer
            "risk_label": risk
        })

print(f"✅ Loaded {len(articles)} articles from JSON")

# ===== SPLIT TRAIN/TEST =====
train_data, test_data = train_test_split(articles, test_size=0.2, random_state=42)
train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)

datasets = DatasetDict({
    "train": train_dataset,
    "test": test_dataset
})

print("✅ Datasets prepared:", datasets)
print("Columns:", datasets["train"].column_names)


✅ Loaded 46 articles from JSON
✅ Datasets prepared: DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'risk_label'],
        num_rows: 36
    })
    test: Dataset({
        features: ['text', 'summary', 'risk_label'],
        num_rows: 10
    })
})
Columns: ['text', 'summary', 'risk_label']


In [19]:
from transformers import (
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import numpy as np
import torch
import transformers
import os, gc, shutil

print("Transformers version:", transformers.__version__)

# ---- CONFIG ----
do_train_summarizer = True
do_train_classifier = True
num_train_epochs = 3
per_device_train_batch_size = 2
per_device_eval_batch_size = 2
output_dir_summarizer = './summarizer_out'
output_dir_classifier = './classifier_out'

# ---- SAFE SAVE FUNCTION ----
def save_model_and_tokenizer(model, tokenizer, out_dir):
    """
    Safely saves a model and tokenizer (fixes Windows safetensors issue).
    """
    os.makedirs(out_dir, exist_ok=True)

    # Free GPU memory and Python handles
    gc.collect()
    torch.cuda.empty_cache()

    # Save tokenizer
    tokenizer.save_pretrained(out_dir)

    # Save model using torch format (avoid safetensors conflict)
    try:
        model.save_pretrained(out_dir, safe_serialization=False)
        print(f"✅ Model and tokenizer saved to {out_dir}")
    except Exception as e:
        print(f"⚠️ Safetensor error encountered: {e}")
        print("Retrying save using fallback...")
        temp_dir = out_dir + "_tmp"
        os.makedirs(temp_dir, exist_ok=True)
        model.save_pretrained(temp_dir, safe_serialization=False)
        for f in os.listdir(temp_dir):
            shutil.move(os.path.join(temp_dir, f), out_dir)
        shutil.rmtree(temp_dir)
        print("✅ Model saved successfully after fallback.")

# ---- Check Dataset ----
if 'datasets' not in globals():
    raise RuntimeError("`datasets` not found. Please load it before training.")

print("Dataset columns:", datasets['train'].column_names)

# ===== SUMMARIZER TRAINING =====
if do_train_summarizer:
    print("\n=== Summarizer training ===")
    summarize_tokenizer, summarize_model = prepare_summarization_tokenizer_and_model()
    print("Loaded summarizer model:", type(summarize_model).__name__)

    tokenized_train = datasets['train'].map(
        lambda batch: tokenize_for_summarization(batch, summarize_tokenizer),
        batched=True,
        remove_columns=datasets['train'].column_names
    )
    tokenized_test = datasets['test'].map(
        lambda batch: tokenize_for_summarization(batch, summarize_tokenizer),
        batched=True,
        remove_columns=datasets['test'].column_names
    )

    data_collator = DataCollatorForSeq2Seq(summarize_tokenizer, model=summarize_model)

    args_dict = dict(
        output_dir=output_dir_summarizer,
        logging_strategy="epoch",
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        predict_with_generate=True,
        num_train_epochs=num_train_epochs,
        save_total_limit=2,
        fp16=torch.cuda.is_available(),
        report_to="none"
    )

    # ✅ Safe initialization with fallback
    try:
        training_args = Seq2SeqTrainingArguments(evaluation_strategy="epoch", **args_dict)
    except TypeError:
        training_args = Seq2SeqTrainingArguments(eval_strategy="epoch", **args_dict)

    trainer = Seq2SeqTrainer(
        model=summarize_model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=summarize_tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_rouge
    )

    trainer.train()
    save_model_and_tokenizer(summarize_model, summarize_tokenizer, output_dir_summarizer)

# ===== CLASSIFIER TRAINING =====
if do_train_classifier:
    print("\n=== Classifier training ===")
    cls_tokenizer, cls_model = prepare_classification_tokenizer_and_model()
    print("Loaded classifier model:", type(cls_model).__name__)

    possible_labels = ['label', 'labels', 'risk_label', 'target', 'category']
    label_col = next((col for col in possible_labels if col in datasets['train'].column_names), None)

    if not label_col:
        raise KeyError(f"No label column found. Available: {datasets['train'].column_names}")
    print(f"✅ Using label column: '{label_col}'")

    def tokenize_for_classification(batch, tokenizer, max_length=256):
        texts = batch.get('text') or batch.get('input') or batch.get('document')
        toks = tokenizer(
            texts,
            padding="max_length",
            truncation=True,
            max_length=max_length
        )
        toks['labels'] = batch[label_col]
        return toks

    tokenized_train = datasets['train'].map(
        lambda batch: tokenize_for_classification(batch, cls_tokenizer),
        batched=True,
        remove_columns=datasets['train'].column_names
    )
    tokenized_test = datasets['test'].map(
        lambda batch: tokenize_for_classification(batch, cls_tokenizer),
        batched=True,
        remove_columns=datasets['test'].column_names
    )

    data_collator_cls = DataCollatorWithPadding(tokenizer=cls_tokenizer)

    args_dict_cls = dict(
        output_dir=output_dir_classifier,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        num_train_epochs=num_train_epochs,
        save_total_limit=2,
        fp16=torch.cuda.is_available(),
        report_to="none"
    )

    try:
        training_args_cls = TrainingArguments(evaluation_strategy="epoch", **args_dict_cls)
    except TypeError:
        training_args_cls = TrainingArguments(eval_strategy="epoch", **args_dict_cls)

    def compute_metrics_cls(eval_pred):
        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        average_type = "binary" if len(set(labels)) == 2 else "macro"
        return {
            "accuracy": accuracy_score(labels, preds),
            "f1": f1_score(labels, preds, average=average_type),
            "precision": precision_score(labels, preds, average=average_type, zero_division=0),
            "recall": recall_score(labels, preds, average=average_type, zero_division=0)
        }

    trainer_cls = Trainer(
        model=cls_model,
        args=training_args_cls,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_test,
        tokenizer=cls_tokenizer,
        data_collator=data_collator_cls,
        compute_metrics=compute_metrics_cls
    )

    trainer_cls.train()
    save_model_and_tokenizer(cls_model, cls_tokenizer, output_dir_classifier)


Transformers version: 4.57.0
Dataset columns: ['text', 'summary', 'risk_label']

=== Summarizer training ===
Loaded summarizer model: T5ForConditionalGeneration


Map: 100%|██████████| 36/36 [00:00<00:00, 274.27 examples/s]
Map: 100%|██████████| 10/10 [00:00<00:00, 641.42 examples/s]
  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,10.9082,8.255288,25.7396,19.6426,23.316,23.3246
2,5.4795,4.055892,25.2366,21.6667,24.7344,24.746
3,4.0388,2.769382,12.9804,11.8301,12.8302,12.9804




✅ Model and tokenizer saved to ./summarizer_out

=== Classifier training ===


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded classifier model: DistilBertForSequenceClassification
✅ Using label column: 'risk_label'


Map: 100%|██████████| 36/36 [00:00<00:00, 631.96 examples/s]
Map: 100%|██████████| 10/10 [00:00<00:00, 881.40 examples/s]
  trainer_cls = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.871505,0.8,0.0,0.0,0.0
2,No log,1.080064,0.8,0.0,0.0,0.0
3,No log,1.077471,0.8,0.0,0.0,0.0




✅ Model and tokenizer saved to ./classifier_out


# Added: OCR + Test Pipeline

**What this adds:**

- Functions to convert PDF → images → OCR text (uses `pdf2image` + `pytesseract`).
- A summarization function that uses Hugging Face `transformers` if available, otherwise a simple fallback summarizer.
- A simple "hidden risk" detector that searches for legal/risk keywords and computes a risk score.
- A test cell to run the pipeline on a PDF, produce a `pandas.DataFrame` with columns: page, original_text, summary, hidden_risks, risk_score, and export to CSV/Excel.

Run the following code cells in order. If a library is missing, uncomment the pip install lines to install it in your environment.


In [20]:
# Dependencies (uncomment to install if missing)
# !pip install pytesseract pdf2image pillow pandas openpyxl transformers sentencepiece --quiet

import os
import io
from PIL import Image
import pytesseract
import pandas as pd
import math
from pdf2image import convert_from_path
import textwrap
print('imports ok')


imports ok


In [21]:
def ocr_pdf_to_page_texts(pdf_path, dpi=300, first_page=None, last_page=None, poppler_path=None):
    """Return list of page text strings for a PDF using pdf2image + pytesseract.
    - pdf_path: path to PDF file
    - dpi: resolution for conversion
    - first_page/last_page: optional page range (1-based)
    - poppler_path: if you're on Windows and have poppler installed, provide path to bin folder
    """
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f'PDF not found: {pdf_path}')
    convert_kwargs = {'dpi': dpi}
    if first_page is not None or last_page is not None:
        convert_kwargs['first_page'] = first_page
        convert_kwargs['last_page'] = last_page
    if poppler_path:
        pages = convert_from_path(pdf_path, poppler_path=poppler_path, **convert_kwargs)
    else:
        pages = convert_from_path(pdf_path, **convert_kwargs)
    texts = []
    for i, page in enumerate(pages, start=1):
        txt = pytesseract.image_to_string(page)
        txt = txt.replace('\x0c','').strip()
        texts.append(txt)
    return texts


In [22]:
def summarize_text(text, max_length=180, min_length=30, model_name='sshleifer/distilbart-cnn-12-6'):
    """Try transformers summarization; if not available, fall back to a simple sentence-based summarizer."""
    try:
        from transformers import pipeline
        summarizer = pipeline('summarization', model=model_name)
        if len(text) < 1000:
            out = summarizer(text, max_length=max_length, min_length=min_length, truncation=True)
            return out[0]['summary_text']
        sentences = text.split('. ')
        chunks = []
        cur = ''
        for s in sentences:
            if len(cur) + len(s) < 800:
                cur += s + '. '
            else:
                chunks.append(cur)
                cur = s + '. '
        if cur:
            chunks.append(cur)
        summaries = []
        for c in chunks:
            out = summarizer(c, max_length=max_length, min_length=20, truncation=True)
            summaries.append(out[0]['summary_text'])
        return ' '.join(summaries)
    except Exception:
        import re
        sents = re.split(r'(?<=[.!?]) +', text)
        if not sents:
            return text[:max_length]
        sents_sorted = sorted(sents, key=lambda s: len(s), reverse=True)
        res = []
        cur_len = 0
        for s in sents_sorted:
            res.append(s.strip())
            cur_len += len(s)
            if cur_len > max_length:
                break
        return ' '.join(res)[:max_length]


In [23]:
RISK_KEYWORDS = [
    'breach','penalty','terminate','termination','liability','indemnity','fine','lawsuit','sue',
    'obligation','breach of contract','default','defaulted','safety','risk','fraud','non-compliance',
    'confidential','unauthorized','exposure','privacy','violation','penalty','interest','charge'
]

def detect_hidden_risks(text):
    lower = text.lower()
    found = []
    for kw in RISK_KEYWORDS:
        if kw in lower:
            found.append(kw)
    uniq_hits = len(set(found))
    score = 1 - math.exp(-0.5 * uniq_hits)
    return found, float(score)


In [29]:
pdf_path = r"C:\Users\hp\PycharmProjects\mini\CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement.pdf"
poppler_path = r"C:\Users\hp\Downloads\Release-25.07.0-0\poppler-25.07.0\Library\bin"




Unnamed: 0,page,original_text,summary,hidden_risks,risk_score
0,1,"Exhibit 10.33\n\nLast Updated: April 6, 2007\n...",Chase will pay Affiliate a fee for each approv...,LABEL_0,0.866553
1,2,¢ Incorporates any materials which infringe or...,Chase reserves the right to terminate this Agr...,LABEL_0,0.861438
2,3,¢ STARBUCKS\n\n¢ SUBARU\n\n¢ TEMPLE UNIVERSITY...,"STARBUCKS SUBARU TEMPLE UNIVERSITY TOYS ""R""...",LABEL_0,0.843836
3,4,¢ If Affiliate manages a sub-affiliate network...,Affiliate may not pay sub-affiliates or other ...,LABEL_0,0.890771
4,5,7. Order Processing\n\nChase will be solely re...,customers may only use the Chase on-line appli...,LABEL_0,0.871817


{'doc_summary': 'Chase will pay Affiliate a fee for each approved credit card account that originates from a link in Affiliate’s Website. AFFILIATE CERTIFIES THAT YOU HAVE READ AND UNDERSTAND THE TERMS SET FORTH BELOW, AND THAT YOU AREAUTHORIZED TO SUBMIT THIS REGISTRATION FORM BY THE NAMED AFFILIATE. AFFILIATE CERTIFIES THAT YOU HAVE READ AND UNDERSTAND THE TERMS', 'doc_risks': 'LABEL_0', 'doc_risk_score': 0.8665527701377869, 'pages': 12}


In [4]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer, AutoModelForSequenceClassification
import torch, numpy as np, warnings, logging, contextlib, io
from pdf2image import convert_from_path
import pytesseract
from tqdm import tqdm

# ============================
# 1️⃣ LOAD TRAINED MODELS
# ============================
summarize_model = T5ForConditionalGeneration.from_pretrained("./summarizer_out")
summarize_tokenizer = T5Tokenizer.from_pretrained("./summarizer_out")

cls_model = AutoModelForSequenceClassification.from_pretrained("./classifier_out")
cls_tokenizer = AutoTokenizer.from_pretrained("./classifier_out")

# ============================
# 2️⃣ SUMMARIZATION FUNCTION
# ============================
def summarize_text(text):
    if not text.strip():
        return ""
    inputs = summarize_tokenizer(
        "summarize: " + text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )
    summary_ids = summarize_model.generate(
        inputs["input_ids"],
        max_length=150,
        min_length=40,
        length_penalty=2.0,
        num_beams=4
    )
    return summarize_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# ============================
# 3️⃣ RISK DETECTION FUNCTION
# ============================
def detect_hidden_risks(text):
    if not text.strip():
        return "No Text", 0.0
    inputs = cls_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = cls_model(**inputs).logits
        probs = torch.softmax(logits, dim=-1).numpy()[0]
        label_id = np.argmax(probs)
        score = float(probs[label_id])
    label_name = cls_model.config.id2label.get(label_id, str(label_id))
    return label_name, score

# ============================
# 4️⃣ OCR CONVERSION FUNCTION
# ============================
def ocr_pdf_to_page_texts(pdf_path, dpi=300, poppler_path=None):
    try:
        pages = convert_from_path(pdf_path, dpi=dpi, poppler_path=poppler_path)
    except Exception as e:
        print(f"❌ PDF conversion failed: {e}")
        return []
    texts = []
    for i, page in enumerate(tqdm(pages, desc="OCR Processing"), start=1):
        try:
            text = pytesseract.image_to_string(page)
        except Exception as e:
            print(f"⚠️ OCR failed on page {i}: {e}")
            text = ""
        texts.append(text)
    return texts

# ============================
# 5️⃣ MAIN PIPELINE FUNCTION
# ============================
def run_full_pipeline(pdf_path, dpi=300, poppler_path=None):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        logging.getLogger().setLevel(logging.ERROR)
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            pages_text = ocr_pdf_to_page_texts(pdf_path, dpi=dpi, poppler_path=poppler_path)

    if not pages_text:
        return "(❌ No text extracted from PDF.)"

    # Combine all text from all pages
    full_text = '\n\n'.join(pages_text)

    # Generate document-level summary
    doc_summary = summarize_text(full_text)
    return doc_summary if doc_summary else "(No overall summary generated.)"

# ============================
# 6️⃣ RUN THE PIPELINE
# ============================
pdf_path = r"C:\Users\hp\PycharmProjects\mini\CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement.pdf"
poppler_path = r"C:\Users\hp\Downloads\Release-25.07.0-0\poppler-25.07.0\Library\bin"

final_summary = run_full_pipeline(pdf_path, dpi=300, poppler_path=poppler_path)

# ============================
# 7️⃣ DISPLAY ONLY FINAL SUMMARY
# ============================
print("\n==================== 📄 FINAL DOCUMENT SUMMARY ====================")
print(final_summary)
print("====================================================================\n")



Chase will pay Affiliate a fee for each approved credit card account that originates from a link in Affiliate’s Website. AFFILIATE CERTIFIES THAT YOU HAVE READ AND UNDERSTAND THE TERMS SET FORTH BELOW, AND THAT YOU AREAUTHORIZED TO SUBMIT THIS REGISTRATION FORM BY THE NAMED AFFILIATE. AFFILIATE CERTIFIES THAT YOU HAVE READ AND UNDERSTAND THE TERMS



In [10]:
# ============================
# 6️⃣ RUN THE PIPELINE
# ============================
pdf_path = r"C:\Users\hp\PycharmProjects\mini\CreditcardscomInc_20070810_S-1_EX-10.33_362297_EX-10.33_Affiliate Agreement.pdf"
poppler_path = r"C:\Users\hp\Downloads\Release-25.07.0-0\poppler-25.07.0\Library\bin"

final_summary = run_full_pipeline(pdf_path, dpi=300, poppler_path=poppler_path)

# ============================
# 7️⃣ DISPLAY ONLY FINAL SUMMARY
# ============================
print("\n==================== 📄 FINAL DOCUMENT SUMMARY ====================")
print(final_summary)
print("====================================================================\n")



📜 ORIGINAL TEXT:
Exhibit 10.33

Last Updated: April 6, 2007

CHASE AFFILIATE AGREEMENT

THIS AGREEMENT sets forth the terms and conditions agreed to between Chase Bank USA, N.A. (?Chase?) and you as an “Affiliate” in the Chase
Affiliate Program (the “Affiliate Program’). Once accepted into the Affiliate Program, an Affiliate can establish links from the Affiliate’s Website to
[Chase.com]. Chase will pay Affiliate a fee for each approved credit card account that originates from a link in Affiliate’s Website.

THIS IS A LEGAL AND CONTRACTUALLY BINDING AGREEMENT BETWEEN AFFILIATE AND CHASE. TO APPLY TO THE AFFILIATE
PROGRAM, YOU MUST COMPLETE AND SUBMIT THE AFFILIATE REGISTRATION FORM AND CLICK ON THE “AGREE” BUTTON BELOW
TO INDICATE YOUR WILLINGNESS TO BE BOUND TO CHASE BY THIS AGREEMENT. THIS AGREEMENT WILL TAKE EFFECT IF AND
WHEN CHASE REVIEWS AND ACCEPTS YOUR REGISTRATION FORM AND PROVIDES YOU NOTICE OF ACCEPTANCE. BY SUBMITTING
YOUR REGISTRATION FORM, AFFILIATE CERTIFIES THAT YOU H

## Notes / Troubleshooting

- If `pdf2image.convert_from_path` raises an error on Windows, install Poppler and pass `poppler_path='C:/path/to/poppler/bin'` to the conversion function.
- If `pytesseract` can't be found, install Tesseract-OCR and ensure it's on your PATH (or set `pytesseract.pytesseract.tesseract_cmd` to the tesseract.exe path).
- For better summaries, install `transformers` and allow the notebook to download a model (requires internet). Example model: `sshleifer/distilbart-cnn-12-6`.
- The hidden risk detector is a keyword-based heuristic. For production use, consider an ML classifier or rule engine.


In [2]:
from transformers import pipeline
import textwrap

# Load summarization model (smaller and CPU-friendly)
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=-1)

# Example: Replace this with your OCR or contract text
ocr_text = """Paste your full legal contract text here..."""

# Split into smaller chunks of ~800 characters each
chunks = textwrap.wrap(ocr_text, width=800)
print(f"[+] Total Chunks: {len(chunks)} — Summarizing, please wait...\n")

summaries = []

for i, chunk in enumerate(chunks, 1):
    result = summarizer(chunk, max_length=300, min_length=100, do_sample=False)
    summaries.append(result[0]['summary_text'])
    print(f"✅ Processed chunk {i}/{len(chunks)}")

# Combine all summaries
final_summary = " ".join(summaries)
print("\n============================================================")
print("📄 SIMPLIFIED LEGAL CONTRACT:\n")
print(final_summary)
print("============================================================")

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cpu
Your max_length is set to 300, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


[+] Total Chunks: 1 — Summarizing, please wait...

✅ Processed chunk 1/1

📄 SIMPLIFIED LEGAL CONTRACT:

 Paste the full legal contract text here... Paste it into your legal contract . Copy the text of the contract text below to help you understand your legal team's decision to sign it up . Paste the text here to see how you sign a legal contract in the U.S. State of the State of New York City. Paste it here to read the full contract text for the first time in a row. Paste the legal text to see if you want to sign a new contract in a new place of business .
