## Step 1: Load the master clause file

In [2]:
# Load master_clauses file
import pandas as pd
master_clauses = pd.read_csv("CUAD_v1/master_clauses.csv")

## Step 2: Load CUAD full documents

In [4]:
import os

# Set paths
label_folder = "CUAD_v1/label_group_xlsx"
all_dfs = []

# Loop through all Excel label files
for file in os.listdir(label_folder):
    if file.endswith(".xlsx"):
        path = os.path.join(label_folder, file)
        df = pd.read_excel(path)
        df = df.melt(id_vars=["Filename"], var_name="Clause_Type", value_name="Clause_Text")
        df = df.dropna(subset=["Clause_Text"])
        all_dfs.append(df)

# Combine all clauses
df_long_all = pd.concat(all_dfs, ignore_index=True)

## PHASE 1 - Clause Classification System
Goal: Evaluate BERT & legal-specific models for classifying clauses into 41 CUAD clause types.

## Step 1.1 - Prepare Training Data
You need labeled clause examples. Then, encode the 41 clause types so that it is ready for training text classification models e.g. BERT, LegalBERT.

In [6]:
from sklearn.preprocessing import LabelEncoder

# Prepare for encoding
df_clauses = df_long_all[["Clause_Text", "Clause_Type"]].rename(
    columns={"Clause_Text": "text", "Clause_Type": "label"}
)

# Encode the clause type labels (e.g. Change of Control -> 0, Anti-assignment -> 1, etc.)
le = LabelEncoder()
df_clauses["label_encoded"] = le.fit_transform(df_clauses["label"])

print("Number of unique clause types:", len(le.classes_))

Number of unique clause types: 47


## Step 1.2 - Train Models
Let's start with:
1. Split your data
2. Train a BERT model using Hugging Face's `transformers` and `datasets`
3. Evaluate it
4. Repeat with LegalBERT (a legal-domain-pretrained version of BERT)

Train and compare:
- BERT-base
- LegalBERT
- Any other specialized model...
Use Hugging Face `Trainer` or `transformers` + `sklearn` pipeline.

## Prepare datasets for model training
Here's a setup using Hugging Face:

In [8]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Keep only text and label_encoded
df_clauses_model = df_clauses[["text", "label_encoded"]].copy()

# Force 'text' column to be string type
df_clauses_model["text"] = df_clauses_model["text"].astype(str)

# Split into train/test sets
train_df, test_df = train_test_split(df_clauses_model, 
                                     test_size=0.2, 
                                     stratify=df_clauses_model["label_encoded"], 
                                     random_state=42)

# Convert to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

## Tokenization
Here's what we'll do next:
1. Load a tokenizer (e.g. BERT's tokenizer).
2. Apply the tokenizer to your datasets.
3. Format the output for training.

In [10]:
# Load the tokenizer from a pre-trained model like bert-base-uncased
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the datasets: Define a function to tokenize the text and apply it
def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set format to PyTorch
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label_encoded"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label_encoded"])

# Rename label column to 'labels'
train_dataset = train_dataset.rename_column("label_encoded", "labels")
test_dataset = test_dataset.rename_column("label_encoded", "labels")



Map:   0%|          | 0/6902 [00:00<?, ? examples/s]

Map:   0%|          | 0/1726 [00:00<?, ? examples/s]

## Fine-tune BERT base model
We'll use `trainer` API from Hugging Face's `transformers`.

In [17]:
import transformers
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Load BERT base model
num_labels = len(le.classes_)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=num_labels)

# Training arguments
training_args = TrainingArguments(
    output_dir="./bert_results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
)

# Metric function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    labels = p.label_ids
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="macro")
    }

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainerCallback

class PrintCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        print(f"Step {state.global_step}: {logs}")

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    callbacks=[PrintCallback()],
)

print('Training start...')
trainer.train()
print('Training done.')

Training start...


Epoch,Training Loss,Validation Loss


In [1]:
#from transformers import TrainingArguments
#help(TrainingArguments)
#print(TrainingArguments.__module__)

#args_test = TrainingArguments(output_dir="./test")
#print(args_test.output_dir)

import transformers
print(transformers.__version__)
#print(transformers.__file__)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

4.37.2


## Step 1.3 - Evaluate
Evaluate using:

In [None]:
from sklearn.metrics import classification_report
#print(classification_report(y_true, y_pred, target_names = le.classes_))

You can compare macro-F1, precision, recall per clause type.

## PHASE 2 - Clause Risk Assessment
Goal: Predict severity or risk level of clauses

## Step 2.1 - Define Risk Labels
If your dataset doesn't already have risk labels, you must create them.
Options:
- Use heuristic rules e.g. clauses with "terminate", "waive", "without consent" = high risk)
- Or manually annotate a subset (High, Medium, Low)

Example:

In [None]:
def simple_risk_heuristic(text):
    if any(word in text.lower() for word in ["terminate", "penalty", "waive", "without consent"]):
        return "High"
    elif any(word in text.lower() for word in ["notice", "require", "may", "prior"]):
        return "Medium"
    else:
        return "Low"

df_clauses["risk_level"] = df_clauses["text"].apply(simple_risk_heuristic)

You can refine this over time using patterns or ML models.

## Step 2.2 - Train Secondary Model

Train another classifier to predict `risk_level` from clause text. Again, use BERT or LegalBERT.
Input: Clause Text
Target: Risk level (High, Medium, Low)

## PHASE 3 - Visual Risk Mapping
Goal: Create visualizations like risk heatmaps / color-coded contract sections

## Step 3.1 - Highlight Clauses in Full Text
For each contract file:
- Highlight or color-code clause segments in context

Use libraries like:
- spaCy + displacy for basic HTML highlighting
- Plotly / Dash for heatmaps or interactive views
- Streamlit if build a UI tool

You could create a dictionary like:

In [None]:
{ "clause_text": "X", "risk": "High", "color": "red" }

Then, render full text with color-coded spans.

## Step 3.2 - Risk Heatmaps
If you divide contracts into sections (e.g. every 500 tokens), you can visualize clause density per section:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# dummy example
risk_scores = [0.2, 0.7, 0.9, 0.4, 0.1]
sns.heatmap([risk_scores], cmap="Reds", xticklabels=["Sec1", "Sec2", "Sec3", "Sec4", "Sec5"])

## PHASE 4 - Explainability
Goal: Explain why a clause is flagged as risky or belonging to a class.

## Step 4.1 - Use Explainability Tools
Start with:
- LIME (Local Interpretable Model Explanations)
- SHAP (SHapley Additive Explanations)
- Attention Visualization for transformers

In [None]:
# Example: LIME
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=le.classes_)
explanation = explainer.explain_instance(text_instance, model.predict_proba)
explanation.show_in_notebook()

# Example: SHAP
import shap
explainer = shap.Explainer(model)
shap_values = explainer(["clause text here"])
shap.plots.text(shap_values[0])

Attention:
Use Hugging Face's `bertviz` or extract `attention weights` for heatmaps of token importance.