# **Efficient Legal Summarization: Harnessing Transformers for Terms and Conditions**

## **Team Members:**

#### - **Shravan Venkatraman 21BCE1200**
#### - **Pavan Kumar S 21BCE1179**
#### - **Ayush Kumar Lal 21BCE1129**

# **Abstract**

Legal terms and conditions are often **complex and lengthy**, making them **difficult to understand**. Manually summarizing these documents would be **time-consuming** and **susceptible to errors**, which might lead to **severe issues** in the future.

To tackle this problem, we propose a **novel** encoder-decoder based **NLP transformer architecture** to **automate** the summarization process.

These embeddings are then processed by our **Transformer-based model**, which uses an **Adaptive Multi-Head Attention (MHA)** mechanism to generate a **concise summary** of the input document.

This approach aims to **improve accessibility** and **comprehension** of legal documents, offering a **more efficient and error-free solution**.


# **1. Setup and Imports**

In [1]:
!pip install -q transformers==4.31.0 torch pypdf2

In [2]:
import torch
import torch.nn as nn
from transformers import BartTokenizer, BartConfig
from transformers.models.bart.modeling_bart import (
    BartForConditionalGeneration,
    BartEncoder,
    BartDecoder,
    BartAttention,
    BartEncoderLayer,
    BartDecoderLayer,
)
import PyPDF2
import os
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup


# **2. Define the Adaptive Multi-Head Attention (AMHA)**

In [3]:
class AdaptiveMultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.0, is_decoder=False):
        super(AdaptiveMultiHeadAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads  # Initial number of heads
        self.max_heads = num_heads  # Set max_heads to num_heads for compatibility
        self.head_dim = embed_dim // self.max_heads
        assert (
            self.head_dim * self.max_heads == self.embed_dim
        ), "embed_dim must be divisible by max_heads"

        self.scaling = self.head_dim ** -0.5

        self.is_decoder = is_decoder
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

        # Adaptive parameters
        self.head_selector = nn.Linear(embed_dim, self.max_heads)

        self.dropout = nn.Dropout(dropout)

    def forward(
        self,
        hidden_states,
        key_value_states=None,
        past_key_value=None,
        attention_mask=None,
        layer_head_mask=None,
        output_attentions=False,
    ):
        """Input shape: Batch x Time x Channel"""

        is_cross_attention = key_value_states is not None

        # Adaptive head selection
        head_scores = self.head_selector(hidden_states).mean(dim=1)  # [Batch, max_heads]
        head_probs = torch.softmax(head_scores, dim=-1)
        active_heads = (head_probs > (1 / self.max_heads)).sum().item()
        num_heads = max(1, min(active_heads, self.max_heads))

        # Project queries, keys, and values
        query_states = self.q_proj(hidden_states) * self.scaling

        if is_cross_attention:
            if past_key_value is not None:
                key_states = past_key_value[0]
                value_states = past_key_value[1]
            else:
                key_states = self.k_proj(key_value_states)
                value_states = self.v_proj(key_value_states)
                if self.is_decoder:
                    # If caching is enabled, set up past key value states
                    past_key_value = (key_states, value_states)
        else:
            key_states = self.k_proj(hidden_states)
            value_states = self.v_proj(hidden_states)
            if self.is_decoder and past_key_value is not None:
                # Append to past key value states for caching
                key_states = torch.cat([past_key_value[0], key_states], dim=1)
                value_states = torch.cat([past_key_value[1], value_states], dim=1)
                past_key_value = (key_states, value_states)
            elif self.is_decoder:
                past_key_value = (key_states, value_states)

        # Reshape to [Batch, Time, Num_heads, Head_dim]
        query_states = query_states.view(
            hidden_states.size(0), -1, self.max_heads, self.head_dim
        )[:, :, :num_heads, :]
        key_states = key_states.view(
            hidden_states.size(0), -1, self.max_heads, self.head_dim
        )[:, :, :num_heads, :]
        value_states = value_states.view(
            hidden_states.size(0), -1, self.max_heads, self.head_dim
        )[:, :, :num_heads, :]

        # Transpose for attention calculation
        query_states = query_states.transpose(1, 2)  # [Batch, Num_heads, Time, Head_dim]
        key_states = key_states.transpose(1, 2)
        value_states = value_states.transpose(1, 2)

        # Compute attention scores
        attn_weights = torch.matmul(query_states, key_states.transpose(-1, -2))

        if attention_mask is not None:
            attn_weights += attention_mask

        attn_weights = torch.softmax(attn_weights, dim=-1)

        if layer_head_mask is not None:
            attn_weights = attn_weights * layer_head_mask.view(1, -1, 1, 1)

        attn_probs = self.dropout(attn_weights)

        # Apply attention to values
        attn_output = torch.matmul(attn_probs, value_states)

        # Transpose and reshape to [Batch, Time, Num_heads * Head_dim]
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(
            hidden_states.size(0), -1, num_heads * self.head_dim
        )

        # Pad or truncate attn_output to match embed_dim
        if attn_output.size(-1) < self.embed_dim:
            padding = attn_output.new_zeros(
                attn_output.size(0), attn_output.size(1), self.embed_dim - attn_output.size(-1)
            )
            attn_output = torch.cat([attn_output, padding], dim=-1)
        elif attn_output.size(-1) > self.embed_dim:
            attn_output = attn_output[:, :, : self.embed_dim]

        attn_output = self.out_proj(attn_output)

        outputs = (attn_output,)

        if output_attentions:
            outputs += (attn_weights,)
        else:
            outputs += (None,)

        outputs += (past_key_value,)

        return outputs  # attn_output, attn_weights (or None), past_key_value


# **3. Create Custom Encoder and Decoder Layers**

## **3.1 Custom Encoder Layer**

In [4]:
class CustomBartEncoderLayer(BartEncoderLayer):
    def __init__(self, config):
        super().__init__(config)
        self.self_attn = AdaptiveMultiHeadAttention(
            embed_dim=config.d_model,
            num_heads=config.encoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=False,
        )
        # Re-initialize the layer norm to match dimensions
        self.self_attn_layer_norm = nn.LayerNorm(config.d_model)


## **3.2 Custom Decoder Layer**

In [5]:
class CustomBartDecoderLayer(BartDecoderLayer):
    def __init__(self, config):
        super().__init__(config)
        self.self_attn = AdaptiveMultiHeadAttention(
            embed_dim=config.d_model,
            num_heads=config.decoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
        )
        self.self_attn_layer_norm = nn.LayerNorm(config.d_model)

        self.encoder_attn = AdaptiveMultiHeadAttention(
            embed_dim=config.d_model,
            num_heads=config.decoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
        )
        self.encoder_attn_layer_norm = nn.LayerNorm(config.d_model)


# **4. Create Custom Encoder and Decoder**

## **4.1 Custom Encoder**

In [6]:
class CustomBartEncoder(BartEncoder):
    def __init__(self, config, embed_tokens):
        super().__init__(config, embed_tokens)
        self.layers = nn.ModuleList(
            [CustomBartEncoderLayer(config) for _ in range(config.encoder_layers)]
        )


# **4.2 Custom Decoder**

In [7]:
class CustomBartDecoder(BartDecoder):
    def __init__(self, config, embed_tokens):
        super().__init__(config, embed_tokens)
        self.layers = nn.ModuleList(
            [CustomBartDecoderLayer(config) for _ in range(config.decoder_layers)]
        )


# **5. Create the Custom BART Model**

In [8]:
class CustomBartModel(BartForConditionalGeneration):
    def __init__(self, config):
        super().__init__(config)
        self.model.encoder = CustomBartEncoder(config, self.model.shared)
        self.model.decoder = CustomBartDecoder(config, self.model.shared)


# **6. Load Pre-trained Weights**

In [9]:
def load_pretrained_weights(custom_model, pretrained_model_name):
    # Load the state dict of the pre-trained model
    pretrained_model = BartForConditionalGeneration.from_pretrained(pretrained_model_name)
    pretrained_state_dict = pretrained_model.state_dict()

    custom_state_dict = custom_model.state_dict()

    # Filter out attention layers from the state dict
    filtered_state_dict = {}
    for name, param in pretrained_state_dict.items():
        if "self_attn" not in name and "encoder_attn" not in name:
            filtered_state_dict[name] = param

    # Load the filtered state dict into the custom model
    custom_model.load_state_dict(filtered_state_dict, strict=False)

    # The modified attention layers are re-initialized
    return custom_model


# **7. Training and Inference**

## **7.1 Initialize Tokenizer and Model**

In [10]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
config = BartConfig.from_pretrained('facebook/bart-base')
custom_model = CustomBartModel(config)
custom_model = load_pretrained_weights(custom_model, 'facebook/bart-base')
custom_model = custom_model.to('cuda' if torch.cuda.is_available() else 'cpu')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## **7.2 Prepare Data**

In [11]:
def show_pdfs_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
          print(filename)

pdf_directory = '/content'
show_pdfs_from_directory(pdf_directory)


Airbnb.pdf
SaaS.pdf
HealthApp.pdf
CloudStorage.pdf
EducationalPlatforms.pdf
GamingPlatforms.pdf
ecommerceapp.pdf
socialmediaplatform.pdf
netbankingapp.pdf
fooddeliveryapp.pdf
streamingservice.pdf


In [12]:
import os

def read_pdf(file_path):
    pdf_reader = PyPDF2.PdfReader(file_path)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()
    return text

def read_pdfs_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
            file_path = os.path.join(directory_path, filename)
            text = read_pdf(file_path)
            documents.append(text)
    return documents

documents = read_pdfs_from_directory(pdf_directory)

summaries = [
    """
    These terms cover the use of the online booking platform, requiring users to create an account and follow the Platform's conduct guidelines. Booking payments are processed through the Platform, and cancellation policies may vary. Users retain ownership of their content but grant the Platform rights to use it. Hosts are responsible for listing accuracy and guest safety, while guests must respect property and house rules. The Platform disclaims warranties, limits liability for damages, and collects user data per its Privacy Policy. Disputes are resolved through arbitration, and the terms may be updated periodically.
    """,
    """
    These terms govern the use of the SaaS platform, requiring users to register an account and follow usage guidelines. Users must not engage in illegal activities, interfere with the platform's functionality, or misuse services. The Platform grants users a limited license to access its services, while user-generated content remains the user's property. Paid subscriptions may be required for certain features, and fees may change. The platform disclaims warranties and limits liability for damages. Data collection and use comply with the Privacy Policy, but absolute security is not guaranteed. The terms may change, and users agree to resolve disputes through arbitration.
    """,
    """
    The HealthTrack app requires users to agree to its Terms and Conditions, which include accepting all legal obligations. Users must be at least 18 or have parental consent to access the app. HealthTrack reserves the right to modify these terms at any time, and users are responsible for reviewing them regularly. The app is intended for health tracking purposes only and is not a substitute for professional medical advice. Users are accountable for their account security and for complying with legal requirements when sharing content. The app offers both free and paid subscription options, with auto-renewal for paid services unless canceled. HealthTrack takes data privacy seriously but acknowledges that no method is entirely secure. The content within the app is owned by HealthTrack and cannot be modified or sold without permission. The app also integrates with third-party services but is not responsible for their actions. HealthTrack provides its services "as is" without any guarantees of uninterrupted service or freedom from errors, and its liability is limited to the amount paid by the user. Users' access may be terminated for any violations of the Terms. Legal disputes are subject to the laws of the app's jurisdiction and resolved through arbitration. Finally, if any part of these Terms is deemed invalid, the remainder will still apply.
    """,
    """
    These terms govern the use of the cloud storage service, requiring users to create an account and follow the Service's guidelines. Users are responsible for their account security and agree not to engage in prohibited activities. Uploaded content remains the user's property, but users grant the Service a license to manage it as needed. The Service collects and uses personal data per its Privacy Policy, and security measures are in place, though absolute security is not guaranteed. Fees may apply for certain features, and prices can change. The Service disclaims warranties and limits its liability for damages. Terms may be amended, and user accounts can be terminated for violations. Disputes will be resolved through arbitration, and the terms are governed by the laws of the specified jurisdiction.
    """,
    """
    These terms govern the use of the Platform and require users to register for an account. Users are responsible for their actions and must comply with the Platform's guidelines. Content shared on the Platform remains the user’s property, but users grant the Platform a license to use it. Courses may require payment, and refunds are limited. The Platform limits liability and disclaims warranties regarding content accuracy. Personal data is handled according to the Privacy Policy. Terms may change, and continued use signifies acceptance of new terms. Violations may result in termination, and disputes are subject to arbitration under applicable law.
    """,
    """
    These terms govern the use of the gaming platform, requiring users to create an account and adhere to conduct guidelines, which prohibit cheating, illegal activities, and harassment. Users can purchase games, DLC, and other digital content, but only receive a license to use, not own, the content. Subscription services may be available, and refunds are governed by a specific policy. The Platform disclaims warranties and limits liability for damages. Modifications to content, terms, and services may occur at any time. Disputes are resolved through arbitration, and the terms are governed by the laws of the specified jurisdiction.
    """,
    """
    By using flipkart.com, you agree to the terms outlined here. These terms include the process for creating an account, placing orders, and making payments. Prices are subject to change, and once an order is confirmed, you will receive a notification via email. Products are shipped based on availability, and estimated delivery times may vary. Our returns policy allows you to return items within 7 days in their original condition, and refunds will be processed promptly. The website's content is protected by intellectual property laws, and you must not copy or misuse it. Users are expected to act responsibly when using our services, and accounts can be suspended for fraudulent or harmful behavior. Your personal data is protected under our Privacy Policy. These terms may be updated from time to time, and any legal matters will be handled according to the laws of Constitution of the Republic of India. If you have any questions, please reach out to our support team.
    """,
    """
    By using ConnectSphere, you agree to create an account with accurate information and are responsible for maintaining the security of your login details. Users own the content they post but grant ConnectSphere a license to display and use the content within the platform. Harmful or illegal behavior, including harassment, spam, and impersonation, is prohibited, and such actions may result in account suspension or termination. ConnectSphere takes privacy seriously and follows data protection standards, with user data managed according to the Privacy Policy. While users can interact with third-party links on the platform, ConnectSphere is not responsible for the content or security of those third-party services. Users may deactivate their accounts at any time, but ConnectSphere reserves the right to moderate content and terminate accounts that violate the platform's terms. Disputes are governed by the laws of the United States, with any legal matters handled in California courts. For more information or assistance, users can contact ConnectSphere’s support team.
    """,
    """
    By using the EasyBank app, you agree to these Terms and Conditions. You must create an account with accurate information and are responsible for maintaining the security of your login credentials. EasyBank provides a range of banking services, including account management, fund transfers, bill payments, and mobile deposits. Users must have sufficient funds to complete transactions, and certain services may incur fees. EasyBank prioritizes account security, and users are required to use two-factor authentication. You should immediately report any suspicious activity. Transactions may be subject to verification and are not guaranteed until confirmed. If your account is found to have violated these terms, it may be suspended or terminated. EasyBank collects and uses your personal and financial data in accordance with its Privacy Policy. For any disputes, users agree to resolve them through binding arbitration in California.
    """,
    """
    By using the QuickBite app, you agree to create an account with accurate information and are responsible for keeping your login details secure. You will be charged the total amount displayed for your order, including any taxes, fees, and tips, with payments handled through third-party processors. Cancellations and refunds are only available within limited timeframes and are subject to restaurant policies. Delivery times are estimates, and QuickBite cannot guarantee exact delivery times due to external factors like traffic and weather. You must provide accurate delivery information and behave respectfully towards delivery partners and restaurant staff. QuickBite collects and processes your data in accordance with its Privacy Policy. If you violate any terms or engage in fraudulent activity, QuickBite reserves the right to suspend or terminate your account. All disputes are governed by the laws of Illinois, with arbitration in Chicago if needed. For further questions, users can contact QuickBite's support team.
    """,
    """
    By using StreamFlix, you agree to create an account with accurate information and are responsible for keeping your login details secure. Your subscription will automatically renew unless you cancel it before the renewal date. StreamFlix offers various subscription plans, and content availability may vary based on your location. You must not engage in illegal activities or attempt to redistribute or copy content from the platform. Your use of StreamFlix is for personal, non-commercial purposes only, and we may suspend or terminate your account if you violate these terms. StreamFlix strives to provide a high-quality streaming experience, but the quality may vary based on your internet connection. StreamFlix is not responsible for interruptions or errors in the service. For any issues, you can contact our support team. Disputes will be governed by the laws of New York, and users agree to resolve disputes through the courts in New York City.
    """,
    # Add summaries corresponding to each document
]

assert len(documents) == len(summaries), "Number of documents and summaries must be equal."
print("Assertion tests passed!")


Assertion tests passed!


## **7.3 Create Dataset and DataLoader**

In [13]:
# Tokenize and encode the data
def encode_data(documents, summaries, tokenizer, max_length=512):
    inputs = tokenizer(
        documents,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
    )
    targets = tokenizer(
        summaries,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
    )
    return inputs, targets

inputs, targets = encode_data(documents, summaries, tokenizer)


In [14]:
from torch.utils.data import Dataset, DataLoader

class LegalSummarizationDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return self.inputs.input_ids.shape[0]

    def __getitem__(self, idx):
        item = {
            'input_ids': self.inputs.input_ids[idx],
            'attention_mask': self.inputs.attention_mask[idx],
            'labels': self.targets.input_ids[idx],
        }
        return item

dataset = LegalSummarizationDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)


## **7.4 Define Training Loop**

In [15]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(custom_model.parameters(), lr=5e-5)
num_epochs = 128
total_steps = len(dataloader) * num_epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=total_steps
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

custom_model.train()
print("Training Transformer Model with Adaptive Multi Head Attention")
for epoch in range(num_epochs):
    total_loss = 0
    for batch in dataloader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = custom_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )

        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)

    if epoch % 10 == 0 or epoch == num_epochs - 1:
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')




Training Transformer Model with Adaptive Multi Head Attention
Epoch 1/128, Loss: 11.6949
Epoch 11/128, Loss: 3.8078
Epoch 21/128, Loss: 2.3730
Epoch 31/128, Loss: 1.8377
Epoch 41/128, Loss: 1.5218
Epoch 51/128, Loss: 1.4199
Epoch 61/128, Loss: 1.2575
Epoch 71/128, Loss: 1.1723
Epoch 81/128, Loss: 1.0435
Epoch 91/128, Loss: 0.9782
Epoch 101/128, Loss: 0.9427
Epoch 111/128, Loss: 0.8920
Epoch 121/128, Loss: 0.8374
Epoch 128/128, Loss: 0.8427


## **7.5 Inference**

In [16]:
custom_model.eval()

def summarize(text, max_length=128):
    inputs = tokenizer(
        [text],
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt',
    )
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)

    with torch.no_grad():
        summary_ids = custom_model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
        )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example usage
test_document = """The HealthTrack app requires users to agree to its Terms and Conditions, which include accepting all legal obligations. Users must be at least 18 or have parental consent to access the app. HealthTrack reserves the right to modify these terms at any time, and users are responsible for reviewing them regularly. The app is intended for health tracking purposes only and is not a substitute for professional medical advice. Users are accountable for their account security and for complying with legal requirements when sharing content. The app offers both free and paid subscription options, with auto-renewal for paid services unless canceled. HealthTrack takes data privacy seriously but acknowledges that no method is entirely secure. The content within the app is owned by HealthTrack and cannot be modified or sold without permission. The app also integrates with third-party services but is not responsible for their actions. HealthTrack provides its services "as is" without any guarantees of uninterrupted service or freedom from errors, and its liability is limited to the amount paid by the user. Users' access may be terminated for any violations of the Terms. Legal disputes are subject to the laws of the app's jurisdiction and resolved through arbitration. Finally, if any part of these Terms is deemed invalid, the remainder will still apply."""
generated_summary = summarize(test_document)
print("Generated Summary:", generated_summary)


Generated Summary:  Users must be at least 18 or have parental consent to access the right to modify these terms at any time, which include accepting all legal obligations. The app offers both free and for professional medical advice. HealthTrack takes data privacy seriously
