# Possible models to use

## DistilBART - distilled version of BART, which is much smaller than the full BART model but retains much of its performance. Since it is distilled, it's faster and more efficient while still being well-suited for summarization tasks. DistilBART is designed for text summarization, and the cnn-12-6 variant is trained on news articles, making it a viable medium sized model for summarizing legal documents.

## T5 (Text-to-Text Transfer Transformer) - Small or Base - T5 treats every task as a text-to-text problem, making it very flexible for summarization. The small and base variants offer a middle ground between performance and model size, making them suitable for use cases where computational resources are limited.

In [1]:
from datasets import load_dataset

### Here I load the datasets and edit some of the columns prior to tokenizing the datasets

In [20]:
# Load the datasets
ds1_train = load_dataset("joelniklaus/legal_case_document_summarization", split='train')
ds1_train = ds1_train.remove_columns(['dataset_name'])
ds1_train = ds1_train.rename_column('judgement', 'text')
ds1_train = ds1_train.rename_column('summary', 'label')
print(ds1_train)

ds1_test = load_dataset("joelniklaus/legal_case_document_summarization", split='test')
ds1_test = ds1_test.remove_columns(['dataset_name'])
ds1_test = ds1_test.rename_column('judgement', 'text')
ds1_test = ds1_test.rename_column('summary', 'label')

# NOTE: This dataset only has 50 rows. It may not be a dataset we want to use.
# NOTE: THIS DATA IS NOT PLAYING NICELY WITH CONCATENATION
# Although the summaries appear to be good
ds2_train = load_dataset("manasvikalyan/legal-documents-summary")
ds2_train = ds2_train['data']
ds2_train = ds2_train.remove_columns(['summary_a2'])
ds2_train = ds2_train.rename_column('summary_a1', 'label')
ds2_train = ds2_train.rename_column('judgement', 'text')
print(ds2_train)

# NOTE: This dataset may not be useful the Task: Text Summarization. But moreso, option selection.
# Context: is a given legal scenario or fact pattern
# Options (Holdings): Multiple candidate holdings, one of which is correct.
# Labels: The correct holding is labeled to allow supervised learning and evaluation
ds3_train = load_dataset("coastalcph/lex_glue", "case_hold", split='train')
ds3_test = load_dataset("coastalcph/lex_glue", "case_hold", split='test')
print(ds3_train)

ds4_train = load_dataset("coastalcph/lex_glue", "ecthr_a", split='train')
ds4_test = load_dataset("coastalcph/lex_glue", "ecthr_a", split='test')
print(ds4_train)

ds5_train = load_dataset("coastalcph/lex_glue", "ecthr_b", split='train')
ds5_test = load_dataset("coastalcph/lex_glue", "ecthr_b", split='test')
print(ds5_train)

ds6_train = load_dataset("coastalcph/lex_glue", "eurlex", split='train')
ds6_test = load_dataset("coastalcph/lex_glue", "eurlex", split='test')
print(ds6_train)

ds7_train = load_dataset("coastalcph/lex_glue", "ledgar", split='train')
ds7_test = load_dataset("coastalcph/lex_glue", "ledgar", split='test')
print(ds7_train)

ds8_train = load_dataset("coastalcph/lex_glue", "scotus", split='train')
ds8_test = load_dataset("coastalcph/lex_glue", "scotus", split='test')
print(ds8_train)



Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['text', 'label'],
    num_rows: 7773
})


Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['text', 'label'],
    num_rows: 50
})
Dataset({
    features: ['context', 'endings', 'label'],
    num_rows: 45000
})
Dataset({
    features: ['text', 'labels'],
    num_rows: 9000
})
Dataset({
    features: ['text', 'labels'],
    num_rows: 9000
})
Dataset({
    features: ['text', 'labels'],
    num_rows: 55000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 60000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 5000
})


### Here I am pre-processing the data for the DistilBART model

In [3]:
from transformers import BartTokenizer

In [4]:
# Load the BART tokenizer
tokenizer = BartTokenizer.from_pretrained('sshleifer/distilbart-cnn-12-6')



In [5]:
# Tokenization function for text and summaries
def tokenize_function(examples):
    # Tokenize the input text
    inputs = tokenizer(examples['text'], max_length=512, truncation=True, padding='max_length')
    
    # Tokenize the output summary labels
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['label'], max_length=150, truncation=True, padding='max_length')

    # Set the tokenized labels in the input dictionary
    inputs['labels'] = labels['input_ids']
    
    return inputs

### Here I am just Tokenizing 'ds1' and 'ds2' for DistilBART (ds1_train and ds2_actual)

### TODO: Tokenize the training set data later

In [21]:
# Tokenize the datasets for DistilBART
# Training Data
ds1_train_tokenized = ds1_train.map(tokenize_function, batched=True)
ds2_train_tokenized = ds2_train.map(tokenize_function, batched=True)
# ds3_train_tokenized = ds3_train.map(tokenize_function, batched=True) <-- multiple choice data
ds4_train_tokenized = ds4_train.map(tokenize_function, batched=True)
ds5_train_tokenized = ds5_train.map(tokenize_function, batched=True)
ds6_train_tokenized = ds6_train.map(tokenize_function, batched=True)
ds7_train_tokenized = ds7_train.map(tokenize_function, batched=True)
ds8_train_tokenized = ds8_train.map(tokenize_function, batched=True)

# Testing Data


Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

ValueError: too many values to unpack (expected 2)

### TODO: set the other dataset formats later:

### Extra Columns (`input_ids`, `attention_mask`, `labels`)
- **`input_ids`**: Token IDs representing the input text for the model.
- **`attention_mask`**: Identifies which tokens are real and which are padding.
- **`labels`**: Token IDs representing the target summary, used for training.
These columns are essential for the model to properly process inputs, ignore padding, and learn to generate correct summaries during training.


In [7]:
# Set the dataset format to PyTorch tensors
# print(ds1_train_tokenized)
ds1_train_tokenized.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
# print(ds1_train_tokenized)

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 7773
})
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 7773
})


### Here I am concatenating the datasets to use all together

### TODO: concatenate the rest of the datasets later

In [8]:
from datasets import concatenate_datasets

In [9]:
# Concatenate/Merge the datasets
# FIX LATER --> combined_dataset = concatenate_datasets([ds1_train_tokenized, ds2_actual_tokenized])

### Splitting the combined dataset into train and validation sets

### TODO: split all datasets here later. Not just ds1

In [10]:
# FIX LATER --> combined_dataset = combined_dataset.train_test_split(test_size=0.2)
ds1_train_tokenized = ds1_train_tokenized.train_test_split(test_size=0.2)
train_dataset = ds1_train_tokenized['train'] 
val_dataset = ds1_train_tokenized['test']

### Load the DistilBART model here

In [11]:
from transformers import BartForConditionalGeneration

In [12]:
# Load the DistilBART model for conditional generation
model = BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6')

### Setting up training arguments for the model here

### TODO: These can be modified later to improve the model

In [13]:
from transformers import TrainingArguments, Trainer

In [14]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',            # output directory
    eval_strategy="epoch",       # evaluate at each epoch
    learning_rate=5e-5,                # learning rate
    per_device_train_batch_size=4,     # batch size for training
    per_device_eval_batch_size=4,      # batch size for evaluation
    num_train_epochs=3,                # number of training epochs
    weight_decay=0.01,                 # strength of weight decay
    save_total_limit=2,                # only keep last 2 checkpoints
)

In [15]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

### Training the model here

In [16]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.1815,2.056316
2,1.7665,1.985363
3,1.4946,1.993097


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_

TrainOutput(global_step=4665, training_loss=1.8569101053546606, metrics={'train_runtime': 45888.2297, 'train_samples_per_second': 0.407, 'train_steps_per_second': 0.102, 'total_flos': 1.4437376021495808e+16, 'train_loss': 1.8569101053546606, 'epoch': 3.0})

In [17]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 1.9930967092514038, 'eval_runtime': 3726.9892, 'eval_samples_per_second': 0.417, 'eval_steps_per_second': 0.104, 'epoch': 3.0}


### Training and Evaluation Results

After training the DistilBART model for **3 epochs** on the legal case summarization dataset, we achieved the following results:

#### Training Metrics:
- **Training Loss**: **1.8569**
  - The training loss represents the average difference between the predicted token probabilities and the actual tokens across the entire dataset. For a complex task like summarization, this loss value indicates that the model is learning effectively.
  - While ideally a loss closer to zero is better, for sequence generation tasks involving long and complex legal texts, a value around **1.8** is reasonable. The model is capturing the patterns within the legal data without significant overfitting.

#### Evaluation Metrics:
- **Evaluation Loss**: **1.9931**
  - The evaluation loss is slightly higher than the training loss, which suggests that the model generalizes moderately well to unseen data. This is a positive sign as it implies that the model has not overfit significantly to the training dataset.
  - Summarization models, particularly with large input/output sequences and complex legal terminology, typically have evaluation loss values greater than **1**. The small difference between the training and evaluation loss indicates good generalization.

- **Evaluation Runtime**: **3,726.99 seconds** (~62 minutes)
  - This is the time taken to evaluate the model over the validation set. The runtime is reasonable considering the complexity of the task and the length of the input sequences.

- **Samples per Second**:
  - **Training**: **0.407** samples per second
  - **Evaluation**: **0.417** samples per second
  - These rates are consistent across training and evaluation, indicating that the model was trained and evaluated with stable performance given the computational resources. The relatively low samples per second can be attributed to the complexity of processing long legal documents and generating summaries.

#### Interpretation of Loss Values:
- **Training Loss and Evaluation Loss**:
  - The **training loss of 1.8569** compared to the **evaluation loss of 1.9931** indicates that the model is not significantly overfitting to the training set, which is a good outcome. The slight increase in evaluation loss shows that the model is encountering some additional complexity when dealing with unseen data, which is expected.
  - In general, for summarization tasks involving complex data, a loss in the range of **1.5 - 3.0** is typical. This is due to the nature of cross-entropy loss accumulating over long sequences of tokens. Thus, the current loss values are quite reasonable.

#### Next Steps for Improvement:
1. **Hyperparameter Tuning**:
   - Consider adjusting the learning rate or using **scheduled learning rate decay** to help further reduce the training and evaluation loss.
2. **Additional Training Epochs**:
   - Training for an additional **1-2 epochs** could further reduce the loss, provided that overfitting is controlled.
3. **Regularization Techniques**:
   - **Weight Decay** or **Dropout** could be introduced to help improve generalization.
4. **Evaluate with ROUGE Metric**:
   - In addition to using loss as a performance measure, evaluating the model with **ROUGE** scores can give a more targeted assessment of how well the summaries capture the important content from the legal texts.

#### Summary:
- The **training and evaluation losses** are reasonable for a text generation task involving legal documents. The model seems to be learning effectively without significant overfitting.
- Further improvement can be achieved through hyperparameter tuning, training for additional epochs, and using metrics such as **ROUGE** to better evaluate the quality of the generated summaries.

The next logical step is to test the quality of the generated summaries by comparing them with the reference summaries and calculating relevant metrics to better understand the model's performance.
