### Environment Setup

In [None]:
# Checking the GPU availability
!nvidia-smi

Sat Apr 20 05:58:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

We are using T4 GPU (Tesla 4 GPU)

In [None]:
# Installing some packages
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

- `-q` flag for quiet mode (no output)

- **transformers:** This is a popular library for working with state-of-the-art natural language processing (NLP) models, especially transformers like BERT and GPT-2. The `[sentencepiece]` part specifies an optional extra dependency for the SentencePiece tokenizer, which is often used by transformer models.

- **datasets:** This library provides a collection of ready-to-use NLP datasets and tools for loading, processing, and evaluating them.

- **sacrebleu:** This package calculates the SacreBLEU score, a metric used to evaluate the quality of machine translation by comparing a generated translation to a reference translation.

- **rouge_score:** This package calculates ROUGE scores, another set of metrics for evaluating machine translation quality.

- **py7zr:** This library handles 7z archives, allowing you to work with datasets compressed in this format.

In [None]:
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

Successfully installed accelerate-0.29.3 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.19.3 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105
Found existing installation: transformers 4.38.2
Uninstalling transformers-4.38.2:
  Successfully uninstalled transformers-4.38.2
Found existing installation: accelerate 0.29.3
Uninstalling accelerate-0.29.3:
  Successfully uninstalled accelerate-0.29.3
Collecting transformers
  Downloading transformers-4.40.0-py3-none-any.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Using cached accelerate-0.29.3-py3-none-any.whl (297 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Dow

- **!pip install --upgrade accelerate:**
    - This line upgrades the accelerate package using pip.
    - The accelerate package helps with training and running machine learning models on multiple GPUs (Graphics Processing Units).

- **!pip uninstall -y transformers accelerate:**
    - This line uninstalls both transformers and accelerate packages.
    - The -y flag tells pip to proceed without confirmation prompts.

- **!pip install transformers accelerate:**
    - This line reinstalls both transformers and accelerate packages.
  
- **There are two possible reasons for this seemingly redundant approach:**
    - **Dependency Conflict:** It's possible there was a conflict between the versions of transformers and accelerate you had installed previously. Uninstalling both and reinstalling them together ensures they are compatible versions.
    - **Specific Version Requirement:** There might be a specific version requirement for transformers or accelerate for your project. This approach ensures you get the exact versions you need.

Without more context about your project setup, it's difficult to say for sure. But in essence, this code snippet manages the installation of transformers and accelerate packages, likely to address a dependency issue or ensure specific version requirements.

### Imports and Setup

In [None]:
import pandas as pd
import torch

from datasets import load_dataset, load_from_disk, load_metric

from transformers import (AutoModelForSeq2SeqLM,
                          AutoTokenizer,
                          DataCollatorForSeq2Seq,
                          pipeline,
                          TrainingArguments,
                          Trainer)

from tqdm import tqdm

**`torch`**: This library is a popular framework for deep learning and scientific computing. It's likely used for tensor operations and model computations in this context.

**From datasets library:**
- `load_dataset`: This function allows you to load various NLP datasets from the Datasets Hub.
- `load_from_disk`: This function can be used to load datasets that were previously downloaded and saved to disk.
- `load_metric`: This function helps you load evaluation metrics for NLP tasks, allowing you to assess model performance.

**From transformers library:**

- `AutoModelForSeq2SeqLM`: This allows you to automatically load any pre-trained sequence-to-sequence language model (Seq2SeqLM) supported by the library.
- `AutoTokenizer`: This provides functionality to automatically load the tokenizer associated with the chosen Seq2SeqLM model.
- `DataCollatorForSeq2Seq`: This class helps prepare batches of data for training or evaluation in a Seq2SeqLM model. It ensures proper formatting and handling of different data elements like dialogue, summary, and attention masks.
- `pipeline`: This function allows you to create pipelines for various NLP tasks offered by the transformers library. This could be useful for deploying the model for real-time summarization tasks.
- `TrainingArguments`: This class allows you to define and configure various training parameters for your model. These parameters can include things like:
    - Number of training epochs
    - Batch size
    - Learning rate
    - Optimizer selection
    - Gradient accumulation steps
    - Whether to perform evaluation during training
    - Output directory for saving checkpoints and logs
- `Trainer`: This class provides a high-level interface for training and evaluating a model using the configurations defined in TrainingArguments. It handles the training loop, logging, checkpointing, and evaluation, simplifying the training process.



In [None]:
# Setting device to GPU if available else CPU (with PyTorch)
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

**`device = "cuda" if torch.cuda.is_available() else "cpu"`:** This line sets the device on which computations will be performed.

- `torch.cuda.is_available()`:
    - This checks if a CUDA-enabled NVIDIA GPU is available on your system.
    - If a GPU is available, "cuda" is assigned to the device variable, indicating calculations will be done on the GPU for faster processing (assuming your model and data fit in GPU memory).
    - If no GPU is found, "cpu" is assigned to device, indicating computations will be done on the CPU.

In [None]:
model_ckpt = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

1. **Model Checkpoint Path:**

`model_ckpt = "google/pegasus-cnn_dailymail":` This line defines a variable named model_ckpt and assigns it a string value, "google/pegasus-cnn_dailymail". This string represents the identifier (checkpoint) for a pre-trained Pegasus model on the CNN/Daily Mail dataset, likely trained (fine tuned) for summarization tasks.

2. **Loading Tokenizer:**

`tokenizer = AutoTokenizer.from_pretrained(model_ckpt)`: This line uses the AutoTokenizer function from transformers to load the tokenizer associated with the model checkpoint specified in model_ckpt. The tokenizer is an essential part of the pipeline as it converts text into numerical representations (tokens) that the model can understand and process.

3. **Loading and Moving Model:**

`model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)`:

This line performs two actions:
- **Loading the Model:** It uses the AutoModelForSeq2SeqLM function from transformers to load the actual Pegasus model architecture based on the checkpoint specified in model_ckpt.
- **Moving to Device:** It then uses the .to(device) method to move the loaded model to the device specified by the device variable (defined earlier). Recall that device is either "cuda" (GPU) or "cpu" depending on availability. This ensures computations happen on the chosen device for efficiency.

In essence, this code sets up the necessary components (tokenizer and model) for text processing tasks, likely leveraging the pre-trained Pegasus model for summarization based on the chosen checkpoint.

### Data Import and Analysis

In [None]:
# Download and unzip the dataset for fine tuning
!wget https://github.com/sg13041995/Datasets/raw/main/textSummarizer_samsun.zip
!unzip textSummarizer_samsun.zip

--2024-04-20 06:00:22--  https://github.com/sg13041995/Datasets/raw/main/textSummarizer_samsun.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sg13041995/Datasets/main/textSummarizer_samsun.zip [following]
--2024-04-20 06:00:23--  https://raw.githubusercontent.com/sg13041995/Datasets/main/textSummarizer_samsun.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7903594 (7.5M) [application/zip]
Saving to: ‘textSummarizer_samsun.zip’


2024-04-20 06:00:23 (130 MB/s) - ‘textSummarizer_samsun.zip’ saved [7903594/7903594]

Archive:  textSummarizer_samsun.zip
  inflating: samsum-test.csv         


In [None]:
# Loading the dataset from disk and explore
dataset_samsum = load_from_disk('samsum_dataset')
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [None]:
type(dataset_samsum)

datasets.dataset_dict.DatasetDict

In [None]:
dataset_samsum.keys()

dict_keys(['train', 'test', 'validation'])

In [None]:
dataset_samsum.values()

dict_values([Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 14732
}), Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 819
}), Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 818
})])

In [None]:
dataset_samsum["train"]

Dataset({
    features: ['id', 'dialogue', 'summary'],
    num_rows: 14732
})

In [None]:
type(dataset_samsum["train"])

In [None]:
dataset_samsum['train'].column_names

['id', 'dialogue', 'summary']

In [None]:
dataset_samsum["train"][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

In [None]:
type(dataset_samsum["train"][0])

dict

In [None]:
dataset_samsum["train"][0]["id"]

'13818513'

In [None]:
print(dataset_samsum["train"][0]["dialogue"])

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)


In [None]:
print(dataset_samsum["train"][0]["summary"])

Amanda baked cookies and will bring Jerry some tomorrow.


### Data Preprocessing and Training Setup

In [None]:
# Fine tuning data preparation
def convert_examples_to_features(example_batch):
    # Tokenizing the dialogue
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    # Tokenizing the summary considering them as target
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

**Function Purpose:**

This function takes a batch of examples (likely containing dialogue and corresponding summaries) and converts them into a format suitable for the transformers library's Seq2SeqLM models.

**Argument:**
`example_batch` is a dictionary-like object containing the data for a batch of examples.

**Steps:**
1. Encode Dialogue: `input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )`

This line uses the loaded tokenizer (tokenizer) to convert the dialogue text in `example_batch['dialogue']` into numerical representations. `max_length = 1024` specifies the maximum allowed length for the encoded dialogue (sequence of tokens). Longer sequences will be truncated. `truncation = True` indicates that if the dialogue is longer than max_length, it will be shortened (truncated) to fit the limit.

2. Encode Summary (with special target handling):

`with tokenizer.as_target_tokenizer():` This line enters a context where the tokenizer treats the input text as a "target" sequence. This might be necessary because the tokenizer might handle target text differently than input text (e.g., adding special tokens for the beginning or end of the summary).

`target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )`: This line is similar to the dialogue encoding but uses the tokenizer in "target mode" and applies it to the summary text in `example_batch['summary']`.

3. Creating the Output Dictionary:
- `'input_ids'`: This key stores the encoded dialogue sequence (`input_encodings['input_ids']`).
- `'attention_mask'`: This key stores the attention mask (`input_encodings['attention_mask']`). The attention mask is used by the model to focus on relevant parts of the input sequence.
- `'labels'`: This key stores the encoded summary sequence (`target_encodings['input_ids']`). It's called "labels" here, possibly because the model is being trained to predict the summary given the dialogue.

In [None]:
# Testing the function on the first sample from train dataset
tokenized_example = convert_examples_to_features(dataset_samsum["train"][0])

In [None]:
print(tokenized_example["input_ids"])

[12195, 151, 125, 7091, 3659, 107, 842, 119, 245, 181, 152, 10508, 151, 7435, 147, 12195, 151, 125, 131, 267, 650, 119, 3469, 29344, 1]


In [None]:
print(tokenized_example["attention_mask"])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [None]:
print(tokenized_example["labels"])

[12195, 7091, 3659, 111, 138, 650, 10508, 181, 3469, 107, 1]


In [None]:
# Decoding the tokenized sample without including the special tokens

decoded_dialogue = tokenizer.decode(tokenized_example["input_ids"], skip_special_tokens=True)
decoded_summary = tokenizer.decode(tokenized_example["labels"], skip_special_tokens=True)

print(f"Decoded Dialogue: {decoded_dialogue}")
print(f"Decoded Summary: {decoded_summary}")

Decoded Dialogue: Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)
Decoded Summary: Amanda baked cookies and will bring Jerry some tomorrow.


In [None]:
# Decoding the tokenized sample including the special tokens

decoded_dialogue = tokenizer.decode(tokenized_example["input_ids"], skip_special_tokens=False)
decoded_summary = tokenizer.decode(tokenized_example["labels"], skip_special_tokens=False)

print(f"Decoded Dialogue: {decoded_dialogue}")
print(f"Decoded Summary: {decoded_summary}")

Decoded Dialogue: Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)</s>
Decoded Summary: Amanda baked cookies and will bring Jerry some tomorrow.</s>


In [None]:
# Decoding the attention mask

decoded_attention = tokenizer.decode(tokenized_example["attention_mask"], skip_special_tokens=False)

print(f"Decoded Attention: {decoded_attention}")

Decoded Attention: </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>


In [None]:
# We can observe that the length of the attention mask is same as the number of input tokens
# There was no padding and so the attention mask is all 1s

print(len(tokenized_example["input_ids"]))
print(len(tokenized_example["labels"]))
print(len(tokenized_example["attention_mask"]))

25
11
25


In [None]:
# Applying the convert_examples_to_features function on the dataset
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [None]:
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

This line creates an instance of the DataCollatorForSeq2Seq class from the transformers library. This class is specifically designed to handle data preparation for training and evaluation of sequence-to-sequence models (Seq2SeqLM) like the Pegasus model you loaded earlier.

Here's a breakdown of how it's used:

- DataCollatorForSeq2Seq: This is the class name used to create the data collator object.
- tokenizer: This argument specifies the tokenizer you loaded previously (likely tokenizer) which is used to understand the structure and format the text data.
- model=model_pegasus: This argument specifies the Seq2SeqLM model you loaded previously (likely model_pegasus). While not always required, providing the model can allow the data collator to perform optimizations specific to that model's requirements.

In essence, this line creates a helper object (seq2seq_data_collator) that will be responsible for batching and preparing your data (dialogue and summaries) in a format suitable for training or evaluating your Pegasus model for summarization tasks.

In [None]:
# Defining the training configuration
trainer_args = TrainingArguments(
    output_dir='pegasus-samsum',
    num_train_epochs=4,
    warmup_steps=200,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=50,
    save_steps=1e6,
    gradient_accumulation_steps=16
)

`output_dir='pegasus-samsum'`: This argument specifies the directory where the trained model, checkpoints, and logging files will be saved. Here, it's set to "pegasus-samsum", suggesting the model is being trained for summarization tasks.

`num_train_epochs=4`: This argument sets the number of times the entire training dataset will be passed through the model for training. Here, it's set to 1, indicating a single training epoch.

`warmup_steps=200`: This argument specifies the number of steps during which the learning rate will be gradually increased from zero to its final value. This helps to stabilize the training process at the beginning. Here, the warmup will last for the first 200 training steps.

`per_device_train_batch_size=1`: This argument defines the number of training examples included in each batch processed by the model on a single device (GPU or CPU). Here, a batch size of 1 is used, which means each training step will process a single dialogue-summary pair.

`per_device_eval_batch_size=1`: This argument defines the number of examples included in each batch processed by the model during evaluation on a single device. Similar to training batch size, here it's set to 1, indicating a single example per evaluation step.

`weight_decay=0.01`: This argument controls a regularization technique called weight decay that helps prevent overfitting. Here, a weight decay of 0.01 is applied.

`logging_steps=10`: This argument specifies the frequency at which training metrics (loss, accuracy, etc.) are logged and printed to the console. Here, logs will be printed every 10 training steps.

`evaluation_strategy='steps'`: This argument defines the strategy for performing model evaluation during training. Here, "steps" is chosen, indicating evaluation will occur at regular intervals based on the number of training steps.

`eval_steps=50`: This argument defines the frequency of evaluation steps when using the "steps" evaluation strategy. Here, the model will be evaluated every 50 training steps.

`save_steps=1e6`: This argument specifies the frequency at which the model checkpoint is saved during training. Here, the model will be saved every 1 million training steps (1e6).

`gradient_accumulation_steps=16`: This argument allows accumulating gradients from multiple training steps before performing a parameter update. This technique can improve training speed and memory usage with large models. Here, gradients from 16 training steps will be accumulated before updating the model's parameters.

In [None]:
# Initializing the trainer object
trainer = Trainer(model=model_pegasus,
                  args=trainer_args,
                  tokenizer=tokenizer,
                  data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["test"],
                  eval_dataset=dataset_samsum_pt["validation"])

`model=model_pegasus`: This argument specifies the PEGASUS model instance that you want to fine-tune. It's assumed that you've already loaded the model using transformers.PegasusForConditionalGeneration.from_pretrained().

`args=trainer_args`: This argument provides a dictionary containing training hyperparameters that control the training process. Common parameters include learning rate, number of training epochs, gradient accumulation steps, and early stopping criteria.

`tokenizer=tokenizer`: This argument specifies the tokenizer that will be used to prepare your text data for the model. For PEGASUS summarization, you'll likely use PegasusTokenizer.from_pretrained(). The tokenizer handles tasks like vocabulary handling, tokenization, and padding.

`data_collator=seq2seq_data_collator`: This argument is a function that collates individual data samples (text and summaries) into mini-batches suitable for training the model. It's particularly important for handling variable-length sequences in summarization. The Transformers library often provides pre-built data collators for common tasks like summarization.

`train_dataset=dataset_samsum_pt["test"]`: This argument specifies the dataset that will be used for training. Here, dataset_samsum_pt is assumed to be a dictionary containing dataset splits. You're using the "test" split, which might not be ideal for training (usually the "train" split is used). Make sure you're using the correct training split for your dataset.

`eval_dataset=dataset_samsum_pt["validation"]`: This argument specifies the dataset that will be used for evaluation during training. The model will be periodically evaluated on this split to monitor its performance.

### Model Training

In [None]:
# Train the model with 4 epochs based on the previous observation of overfitting after epoch 4
trainer.train()

Step,Training Loss,Validation Loss
50,2.6226,2.072233
100,1.9911,1.747815
150,1.8527,1.630636
200,1.6269,1.580894


TrainOutput(global_step=204, training_loss=2.138618700644549, metrics={'train_runtime': 753.9962, 'train_samples_per_second': 4.345, 'train_steps_per_second': 0.271, 'total_flos': 1252252679675904.0, 'train_loss': 2.138618700644549, 'epoch': 3.9853479853479854})

- We observed overfitting after 4 epochs so we tried to stop the training at epoch 4
- But the model is not really working well with a training till 4 epochs

### Model Export

In [None]:
# Save model
model_pegasus.save_pretrained("pegasus-samsum-model")

Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


In [None]:
# Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

### Model Evaluation on Test Data

In [None]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

This code defines a function named generate_batch_sized_chunks that splits a list of elements into smaller batches. Here's a breakdown of how it works:

**Function Purpose:**

This function takes two arguments:
- list_of_elements: This is a list containing the elements you want to split into batches.

- batch_size: This is an integer specifying the desired size of each batch.
The function iterates through the list and yields (returns) sublists of the specified batch_size.

**Steps:**

- Looping Through the List:
for i in range(0, len(list_of_elements), batch_size):: This loop iterates over a range of indexes starting from 0, up to but not including the length of the list (len(list_of_elements)), and stepping by the batch_size.

- Slicing and Yielding Batches:
yield list_of_elements[i : i + batch_size]: Inside the loop, this line uses slicing to extract a sublist from the original list. The slice starts at index i (current loop position) and goes up to (but not including) i + batch_size. This ensures the sublist has the desired size.
The yield keyword is used to return this sublist (batch) from the function without stopping the loop. The next iteration of the loop will create and yield the next batch of elements.

Overall, this function provides a convenient way to iterate over a large list in smaller, manageable chunks. This can be useful for various tasks, such as training machine learning models in batches or processing large datasets piece by piece.

In [None]:
def calculate_metric_on_test_ds(dataset,
                                metric,
                                model,
                                tokenizer,
                                batch_size=16,
                                device=device,
                                column_text="article",
                                column_summary="highlights"):

    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we decode the generated texts, replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score

This code defines a function calculate_metric_on_test_ds that likely evaluates a model's performance on a test dataset using a specific metric (ROUGE score in this case) for a summarization task. Here's a breakdown of the function:

**Purpose:**

This function takes several arguments:

- dataset: This is the test dataset containing examples (likely dialogue and corresponding summaries).

- metric: This is a metric object (likely a ROUGE metric) used to evaluate the quality of generated summaries.

- model: This is the pre-trained summarization model (likely Pegasus in this case).

- tokenizer: This is the tokenizer associated with the model used for text processing.

- batch_size: This is an optional argument specifying the number of examples to process in each batch (default 16).

- device: This is an optional argument specifying the device to use for computations (CPU or GPU, likely set earlier).

- column_text: This is an optional argument specifying the column name in the dataset containing the text to be summarized (default "article").

- column_summary: This is an optional argument specifying the column name in the dataset containing the reference summaries (default "highlights").

The function iterates through the test dataset in batches, generates summaries for each dialogue using the model, and then uses the metric object to compare the generated summaries with the reference summaries in the dataset. Finally, it returns the calculated metric score.

**Steps:**

1. Splitting Data into Batches:

article_batches = ... target_batches = ...: These lines use the generate_batch_sized_chunks function (defined earlier) to split the text data (dataset[column_text]) and reference summaries (dataset[column_summary]) from the test dataset into batches of the specified batch_size.

2. Looping Through Batches:

for article_batch, target_batch in tqdm(...):: This loop iterates over the corresponding batches of text data and reference summaries. The tqdm progress bar shows the progress of the loop.

3. Encoding Text Data:

inputs = tokenizer(article_batch, max_length=1024, truncation=True, padding="max_length", return_tensors="pt"): This line uses the tokenizer to encode the text data in article_batch for the model. It applies:
max_length=1024: Maximum allowed length for encoded text.
truncation=True: Truncates longer text to fit the limit.
padding="max_length": Pads shorter text to match the longest sequence.
return_tensors="pt": Converts the encoded data to PyTorch tensors for efficient model computations (if device is GPU).

4. Generating Summaries:

summaries = model.generate(...): This line calls the generate method of the model (likely Pegasus for summarization). It generates summaries for the encoded text in inputs based on the model's training. Here:
input_ids: Encoded dialogue text (from inputs).
attention_mask: Attention mask for the encoded text (from inputs).
length_penalty=0.8: This parameter discourages generating very long summaries (configurable).
num_beams=8: This parameter controls the beam search strategy for generating summaries (configurable).
max_length=128: This parameter specifies the maximum allowed length for the generated summaries.

5. Decoding Generated Summaries:

decoded_summaries = [tokenizer.decode(s, ...)]: This line decodes the generated summaries from tokens back to human-readable text using the tokenizer. It applies:
skip_special_tokens=True: Ignores special tokens added by the tokenizer during encoding.
clean_up_tokenization_spaces: Cleans up any extra spaces introduced during tokenization.

6. Replacing Empty Decoded Summaries:

decoded_summaries = [d.replace("", " ") for d in decoded_summaries]: This line replaces any empty decoded summaries (which might occur due to generation issues) with a space to avoid errors in the metric calculation.

7. Adding Summaries and References to Metric:

metric.add_batch(predictions=..., references=...): This line adds the generated summaries (decoded_summaries) as predictions and the reference summaries (target_batch) from the dataset to the metric object (likely ROUGE metric) for evaluation.

8. Calculating and Returning Metric Score:

score = metric.compute(): This line calls the compute method of the metric object to calculate the final ROUGE score based on the accumulated predictions and references added earlier.
return score: The function returns the calculated ROUGE score (score) as the output.

Overall, this function provides a framework for evaluating a summarization model's performance on a test dataset using the ROUGE metric. It iterates through the dataset in batches, generates summaries for the text data, compares them with reference summaries, and calculates the overall ROUGE score.

In [None]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load_metric('rouge')

  rouge_metric = load_metric('rouge')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

rouge_names: This list contains the names of specific ROUGE variants you want to calculate:

- "rouge1": Refers to ROUGE-1 which considers the overlap of unigrams (single words) between the generated summary and the reference summaries.

- "rouge2": Refers to ROUGE-2 which considers the overlap of bigrams (sequences of two words) between the generated summary and the reference summaries.

- "rougeL": Refers to ROUGE-L which considers the longest common subsequence (LCS) of words between the generated summary and the reference summaries.

- "rougeLsum": This is likely a custom name for ROUGE-L Recall, focusing on the recall aspect of the LCS-based evaluation. (Standard ROUGE-L considers both precision and recall).

rouge_metric: This line uses the load_metric function from the datasets library to load the ROUGE metric with the specified configurations. Here, it likely loads the ROUGE metric while keeping the configurations flexible for future adjustments.

In essence, this code prepares to use the ROUGE metric for evaluation, focusing on specific ROUGE variants for unigram, bigram overlap, and longest common subsequences. The custom "rougeLsum" possibly suggests an interest in the recall aspect of ROUGE-L.

In [None]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'],
    rouge_metric,
    trainer.model,
    tokenizer,
    batch_size = 2,
    column_text = 'dialogue',
    column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|██████████| 410/410 [13:00<00:00,  1.90s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.018343,0.000285,0.018281,0.018289


### Prediction

In [None]:
# Looking at specific example from the test dataset

test_example_number = 0
sample_text = dataset_samsum["test"][test_example_number]["dialogue"]
reference = dataset_samsum["test"][test_example_number]["summary"]

print("Input Dialogue:\n\n", sample_text)

print()

print("Summary:\n", reference)

Input Dialogue:

 Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Summary:
 Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [None]:
# Looking at the length of input and target(summary) sequences for the specific example

print(len(sample_text))
print(len(reference))

459
54


In [None]:
# Looking at the length of input and target(summary) sequences and (summary/input) ratio for some examples

for i in range(10):
  test_example_number = i
  sample_text = dataset_samsum["test"][test_example_number]["dialogue"]
  reference = dataset_samsum["test"][test_example_number]["summary"]

  print("Input:", len(sample_text))
  print("Summary", len(reference))
  print("Ratio", len(reference)/len(sample_text))
  print("="*50)

Input: 407
Summary 83
Ratio 0.20393120393120392
Input: 459
Summary 54
Ratio 0.11764705882352941
Input: 592
Summary 150
Ratio 0.2533783783783784
Input: 461
Summary 50
Ratio 0.10845986984815618
Input: 1101
Summary 221
Ratio 0.20072661217075385
Input: 1559
Summary 300
Ratio 0.19243104554201412
Input: 1055
Summary 190
Ratio 0.18009478672985782
Input: 439
Summary 60
Ratio 0.1366742596810934
Input: 479
Summary 138
Ratio 0.2881002087682672
Input: 427
Summary 96
Ratio 0.22482435597189696


In [None]:
# Decided the multipliers based on the above observed ratios
min_length_multiplier = 0.10
max_length_multiplier = 0.25

In [None]:
# Checking the calculated min_length and max_length as per the multipliers

test_example_number = 0
sample_text = dataset_samsum["test"][test_example_number]["dialogue"]
reference = dataset_samsum["test"][test_example_number]["summary"]

print(len(sample_text)*min_length_multiplier)
print(len(sample_text)*max_length_multiplier)

print()

print(len(reference))

40.7
101.75

83


In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [None]:
# Summarization parameter settings

test_example_number = 0
sample_text = dataset_samsum["test"][test_example_number]["dialogue"]
reference = dataset_samsum["test"][test_example_number]["summary"]

min_length = int(len(sample_text)*min_length_multiplier)
max_length = int(len(sample_text)*max_length_multiplier)

gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "min_length": min_length, "max_length": max_length}

In [None]:
pipe = pipeline("summarization", model="pegasus-samsum-model", tokenizer=tokenizer)

print("Dialogue:")
print(sample_text)

print("\nReference Summary:")
print(reference)

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Betty's number is Larry's. He called her last time they were at the park together. He's very nice. Hannah would rather she text him instead of finding Betty's number.


The summary looks distorted. Not really good.