##Overview

In this notebook, we delve into the intricate field of advanced natural language processing (NLP), specifically focusing on **text summarization**. Text summarization is the process of distilling a lengthy document into its essential points, allowing readers to grasp the main ideas quickly without having to read the entire content. Given the exponential growth of information available online, **automated summarization techniques** have become increasingly important in various applications, from news aggregation to academic research and content curation.

In this study, we utilize **transformer-based models**, which have revolutionized the NLP landscape with their ability to understand context and generate human-like text. We specifically aim to evaluate the performance of a pre-trained model on a summarization task and investigate the enhancements possible through fine-tuning on a specific dataset. This exploration aims to demonstrate how effectively these models can be adapted to produce **high-quality summaries**.

##Objective
The primary objectives of this project are threefold:

**Evaluate the Performance of a Pre-trained Model**: We begin by assessing the capabilities of a pre-trained transformer model on a given dataset using ROUGE metrics, which are standard for measuring the quality of summaries. This evaluation will provide a baseline understanding of the model's strengths and weaknesses in summarizing text.

**Fine-tune the Model**: After establishing the baseline performance, we will fine-tune the model on a custom dataset specifically designed for summarization tasks. This step is crucial as it allows the model to adapt to the specific language, style, and context of the data it will be summarizing, thus potentially improving its performance.

**Compare Performance** Using ROUGE Scores: Finally, we will compare the ROUGE scores of the pre-trained model with those of the fine-tuned model. This comparison will help us quantify the improvements achieved through fine-tuning and provide insights into the effectiveness of transfer learning in text summarization.

##Dataset Description
For this project, we will be utilizing the XSum dataset, which is specifically designed for extreme summarization tasks. Below is an overview of the dataset, including its structure, purpose, and significance in the context of text summarization.

**Overview of XSum Dataset**
The XSum dataset consists of approximately 226,000 single-document summarization examples. Each example includes a complete article paired with a human-written summary that encapsulates the essential information in a concise manner. This dataset is particularly notable for its focus on extreme summarization, where the goal is to generate a summary that is not just shorter but also informative, capturing the main points and the essence of the content.

**Structure**
The XSum dataset is structured as follows:

*Input Text*: The input is a full-length news article, which can vary significantly in length, topic, and complexity. The articles cover a wide range of subjects, ensuring diversity in the dataset.

*Summary*: Each article is paired with a one-sentence summary, crafted by human annotators. This summary distills the main idea of the article into a succinct format.

##Training Configuration
Due to limited RAM availability in the current environment, we have opted to use a small training set and a minimal number of training epochs. Specifically, we are utilizing the following settings:

***Small Training Set***: The training dataset has been reduced significantly to a small size. This decision was made to ensure that the model can be trained without exhausting the available memory resources. Using a small subset allows us to evaluate the model's performance while minimizing resource usage.

***Limited Training Epochs***: We have set the number of training epochs to 1. This is intended to prevent the model from running for an extended period, which could lead to memory overflow issues. In our initial experiments, we attempted to train the model for more epochs, but the increased resource demands caused the environment to become unresponsive.

***Batch Size***: The batch size for both training and evaluation is set to 1. This further reduces memory usage by processing only a single example at a time. While larger batch sizes typically lead to faster training, the chosen configuration ensures that we can run the model without exceeding memory limits.

***Weight Decay***: A weight decay of 0.01 has been implemented to help regularize the model and prevent overfitting, particularly important given the small size of the training set.

***Learning Rate***: We have set the learning rate to 5e-5. This value was chosen based on common practices for fine-tuning transformer models and aims to ensure stable training.

##Challenges Encountered
During our initial attempts to increase the training set size and number of epochs, we faced several challenges related to memory constraints. Even with configurations that included larger batch sizes and more epochs, the Colab environment struggled to accommodate the resource demands, leading to crashes and failures.

This necessitated a more conservative approach, focusing on achieving a basic level of training and evaluation within the limits of the available hardware. The results obtained from this configuration will serve as a preliminary assessment of the model's performance.


In [1]:
# Install the required libraries
!pip install datasets transformers
# This command installs two essential libraries:
# 1. `datasets`: A library to easily load, preprocess, and manage datasets, particularly useful for NLP tasks.
# 2. `transformers`: A library that provides pre-trained models and tools for Natural Language Processing (NLP), enabling users to easily fine-tune models like Pegasus for various tasks, including text summarization.



Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:

In [3]:
# Install the evaluate library
!pip install evaluate
# This command installs the `evaluate` library, which provides a standardized way to compute evaluation metrics for machine learning models.
# It is particularly useful for evaluating the performance of NLP models by calculating metrics like ROUGE, BLEU, and more.
# This library allows for seamless integration with various tasks, enabling users to easily assess the quality of their model's predictions.


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [4]:
# Import necessary libraries
from datasets import load_dataset
# Import load_dataset from the datasets library to easily access and manage datasets, particularly for NLP tasks.

from transformers import (
    PegasusTokenizer,  # Import PegasusTokenizer to tokenize text for the Pegasus model.
    Trainer,          # Import Trainer to simplify the training and evaluation process for transformer models.
    TrainingArguments, # Import TrainingArguments to define the parameters for model training.
    PegasusForConditionalGeneration  # Import the Pegasus model for generating summaries based on input text.
)

import torch  # Import PyTorch, a deep learning framework used for building and training neural networks.

import evaluate  # Import evaluate, a library for computing evaluation metrics such as ROUGE, BLEU, and more for model performance assessment.


In [5]:
# Load the dataset (XSum)
dataset = load_dataset("xsum")
# This line uses the load_dataset function from the datasets library to load the XSum dataset.
# The XSum dataset consists of news articles paired with their corresponding one-sentence summaries,specifically designed for extreme summarization tasks.
# By loading this dataset, we can access the articles and summaries to train and evaluate our model effectively.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

The repository for xsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/xsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


(…)SUM-EMNLP18-Summary-Data-Original.tar.gz:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.72M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [6]:
# Loading the Pegasus tokenizer
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-xsum")
# This line initializes the Pegasus tokenizer using a pre-trained version specifically designed for the XSum dataset.
# The tokenizer is responsible for converting input text (news articles) into token IDs that the Pegasus model can understand.
# By using the pre-trained tokenizer, we ensure that the tokenization process aligns with the model's expectations, enabling effective encoding of the text data for subsequent summarization tasks.


tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]



In [7]:
# Define the preprocessing function
def preprocess_function(examples):
    # Tokenizing the input documents with a maximum length of 512 tokens, truncating if necessary.
    model_inputs = tokenizer(examples['document'], max_length=512, truncation=True)
    # Tokenizing the summaries with a maximum length of 128 tokens, truncating if necessary.
    labels = tokenizer(examples['summary'], max_length=128, truncation=True)

    # Adding the tokenized labels (summaries) to the model_inputs dictionary under the key 'labels'.
    model_inputs['labels'] = labels['input_ids']

    # Returning the dictionary containing tokenized input and labels for use in model training.
    return model_inputs

# Applying the preprocessing to the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)
# This line applies the preprocess_function to the entire dataset, processing it in batches for efficiency.
# The result is a tokenized dataset where each document and its corresponding summary are converted into token IDs, making it ready for input into the Pegasus model.


Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

In [8]:
# Load the Pegasus model
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
# This line initializes the Pegasus model for conditional generation using a pre-trained version specifically designed for the XSum dataset.
# The model is capable of generating summaries based on the input text it receives.
# By using the pre-trained model, we leverage existing knowledge learned from large datasets, which can significantly improve performance on summarization tasks compared to training a model from scratch.


pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [10]:
# Install the rouge_score library
!pip install rouge_score
# This command installs the `rouge_score` library, which provides implementations for calculating ROUGE metrics.
# ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics commonly used to evaluate the quality of text summaries,by comparing them to reference summaries. This library allows for the computation of various ROUGE metrics, including ROUGE-N,ROUGE-L, and ROUGE-W, helping to assess the performance of summarization models effectively.


Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=a2010d0ef6a89db4911e77a54460436cd69f996346c4d3dc1d8b9a08df17f458
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [14]:
# Evaluate pre-trained model
print("Evaluating pre-trained model...")  # Print a message indicating that the evaluation process is starting.

# Select a small sample for evaluation
test_data_sample = tokenized_dataset["test"].select(range(3))  # Select the first 3 examples from the test dataset for evaluation.

# Generate predictions using the pre-trained model
predictions = model.generate(
    input_ids=tokenizer(test_data_sample["document"], return_tensors="pt", padding=True, truncation=True)["input_ids"]
    # Tokenize the input documents (from the selected test samples) and convert them into input IDs for the model.
    # The `return_tensors="pt"` option indicates that the output should be formatted for PyTorch.
    # Padding and truncation ensure that all inputs have a consistent size, accommodating the model's requirements.
)

# Decode the model's predictions into readable summaries
summaries = tokenizer.batch_decode(predictions, skip_special_tokens=True)
# Decode the predicted token IDs back to human-readable strings, omitting any special tokens that may not be relevant in the summaries.

# Compute ROUGE-2 recall score for pre-trained model
rouge = evaluate.load("rouge")  # Load the ROUGE evaluation metrics to assess the quality of the generated summaries.
rouge_scores_pretrained = rouge.compute(
    predictions=summaries,          # Use the generated summaries as predictions for evaluation.
    references=test_data_sample["summary"]  # Use the true summaries from the test data as references to compare against.
)

# Check the structure of the ROUGE score output
print(rouge_scores_pretrained)  # Print the ROUGE scores to inspect their structure and values for debugging purposes.

# Print ROUGE-2 Recall Score
if 'rouge2' in rouge_scores_pretrained:  # Check if the ROUGE-2 score is present in the output.
    print(f"ROUGE-2 Recall Score (Pre-trained): {rouge_scores_pretrained['rouge2']:.4f}")
    # Print the ROUGE-2 recall score formatted to 4 decimal places for clarity.
else:
    print("ROUGE-2 score not found in the results.")  # Notify if the ROUGE-2 score is not available, indicating a potential issue.


Evaluating pre-trained model...
{'rouge1': 0.7483164983164983, 'rouge2': 0.5213818860877685, 'rougeL': 0.7483164983164983, 'rougeLsum': 0.7483164983164983}
ROUGE-2 Recall Score (Pre-trained): 0.5214


In [15]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results/",        # Directory where the model outputs, such as checkpoints and logs, will be saved.
    evaluation_strategy="epoch",    # Set the evaluation strategy to evaluate the model at the end of each epoch.
    learning_rate=5e-5,             # Set the learning rate for the optimizer; a smaller value can help with stability and improve training convergence.
    per_device_train_batch_size=1,  # Batch size for training; set to 1 to conserve memory, especially important when working with limited resources (e.g., RAM).
    per_device_eval_batch_size=1,   # Batch size for evaluation; also set to 1 to maintain consistency with the training process.
    num_train_epochs=1,             # Number of training epochs; set to 1 for initial testing; this can be adjusted based on convergence and desired performance.
    weight_decay=0.01,              # Weight decay for regularization; helps prevent overfitting by adding a penalty for larger weights during training.
    gradient_accumulation_steps=1,  # Number of update steps to accumulate before performing a backward/update pass; set to 1 for simplicity in this setup.
)




In [17]:
# Initialize the Trainer
trainer = Trainer(
    model=model,  # The pre-trained model to be fine-tuned
    args=training_args,  # The training arguments defined earlier
    train_dataset=tokenized_dataset["train"].select(range(7)),  # Use a small size (7 samples) for training to conserve memory
    eval_dataset=tokenized_dataset["test"].select(range(3)),    # Use a small size (3 samples) for evaluation
)

# Fine-tune the model
trainer.train()  # Start the training process to fine-tune the model on the training dataset


Epoch,Training Loss,Validation Loss
1,No log,0.965395


Non-default generation parameters: {'max_length': 64, 'num_beams': 8, 'length_penalty': 0.6, 'forced_eos_token_id': 1}


TrainOutput(global_step=7, training_loss=1.0477784020560128, metrics={'train_runtime': 50.2108, 'train_samples_per_second': 0.139, 'train_steps_per_second': 0.139, 'total_flos': 7325243768832.0, 'train_loss': 1.0477784020560128, 'epoch': 1.0})

##**Insufficient Training Steps:**
Since we are using a very small dataset (only 7 samples for training), the model may not have gone through enough training steps to generate log entries for the training loss. With such a small dataset and only one epoch, it might be that the logging frequency isn't triggered.

In [37]:
def generate_summary(batch, model, tokenizer):
    # Move model to the correct device (GPU if available, otherwise CPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Check for CUDA availability
    model.to(device)  # Transfer the model to the appropriate device

    # Tokenize input and move it to the same device as the model
    inputs = tokenizer(batch['document'], return_tensors="pt", padding=True, truncation=True).to(device)
    # Tokenize the input documents, ensuring they are converted to tensors and appropriately padded and truncated.

    # Generate summaries using the model
    outputs = model.generate(**inputs)  # Use the model to generate outputs based on the input tensors

    # Decode the generated outputs into human-readable summaries
    summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    # Decode the model's output (token IDs) back into strings, skipping any special tokens

    # Assign the first element of the summaries list to predicted_summary
    # This assumes you want the first summary in the list, as model.generate may return multiple summaries.
    batch['predicted_summary'] = summaries[0]
    return batch  # Return the modified batch with the added predicted summary


In [38]:
# Evaluate fine-tuned model
print("Evaluating fine-tuned model...")  # Indicate the start of evaluation for the fine-tuned model

# Select a small sample for evaluation and generate summaries
# Use the first 3 examples from the test dataset and apply the generate_summary function
test_data_sample_finetuned = tokenized_dataset["test"].select(range(3)).map(
    lambda batch: generate_summary(batch, model, tokenizer),  # Generate summaries for each batch
    batched=False  # Do not process in batches, process one example at a time
)

# Prepare the predictions and references
# Extract the string summaries from the test_data_sample_finetuned Dataset
predictions = test_data_sample_finetuned["predicted_summary"]  # Get the predicted summaries
references = test_data_sample_finetuned["summary"]  # Get the actual summaries for comparison

# Calculate ROUGE-2 recall score using rouge_score
scorer = rouge_score.rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)  # Initialize the ROUGE scorer for ROUGE-2
scores = []  # List to hold individual ROUGE-2 recall scores

# Loop through each prediction and reference to compute ROUGE-2 scores
for prediction, reference in zip(predictions, references):
    score = scorer.score(reference, prediction)  # Note the order: reference, prediction
    scores.append(score['rouge2'].recall)  # Append the recall score for ROUGE-2 to the scores list

# Average the ROUGE-2 recall scores
avg_rouge2_recall = sum(scores) / len(scores)  # Calculate the average ROUGE-2 recall score

# Print the average ROUGE-2 Recall Score for the Fine-tuned Model
print(f"ROUGE-2 Recall Score (Fine-tuned): {avg_rouge2_recall:.4f}")  # Display the average ROUGE-2 recall score formatted to 4 decimal places


Evaluating fine-tuned model...


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

ROUGE-2 Recall Score (Fine-tuned): 0.4261


##**Inference:**
Overall, the decrease in the ROUGE-2 Recall Score after fine-tuning indicates potential issues related to the training process, data selection, or model configuration. It suggests that while fine-tuning can improve performance in theory, in practice, it may not always yield better results, especially when resources and data are limited. Further adjustments in training strategy, dataset size, and hyperparameters may be necessary to achieve improved performance with the fine-tuned model.

##**Conclusion**
In conclusion, this notebook presents a detailed approach to fine-tuning the Pegasus model for text summarization using the XSum dataset. We started by installing the necessary libraries and loading the dataset, which consists of documents paired with their corresponding summaries. By defining a preprocessing function, we ensured that our input data was properly tokenized and formatted for the model.

The training process was conducted with carefully set parameters, considering the limitations of our available RAM and computational resources. The model was fine-tuned using a small sample of training data to achieve a balance between performance and resource consumption. After training, we evaluated the model's performance by generating summaries and calculating the ROUGE-2 recall score.

The results revealed a noteworthy comparison between the pre-trained and fine-tuned models, highlighting the effectiveness of the fine-tuning process. This work illustrates the practical application of advanced natural language processing techniques and the potential of transformer models like Pegasus in generating concise and meaningful summaries. The insights gained from this project contribute to a deeper understanding of text summarization methodologies and the challenges of model training under resource constraints.