## Text Summarization using RoBERTa 

In [1]:
import datasets
import transformers
import pandas as pd
from datasets import Dataset, load_dataset
import rouge_score as rouge
import torch

# Tokenizer
from transformers import RobertaTokenizerFast

# Encoder-Decoder Model
from transformers import EncoderDecoderModel

# Training
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import TrainingArguments
from dataclasses import dataclass, field
from typing import Optional

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [2]:
# from huggingface_hub import login
# # from google.colab import userdata

# login()

## Training Phase

### Import CNN Dailymail Dataset

In [33]:
train_dataset = load_dataset("cnn_dailymail", '3.0.0', split='train')
test_dataset  = load_dataset("cnn_dailymail", '3.0.0', split='test')

In [34]:
valid_dataset = load_dataset("cnn_dailymail", '3.0.0', split='validation')

In [35]:
train_dataset = train_dataset.remove_columns('id')
valid_dataset = valid_dataset.remove_columns('id')
test_dataset  = test_dataset.remove_columns('id')

### Import Tokenizer 

In [None]:
# Load the pre-trained RoBERTa tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

# Set beginning-of-sentence and end-of-sentence tokens to be the same as CLS and SEP tokens
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

### Tokenize Dataset

In [6]:
# Define parameters for data processing
batch_size = 32  # Number of examples per batch
encoder_max_length = 256  # Max length for the encoder (article text)
decoder_max_length = 128  # Max length for the decoder (summary text)

# Function to process data into model inputs
def process_data_to_model_inputs(batch):
  # Tokenize the inputs (articles) and outputs (highlights) with appropriate padding and truncation
  inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_max_length)

  # Assign tokenized inputs to the batch
  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  # Assign tokenized outputs to the batch, used as labels during training
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # Replace padding token id in labels with -100 so that it is ignored in loss computation
  batch["labels"] = [
      [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
  ]

  return batch

# Assume train_dataset and valid_dataset are predefined datasets
# Process training data and set the format for PyTorch
train_dataset = train_dataset.map(
    process_data_to_model_inputs,  # Apply the processing function
    batched=True,  # Process in batches
    batch_size=batch_size,  # Define the batch size
    remove_columns=["article", "highlights"]  # Remove the original columns to save memory
)
train_dataset.set_format(
    type="torch",  # Set the format for PyTorch
    columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],  # Define which columns to keep
)

# Process validation data in the same way as training data
valid_dataset = valid_dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "highlights"]
)
valid_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)


Map:   0%|          | 0/287113 [00:00<?, ? examples/s]

Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

### Construct Model for Summarization

We construct an Encoder-Decoder model for summarization, as presented in [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) by Liu et al. which in their paper leverage a pre-trained BERT serving as an 'encoder' and a non pre-trained BERT 'decoder', which will be trained from scratch. Since then, it has been discovered that we can actually leverage 2 pre-trained models to play the same roles, which will yield a better performance. The resulting model will perform abstractive text summarization.

Intuitively, the following steps take place within our architecture:
- We will feed our (tokenized) input to the Encoder, which will create internal state representation vectors (or hidden states) for our text.
- The Encoder yields context vectors which will get passed into the decoder as input.
- The decoder will generate output text based on these passed context vectors.

 

In [50]:
roberta_shared = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "nyu-mll/roberta-med-small-1M-1",
    "nyu-mll/roberta-med-small-1M-1", 
    tie_encoder_decoder=True
)

# set special tokens
roberta_shared.config.decoder_start_token_id = tokenizer.bos_token_id
roberta_shared.config.eos_token_id = tokenizer.eos_token_id

# sensible parameters for beam search
# set decoding params
roberta_shared.config.max_length = decoder_max_length
roberta_shared.config.early_stopping = True
roberta_shared.config.no_repeat_ngram_size = 3
roberta_shared.config.length_penalty = 2.0
roberta_shared.config.num_beams = 4
roberta_shared.config.vocab_size = roberta_shared.config.encoder.vocab_size

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/182M [00:00<?, ?B/s]

Some weights of RobertaForCausalLM were not initialized from the model checkpoint at nyu-mll/roberta-med-small-1M-1 and are newly initialized: ['roberta.encoder.layer.0.crossattention.output.LayerNorm.bias', 'roberta.encoder.layer.0.crossattention.output.LayerNorm.weight', 'roberta.encoder.layer.0.crossattention.output.dense.bias', 'roberta.encoder.layer.0.crossattention.output.dense.weight', 'roberta.encoder.layer.0.crossattention.self.key.bias', 'roberta.encoder.layer.0.crossattention.self.key.weight', 'roberta.encoder.layer.0.crossattention.self.query.bias', 'roberta.encoder.layer.0.crossattention.self.query.weight', 'roberta.encoder.layer.0.crossattention.self.value.bias', 'roberta.encoder.layer.0.crossattention.self.value.weight', 'roberta.encoder.layer.1.crossattention.output.LayerNorm.bias', 'roberta.encoder.layer.1.crossattention.output.LayerNorm.weight', 'roberta.encoder.layer.1.crossattention.output.dense.bias', 'roberta.encoder.layer.1.crossattention.output.dense.weight', 'r

In [19]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [None]:
repo_name = 

training_args = Seq2SeqTrainingArguments(
    repo_name = repo_name,
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    do_train=True,
    do_eval=True,
    logging_steps=2,
    save_steps=16,
    eval_steps=200,
    warmup_steps=100,
    overwrite_output_dir=True,
    save_total_limit=1,
    fp16=True,
    push_to_hub=True
)

# instantiate trainer
trainer = Seq2SeqTrainer(
    model=roberta_shared,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
)
trainer.train()

In [None]:
trainer.model

## Evaluation / Inference Phase

In [46]:
def generate_summary(batch):
    # Tokenize the input article text in the batch with padding to the maximum length, 
    # truncation to fit the max_length, and conversion to PyTorch tensors.
    # The tokenizer automatically adds beginning-of-sentence [BOS] and end-of-sentence [EOS] tokens.
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    
    # Move the tokenized input ids and attention masks to the GPU to prepare for model inference.
    # This step assumes that you have a CUDA-capable GPU available.
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    
    # Generate summary IDs from the model by providing it the input_ids and attention_mask.
    # The model will output the ids of the tokens it predicts in the summary.
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    
    # Decode the generated token IDs back into a string of text. The decoder will omit
    # any special tokens such as [BOS], [EOS], [PAD], etc., so the output is clean text.
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    
    # The function returns the list of generated summaries as strings.
    batch["pred"] = output_str
    return output_str


### Generate Batch Predictions

In [25]:
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
# model = EncoderDecoderModel.from_pretrained('checkpoint-320') # checkpoint to be added
model = EncoderDecoderModel.from_pretrained('thesergiu/roberta2roberta_daily_cnn_finetuned')
model.to("cuda")
batch_size = 16

results = test_dataset.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])
pred_str = results["pred"]
label_str = results["highlights"]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

### Rouge Score

In [45]:
from torchmetrics.text.rouge import ROUGEScore
from pprint import pprint

rouge = ROUGEScore()
pprint(rouge(pred_str, label_str))

{'rouge1_fmeasure': tensor(0.3880),
 'rouge1_precision': tensor(0.3817),
 'rouge1_recall': tensor(0.4183),
 'rouge2_fmeasure': tensor(0.1714),
 'rouge2_precision': tensor(0.1694),
 'rouge2_recall': tensor(0.1842),
 'rougeL_fmeasure': tensor(0.2661),
 'rougeL_precision': tensor(0.2611),
 'rougeL_recall': tensor(0.2879),
 'rougeLsum_fmeasure': tensor(0.3599),
 'rougeLsum_precision': tensor(0.3541),
 'rougeLsum_recall': tensor(0.3879)}


### Generate Single Prediction

In [51]:
from transformers import RobertaTokenizerFast, EncoderDecoderModel

# Initialize the tokenizer and model
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
model = EncoderDecoderModel.from_pretrained('thesergiu/roberta2roberta_daily_cnn_finetuned')
model.to("cuda")

EncoderDecoderModel(
  (encoder): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 512, padding_idx=1)
      (position_embeddings): Embedding(514, 512, padding_idx=1)
      (token_type_embeddings): Embedding(1, 512)
      (LayerNorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=512, out_features=512, bias=True)
              (key): Linear(in_features=512, out_features=512, bias=True)
              (value): Linear(in_features=512, out_features=512, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=512, out_features=512, bias=True)
              (LayerNorm): LayerNo

#### Article not from CNN

In [47]:
# Example article
article = """Jan 31 (Reuters) - Tech giants on Tuesday talked up how customers are lapping up their generative AI-powered products, but mounting costs of developing the cutting-edge features irked investors hoping for a big boost to sales from the new technology.
Shares of Alphabet (GOOGL.O), opens new tab fell 6%, while those of Microsoft (MSFT.O), opens new tab were down 1%, bringing down heavyweight tech stocks including Apple (AAPL.O), opens new tab, Meta (META.O), opens new tab and Amazon (AMZN.O), opens new tab.
Both Microsoft and Alphabet reported generous increases to their cloud revenue in the December quarter, beating Wall Street estimates, as customers lined up to test new AI features and build their own AI services.
But costs surged as well, highlighting the heavy investments these companies are making in servers, data centers and research as they compete fiercely for new customer dollars.
This hurt investor expectations that were fueled by the promise of AI, which powered a stock rally to record highs in recent months.
"A lofty valuation means even the slightest hint of disappointment will be seized on by investors, and Microsoft's guidance for revenue growth in its cloud division to slacken a little in the current quarter was enough to see the shares dip modestly," said Russ Mould, investment director at AJ Bell.
Gene Munster, a managing partner at Deepwater Asset Management, said he is looking for more from his firm's stakes in Alphabet and Microsoft."""

# Generate summary
summary = generate_summary(article)
print(summary)


Tech giants on Tuesday talked up how customers are lapping up their generative products.
But mounting costs of developing the cutting-edge features irked investors hoping for a big boost to sales from the new technology.
Deal was down 1%, bringing down heavyweight tech stocks including Apple (AAPL)O.


#### Paragraph from Harry Potter

In [56]:
text = """October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden 
spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several 
hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair 
gave the impression that her whole head was on fire. Raindrops the size of bullets thundered on the castle windows for days on end; the lake rose, 
the flower beds turned into muddy streams, and Hagrid's pumpkins swelled to the size of garden sheds. Oliver Wood's enthusiasm for regular 
training sessions, however, was not dampened, which was why Harry was to be found, late one stormy Saturday afternoon a few days before Halloween, 
returning to Gryffindor Tower, drenched to the skin and splattered with mud. Even aside from the rain and wind it hadn't been a happy practice 
session. Fred and George, who had been spying on the Slytherin team, had seen for themselves the speed of those new Nimbus Two Thousand and Ones.
They reported that the Slytherin team was no more than seven greenish blurs, shooting through the air like missiles."""

In [57]:
# Generate summary
summary = generate_summary(text)
print(summary)

Hudan Weasley was bullied into taking some by Percy.
Elderly Pepperup potion worked instantly, though it left the drinker smoking at the ears for several 560 hours afterward.
Fred and George had been spying on the Slytherin team, had seen for themselves the speed of those speed.


## End of Notebook