## Pegasus Text Summarization Tutorial

Text summarisation application is done in the following steps:

1. loading a pretrained model.
2. initialize a tokenizer and the model.
3. create an input text for testing.
4. tokenize and use the pretrained model to summarise the input text.
5. decode the generated token
6. Finally, display the summary

In [1]:
# Install Required Libraries
# !pip install transformers torch

In [2]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


select the `llm-env` kernel before running the notebook

**Note**
install jupyter dependencies if needed before using this notebook
install torch based on their official website either in CPU or GPU based

In [3]:
# Load Pre-trained Pegasus Model and Tokenizer
model_name = "google/pegasus-xsum"  # You can also use "google/pegasus-large" or other variants

PegasusTokenizer requires the SentencePiece library. We can install using:
`python -m pip install sentencepiece`

In [4]:
# Initialize tokenizer and model
print("Loading tokenizer and model...")
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)
print("Tokenizer and model loaded successfully.")

Loading tokenizer and model...


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizer and model loaded successfully.


In [5]:
# Input Text for Summarization
input_text = (
    "The COVID-19 pandemic has had a profound impact on global health and the economy. "
    "Countries around the world have implemented measures to curb the spread of the virus, "
    "including lockdowns, social distancing, and vaccination campaigns. Despite these efforts, "
    "many challenges remain, including vaccine distribution, public compliance, and emerging variants."
)

In [6]:
print("Input text:", input_text)

Input text: The COVID-19 pandemic has had a profound impact on global health and the economy. Countries around the world have implemented measures to curb the spread of the virus, including lockdowns, social distancing, and vaccination campaigns. Despite these efforts, many challenges remain, including vaccine distribution, public compliance, and emerging variants.


In [7]:
# Tokenize Input Text
print("Tokenizing input text...")
tokens = tokenizer(input_text, truncation=True, padding="longest", max_length=512, return_tensors="pt")
print("Tokens:", tokens)

Tokenizing input text...
Tokens: {'input_ids': tensor([[  139,  4585, 44078, 11545, 41428,   148,   196,   114,  9092,   979,
           124,  1122,   426,   111,   109,  1968,   107, 23679,   279,   109,
           278,   133,  4440,  2548,   112, 11762,   109,  2275,   113,   109,
          5807,   108,   330, 59851,   116,   108,   525, 82585,   108,   111,
         19138,  4515,   107,  4987,   219,  1645,   108,   223,  1628,  1686,
           108,   330, 10733,  2807,   108,   481,  3529,   108,   111,  4610,
         12565,   107,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [8]:
# Generate Summary
print("Generating summary...")
summary_ids = model.generate(tokens["input_ids"], max_length=50, num_beams=5, early_stopping=True)
print("Summary IDs:", summary_ids)

Generating summary...
Summary IDs: tensor([[    0,   139,   894,  1300,  7235,   143, 32916,   158,   148,  6130,
           109,  4585, 11545, 41428,   204,   107,     1]])


In [9]:
# Decode Summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Decoded summary:", summary)

Decoded summary: The World Health Organization (WHO) has declared the CO-19 pandemic over.


In [10]:
# Display Summary
print("Original Text:\n", input_text)
print("\nGenerated Summary:\n", summary)


Original Text:
 The COVID-19 pandemic has had a profound impact on global health and the economy. Countries around the world have implemented measures to curb the spread of the virus, including lockdowns, social distancing, and vaccination campaigns. Despite these efforts, many challenges remain, including vaccine distribution, public compliance, and emerging variants.

Generated Summary:
 The World Health Organization (WHO) has declared the CO-19 pandemic over.


In [11]:
# Optional: Function for Summarizing Multiple Texts
def summarize_texts(texts, model, tokenizer, max_length=50):
    summaries = []
    for text in texts:
        print("Processing text:", text)
        inputs = tokenizer(text, truncation=True, padding="longest", max_length=512, return_tensors="pt")
        print("Tokenized inputs:", inputs)
        outputs = model.generate(inputs["input_ids"], max_length=max_length, num_beams=5, early_stopping=True)
        print("Generated outputs:", outputs)
        summaries.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
        print("Decoded summary:", summaries[-1])
    return summaries

In [12]:
# Example: Summarizing Multiple Texts
texts = [
    "Artificial intelligence is transforming industries worldwide, with applications in healthcare, finance, and more.",
    "Climate change is a pressing global issue, requiring immediate action to reduce greenhouse gas emissions.",
]

print("Summarizing multiple texts...")
summaries = summarize_texts(texts, model, tokenizer)
for i, summary in enumerate(summaries):
    print(f"\nText {i+1}:\n{texts[i]}\n\nSummary {i+1}:\n{summary}")

Summarizing multiple texts...
Processing text: Artificial intelligence is transforming industries worldwide, with applications in healthcare, finance, and more.
Tokenized inputs: {'input_ids': tensor([[16882,  3941,   117, 11204,  3217,  2586,   108,   122,  1160,   115,
          2896,   108,  3324,   108,   111,   154,   107,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Generated outputs: tensor([[    0,   222,   150,   679,   113,  3439,   135,  2636,  8755,   108,
         17569,   111, 22803,  6825,  8025,  9541,  9944,   978,   134,   109,
           979,  4958,  3941,   117,   458,   124,   109, 10156,   107,     1]])
Decoded summary: In our series of letters from African journalists, filmmaker and columnist Farai Sevenzo looks at the impact artificial intelligence is having on the continent.
Processing text: Climate change is a pressing global issue, requiring immediate action to reduce greenhouse gas emissions.
Tokenized inputs:

For the next step, we can try with different model and evaluate which one we prefer to use as a summariser model.