<a href="https://colab.research.google.com/github/shfarhaan/pegasus_abstractive_summariztion/blob/main/PEGASUS_Notebook_v_0_0_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization**

*Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.*

Paper url: [*Click here*](https://arxiv.org/pdf/1912.08777.pdf)

## **0. Installing Dependencies**

In [3]:
# install pytorch

!pip3 install torch



In [4]:
# install transformers, because this provides us means to install PEGASUS, via Hugging Face Transformers

!pip3 install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 23.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 34.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 56.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 53.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: py

In [5]:
!pip install SentencePiece

Collecting SentencePiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 21.0 MB/s eta 0:00:01[K     |▌                               | 20 kB 27.4 MB/s eta 0:00:01[K     |▉                               | 30 kB 32.3 MB/s eta 0:00:01[K     |█                               | 40 kB 14.5 MB/s eta 0:00:01[K     |█▍                              | 51 kB 12.9 MB/s eta 0:00:01[K     |█▋                              | 61 kB 14.9 MB/s eta 0:00:01[K     |██                              | 71 kB 14.6 MB/s eta 0:00:01[K     |██▏                             | 81 kB 14.1 MB/s eta 0:00:01[K     |██▍                             | 92 kB 15.6 MB/s eta 0:00:01[K     |██▊                             | 102 kB 14.4 MB/s eta 0:00:01[K     |███                             | 112 kB 14.4 MB/s eta 0:00:01[K     |███▎                            | 122 kB 14.4 MB/s eta 0:00:01[K     |██

## **1. Import and Load Our Model**

In [6]:
# Importing dependencies from transformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizerFast

In [7]:
# Load Tokenizer
tokenizer = PegasusTokenizerFast.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

In [8]:
# Load Model
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

## **2. Perform Abstractive Summarization**

In [9]:
text = """
Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have
not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training
large Transformer-based encoder-decoder models on massive text corpora with a new selfsupervised objective. In PEGASUS, important
sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar
to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured
by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we
validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.
"""

In [10]:
# Create Tokens - number representation of our text

tokens = tokenizer(text, truncation = True, padding="longest", return_tensors="pt")

In [11]:
tokens

{'input_ids': tensor([[13618,   201,  1133,   121, 18006, 38979,   122,   813,   121, 83465,
          4358,   124,   423,  1352,   110, 88758,   148,  1673,   255,   924,
           173,  1226,   121, 37126,   124, 18030, 36789,  2722,   330,  1352,
          5906,  6520,  3884,   107,   611,   108,  1133,   121, 18006,  4358,
          5516,   118,  7093,  5551,  1352,  5906,  6520,  3884,   133,   146,
           174,  8678,   107,  5689,   186,   117,   114,  1905,   113, 11624,
          4051,   482,  2766,  9982,   107,   222,   136,   201,   108,   145,
         10287,  1133,   121, 18006,   423, 51979,   121,   936, 40753,   121,
          2534, 56636,  1581,   124,  2926,  1352,   110, 88758,   122,   114,
           177,   813, 83465,  4129,   107,   222, 49921, 89637,   108,   356,
          9750,   127,  2515,   191, 33238,   316,   135,   142,  3196,  2199,
           111,   127,  3943,   424,   130,   156,  2940,  5936,   135,   109,
          2756,  9750,   108,   984,  

In [14]:
# Summarize
summary = model.generate(**tokens)

By using ****tokens** we were able to unpack our dictionary i.e. the data inside the tokens

In [15]:
{**tokens}

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[13618,   201,  1133,   121, 18006, 38979,   122,   813,   121, 83465,
           4358,   124,   423,  1352,   110, 88758,   148,  1673,   2

In [16]:
# Summary in tokens
summary

tensor([[    0,  3414,   121, 18006, 36789,  1581,   122,   813,   121, 83465,
          4358,   124,   423,  1352,   110, 88758,   148,  1673,   255,   924,
           173,  1226,   121, 37126,   124, 18030, 36789,  2722,   330,  1352,
          5906,  6520,  3884,   107,     1]])

In [21]:
# Decode Summary
tokenizer.decode(summary[0])

'<pad> Pre-training NLP models with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization.</s>'

<pad> Pre-training NLP models with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization.</s>