<a href="https://colab.research.google.com/github/sujitpal/nlp-deeplearning-ai-examples/blob/master/04_02a_pegasus_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pegasus Summarizer (Inference only)

Using a Pegasus transformer pre-trained on Google/XSum dataset to generate summaries for a few Amazon Fine Foods Review dataset. Good summaries, matches the labels in terms of general sentiment (except for the first record).

Model is large, needs a GPU even for inference for reasonable response times.

Model is based on [PEGASUS: Pre-training with extracted Gap Sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777). Idea here is to identify candidate sentences for extractive summarization, then mask these sentences out during pre-training, and train the model to predict the sequence of masked sentences.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |▏                               | 10kB 20.9MB/s eta 0:00:01[K     |▍                               | 20kB 29.1MB/s eta 0:00:01[K     |▋                               | 30kB 21.5MB/s eta 0:00:01[K     |▉                               | 40kB 19.6MB/s eta 0:00:01[K     |█                               | 51kB 21.6MB/s eta 0:00:01[K     |█▎                              | 61kB 15.9MB/s eta 0:00:01[K     |█▌                              | 71kB 16.2MB/s eta 0:00:01[K     |█▊                              | 81kB 17.0MB/s eta 0:00:01[K     |██                              | 92kB 15.0MB/s eta 0:00:01[K     |██▏                             | 102kB 16.2MB/s eta 0:00:01[K     |██▍                             | 112kB 16.2MB/s eta 0:00:01[K     |██▋                             | 

In [2]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 20.2MB/s eta 0:00:01[K     |▋                               | 20kB 27.6MB/s eta 0:00:01[K     |▉                               | 30kB 31.1MB/s eta 0:00:01[K     |█▏                              | 40kB 34.2MB/s eta 0:00:01[K     |█▌                              | 51kB 35.8MB/s eta 0:00:01[K     |█▊                              | 61kB 38.1MB/s eta 0:00:01[K     |██                              | 71kB 26.3MB/s eta 0:00:01[K     |██▍                             | 81kB 27.6MB/s eta 0:00:01[K     |██▋                             | 92kB 25.8MB/s eta 0:00:01[K     |███                             | 102kB 23.7MB/s eta 0:00:01[K     |███▎                            | 112kB 23.7MB/s eta 0:00:01[K     |███▌        

In [3]:
import os
import pandas as pd
import sentencepiece
import torch

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

In [5]:
# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "/content/drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)

drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive
Mounted at /content/drive


In [6]:
DATA_DIR = os.path.join(ROOT, "MyDrive", "nlp-deeplearning-ai-data")

REVIEWS_FILE = os.path.join(DATA_DIR, "Reviews.csv")

In [7]:
reviews_df = pd.read_csv(REVIEWS_FILE, nrows=10)[["Text", "Summary"]]
reviews_df.head()

Unnamed: 0,Text,Summary
0,I have bought several of the Vitality canned d...,Good Quality Dog Food
1,Product arrived labeled as Jumbo Salted Peanut...,Not as Advertised
2,This is a confection that has been around a fe...,"""Delight"" says it all"
3,If you are looking for the secret ingredient i...,Cough Medicine
4,Great taffy at a great price. There was a wid...,Great taffy


In [8]:
texts = reviews_df.Text.values.tolist()
texts

['I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.',
 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
 'This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.',
 'If you are looking f

In [9]:
len(texts)

10

In [10]:
model_name = 'google/pegasus-xsum'

tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1912529.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1099.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2275329241.0, style=ProgressStyle(descr…




Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
dev = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(dev)

In [12]:
batch = tokenizer.prepare_seq2seq_batch(
    texts, truncation=True, padding="longest", return_tensors="pt")
batch = batch.to(dev)
translated = model.generate(**batch)
summaries = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(summaries)

['Can you tell me what I am getting for my money?', 'I was sent a box of Jumbo Salted Peanuts and they were not the size I was expecting.', 'This is one of my all-time favourite treats.', 'I have been taking Robitussin for a long time and it has made me feel better.', 'Great taffy at a great price.', 'taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy, taffy', 'This is the best taffy I have ever had.', 'This is the best taffy I have ever had.', "This is the first year I've grown this type of grass in my back garden.", 'This is a good quality food.']


In [14]:
reviews_df["pred_summaries"] = summaries
reviews_df.head(10)

Unnamed: 0,Text,Summary,pred_summaries
0,I have bought several of the Vitality canned d...,Good Quality Dog Food,Can you tell me what I am getting for my money?
1,Product arrived labeled as Jumbo Salted Peanut...,Not as Advertised,I was sent a box of Jumbo Salted Peanuts and t...
2,This is a confection that has been around a fe...,"""Delight"" says it all",This is one of my all-time favourite treats.
3,If you are looking for the secret ingredient i...,Cough Medicine,I have been taking Robitussin for a long time ...
4,Great taffy at a great price. There was a wid...,Great taffy,Great taffy at a great price.
5,I got a wild hair for taffy and ordered this f...,Nice Taffy,"taffy, taffy, taffy, taffy, taffy, taffy, taff..."
6,This saltwater taffy had great flavors and was...,Great! Just as good as the expensive brands!,This is the best taffy I have ever had.
7,This taffy is so good. It is very soft and ch...,"Wonderful, tasty taffy",This is the best taffy I have ever had.
8,Right now I'm mostly just sprouting this so my...,Yay Barley,This is the first year I've grown this type of...
9,This is a very healthy dog food. Good for thei...,Healthy Dog Food,This is a good quality food.
