<a href="https://colab.research.google.com/github/utd-hltri/nlp/blob/main/hw2/neural_multi_doc_summarization_primera.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title MIT License
#
# Copyright (c) 2022 Maxwell Weinzierl
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

# Multi-Document Neural Summarization with PRIMERA

This notebook utilizes PRIMERA: Pyramid-based Masked Sentence Pre-training for
Multi-document Summarization
The paper which introduces PRIMERA can be found here:
https://openreview.net/pdf?id=xBz8_ZZWM8d

This notebook utilizes code from the official repo: https://github.com/allenai/PRIMER

# Packages and Libraries
We will utilize the deep learning library PyTorch this time as opposed to TensorFlow. PyTorch (https://pytorch.org/) has become the most popular deep learning library for research to-date: http://horace.io/pytorch-vs-tensorflow/

![](https://www.assemblyai.com/blog/content/images/2021/12/Fraction-of-Papers-Using-PyTorch-vs.-TensorFlow.png)

## LongFormer
LongFormer is the transformer library upon which PRIMERA is built.
See https://arxiv.org/pdf/2004.05150.pdf

You may need to restart the notebook after installing these libraries.

In [None]:
!pip install pytorch_lightning==1.3.5 spacy==2.3.3 nltk==3.6.1 tqdm==4.49.0 datasets==1.6.2

In [None]:
!pip install git+https://github.com/allenai/longformer.git

## HuggingFace Transformers

Next we will install the `transformers` library, built by HuggingFace. This library makes it extremely easy to use SOTA neural NLP models with PyTorch. See the HuggingFace website to browse all the publically available models: https://huggingface.co/models

## HuggingFace Datasets
HuggingFace also provides a library called `datasets` for downloading and utilizing common NLP datasets: https://huggingface.co/datasets

## SentencePiece Tokenizer
The SentencePiece tokenizer library is required for the PEGASUS model

## Model Summary
TorchInfo is a nice little library to provide a summary of model sizes and layers. We install it below to visualize the size of our models.

In [None]:
!pip install transformers sentencepiece torchinfo

In [None]:
import torch
import transformers
import datasets
from torchinfo import summary
from textwrap import wrap

In [None]:
print(torch.__version__)
print('CUDA Enabled: ', torch.cuda.is_available())
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
  print(f'  {device} - ' + torch.cuda.get_device_name(0))
else:
  print(f'  {device}')

The above cell should include a torch library with "+cu..." to denote PyTorch is installed with CUDA capabilities. CUDA should be enabled with at least one device. Typically a Tesla K80 is the GPU I get on Google Colab, but others may be assigned as resources are made available. If you are unable to reserve a GPU instance then the device will be "cpu" and the code will run much slower, but still work.

# Neural Multi-Document Summarization Model

We will download the PRIMERA model and unzip as follows:

In [None]:
!wget https://storage.googleapis.com/primer_summ/PRIMER_multixscience.tar.gz

In [None]:
!tar -xvzf PRIMER_multixscience.tar.gz

We then move the model to the `cuda:0` device (our GPU) and turn on eval mode to avoid dropout randomness.

Finally, we print a summary of our model.

In [None]:
#@title Model
from transformers import AutoTokenizer
from longformer import LongformerEncoderDecoderForConditionalGeneration
from longformer import LongformerEncoderDecoderConfig

tokenizer = AutoTokenizer.from_pretrained('./PRIMER_multixscience')
config = LongformerEncoderDecoderConfig.from_pretrained('./PRIMER_multixscience')
model = LongformerEncoderDecoderForConditionalGeneration.from_pretrained(
    './PRIMER_multixscience', config=config)

# move model to GPU device
model.to(device)
# turn on EVAL mode so drop-out layers do not randomize outputs
model.eval()
# create model summary
summary(model)

# Summarization Dataset
We will examine the Multi-XScience Dataset. https://github.com/yaolu/Multi-XScience

Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing therelated-work section of a paper based on itsabstract and the articles it references.

In [None]:
from datasets import load_dataset
import re
#@title Dataset

dataset = 'multi_x_science_sum' #@param ["multi_x_science_sum", "multi_news"]
data = load_dataset(dataset)
ds = data['validation']
data_size = len(ds)
print(ds)

def preproces(example):
  all_docs = [example["abstract"]]
  for d in example["ref_abstract"]["abstract"]:
      if len(d) > 0:
          all_docs.append(d)
  tgt = example["related_work"]
  # remove all @cite_d
  tgt = re.sub(r"\@cite_\d+", "cite", tgt)
  ex = {
      "documents": all_docs,
      "summary": tgt,
  }
  return ex

ds = [preproces(ex) for ex in ds]

## Inspecting the Dataset

We can look at individual examples in the validation collection of SQUAD v2 to get a feeling for the types of questions and answers.

In [None]:
#@title Example { run: "auto" }
example_index = 1069 #@param {type:"slider", min:0, max:5065, step:1}
example = ds[example_index]
print('Documents: ')
for doc in example["documents"]:
  for line in wrap(doc, 100):
    print(f'  {line}')
  print()
  
print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

# Specific Example
We will use the below example to follow the prediction process of the model

In [None]:
example_index = 2161
example = ds[example_index]
print('Documents: ')
for doc in example["documents"]:
  for line in wrap(doc, 100):
    print(f'  {line}')
  print()
  
print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

## Tokenization

We will tokenize the above example using the HuggingFace tokenizer:

In [None]:
# first we get the token for seperating documents:
docsep_token_id = tokenizer.additional_special_tokens_ids[0]
print(docsep_token_id)
print(tokenizer.decode([docsep_token_id]))
pad_token_id = tokenizer.pad_token_id
print(pad_token_id)

In [None]:
max_input_len = 4096
input_ids = []
for doc in example['documents']:
  input_ids.extend(
      tokenizer.encode(
          doc,
          truncation=True,
          max_length=(max_input_len) // len(example['documents']),
      )[1:-1]
  )
  input_ids.append(docsep_token_id)

input_ids = (
    [tokenizer.bos_token_id]
    + input_ids
    + [tokenizer.eos_token_id]
)

input_ids = torch.tensor([input_ids]).to(device)
print(input_ids)

In [None]:
# these are the token ids of the input. We can convert back to text like so:
input_tokens = tokenizer.decode(input_ids[0], skip_special_tokens=True)
for line in wrap(str(input_tokens), 100):
  print(line)

# Notice that we have added a <s> token to the start,
#  </s> token to denote the end of the sequence,
# and the <doc-sep> token between sequences

## Running Model

Next we will run the model on the above example

In [None]:
from longformer.sliding_chunks import pad_to_window_size

In [None]:
# the outputs will contain decoded token ids
# based on the estimated most likely summary sequence
# using greedy decoding
attention_mask = torch.ones(
    input_ids.shape, dtype=torch.long, device=input_ids.device
)
attention_mask[input_ids == pad_token_id] = 0
# global attention on one token for all model params to be used,
# which is important for gradient checkpointing to work
attention_mask[:, 0] = 2
attention_mask[input_ids == docsep_token_id] = 2
# attention_mode == "sliding_chunks":
half_padding_mod = model.config.attention_window[0]

input_ids, attention_mask = pad_to_window_size(
    # ideally, should be moved inside the LongformerModel
    input_ids,
    attention_mask,
    half_padding_mod,
    pad_token_id,
)
summary_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    use_cache=True,
    max_length=1024,
    min_length=0,
    num_beams=1,
    length_penalty=1.0,
    no_repeat_ngram_size=3
)[0, 1:]
print(summary_ids)

In [None]:
# we can then transform these tokens to a normal string:
summary = tokenizer.decode(summary_ids, skip_special_tokens=True)
print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')
print('Generated Summary:')
for line in wrap(summary, 100):
  print(f'  {line}')

# Sampling Summaries
We will first define a `run_model` function to do all of the above for an example.

In [None]:
# Re-run this cell when you swap models
def run_model(example, **generate_args):
  # we will tokenize a single example document,
  # and we will move these tensors to the GPU device:
  max_input_len = 4096
  input_ids = []
  for doc in example['documents']:
    input_ids.extend(
        tokenizer.encode(
            doc,
            truncation=True,
            max_length=(max_input_len) // len(example['documents']),
        )[1:-1]
    )
    input_ids.append(docsep_token_id)

  input_ids = (
      [tokenizer.bos_token_id]
      + input_ids
      + [tokenizer.eos_token_id]
  )

  input_ids = torch.tensor([input_ids]).to(device)

  attention_mask = torch.ones(
    input_ids.shape, dtype=torch.long, device=input_ids.device
  )
  attention_mask[input_ids == pad_token_id] = 0
  # global attention on one token for all model params to be used,
  # which is important for gradient checkpointing to work
  attention_mask[:, 0] = 2
  attention_mask[input_ids == docsep_token_id] = 2
  # attention_mode == "sliding_chunks":
  half_padding_mod = model.config.attention_window[0]

  input_ids, attention_mask = pad_to_window_size(
      # ideally, should be moved inside the LongformerModel
      input_ids,
      attention_mask,
      half_padding_mod,
      pad_token_id,
  )
  # the outputs will contain decoded token ids
  # based on the estimated most likely summary sequence
  # using various decoding options
  multi_summary_ids = model.generate(
      input_ids=input_ids,   
      attention_mask=attention_mask,
      use_cache=True,
      max_length=1024,
      min_length=0,
      length_penalty=1.0,
      no_repeat_ngram_size=3,
      **generate_args
  )[:, 1:]
  # converts token ids back to strings for multiple summaries
  summaries = tokenizer.batch_decode(
      multi_summary_ids, 
      skip_special_tokens=True
  )
  return summaries


## Generating Strategies
There are various ways to produce samples from a sequence generating model. 
Above we utilized Greedy search, which picks the maximum probability token at
every opportunity. This can miss out on other tokens which may have 
a lower conditional probability, but produce a higher joint sentence probability
after futher token generation.
The following article summarizes many popular generating strategies: https://huggingface.co/blog/how-to-generate

### Greedy Search
![](https://huggingface.co/blog/assets/02_how-to-generate/greedy_search.png)



In [None]:
summary = run_model(example)[0]

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')
  
print('Greedy Summary:')
for line in wrap(summary, 100):
  print(f'  {line}')

### Beam Search
![](https://huggingface.co/blog/assets/02_how-to-generate/beam_search.png)


In [None]:
summaries = run_model(
    example,
    num_beams=10, 
    num_return_sequences=5, 
    early_stopping=True
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Beam Summaries:')
for beam, summary in enumerate(summaries, start=1):
  print(f'  Beam #{beam} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

### Problems with Beam Search
Beam search and all deterministic generating approaches are rarely suprising.
This leads to an almost robotic sounding result, where only high-probability English words are selected. 

![](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)

In reality, language is often suprising, with unlikely words showing up all the time! Therefore, we want to consider approaches which randomly sample from the conditional distribution produced by our model.

### Sampling
Now there are multiple approaches to random sampling from the $P(w_t|w_{1:t-1})$ conditional distribution. The first approach is just to directly sample:

In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_k=0
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Sampled Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

### Temperature

We can modify $P(w_t|w_{1:t-1})$ to be more or less "suprising", by making the distribution sharper or more flat with the `temperature` parameter. A lower temperature ($t<1.0$) leads to a sharper distribution, which will have a higher probability of sampling from high probability tokens.


In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_k=0,
    temperature=0.7
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Low Temperature Sampled Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

A higher temperature ($t>1.0$) leads to a flatter distribution, which will have a higher probability of sampling from low probability tokens.

In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_k=0,
    temperature=1.3
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('High Temperature Sampled Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

### Top-K Sampling
Top-K sampling restricts $P(w_t|w_{1:t-1})$ to only allow sampling from the top-k probability tokens. In effect, this rebalances $P(w_t|w_{1:t-1})$ to remove all probability mass from non top-k tokens to be redistributed to top-k tokens, such that only top-k tokens get sampled. This approach avoids sampling extremely low probability tokens, and thus potentially ruining the sequence.

In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_k=50,
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Top-K Sampling Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

### Top-P Sampling
Top-P sampling restricts $P(w_t|w_{1:t-1})$ to only allow sampling from the tokens which have a sum total probability mass greater than p. In other words, the probabilities $P(w_t|w_{1:t-1})$ are sorted, from largest to smallest, and only tokens from the first top-p probability mass are available to be sampled from. The probability mass is then redistributed among these top-p tokens.

In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_p=0.90, 
    top_k=0
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Top-P Sampling Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

### Top-P and Top-K Sampling
We can also perform both Top-P and Top-K sampling together, which provides multiple constraints on which tokens we can sample from $P(w_t|w_{1:t-1})$.

In [None]:
summaries = run_model(
    example,
    do_sample=True, 
    num_return_sequences=5,
    top_p=0.90, 
    top_k=50, 
)

print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

print('Top-P AND Top-K Sampling Summaries:')
for sample, summary in enumerate(summaries, start=1):
  print(f'  Sample #{sample} Summary:')
  for line in wrap(summary, 100):
    print(f'    {line}')

# Examine Summaries
Now we will perform the above process for a few examples. We will first define a `generate` function to do all of the above for an example.

In [None]:
def generate(example, strategy):
  if strategy == 'greedy':
    summary = run_model(example)[0]
  elif strategy == 'beam':
    summary = run_model(
        example,
        num_beams=10, 
        num_return_sequences=1, 
        early_stopping=True
    )[0]
  elif strategy == 'sample':
    summary = run_model(
        example,
        do_sample=True, 
        num_return_sequences=1,
        top_k=0, 
    )[0]
  elif strategy == 'top-k':
    summary = run_model(
        example,
        do_sample=True, 
        num_return_sequences=1,
        top_k=50, 
    )[0]
  elif strategy == 'top-p':
    summary = run_model(
        example,
        do_sample=True, 
        num_return_sequences=1,
        top_p=0.90, 
        top_k=0, 
    )[0]
  elif strategy == 'top-p-k':
    summary = run_model(
        example,
        do_sample=True, 
        num_return_sequences=1,
        top_p=0.90, 
        top_k=50, 
    )[0]
  else:
    raise ValueError(f'Unknown generator strategy: {strategy}')
  return summary

## Evaluation

Change the example index and view the model's predictions below for 10 different examples. For each example, compare the results for each strategy. Manually judge whether each strategy for each of the 10 examples is correct, for a total of 60 judgements. 

Discuss how accurately the model summarized the documents and whether they lined up with the annotated summaries of the examples. Report the results in your report. 

In [None]:
#@title Example { run: "auto" }
example_index = 0 #@param {type:"slider", min:0, max:11331, step:1}
strategy = 'top-p-k' #@param ["greedy", "beam", "sample", "top-k", "top-p", "top-p-k"]

example = ds[example_index]
print('Documents: ')
for doc in example["documents"]:
  for line in wrap(doc, 100):
    print(f'  {line}')
  print()
  
print('Annotated Summary: ')
for line in wrap(example['summary'], 100):
  print(f'  {line}')

summary = generate(example, strategy)

print(f'Generated Summary: ')
for line in wrap(summary, 100):
  print(f'  {line}')

## Report Format

You should have the following in your report:

| Strategy      | Accuracy |
| ----------- | ----------- |
| greedy      | ...       |
| beam      | ...       |
| sample      | ...       |
| top-k      | ...       |
| top-p      | ...       |
| top-p-k      | ...       |


Calculate the accuracy of each summary strategy by adding up the number of correct examples (by your own judgement) and dividing by 10 (the total number of examples you should evaluate). 

Also include an example prediction that has a judged answer and compare it to the predictions by each strategy. Try to find an example where the strategies differ.