<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## BigBirdPegasus Evaluation 🤗

In this notebook, we are going to evaluate BigBirdPegasus for summarization task. BigBird was introduced in this [paper](https://arxiv.org/abs/2007.14062) and achieved awesome results on long document summarization.

This notebook shows how to reproduce the official results in 20-some lines of code with 🤗Datasets and 🤗Transformers.

First, let's try to get a GPU with at least 15GB RAM.

In [None]:
# crash colab to get more RAM
# !kill -9 -1

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [1]:
!nvidia-smi

Fri Apr 23 20:24:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [19]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/vasudevgupta7/transformers@add_bigbird_pegasus
!pip3 install sentencepiece

We will evaluate **BigBirdPegasus** on the **_arxiv_** dataset using the **Rouge-2** metric. Let's 
import the two loading functions `load_dataset` and `load_metric`.

In [3]:
from datasets import load_dataset, load_metric

Let's download the arxiv dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
cd /content/drive/MyDrive/

/content/drive/MyDrive


In [7]:
test_dataset = load_dataset("scientific_papers", "arxiv", split="test", cache_dir="/content/drive/MyDrive/arxiv")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2069.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1231.0, style=ProgressStyle(description…


Downloading and preparing dataset scientific_papers/arxiv (download: 4.20 GiB, generated: 7.06 GiB, post-processed: Unknown size, total: 11.26 GiB) to /content/drive/MyDrive/arxiv/scientific_papers/arxiv/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3624420843.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=880225504.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset scientific_papers downloaded and prepared to /content/drive/MyDrive/arxiv/scientific_papers/arxiv/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc. Subsequent calls will reuse this data.


Next, we import the `BigBirdPegasus` model and `BigBirdTokenizer` tokenizer.

In [10]:
from transformers import BigBirdPegasusForConditionalGeneration, BigBirdPegasusTokenizer

The official checkpoint "google/bigbird-pegasus-large-arxiv" ([click to see on 🤗Model Hub](https://huggingface.co/google/bigbird-pegasus-large-arxiv)) has already been fine-tuned on arxiv. In this notebook, we are just interested in evaluating the model.

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [11]:
device = "cuda"

tokenizer = BigBirdPegasusTokenizer.from_pretrained("vasudevgupta/bigbird-pegasus-large-arxiv")
model = BigBirdPegasusForConditionalGeneration.from_pretrained("vasudevgupta/bigbird-pegasus-large-arxiv").to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1915455.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=775.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=943.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=949.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2309000703.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at vasudevgupta/bigbird-pegasus-large-arxiv were not used when initializing BigBirdPegasusForConditionalGeneration: ['model.encoder.layers.0.self_attn.self.query.bias', 'model.encoder.layers.0.self_attn.self.key.bias', 'model.encoder.layers.0.self_attn.self.value.bias', 'model.encoder.layers.1.self_attn.self.query.bias', 'model.encoder.layers.1.self_attn.self.key.bias', 'model.encoder.layers.1.self_attn.self.value.bias', 'model.encoder.layers.2.self_attn.self.query.bias', 'model.encoder.layers.2.self_attn.self.key.bias', 'model.encoder.layers.2.self_attn.self.value.bias', 'model.encoder.layers.3.self_attn.self.query.bias', 'model.encoder.layers.3.self_attn.self.key.bias', 'model.encoder.layers.3.self_attn.self.value.bias', 'model.encoder.layers.4.self_attn.self.query.bias', 'model.encoder.layers.4.self_attn.self.key.bias', 'model.encoder.layers.4.self_attn.self.value.bias', 'model.encoder.layers.5.self_attn.self.query.bias', 'model.encoder.layers.5.

In [47]:
model.config.block_size

64

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=4`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [48]:
import torch

def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(device) for k in inputs_dict}

  predicted_abstract_ids = model.generate(**inputs_dict, max_length=512, num_beams=4)
  batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
  return batch[0]

Because of the very large input size of over 4K tokens, in this notebook it would take over (time) to evaluate the whole test dataset. For the sake of this notebook, we'll only evaluate on the first 600 examples. Therefore, we cut the whole 6000+ samples dataset to just 600 samples using 🤗Datasets' convenient `.select()` functionality. 

In [49]:
dataset_small = test_dataset.select(range(2))

In [50]:
dataset_small

Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 2
})

Alright, let's map each sample to the predicted *abstract*. This will take ca. (time) if you're given a fast GPU.

In [51]:
result_small = dataset_small.map(generate_answer)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [52]:
result_small["predicted_abstract"][:4]

['the problem of the existence of the 155-day periodicity in the daily sunspot areas, the mean sunspot areas per carrington rotation, the monthly sunspot numbers and their fluctuations, which are obtained after removing the 11-year cycle is considered.<n> two methods of the power spectrum analysis are used : the fast fourier transformation algorithm with the hamming window function ( fft ) and the blackman - tukey ( bt ) method.<n> the fft method consists in the smoothing of a cosine transform of an autocorrelation function using a 3-point weighting average.<n> the bt method consists in the smoothing of a cosine transform of an autocorrelation function using a 3-point weighting average.<n> numerical results of the new method of the diagnosis of an echo - effect for sunspot area data are discussed.<n> it is shown that the sunspot data from 1923 - 1933 ( cycle 16 ) present the 155-day periodicity in the mean sunspot areas per carrington rotation, in the mean sunspot areas per carrington 

In [56]:
result_small["abstract"][:4]

[' the short - term periodicities of the daily sunspot area fluctuations from august 1923 to october 1933 are discussed . for these data \n the correlative analysis indicates negative correlation for the periodicity of about @xmath0 days , but the power spectrum analysis indicates a statistically significant peak in this time interval . \n a new method of the diagnosis of an echo - effect in spectrum is proposed and it is stated that the 155-day periodicity is a harmonic of the periodicities from the interval of @xmath1 $ ] days .    the autocorrelation functions for the daily sunspot area fluctuations and for the fluctuations of the one rotation time interval in the northern hemisphere , separately for the whole solar cycle 16 and for the maximum activity period of this cycle do not show differences , especially in the interval of @xmath2 $ ] days . \n it proves against the thesis of the existence of strong positive fluctuations of the about @xmath0-day interval in the maximum activit

The only thing left to do is to evaluate our predictions now by making use of the *rouge* metric. Let's load the metric.

In [54]:
rouge = load_metric("rouge")

Now, we can compute the rouge score on all predicted *abstracts*.

In [55]:
rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.14612751390671802, recall=0.23383838383838385, fmeasure=0.17833297945124682)

For our 600 samples, we get a *Rouge-2* score of **something** 🔥🔥🔥. The [official paper](https://arxiv.org/abs/2007.14062) reports a new state-of-the-art score of  **something** on the whole test dataset which aligns very well with our observation here${}^1$.

The arxiv dataset contains many documents of lengths exceeding 14K tokens, which cannot be handled well by *PEGASUS* and *BigBird* as those models are limited to 1024 and 4096 tokens respectively.

---

The checkpoint was also evaluated on the complete dataset with the exact same hyperparameters as this notebook yielding a score of **something** which is close enough to the official results to confirm the effectiveness of BigBird.
