<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🤗's BigBirdPegasus Evaluation

In this notebook, we are going to evaluate BigBird model for summarization task. BigBird was introduced in this [paper](https://arxiv.org/abs/2007.14062) (from google-research) & in this [repositary](https://github.com/google-research/bigbird) first. It has achieved awesome results on long document summarization with its block sparse attention. You can refer this [blog post](https://huggingface.co/blog/big-bird) in case you want to understand bigbird's block sparse attention.

This notebook shows how to evaluate 🤗's [`BigBirdPegasus`]() TODO (or any 🤗's encoder-decoder model) on summarization task using 🤗Datasets and 🤗Transformers.

Let's see what GPU we got. We need atleast ~14 GB GPU memory to be able to run this notebook.

In [1]:
!nvidia-smi

Sun May  2 12:51:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we will install 🤗Transformers, 🤗Datasets, `rouge_score` & some other dependencies.

In [2]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/vasudevgupta7/transformers@add_bigbird_pegasus
!pip3 install sentencepiece

We will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. Let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `BigBirdTokenizer` tokenizer.

In [3]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, BigBirdPegasusTokenizer

Let's define some variables which will be usefull later.

In [6]:
DATASET_NAME = "pubmed"
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"vasudevgupta/bigbird-pegasus-large-{DATASET_NAME}"

Let's download the `pubmed` dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

In [7]:
test_dataset = load_dataset("scientific_papers", DATASET_NAME, split="test", cache_dir=CACHE_DIR)
test_dataset

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2032.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1231.0, style=ProgressStyle(description…


Downloading and preparing dataset scientific_papers/pubmed (download: 4.20 GiB, generated: 2.33 GiB, post-processed: Unknown size, total: 6.53 GiB) to pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3624420843.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=880225504.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset scientific_papers downloaded and prepared to pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f. Subsequent calls will reuse this data.


Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 6658
})

The official checkpoint `google/bigbird-pegasus-large-pubmed` ([click to see on 🤗Model Hub](https://huggingface.co/google/bigbird-pegasus-large-pubmed)) has already been fine-tuned on pubmed. In this notebook, we are just interested in evaluating the model.

In [8]:
tokenizer = BigBirdPegasusTokenizer.from_pretrained(MODEL_ID)
model = BigBirdPegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
rouge = load_metric("rouge")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1915455.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=775.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=943.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=949.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2308148159.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2170.0, style=ProgressStyle(description…




In [9]:
# let's see the encoder attention_type, block_size
model.config.attention_type, model.config.block_size

('block_sparse', 64)

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [10]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=512, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples & see the predictions just for sake of checking if everything is working 🙂.

In [12]:
dataset_small = test_dataset.select(range(2))
result_small = dataset_small.map(generate_answer)

rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"])

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

although anxiety is the most prominent and prevalent mood disorder in patients with parkinson's disease ( pd ), few studies have investigated the relationship between anxiety and cognition in pd.<n> the aim of this study was to examine the influence of anxiety on cognition in pd by comparing pd patients with and without anxiety.<n> seventeen pd patients with anxiety ( pda+ ) and thirty - three pd patients without anxiety ( pda ) were included in this study.<n> self - reported anxiety was assessed using the hospital anxiety and depression scale ( hads ).<n> groups were matched for age, disease duration, hoehn and yahr ( h&y ) stages, disease severity, and depression.<n> performance on neuropsychological tests of attention ( digit span forward and backward, trail making test part b, logical memory test, and boston naming test ) and executive function ( verbal fluency and attentional set - shifting ) were compared between groups.<n> pd patients with anxiety demonstrated worse performance 

{'rouge1': AggregateScore(low=Score(precision=0.3181818181818182, recall=0.5139664804469274, fmeasure=0.4226415094339623), mid=Score(precision=0.3802447552447552, recall=0.5715899817964973, fmeasure=0.4490468529081956), high=Score(precision=0.4423076923076923, recall=0.6292134831460674, fmeasure=0.47545219638242897)),
 'rouge2': AggregateScore(low=Score(precision=0.13714285714285715, recall=0.21348314606741572, fmeasure=0.18250950570342206), mid=Score(precision=0.16035886818495515, recall=0.2431052093973442, fmeasure=0.18995605155300974), high=Score(precision=0.18357487922705315, recall=0.2727272727272727, fmeasure=0.1974025974025974)),
 'rougeL': AggregateScore(low=Score(precision=0.19886363636363635, recall=0.2905027932960894, fmeasure=0.26415094339622647), mid=Score(precision=0.22443181818181818, recall=0.34188061013119075, fmeasure=0.26644239676271275), high=Score(precision=0.25, recall=0.39325842696629215, fmeasure=0.268733850129199)),
 'rougeLsum': AggregateScore(low=Score(precis

Since this dataset will have sequences with lengths > 4096, which `BigBirdPegasus` can't handle, we will first filter out the samples with sequence length < 4096. We will use 🤗Datasets' `filter()` method for that.

In [13]:
filtered_data = test_dataset.filter(lambda x: len(x['article']) // 4 <= 4096)
filtered_data

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 3723
})

Because of the very large input size of ~ 4K tokens, in this notebook it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate on the first 600 examples. Therefore, we cut the whole 3000+ samples to just 600 samples using 🤗Datasets' convenient `.select()` functionality.

In [None]:
filtered_data = filtered_data.select(range(600))

Alright, let's map each sample to the predicted *abstract*. This will take ~ 4 hours if you're given a fast GPU.

In [15]:
result_filtered = filtered_data.map(generate_answer)

HBox(children=(FloatProgress(value=0.0, max=600.0), HTML(value='')))

microrna ( mirna ) is a class of small non - coding rna that regulates a broad range of cellular processes including cell cycle progression, differentiation, apoptosis, cell proliferation, metastasis, and tumorigenesis.<n> emerging evidence demonstrates that mirnas are involved in breast cancer initiation, progression and metastasis.<n> dysregulated expression of mirnas has implicated components of the non - coding genome as either oncogenes or tumor suppressors of breast cancer.<n> mirnas are involved in the initiation, progression, metastasis and drug resistance of breast cancer.<n> the understanding of how mirnas are involved in breast cancer through regulating the cell cycle remains rudimentary. in this review,<n> we summarize the recent literature and research progress on the mechanism by which mirnas regulate the breast cancer cell cycle and cellular proliferation.<n> the identification of the expression signature of these non - coding small rnas in breast cancer subtypes, and an

The only thing left to do is to evaluate our predictions now by making use of the *rouge* metric. Now, we can compute the rouge score on all predicted *abstracts*.

In [16]:
rouge.compute(predictions=result_filtered["predicted_abstract"], references=result_filtered["abstract"])

{'rouge1': AggregateScore(low=Score(precision=0.4230771547372464, recall=0.4636134697176454, fmeasure=0.4224151843765313), mid=Score(precision=0.4363360988500864, recall=0.4753080938656439, fmeasure=0.4324591171255295), high=Score(precision=0.45008287285222687, recall=0.4856431915348982, fmeasure=0.4418126715376584)),
 'rouge2': AggregateScore(low=Score(precision=0.1975183080788743, recall=0.21187265032596705, fmeasure=0.19477704148531522), mid=Score(precision=0.21056287954084368, recall=0.22314901454191832, fmeasure=0.20582917314342303), high=Score(precision=0.22387892593850328, recall=0.23380239814084747, fmeasure=0.21691093801620678)),
 'rougeL': AggregateScore(low=Score(precision=0.2693106631249034, recall=0.29258279619949207, fmeasure=0.267363090167935), mid=Score(precision=0.28207259597623335, recall=0.3031483988431694, fmeasure=0.27704097579601833), high=Score(precision=0.29587314489122435, recall=0.3135533735790181, fmeasure=0.2876558352619107)),
 'rougeLsum': AggregateScore(lo

For our 600 samples, we get a *Rouge-2* score of **something** 🔥🔥🔥.

In [17]:
## Uncomment below link in case you want to save the predictions to the disk.
# result_filtered.save_to_disk(f"result-filtered-{DATASET_NAME}")

In case you want to evaluate [`google/bigbird-pegasus-large-arxiv`](https://huggingface.co/google/bigbird-pegasus-large-pubmed) on `arxiv` dataset from [`scientific_papers`](https://huggingface.co/datasets/scientific_papers), you can just change the `DATASET_NAME` to `arxiv` in the cell above.

**Note:** You may need to link your google drive to this notebook (and change `CACHE_DIR` accordingly), if you are going to run on arxiv dataset.