# Part 2 - Zero-shot summaries

In this part we will use Hugging Face's high-level Pipeline API to create summaries with a pre-trained model. There are three main steps involved when you pass some text to a pipeline:

1) The text is preprocessed into a format the model can understand.

2) The preprocessed inputs are passed to the model.

3) The predictions of the model are post-processed, so you can make sense of them.

In [1]:
from transformers import pipeline
summarizer = pipeline("summarization")

Downloading: 100%|██████████| 1.80k/1.80k [00:00<00:00, 510kB/s]
Downloading: 100%|██████████| 1.22G/1.22G [01:00<00:00, 20.2MB/s]
Downloading: 100%|██████████| 899k/899k [00:00<00:00, 9.21MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 7.41MB/s]
Downloading: 100%|██████████| 26.0/26.0 [00:00<00:00, 13.7kB/s]


This line of code allows us to see which model is being used by default. We can also find this information in the source code for pipelines:https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/__init__.py

In [2]:
summarizer.model.config.__getattribute__('_name_or_path')

'sshleifer/distilbart-cnn-12-6'

The model for the standard summarisation task is https://huggingface.co/sshleifer/distilbart-cnn-12-6, which has been specifically trained on 2 datasets: https://huggingface.co/datasets/cnn_dailymail and https://huggingface.co/datasets/xsum. We will keep using this model, but if we wanted to use a different model we could easily do this by specifing it like below. All the models that are trained for summarisation can be viewed here: https://huggingface.co/models?pipeline_tag=summarization&sort=downloads

In [None]:
# summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

In [3]:
import pandas as pd
df_test = pd.read_csv('data/test.csv')
ref_summaries = list(df_test['summary'])
texts = list(df_test['text'])

Testing the pipeline with an abstract from the test dataset

In [4]:
texts[0]

'  The coincidence of the set of all nilpotent elements of a ring with its prime radical has a module analogue which occurs when the zero submodule satisfies the radical formula. A ring $R$ is 2-primal if the set of all nilpotent elements of $R$ coincides with its prime radical. This fact motivates our study in this paper, namely, to compare 2-primal submodules and submodules that satisfy the radical formula. A demonstration of the importance of 2-primal modules in bridging the gap between modules over commutative rings and modules over noncommutative rings is done and new examples of rings and modules that satisfy the radical formula are also given. '

In [5]:
summarizer(texts[0], max_length=20)

  next_indices = next_tokens // vocab_size


[{'summary_text': ' A ring $R$ is 2-primal if the set of all nilpot'}]

Running the pipeline over all 2,000 examples. Because this will take a while we print a counter to keep track of the progress. This should take around 50 minutes.

In [6]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 100 == 0:
        print(i)
    candidate = summarizer(text, min_length=5, max_length=20)
    candidate_summaries.append(candidate[0]['summary_text'])

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900


Saving the candidate summaries in case we want to investigate further.

In [8]:
# make sure the "summaries" directory is present
import os
path = "./summaries/"
if not os.path.exists(path):
    os.makedirs(path)

file = open("summaries/zero-shot-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [9]:
candidate_summaries[:5]

[' A ring $R$ is 2-primal if the set of all nilpot',
 ' The $k$ nearest neighbor ($k$NN) query is a fundamental problem in',
 ' For a real number $x$ and set of natural numbers $A$ define $',
 ' A wide class of smooth r-fold quadric bundles over projective n-space',
 ' Plasmonic nanoparticles influence the absorption and emission processes of nearby emitters .']

Calculating the ROUGE scores

In [10]:
from datasets import load_metric
metric = load_metric("rouge")

In [11]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [12]:
calc_rouge_scores(candidate_summaries, ref_summaries)

{'rouge1': 29.4, 'rouge2': 13.7, 'rougeL': 25.7, 'rougeLsum': 25.7}