If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [29]:
!pip install datasets transformers==4.28.0 rouge-score nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [30]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [31]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [32]:
import transformers

print(transformers.__version__)

4.28.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [33]:
from transformers.utils import send_example_telemetry

send_example_telemetry("summarization_notebook", framework="pytorch")

# Fine-tuning a model on a summarization task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task. We will use the [XSum dataset](https://arxiv.org/pdf/1808.08745.pdf) (for extreme summarization) which contains BBC articles accompanied with single-sentence summaries.

![Widget inference on a summarization task](images/summarization.png)

We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [6]:
model_checkpoint = "t5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint. 

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [34]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")
#ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing.
#The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
#ROUGE-N: Overlap of n-grams between the system and reference summaries.
#ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.
#ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
#ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
#ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
#ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
#ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.



  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [35]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 204045
    })
    validation: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11332
    })
    test: Dataset({
        features: ['document', 'summary', 'id'],
        num_rows: 11334
    })
})

To access an actual element, you need to select a split first, then give an index:

In [9]:
raw_datasets["train"][0]

 'summary': 'Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.',
 'id': '35232142'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [10]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,document,summary,id
0,"Three routes would be opened and a fourth for armed rebels, Russian Defence Minister Sergei Shoigu said.\nSyria's president has also offered an amnesty for rebels laying down arms and surrendering within three months.\nGovernment forces have encircled Aleppo, cutting off rebel-held areas and severing all supply routes.\nThe offensive has been aided by Russian air power.\nRebel forces fighting the government of President Bashar al-Assad have held eastern parts of the city for the past four years.\nThe story of the Syrian conflict\nAleppo: City facing its last gasp?\nThe UN has warned of a critical situation for about 300,000 people still there.\nUN humanitarian chief Stephen O'Brien said on Monday that ""food supplies are expected to run out in mid-August and many medical facilities continue to be attacked"".\nMr Shoigu described the corridors as a ""large-scale humanitarian operation"".\nHe said the move was ""first and foremost to ensure the safety of Aleppo residents"".\nThe three corridors for civilians and unarmed fighters would have medical posts and food handouts, Mr Shoigu said, adding that he would welcome the co-operation of international aid organisations.\nThe fourth, in the direction of Castello Road, would be for armed militants, although Mr Shoigu complained that the US had not supplied information about how the rebel Free Syrian Army units it supports had separated from jihadist al-Nusra fighters.\nReports on Thursday said that government forces had taken control of more areas of the city, in the Bani Zeid neighbourhood.\nMr Assad's amnesty offer came in a decree issued on Thursday, the state-run Sana news agency reported.\n""Everyone carrying arms... and sought by justice... is excluded from full punishment if they hand themselves in and lay down their weapons,"" it quoted the decree as saying.\nThere have been several presidential amnesty offers in recent years.\nThroughout the five years of Syria's war, aid agencies have pleaded for humanitarian access - usually in vain. Only under intense international pressure has the regime allowed a limited number of aid convoys to reach areas under siege. But now, with the rebels surrounded in Aleppo, the Syrian government may feel it can afford to appear magnanimous.\nThe announcement has taken many by surprise but may be modelled on a ceasefire agreement last year in Homs.\nThat deal allowed starving rebels to leave, ceding control of the city to the government. Winning back Aleppo - Syria's biggest city - would be a huge prize for the government. But so far there are no signs of fighters leaving the city. Rebels and civilians alike have reacted to the initiative with intense distrust.\nLast week US Secretary of State John Kerry held marathon talks in Moscow with Russian President Vladimir Putin and his Foreign Minister Sergei Lavrov.\nThey agreed ""concrete steps"" on tackling jihadists in Syria and on trying to reach an effective ceasefire, although proposals have not been made public.\nMore than 280,000 people have been killed and millions displaced since the Syrian conflict began in March 2011.","Corridors are to open to allow unarmed rebels and civilians to leave besieged areas of the Syrian city of Aleppo, Russia - Syria's key ally - has said.",36912718
1,"The team from the University of Aberdeen believe the ancient remains could be one of many along the coast south of Stonehaven.\nIt is the first time an official excavation has been carried out there.\nPictish symbol stones were said to be found on the Dunnicaer sea stack by locals in the 19th Century.\nUntil this latest discovery, it was unclear whether the site held other historical remains.\nThe Aberdeen team believe they have found the remains of a house, a fireplace and ramparts.\nLead archaeologist Dr Gordon Noble said it could be the precursor to Dunnotter Castle, the remains of which lie a quarter of a mile south of the site.\nHe explained: ""We've opened a few trenches so far. This is the site where, in the 19th Century, they found six Pictish stones when a group of youths from Stonehaven came up the sea stack.\n""Here we've got clear evidence of people living on the sea stack at least for part of the year. Certainly people are living here for long enough to create this really nice well-constructed hearth and these lovely floor layers.""\nThe remote location meant the archaeologists needed the assistance of a specialist just to reach the site.\nTheir climbing guide was Duncan Paterson.\nHe said: ""Considering the team themselves had never been on a rope, never been on a harness let alone put a helmet on or climbed up slopes like this - it was a big challenge.""\n""We had tide times to consider. We've got a bit of grassy slope, this conglomerate and mud and turf to deal with.\n""So a big challenge.""\nThe team will continue to dig until the weekend.","Archaeologists have uncovered a ""very significant"" Pictish fort after scaling a remote sea stack off the coast of Aberdeenshire.",32325310
2,"Honnir i Mohammed Ali Ege, 41 oed, orchymyn dau ddyn i ladd dyn oherwydd ffrae am arian, ond aeth y ddau i'r cyfeiriad anghywir a llofruddio llanc 17 oed.\nCafodd Ben Hope a Jason Richards eu carcharu am oes am drywanu Aamir Siddiqi i farwolaeth wrth iddo agor y drws i'r ddau yng Nghaerdydd yn 2010.\nWedi'r llofruddiaeth fe wnaeth Ege ddianc i India ond cafodd ei arestio flwyddyn yn ddiweddarach.\nErs hynny mae e wedi bod yn ddalfa yn disgwyl am wrandawiad i'w estraddodi, ond fe wnaeth ddianc o orsaf heddlu wrth gael ei gludo o un gwrandawiad.\nDywedodd Comander Mahendra Kumar Rathod: ""Roedd y dyn yn cael ei gludo yn ôl i Hyderabad ar drên a gofynnodd i gael mynd i'r tŷ bach, a llwyddodd i ddianc oddi yno drwy dynnu'r bariau o ffenest yno.""\nDoedd Heddlu'r De ddim am wneud sylw am ddihangfa Ege o ofal yr heddlu yn India, ond dywedodd llefarydd:\n""Mae ein swyddogion yn parhau i siarad gyda'r awdurdodau am yr estraddodi ac fe fyddwn yn parhau i siarad gyda theulu Aamir Siddiqi er mwyn rhoi'r wybodaeth berthnasol iddyn nhw.""\nFe gafodd Hope a Richards eu talu £1,000 yr un i ladd dyn arall yn ardal y Rhath, Caerdydd, ond aeth y ddau i'r stryd anghywir a thrywanu Aamir Siddiqi mewn camgymeriad.\nFe gafodd rhieni Aamir hefyd eu trywanu wrth iddyn nhw geisio achub eu mab.\nCafodd y ddau eu carcharu am oes gyda lleiafswm o 40 mlynedd dan glo am lofruddio a cheisio llofruddio.",Mae dyn yr honnir iddo orchymyn lladd dyn yng Nghaerdydd - arweiniodd at farwolaeth myfyriwr - wedi dianc o ofal yr heddlu yn India.,39619906
3,"Canaries left-back Martin Olsson saw red in the second minute for handling on the goalline but Tjaronn Chery sent the subsequent penalty wide.\nConor Washington gave the QPR the lead midway through the first half and Sebastian Polter added a second.\nSteven Naismith pulled a goal back to set up a tense finale but the visitors slipped to a fourth consecutive defeat.\nAlex Neil's Norwich, relegated from the Premier League last season, were second in the table a month ago but a five-match winless streak has seen them slip to sixth in the Championship.\nHolloway, who managed QPR between 2001 and 2006, returned to Loftus Road following the dismissal of Jimmy Floyd Hasselbaink and saw his team make a bright start.\nOlsson's early dismissal handed the hosts the upper hand and, after Chery screwed his spot-kick wide, Rangers took the lead through a close-range finish from Northern Ireland international Washington.\nNorwich winger Jacob Murphy hit the crossbar in the second half and then set up Naismith, whose header gave the visitors hope of a comeback.\nHowever, City were unable to find a late equaliser, allowing Holloway to end Rangers' run of three games without a win on his return to west London.\nQPR manager Ian Holloway: ""We got a lot of things wrong and I could see a lack of belief after Norwich scored. We stopped passing and using the extra man.\n""But I was looking for character and I know we have it. Long may that continue.\n""To win games at this level you always need a bit of luck. All teams will have minutes (of pressure) no matter how many players they have on the pitch.""\nNorwich manager Alex Neil: ""The fans will be frustrated and annoyed and I understand that. My job is to win games. All I can say is that I am working extremely hard to turn things around and so are the players.\n""I thought they worked extremely hard and kept going until the last minute. We did better in the second half.\n""You're looking to stick to the task and not concede sloppy goals, and then as the game goes on maybe quieten the crowd. But once we conceded that goal it's difficult.\n""The two goals we conceded in the first half made it too easy for QPR.""\nMatch ends, Queens Park Rangers 2, Norwich City 1.\nSecond Half ends, Queens Park Rangers 2, Norwich City 1.\nAttempt missed. Jacob Murphy (Norwich City) right footed shot from outside the box misses to the left. Assisted by Alexander Tettey.\nCorner, Queens Park Rangers. Conceded by Sebastien Bassong.\nSubstitution, Norwich City. Josh Murphy replaces Steven Naismith.\nOlamide Shodipo (Queens Park Rangers) wins a free kick on the right wing.\nFoul by Cameron Jerome (Norwich City).\nOffside, Queens Park Rangers. Olamide Shodipo tries a through ball, but James Perch is caught offside.\nSubstitution, Queens Park Rangers. Idrissa Sylla replaces Sebastian Polter.\nJoel Lynch (Queens Park Rangers) is shown the yellow card for a bad foul.\nFoul by Joel Lynch (Queens Park Rangers).\nJacob Murphy (Norwich City) wins a free kick on the right wing.\nAttempt blocked. Steven Naismith (Norwich City) right footed shot from outside the box is blocked. Assisted by Graham Dorrans.\nAttempt blocked. Conor Washington (Queens Park Rangers) left footed shot from the left side of the box is blocked. Assisted by Tjaronn Chery.\nGoal! Queens Park Rangers 2, Norwich City 1. Steven Naismith (Norwich City) header from the centre of the box to the bottom right corner. Assisted by Jacob Murphy following a set piece situation.\nFoul by James Perch (Queens Park Rangers).\nSteven Naismith (Norwich City) wins a free kick on the left wing.\nFoul by Joel Lynch (Queens Park Rangers).\nRyan Bennett (Norwich City) wins a free kick in the defensive half.\nSubstitution, Norwich City. Cameron Jerome replaces Nélson Oliveira.\nAttempt blocked. Olamide Shodipo (Queens Park Rangers) right footed shot from outside the box is blocked.\nCorner, Queens Park Rangers. Conceded by Ryan Bennett.\nJacob Murphy (Norwich City) hits the bar with a right footed shot from outside the box. Assisted by Sebastien Bassong.\nAttempt blocked. Graham Dorrans (Norwich City) right footed shot from outside the box is blocked.\nCorner, Norwich City. Conceded by James Perch.\nAttempt saved. Grant Hall (Queens Park Rangers) header from the right side of the six yard box is saved in the centre of the goal. Assisted by Tjaronn Chery with a cross.\nCorner, Queens Park Rangers. Conceded by Jacob Murphy.\nAttempt missed. Robbie Brady (Norwich City) left footed shot from outside the box is too high from a direct free kick.\nFoul by Sandro (Queens Park Rangers).\nRussell Martin (Norwich City) wins a free kick in the attacking half.\nCorner, Norwich City. Conceded by Joel Lynch.\nSubstitution, Queens Park Rangers. Olamide Shodipo replaces Massimo Luongo.\nJordan Cousins (Queens Park Rangers) wins a free kick in the defensive half.\nFoul by Nélson Oliveira (Norwich City).\nFoul by Massimo Luongo (Queens Park Rangers).\nGraham Dorrans (Norwich City) wins a free kick in the attacking half.\nCorner, Queens Park Rangers. Conceded by John Ruddy.\nAttempt saved. Conor Washington (Queens Park Rangers) right footed shot from the left side of the box is saved in the centre of the goal.\nOffside, Norwich City. John Ruddy tries a through ball, but Nélson Oliveira is caught offside.\nOffside, Queens Park Rangers. James Perch tries a through ball, but Sebastian Polter is caught offside.",Ian Holloway began his second spell in charge of Queens Park Rangers with a win as his side beat 10-man Norwich.,37958492
4,"The Commons Work and Pensions Committee says the regulator should have the power to impose ""punitive fines"" of as much as £1bn.\nFollowing the collapse of the BHS pension scheme, the MPs say the regulator itself needs to be reformed.\nSpecifically, it should intervene much earlier if a scheme is in trouble.\n""It is difficult to imagine the Pensions Regulator would still be having to negotiate with Sir Philip Green [former owner of BHS] if he had been facing a bill of £1bn, rather than £350m,"" said Frank Field, who chairs the committee.\n""He would have sorted the pension scheme long ago. The measures we set out in this report are intended to reduce the chance of another scheme going down the BHS route,"" the MPS added.\n""We hope and expect that we will never again see a company like BHS be able to come up with a 23-year recovery plan for its pension fund, and certainly not that it would take the regulator two years to really begin to do anything about it,"" he added.\nMr Field has been running a vigorous campaign against Sir Philip over the wealthy retailer's apparent reluctance to put sufficient of his own money into a rescue of the BHS pension scheme.\nThe fund was left with a huge deficit of £571m when BHS went bust in April this year with the loss of 11,000 jobs, just over a year after Sir Philip had sold the department store chain to a former bankrupt, Dominic Chappell, for just £1.\nThe pensions regulator has demanded that Sir Philip pay £350m into the BHS pension fund to ease the worries of its 20,000 members.\nAs well as criticising Sir Philip and Mr Chappell, MPs have also accused the regulator of being slow to act when faced with evidence of a pension scheme with a large deficit and an employer reluctant to shoulder its obligations.\nDrawing on its inquiries into the saga, the Work and Pensions Committee has come up a series of proposals to strengthen the hand of the regulator and scheme trustees if a pension scheme is heading for trouble:\nMr Field said: ""It is inconceivable that Sir Philip Green's deal to dispose of BHS and its giant pension deficit for £1 to a dismally unqualified man, with no plan for the pension schemes and no means of financing one, would have evaded or passed any mandatory clearance scheme.\n""To prevent another BHS we need to have the means to nip inevitable disasters like this one in the bud.""\nA Department for Work and Pensions spokesperson said the government would publish a Green Paper on pension funding next year and examine the powers of the pensions regulator.\nThe regulator's chief executive, Lesley Titcomb, welcomed the MPs' report.\n""We continue to discuss options with [ministers] for the legislative and regulatory framework for workplace pensions, and how this might be improved.""",MPs have called for the Pensions Regulator to be given much stronger powers to thwart rogue employers who fail to support their pension schemes.,38382003


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [12]:
metric

Metric(name: "rouge", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}, usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLSum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/datasets/issues/617
    use_stemmer: Bool indicating whether Porter stemmer should be used to strip word suffixes.
    use_aggregator: Return aggregates if this is set to True
Retu

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [13]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeL': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)),
 'rougeLsum': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [14]:
!pip install tokenizers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [16]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [17]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [18]:
with tokenizer.as_target_tokenizer():
    print(tokenizer(["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}




If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [19]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [20]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [21]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 37, 423, 583, 13, 1783, 16, 20126, 16496, 6, 80, 13, 8, 844, 6025, 4161, 6, 19, 341, 271, 14841, 5, 7057, 161, 19, 4912, 16, 1626, 5981, 11, 186, 7540, 16, 1276, 15, 2296, 7, 5718, 2367, 14621, 4161, 57, 4125, 387, 5, 15059, 7, 30, 8, 4653, 4939, 711, 747, 522, 17879, 788, 12, 1783, 44, 8, 15763, 6029, 1813, 9, 7472, 5, 1404, 1623, 11, 5699, 277, 130, 4161, 57, 18368, 16, 20126, 16496, 227, 8, 2473, 5895, 15, 147, 89, 22411, 139, 8, 1511, 5, 1485, 3271, 3, 21926, 9, 472, 19623, 5251, 8, 616, 12, 15614, 8, 1783, 5, 37, 13818, 10564, 15, 26, 3, 9, 3, 19513, 1481, 6, 18368, 186, 1328, 2605, 30, 7488, 1887, 3, 18, 8, 711, 2309, 9517, 89, 355, 5, 3966, 1954, 9233, 15, 6, 113, 293, 7, 8, 16548, 13363, 106, 14022, 84, 47, 14621, 4161, 6, 243, 255, 228, 59, 7828, 8, 1249, 18, 545, 11298, 1773, 728, 8, 8347, 1560, 5, 611, 6, 255, 243, 72, 1709, 1528, 161, 228, 43, 118, 4006, 91, 12, 766, 8, 3, 19513, 1481, 410, 59, 5124, 5, 96, 196, 17, 19, 1256, 68, 27, 103, 317, 132

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [22]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/204045 [00:00<?, ? examples/s]

Map:   0%|          | 0/11332 [00:00<?, ? examples/s]

Map:   0%|          | 0/11334 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [23]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [24]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [25]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [26]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [42]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/KSz/t5-small-finetuned-xsum into local empty directory.


We can now finetune our model by just calling the `train` method:

In [None]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


In [40]:
input="Jose Mourinho will play his second European final in two years after leading Roma to the final of this season’s Europa League.The Serie A side held Bayer Leverkusen to a drab goalless draw in Germany on Thursday night.Roma had edged the first leg 1-0 at home.They will take perennial winners of the competition Sevilla for the trophy on May 31.Erik Lamela’s extra-time header proved to be the deciding goal as the LaLiga side fought back to win 2-1 on the night in Seville. They progressed with a 3-2 aggregate win."
inputs = tokenizer(
        input,
        #max_length=max_input_length, # states the maximum length of both generated tokens and input tokens
        #min_length=20,
        truncation=False,
        return_tensors='pt',
        padding=False).to('cuda')
Summary = model.generate(**inputs)

In [41]:
tokenizer.batch_decode(Summary)

['<pad><extra_id_0> 2-1 on the night in Seville. The LaLiga side fought back to']