If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [1]:
! pip install datasets evaluate transformers sentencepiece nltk
! pip install datsets transformers[sentencepiece]
! pip install -U accelerate
! pip install -U transformers
# ! pip install -U torch

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/521.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/521.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m96.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()
# 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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import transformers

print(transformers.__version__)

4.35.2


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [5]:
from transformers.utils import send_example_telemetry

send_example_telemetry("summarization_notebook", framework="pytorch")

# Fine-tuning a model on a summarization task

In [6]:
model_checkpoint = "google/mt5-small"

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`t5-small`](https://huggingface.co/t5-small) checkpoint.

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [7]:
# !pip install rouge-score

from datasets import load_dataset, DatasetDict, load_metric
from evaluate import load

# raw_datasets = load_dataset("xsum")
raw_datasets = load_dataset("squad")
# metric = load("rouge")
metric = load_metric('bleu')

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

  metric = load_metric('bleu')


Downloading builder script:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set:

In [8]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

To access an actual element, you need to select a split first, then give an index:

In [9]:
raw_datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [10]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [11]:
documents = show_random_elements(raw_datasets["train"])
documents

Unnamed: 0,id,title,context,question,answers
0,570e72df0dc6ce1900205090,Melbourne,"The Hoddle Grid (dimensions of 1 by 1⁄2 mile (1.61 by 0.80 km)) forms the centre of Melbourne's central business district. The grid's southern edge fronts onto the Yarra River. Office, commercial and public developments in the adjoining districts of Southbank and Docklands have made these redeveloped areas into extensions of the CBD in all but name. The city centre has a reputation for its historic and prominent lanes and arcades (most notably Block Place and Royal Arcade) which contain a variety of shops and cafés and are a byproduct of the city's layout.",Which edge of the Hoddle Grid fronts onto the Yarra River?,"{'text': ['southern'], 'answer_start': [134]}"
1,5729508d3f37b31900478235,Germans,"The Nazis, led by Adolf Hitler, attempted to unite all the people they claimed were ""Germans"" (Volksdeutsche) into one realm, including ethnic Germans in eastern Europe, many of whom had emigrated more than one hundred fifty years before and developed separate cultures in their new lands. This idea was initially welcomed by many ethnic Germans in Sudetenland, Austria, Poland, Danzig and western Lithuania, particularly the Germans from Klaipeda (Memel). The Swiss resisted the idea. They had viewed themselves as a distinctly separate nation since the Peace of Westphalia of 1648.",Since when had the Swiss viewed themselves as a different nation?,"{'text': ['1648'], 'answer_start': [578]}"
2,5707261a90286e26004fc964,Chihuahua_(state),"During the American occupation of the state, the number of Indian attacks was drastically reduced, but in 1848 the attacks resumed to such a degree that the Mexican officials had no choice but to resume military projects to protect Mexican settlements in the state. Through the next three decades the state faced constant attacks from indigenous on Mexican settlements. After the occupation the people of the state were worried about the potential attack from the hostile indigenous tribes north of the Rio Grande; as a result a decree on July 19, 1848, the state established 18 military colonies along the Rio Grande. The new military colonies were to replace the presidios as population centers to prevent future invasions by indigenous tribes; these policies remained prominent in the state until 1883. Eventually the state replaced the old state security with a state policy to form militias organized with every Mexican in the state capable to serve between the ages of 18 and 55 to fulfill the mandate of having six men defending for every 1000 residents.",How many men per 1000 residents were mandated to defend?,"{'text': ['six men'], 'answer_start': [1018]}"
3,56e822d000c9c71400d775cb,Dialect,"The most common, and most purely linguistic, criterion is that of mutual intelligibility: two varieties are said to be dialects of the same language if being a speaker of one variety confers sufficient knowledge to understand and be understood by a speaker of the other; otherwise, they are said to be different languages. However, this definition becomes problematic in the case of dialect continua, in which it may be the case that dialect B is mutually intelligible with both dialect A and dialect C but dialects A and C are not mutually intelligible with each other. In this case the criterion of mutual intelligibility makes it impossible to decide whether A and C are dialects of the same language or not. Cases may also arise in which a speaker of dialect X can understand a speaker of dialect Y, but not vice versa; the mutual intelligibility criterion flounders here as well.",What trait is the most common way of determining if languages are dialects?,"{'text': ['mutual intelligibility'], 'answer_start': [66]}"
4,570cf238fed7b91900d45b37,Digestion,"In addition to the use of the multiprotein complexes listed above, Gram-negative bacteria possess another method for release of material: the formation of outer membrane vesicles. Portions of the outer membrane pinch off, forming spherical structures made of a lipid bilayer enclosing periplasmic materials. Vesicles from a number of bacterial species have been found to contain virulence factors, some have immunomodulatory effects, and some can directly adhere to and intoxicate host cells. While release of vesicles has been demonstrated as a general response to stress conditions, the process of loading cargo proteins seems to be selective.",What other method does Gram-negative bacters use to release material?,"{'text': ['the formation of outer membrane vesicles'], 'answer_start': [138]}"



The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [12]:
metric

Metric(name: "bleu", features: {'predictions': Sequence(feature=Value(dtype='string', id='token'), length=-1, id='sequence'), 'references': Sequence(feature=Sequence(feature=Value(dtype='string', id='token'), length=-1, id='sequence'), length=-1, id='references')}, usage: """
Computes BLEU score of translated segments against one or more references.
Args:
    predictions: list of translations to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
    max_order: Maximum n-gram order to use when computing BLEU score.
    smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
    'bleu': bleu score,
    'precisions': geometric mean of n-gram precisions,
    'brevity_penalty': brevity penalty,
    'length_ratio': ratio of lengths,
    'translation_length': translation_length,
    'reference_length': reference_length
Examples

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [13]:
# fake_preds = [["how"], ["are"], "question is good?"]
# fake_labels = ["how is he?", "is this question good?"]
# metric.compute(predictions=fake_preds, references=fake_labels)

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that the model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [14]:
from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) t5-small
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) #mt5-small
# tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")

(…)small/resolve/main/tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

(…)oogle/mt5-small/resolve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

(…)ogle/mt5-small/resolve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

(…)all/resolve/main/special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


By default, the call above will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [15]:
tokenizer("Hello, this one sentence!")
# {'input_ids': [30273, 261, 714, 1371, 259, 98923, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


{'input_ids': [30273, 261, 714, 1371, 259, 98923, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [16]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])
# {'input_ids': [[30273, 261, 714, 1371, 259, 98923, 309, 1], [1494, 339, 259, 7845, 259, 98923, 260, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

{'input_ids': [[30273, 261, 714, 1371, 259, 98923, 309, 1], [1494, 339, 259, 7845, 259, 98923, 260, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them using the `text_target` parameter. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [17]:
print(tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[30273, 261, 714, 1371, 259, 98923, 309, 1], [1494, 339, 259, 7845, 259, 98923, 260, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [18]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
    # prefix = "question: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [19]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = ['questiongeneration: ' + doc for doc in examples["context"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["question"], max_length=max_target_length, truncation=True)
    # labels = tokenizer(labels, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [20]:
# max_input_length = 1024
# max_target_length = 128

# def preprocess_function(examples):
#     inputs = ['summarize: ' + doc for doc in examples["document"]]
#     model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

#     # Setup the tokenizer for targets
#     labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True)

#     model_inputs["labels"] = labels["input_ids"]
#     return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [21]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[7680, 96542, 267, 298, 115957, 484, 261, 287, 5560, 1070, 259, 262, 259, 60355, 28021, 260, 298, 1332, 287, 4437, 29827, 277, 263, 17088, 88346, 339, 259, 262, 74172, 259, 169692, 304, 287, 69774, 15255, 260, 336, 201815, 484, 281, 8224, 304, 287, 4437, 29827, 305, 9106, 347, 609, 261, 339, 259, 262, 259, 110360, 259, 169692, 304, 15685, 514, 259, 75807, 1150, 44930, 285, 514, 287, 34666, 313, 30702, 15782, 4515, 1517, 3548, 1865, 1191, 3557, 288, 287, 4437, 29827, 339, 287, 364, 205267, 304, 287, 89194, 345, 24022, 260, 336, 201815, 484, 259, 25386, 287, 330, 205267, 339, 287, 101091, 476, 261, 259, 262, 36757, 2554, 304, 259, 84956, 305, 259, 131642, 260, 1385, 339, 259, 262, 40422, 304, 287, 259, 161963, 268, 344, 458, 133796, 261, 5263, 259, 3001, 287, 69774, 15255, 121041, 69063, 259, 15484, 345, 288, 5528, 60442, 4549, 16856, 7377, 6836, 281, 259, 116096, 260, 2584, 287, 3162, 304, 287, 4397, 12317, 274, 1963, 281, 259, 262, 3867, 6329, 533, 20492, 263, 3026, 381,

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [40]:
def reduce_data(raw_datasets, x_train, x_val, x_test):

  shuffled_dataset = raw_datasets.shuffle(seed=42)
  new_dataset_dict = {
      "train": shuffled_dataset["train"].select(range(x_train)),
      "validation": shuffled_dataset["validation"].select(range(x_val)),
      # "test": shuffled_dataset["test"].select(range(x_test)),
  }

  new_dataset_dict = DatasetDict(new_dataset_dict)
  return new_dataset_dict

reduced_dataset = reduce_data(raw_datasets, 8000, 1000, 1000)

In [41]:
reduced_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1000
    })
})

In [42]:
tokenized_datasets = reduced_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [43]:
show_random_elements(tokenized_datasets["train"])

Unnamed: 0,id,title,context,question,answers,input_ids,attention_mask,labels
0,570d672cb3d812140066d83a,Adolescence,"In studying adolescent development, adolescence can be defined biologically, as the physical transition marked by the onset of puberty and the termination of physical growth; cognitively, as changes in the ability to think abstractly and multi-dimensionally; or socially, as a period of preparation for adult roles. Major pubertal and biological changes include changes to the sex organs, height, weight, and muscle mass, as well as major changes in brain structure and organization. Cognitive advances encompass both increases in knowledge and in the ability to think abstractly and to reason more effectively. The study of adolescent development often involves interdisciplinary collaborations. For example, researchers in neuroscience or bio-behavioral health might focus on pubertal changes in brain structure and its effects on cognition or social relations. Sociologists interested in adolescence might focus on the acquisition of social roles (e.g., worker or romantic partner) and how this varies across cultures or social conditions. Developmental psychologists might focus on changes in relations with parents and peers as a function of school structure and pubertal status.","Changes to sex organs, height, weight, and muscle mass are examples of which type of change?","{'text': ['biological'], 'answer_start': [335]}","[7680, 96542, 267, 563, 10380, 347, 52369, 10030, 261, 142387, 541, 738, 390, 259, 54628, 107521, 4621, 261, 527, 287, 259, 28223, 259, 10091, 64454, 455, 287, 351, 2325, 304, 186352, 276, 305, 287, 259, 128836, 304, 259, 28223, 259, 20147, 296, 259, 119449, 11469, 261, 527, 259, 25444, 281, 287, 259, 7744, 288, 5231, 83949, 484, 305, 5942, 264, 81321, 484, 296, 631, 2943, 484, 261, 527, 259, 262, 8192, 304, 259, 72075, 332, 13403, 259, 54355, 260, 30313, 186352, 473, 305, 330, 121673, 259, 25444, 9452, 259, 25444, 288, 287, 1528, 10523, 263, 261, 259, 1744, 261, 27923, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[259, 96890, 288, 1528, 10523, 263, 261, 259, 1744, 261, 27923, 261, 305, 259, 52167, 26555, 418, 259, 95000, 304, 259, 1542, 4054, 304, 6313, 291, 1]"
1,57324575e99e3014001e660d,Dwight_D._Eisenhower,"Angels in the Outfield was Eisenhower's favorite movie. His favorite reading material for relaxation were the Western novels of Zane Grey. With his excellent memory and ability to focus, Eisenhower was skilled at card games. He learned poker, which he called his ""favorite indoor sport,"" in Abilene. Eisenhower recorded West Point classmates' poker losses for payment after graduation, and later stopped playing because his opponents resented having to pay him. A classmate reported that after learning to play contract bridge at West Point, Eisenhower played the game six nights a week for five months.",Where did Eisenhower learn to play poker?,"{'text': ['Abilene'], 'answer_start': [291]}","[7680, 96542, 267, 130042, 281, 287, 7732, 5504, 639, 101958, 17600, 295, 277, 263, 22590, 13194, 260, 13889, 22590, 11807, 5171, 332, 14788, 1300, 2109, 287, 17358, 20233, 263, 304, 1515, 405, 18332, 260, 3126, 1638, 16386, 28246, 305, 259, 7744, 288, 16857, 261, 101958, 17600, 295, 639, 32607, 345, 344, 10168, 10239, 260, 1669, 11869, 345, 12650, 261, 259, 1542, 790, 259, 13075, 1638, 313, 119532, 259, 87311, 4264, 914, 281, 298, 29147, 265, 260, 101958, 17600, 295, 8449, 345, 4300, 15234, 3931, 63038, 277, 12650, 26754, 299, 332, 23082, 3354, 49267, 1300, 261, 305, 13245, 35042, 345, 259, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[259, 29655, 3031, 101958, 17600, 295, 11869, 288, 5233, 12650, 291, 1]"
2,571aa99d4faf5e1900b8abd4,Umayyad_Caliphate,"The Umayyads have met with a largely negative reception from later Islamic historians, who have accused them of promoting a kingship (mulk, a term with connotations of tyranny) instead of a true caliphate (khilafa). In this respect it is notable that the Umayyad caliphs referred to themselves not as khalifat rasul Allah (""successor of the messenger of God"", the title preferred by the tradition), but rather as khalifat Allah (""deputy of God""). The distinction seems to indicate that the Umayyads ""regarded themselves as God's representatives at the head of the community and saw no need to share their religious power with, or delegate it to, the emergent class of religious scholars."" In fact, it was precisely this class of scholars, based largely in Iraq, that was responsible for collecting and recording the traditions that form the primary source material for the history of the Umayyad period. In reconstructing this history, therefore, it is necessary to rely mainly on sources, such as the histories of Tabari and Baladhuri, that were written in the Abbasid court at Baghdad.",What Arabic term did the Umayyad caliphs use to refer to themselves?,"{'text': ['khalifat Allah'], 'answer_start': [413]}","[7680, 96542, 267, 486, 3048, 79232, 285, 263, 783, 824, 514, 259, 262, 8057, 484, 259, 32588, 60084, 702, 13245, 259, 70355, 259, 173786, 263, 261, 1866, 783, 37979, 345, 2486, 304, 5415, 1821, 259, 262, 259, 6197, 10260, 274, 157201, 261, 259, 262, 10886, 514, 450, 228488, 263, 304, 43883, 321, 680, 271, 281, 15129, 304, 259, 262, 6274, 123043, 88085, 265, 274, 314, 117468, 1518, 483, 563, 714, 7521, 609, 339, 776, 1059, 533, 287, 3048, 79232, 285, 259, 235959, 334, 263, 259, 75568, 288, 259, 39694, 776, 527, 408, 164575, 270, 259, 172244, 3472, 259, 1217, 46327, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[5126, 259, 62233, 10886, 3031, 287, 3048, 79232, 285, 259, 235959, 334, 263, 2225, 288, 15661, 288, 259, 39694, 291, 1]"
3,57267215dd62a815002e8521,East_India_Company,"In 1838 with the amount of smuggled opium entering China approaching 1,400 tons a year, the Chinese imposed a death penalty for opium smuggling and sent a Special Imperial Commissioner, Lin Zexu, to curb smuggling. This resulted in the First Opium War (1839–42). After the war Hong Kong island was ceded to Britain under the Treaty of Nanking and the Chinese market opened to the opium traders of Britain and other nations. The Jardines and Apcar and Company dominated the trade, although P&O also tried to take a share. A Second Opium War fought by Britain and France against China lasted from 1856 until 1860 and led to the Treaty of Tientsin, which legalised the importation of opium. Legalisation stimulated domestic Chinese opium production and increased the importation of opium from Turkey and Persia. This increased competition for the Chinese market led to India reducing its opium output and diversifying its exports.",when did the first Opium war start?,"{'text': ['1839'], 'answer_start': [253]}","[7680, 96542, 267, 563, 259, 145086, 514, 287, 259, 10617, 304, 259, 220379, 59389, 585, 16558, 259, 111323, 3182, 18996, 347, 259, 188767, 259, 46171, 259, 262, 3721, 261, 287, 17542, 39660, 345, 259, 262, 20862, 19208, 1421, 332, 585, 16558, 259, 220379, 76625, 305, 8407, 259, 262, 12183, 78879, 22407, 295, 261, 11508, 5627, 33601, 261, 288, 14087, 316, 259, 220379, 76625, 260, 1494, 8106, 345, 281, 287, 8875, 3120, 16558, 4576, 274, 89558, 1326, 84350, 260, 11076, 287, 2381, 13658, 12458, 43891, 639, 317, 345, 345, 288, 259, 47117, 1711, 287, 108349, 276, 304, 19399, 6197, 305, 287, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[259, 1909, 3031, 287, 2262, 3120, 16558, 2381, 3014, 291, 1]"
4,57345902acc1501500babe25,"Richmond,_Virginia","In 1990 religion and politics intersected to impact the outcome of the Eighth District election in South Richmond. With the endorsements of black power brokers, black clergy and the Richmond Crusade for Voters, South Richmond residents made history, electing Reverend A. Carl Prince to the Richmond City Council. As the first African American Baptist Minister elected to the Richmond City Council, Prince's election paved the way for a political paradigm shift in politics that persist today. Following Prince's election, Reverend Gwendolyn Hedgepeth and the Reverend Leonidas Young, former Richmond Mayor were elected to public office. Prior to Prince's election black clergy made political endorsements and served as appointees to the Richmond School Board and other boards throughout the city. Today religion and politics continues to thrive in the Commonwealth of Virginia. The Honorable Dwight C. Jones, a prominent Baptist pastor and former Chairman of the Richmond School Board and Member of the Virginia House of Delegates serves as Mayor of the City of Richmond.",What political organization supported the city council candidacy of A. Carl Prince?,"{'text': ['Richmond Crusade for Voters'], 'answer_start': [182]}","[7680, 96542, 267, 563, 10478, 43443, 305, 77539, 174952, 3678, 288, 10832, 287, 1350, 284, 265, 304, 287, 259, 122583, 334, 16689, 57046, 281, 6174, 259, 87351, 260, 3126, 287, 259, 117200, 42095, 304, 5866, 6665, 26617, 263, 261, 5866, 259, 95770, 5118, 305, 287, 259, 87351, 155023, 2663, 332, 54644, 1207, 261, 6174, 259, 87351, 259, 77902, 3785, 12312, 261, 259, 111966, 1821, 158258, 3043, 298, 260, 14644, 35913, 288, 287, 259, 87351, 2740, 28996, 260, 1477, 287, 2262, 31134, 6369, 259, 84467, 14010, 259, 126894, 288, 287, 259, 87351, 2740, 28996, 261, 35913, 277, 263, 57046, 555, 9999, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[5126, 259, 28735, 29660, 259, 66337, 287, 9416, 259, 99906, 136965, 2888, 304, 298, 260, 14644, 35913, 291, 1]"


Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [44]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, MT5Model, MT5ForConditionalGeneration

# model = MT5Model.from_pretrained("google/mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# model = MT5ForConditionalGeneration.from_pretrained("google/" + model_checkpoint)

Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [45]:
import accelerate

accelerate.__version__

# !pip install -U torch

'0.24.1'

In [46]:
batch_size = 8
model_name = model_checkpoint.split("/")[-1]
print(model_name)
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    # fp16=True,
    push_to_hub=True,
)

mt5-small


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [47]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [48]:
import nltk
nltk.download('punkt')
import numpy as np
from nltk.translate.bleu_score import corpus_bleu

def compute_metrics(eval_pred):
    # Unpack the predictions and labels from the input.
    predictions, labels = eval_pred

    # Decode the predictions to human-readable text, skipping special tokens.
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace any instance of -100 in the labels with the pad token id.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode the labels to human-readable text, skipping special tokens.
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # For BLEU, we need to tokenize the sentences into words.
    # nltk's word_tokenize function can be used for this purpose.
    tokenized_preds = [nltk.word_tokenize(pred.strip().lower()) for pred in decoded_preds]
    # BLEU expects a list of reference translations for each sentence.
    tokenized_labels = [[nltk.word_tokenize(label.strip().lower())] for label in decoded_labels]

    # Compute the BLEU score for the entire corpus (set of predictions).
    # The function corpus_bleu takes a list of list of reference translations and a list of candidate translations.
    bleu_score = corpus_bleu(tokenized_labels, tokenized_preds)

    # Add mean generated length.
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    gen_len = np.mean(prediction_lens)

    # Return BLEU score and generated length. BLEU score is already a percentage so no need to multiply by 100.
    return {"bleu": round(bleu_score, 4), "gen_len": round(gen_len, 4)}


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [49]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
print()




We can now finetune our model by just calling the `train` method:

In [50]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,4.7092,2.713296,0.0308,11.59
2,3.8496,2.602507,0.0474,11.252




TrainOutput(global_step=2000, training_loss=6.009755126953125, metrics={'train_runtime': 1021.5792, 'train_samples_per_second': 15.662, 'train_steps_per_second': 1.958, 'total_flos': 5370568442265600.0, 'train_loss': 6.009755126953125, 'epoch': 2.0})

In [75]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def _generate(input):
  context = "questionsgeneration: " + input
  model_inputs = tokenizer(context, max_length=max_input_length, truncation=True, return_tensors="pt").to(device)
  print(model_inputs)

  output_ids = model.generate(model_inputs["input_ids"], max_length=50, num_beams=4, early_stopping=True)
  decoded_preds = tokenizer.decode(output_ids[0], skip_special_tokens=True)
  # decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in output_ids]
  print(decoded_preds)
  return output_ids

# context = "Angels in the Outfield was Eisenhower's favorite movie. His favorite reading material for relaxation were the Western novels of Zane Grey. With his excellent memory and ability to focus, Eisenhower was skilled at card games. He learned poker, which he called his \"favorite indoor sport,\" in Abilene. Eisenhower recorded West Point classmates' poker losses for payment after graduation, and later stopped playing because his opponents resented having to pay him. A classmate reported that after learning to play contract bridge at West Point, Eisenhower played the game six nights a week for five months."
# inference = {}
# inference["context"] = context
_generate(context)

{'input_ids': tensor([[ 12366,  96542,    267, 130042,    281,    287,   7732,   5504,    639,
         101958,  17600,    295,    277,    263,  22590,  13194,    260,  13889,
          22590,  11807,   5171,    332,  14788,   1300,   2109,    287,  17358,
          20233,    263,    304,   1515,    405,  18332,    260,   3126,   1638,
          16386,  28246,    305,    259,   7744,    288,  16857,    261, 101958,
          17600,    295,    639,  32607,    345,    344,  10168,  10239,    260,
           1669,  11869,    345,  12650,    261,    259,   1542,    790,    259,
          13075,   1638,    313, 119532,    259,  87311,   4264,    914,    281,
            298,  29147,    265,    260, 101958,  17600,    295,   8449,    345,
           4300,  15234,   3931,  63038,    277,  12650,  26754,    299,    332,
          23082,   3354,  49267,   1300,    261,    305,  13245,  35042,    345,
            259,  29095,    259,   3361,   1638,    585, 167445,  59150,   3678,
            25

tensor([[     0,   5126,    639, 101958,  17600,    295,    277,    263,  22590,
          13194,    291,      1]], device='cuda:0')

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
# trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```