### **Data Preprocessing**

In this section, we show how the data can be pre-processed for training. More importantly, we try to give the reader some insight into the process of deciding how to preprocess the data.

We will need datasets and transformers to be installed.

In [20]:
%%capture
!pip install datasets==1.0.2
!pip install transformers==3.5.0
!pip install torch

Let's start by downloading the *SNLI* dataset.

In [2]:
import datasets
train_data2 = datasets.load_dataset("snli",split="train")

Reusing dataset snli (/home/dimion/.cache/huggingface/datasets/snli/plain_text/1.0.0/bb1102591c6230bd78813e229d5dd4c7fbf4fc478cec28f298761eb69e5b537c)
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.


Alright, let's get a first impression of the dataset.
Alternatively, the dataset can also be visualized using the awesome [datasets viewer](https://huggingface.co/nlp/viewer/?dataset=cnn_dailymail&config=3.0.0) online.



In [3]:
train_data2.info.description

'The SNLI corpus (version 1.0) is a collection of 570k human-written English\nsentence pairs manually labeled for balanced classification with the labels\nentailment, contradiction, and neutral, supporting the task of natural language\ninference (NLI), also known as recognizing textual entailment (RTE).\n'

Our input is called *article* and our labels are called *highlights*. Let's now print out the first example of the training data to get a feeling for the data.

In [4]:
import pandas as pd
from IPython.display import display, HTML
from datasets import ClassLabel

df = pd.DataFrame(train_data2[:1])
# del df["id"]
for column, typ in train_data2.features.items():
      if isinstance(typ, ClassLabel):
          df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))

Unnamed: 0,premise,hypothesis,label
0,A person on a horse jumps over a broken down airplane.,A person is training his horse for a competition.,neutral



The input data seems to consist of short news articles. Interestingly, the labels appear to be bullet-point-like summaries. At this point, one should probably take a look at a couple of other examples to get a better feeling for the data.

One should also notice here that the text is *case-sensitive*. This means that we have to be careful if we want to use *case-insensitive* models.
As *CNN/Dailymail* is a summarization dataset, the model will be evaluated using the *ROUGE* metric. 
Checking the description of *ROUGE* in 🤗datasets, *cf.* [here](https://huggingface.co/metrics/rouge), we can see that the metric is *case-insensitive*, meaning that *upper case* letters will be normalized to *lower case* letters during evaluation. Thus, we can safely leverage *uncased* checkpoints, such as `bert-base-uncased`.

Cool! Next, let's get a sense of the length of input data and labels. 

As models compute length in *token-length*, we will make use of the `bert-base-uncased` tokenizer to compute the article and summary length.

First, we load the tokenizer.

In [5]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




Next, we make use of `.map()` to compute the length of the article and its summary. Since we know that the maximum length that `bert-base-uncased` can process amounts to 512, we are also interested in the percentage of input samples being longer than the maximum length.
Similarly, we compute the percentage of summaries that are longer than 16, and 32 respectively.

 We can define the `.map()` function as follows.

In [6]:
# map article and summary len to dict as well as if sample is longer than 512 tokens
def map_to_length(x):
  x["hypothesis_len"] = len(tokenizer(x["hypothesis"]).input_ids)
  x["hypothesis_longer_32"] = int(x["hypothesis_len"] > 31)
  x["premise_len"] = len(tokenizer(x["premise"]).input_ids)
  x["premise_longer_16"] = int(x["premise_len"] > 16)
  x["premise_longer_32"] = int(x["premise_len"] > 32)
  return x

It should be sufficient to look at the first 10000 samples. We can speed up the mapping by using multiple processes with `num_proc=4`.

In [7]:
sample_size = 500000
data_stats = train_data2.select(range(sample_size)).map(map_to_length, num_proc=4)


    

HBox(children=(FloatProgress(value=0.0, description='#0', max=125000.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='#2', max=125000.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='#1', max=125000.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='#3', max=125000.0, style=ProgressStyle(description_width=…







Having computed the length for the first 10000 samples, we should now average them together. For this, we can make use of the `.map()` function with `batched=True` and `batch_size=-1` to have access to all 10000 samples within the `.map()` function.

In [8]:
def compute_and_print_stats(x):
  if len(x["premise_len"]) == sample_size:
    print(
        "hypothesis Mean: {}, %-hypotheses > 32:{}, premise Mean:{}, %-premise > 16:{}, %-premise > 32:{}".format(
            sum(x["hypothesis_len"]) / sample_size,
            sum(x["hypothesis_longer_32"]) / sample_size, 
            sum(x["premise_len"]) / sample_size,
            sum(x["premise_longer_16"]) / sample_size,
            sum(x["premise_longer_32"]) / sample_size,
        )
    )

output = data_stats.map(
  compute_and_print_stats, 
  batched=True,
  batch_size=-1,
)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

hypothesis Mean: 10.5313, %-hypotheses > 32:0.000546, premise Mean:16.62471, %-premise > 16:0.410652, %-premise > 32:0.02394



In [9]:
labels_tokens = ['contradiction', 'entailment', 'neutral']

In [10]:
encoder_max_length=32
decoder_max_length=64

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  H_y = [batch["hypothesis"][i]+ " "+labels_tokens[batch['label'][i]] for i in range(len(batch['hypothesis']))]
  inputs = tokenizer(H_y, padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["premise"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

Alright, let's prepare the training data.

In [11]:
# batch_size = 16
batch_size=64

train_data = train_data2.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["premise", "hypothesis", "label"]
)

HBox(children=(FloatProgress(value=0.0, max=8597.0), HTML(value='')))




In [12]:
train_data

Dataset(features: {'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 550152)

So far, the data was manipulated using Python's `List` format. Let's convert the data to PyTorch Tensors to be trained on GPU.

In [21]:
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

In [22]:
val_data = datasets.load_dataset("snli", split="validation")

Reusing dataset snli (/home/dimion/.cache/huggingface/datasets/snli/plain_text/1.0.0/bb1102591c6230bd78813e229d5dd4c7fbf4fc478cec28f298761eb69e5b537c)


the mapping function is applied,

In [23]:
val_data = val_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["premise", "hypothesis", "label"]
)

Loading cached processed dataset at /home/dimion/.cache/huggingface/datasets/snli/plain_text/1.0.0/bb1102591c6230bd78813e229d5dd4c7fbf4fc478cec28f298761eb69e5b537c/cache-519a29610e4bcd1c.arrow


and, finally, the validation data is also converted to PyTorch tensors.

In [24]:
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

Great! Now we can move to warm-starting the `EncoderDecoderModel`.

### **Warm-starting the Encoder-Decoder Model**

This section explains how an Encoder-Decoder model can be warm-started using the `bert-base-cased` checkpoint.

Let's start by importing the `EncoderDecoderModel`. For more detailed information about the `EncoderDecoderModel` class, the reader is advised to take a look at the [documentation](https://huggingface.co/transformers/model_doc/encoderdecoder.html).

In [25]:
from transformers import EncoderDecoderModel

In [26]:
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased", tie_encoder_decoder=True)


AttributeError: type object 'EncoderDecoderModel' has no attribute 'from_encoder_decoder_pretrained'

As a comparison, we can see that the tied model has much fewer parameters as expected.

We have warm-started a `bert2bert` model, but we have not defined all the relevant parameters used for beam search decoding yet.

Let's start by setting the special tokens.
`bert-base-cased` does not have a `decoder_start_token_id` or `eos_token_id`, so we will use its `cls_token_id` and `sep_token_id` respectively. 
Also, we should define a `pad_token_id` on the config and make sure the correct `vocab_size` is set.

In [None]:
bert2bert.config.decoder_start_token_id = tokenizer.cls_token_id
bert2bert.config.eos_token_id = tokenizer.sep_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id
bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size

In [None]:
bert2bert.config.max_length = 64
bert2bert.config.min_length = 8
bert2bert.config.no_repeat_ngram_size = 3
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 4

### **Fine-Tuning Warm-Started Encoder-Decoder Models**

In this section, we will show how one can make use of the `Seq2SeqTrainer` that can be found under [examples/seq2seq/seq2seq_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/seq2seq_trainer.py) to fine-tune a warm-started encoder-decoder model.

Let's first download the `Seq2SeqTrainer` and its training arguments `Seq2SeqTrainingArguments`.

In [None]:
%%capture
!rm seq2seq_trainer.py
!rm seq2seq_training_args.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/seq2seq/seq2seq_trainer.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/seq2seq/seq2seq_training_args.py

In addition, we need a couple of python packages to make the `Seq2SeqTrainer` work.

In [None]:
%%capture
!pip install git-python==1.0.3
!pip install rouge_score
!pip install sacrebleu

Alright, let's import the `Seq2SeqTrainer` and the `Seq2SeqTrainingArguments`.

In [None]:
from seq2seq_trainer import Seq2SeqTrainer
from seq2seq_training_args import Seq2SeqTrainingArguments

The `Seq2SeqTrainer` extends 🤗Transformer's Trainer for encoder-decoder models.
In short, it allows using the `generate(...)` function during evaluation, which is necessary to validate the performance of encoder-decoder models on most *sequence-to-sequence* tasks, such as *summarization*. 

For more information on the `Trainer`, one should read through [this](https://huggingface.co/transformers/training.html#trainer) short tutorial.

Let's begin by configuring the `Seq2SeqTrainingArguments`.

The argument `predict_with_generate` should be set to `True`, so that the `Seq2SeqTrainer` runs the `generate(...)` on the validation data and passes the generated output as `predictions` to the `compute_metric(...)` function which we will define later.
The additional arguments are derived from `TrainingArguments` and can be read upon [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).
For a complete training run, one should change those arguments as needed. Good default values are commented out below.

For more information on the `Seq2SeqTrainer`, the reader is advised to take a look at the [code](https://github.com/huggingface/transformers/blob/master/examples/seq2seq/seq2seq_trainer.py).

In [None]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True, 
    output_dir="./",
    # logging_steps=2,
    # save_steps=10,
    # eval_steps=4,
    logging_steps=1000,
    save_steps=500,
    eval_steps=7500,
    warmup_steps=2000,
    save_total_limit=3,
)


Also, we need to define a function to correctly compute the ROUGE score during validation. Since we activated `predict_with_generate`, the `compute_metrics(...)` function expects `predictions` that were obtained using the `generate(...)` function. 
Like most summarization tasks, CNN/Dailymail is typically evaluated using the ROUGE score. 

Let's first load the ROUGE metric using the 🤗datasets library.

In [None]:
rouge = datasets.load_metric("rouge")

Next, we will define the `compute_metrics(...)` function. The `rouge` metric computes the score from two lists of strings. Thus we decode both the `predictions` and `labels` - making sure that `-100` is correctly replaced by the `pad_token_id` and remove all special characters by setting `skip_special_tokens=True`.

In [None]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

In [None]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=bert2bert,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)
trainer.train()

Awesome, we should now be fully equipped to finetune a warm-started encoder-decoder model. To check the result of our fine-tuning let's take a look at the saved checkpoints.

In [None]:
!ls

bert2bert      checkpoint-20  runs	   seq2seq_trainer.py
checkpoint-10  __pycache__    sample_data  seq2seq_training_args.py


Finally, we can load the checkpoint as usual via the `EncoderDecoderModel.from_pretrained(...)` method.

### **Evaluation**

In a final step, we might want to evaluate the *BERT2BERT* model on the test data.

To start, instead of loading the dummy model, let's load a *BERT2BERT* model that was finetuned on the full training dataset. Also, we load its tokenizer, which is just a copy of `bert-base-cased`'s tokenizer.

In [None]:
test_data = datasets.load_dataset("snli", split="test")

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602)


Now, we can again leverage 🤗dataset's handy `map()` function to generate a summary for each test sample.

For each data sample we:

- first, tokenize the `"article"`,
- second, generate the output token ids, and
- third, decode the output token ids to obtain our predicted summary.

In [None]:
def generate_summary(batch):
    # cut off at BERT max length 512
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    outputs = bert2bert.generate(input_ids, attention_mask=attention_mask)

    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    batch["pred_summary"] = output_str

    return batch

Let's run the map function to obtain the *results* dictionary that has the model's predicted summary stored for each sample. Executing the following cell may take *ca.* 10min ☕.

In [None]:
batch_size = 16  # change to 64 for full evaluation

results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




Finally, we compute the ROUGE score.

In [None]:
rouge.compute(predictions=results["pred_summary"], references=results["highlights"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.10389454113300968, recall=0.1564771201053348, fmeasure=0.12175271663717585)

Thanks a lot to Sascha Rothe, Shashi Narayan, and Aliaksei Severyn from Google Research, and Victor Sanh, Sylvain Gugger, and Thomas Wolf from 🤗Hugging Face for proof-reading and giving very much appreciated feedback.