#Fine-tuning with the Hugging Face `Transformers` library

In this project we'll learn how to use the Hugging Face [`transformers`](https://huggingface.co/transformers/) library to fine-tune large pretrained transformers for other downstream NLU tasks. `transformers` is a popular Python library that provides useful wrappers for pre-training and fine-tuning popular architectures such as BERT, GPT-2, RoBERTa, XLNet, etc. It integrates seamlessly with both PyTorch and Tensorflow and is accompanied by an abundant [library of datasets](https://huggingface.co/docs/datasets/).

Specifically, we will work through an example of fine-tuning a [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) model on the SQuAD2.0 dataset, a reading comprehension dataset.

In [None]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 51.9 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 40.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 37.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.50.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 40.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.

## Loading data

In this project we'll be using the `datasets` API (https://huggingface.co/docs/datasets/) to load and work with our data. To get an idea of what kinds of datasets are available for us to work with, we can consult the [dataset viewer](https://huggingface.co/datasets/viewer/). For example, let's look at the dataset card for SQuAD2.0.

![](https://drive.google.com/uc?export=view&id=1KHsnQnGilEo87rY3FmWApthoGo-KLjxa)

Now let's load this data and take a look at it. Fortunately, the `load_dataset` method lets us easily load the entire dataset.

In [None]:
import pprint as pp
from datasets import load_dataset, load_metric

In [None]:
squad = load_dataset("squad_v2")

Downloading builder script:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

`load_dataset()` returns all splits of the data in a dictionary, where the key corresponds to the name of the split. The values of the dictionary are [datasets.Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset) objects which we can iterate through the same way we would a list.

In [None]:
print(f"Split names: {', '.join(list(squad.keys()))}")
print("An example from SQuAD2.0:")
pp.pprint(squad["train"][0])
pp.pprint(squad["train"][19])

Split names: train, validation
An example from SQuAD2.0:
{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born '
            'September 4, 1981) is an American singer, songwriter, record '
            'producer and actress. Born and raised in Houston, Texas, she '
            'performed in various singing and dancing competitions as a child, '
            'and rose to fame in the late 1990s as lead singer of R&B '
            "girl-group Destiny's Child. Managed by her father, Mathew "
            "Knowles, the group became one of the world's best-selling girl "
            "groups of all time. Their hiatus saw the release of Beyoncé's "
            'debut album, Dangerously in Love (2003), which established her as '
            'a solo artist worldwide, earned five Grammy Awards and featured '
            'the Billboard Hot 100 number-one singles "Crazy in Love" and '
            '"Baby Boy".',


As we can see above, the SQuAD2.0 dataset consists of a set of questions paired with relevant context from Wikipedia articles. The answers are either spans from the context or unanswerable.

---

## Preprocessing data
However, the Hugging Face API also contains a useful [Tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html) interface that implements common tokenizing methods for a variety of different architectures, including BERT, RoBERTa, and others. (To see the complete list of supported architectures in the transformers library, see [this table](https://huggingface.co/transformers/index.html#bigtable).) This library handles logic such as:

*   Tokenizing text and encoding/decoding between tokens and token IDs
*   Handling different types of vocabularies, such as byte-pair encodings (BPE) or SentencePieces
*   Handling special tokens such as padding, mask, CLS, or separator tokens.

Since we'll be fine-tuning a DistilBERT model, we will use the DistilBertTokenizerFast class to preprocess our data.



In [None]:
# from transformers import BertTokenizerFast
from transformers import DistilBertTokenizerFast

# We can load a pre-trained tokenizer from a model name (See the Model Hub for a
# complete list: https://huggingface.co/models) or a filepath.
tokenizer = DistilBertTokenizerFast.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'DistilBertTokenizerFast'.


We can directly call the tokenizer on a single text, or a pair of texts. In the BERT model, we pre-pend the example with a special classification token (`[CLS]`) that is used in classification tasks for learning an aggregate representation of the entire sequence. In addition, a separator token (`[SEP]`) is appended to the end of each sequence. Here we'll print out both the encoded and decoded tokens to get an idea of how this works.

In [None]:
import pprint as pp

print("Tokenize as a single sequence:")
pp.pprint(tokenizer("Hello World"))
print("Decoded:")
pp.pprint(tokenizer.decode(tokenizer("Hello World")['input_ids']))

print("\nTokenize as a pair of sequences:")
pp.pprint(tokenizer("Hello", "World"))
print("Decoded:")
pp.pprint(tokenizer.decode(tokenizer("Hello", "World")['input_ids']))

Tokenize as a single sequence:
{'input_ids': [101, 7592, 2088, 102], 'attention_mask': [1, 1, 1, 1]}
Decoded:
'[CLS] hello world [SEP]'

Tokenize as a pair of sequences:
{'input_ids': [101, 7592, 102, 2088, 102], 'attention_mask': [1, 1, 1, 1, 1]}
Decoded:
'[CLS] hello [SEP] world [SEP]'


We can also easily incorporate padding or truncation.

In [None]:
max_seq_len = 4

print("Truncated:")
pp.pprint(tokenizer("Hello", "World", truncation=True, max_length=max_seq_len))

print("\nPadded:")
pp.pprint(tokenizer("Hello", padding="max_length", max_length=max_seq_len))

Truncated:
{'input_ids': [101, 102, 2088, 102], 'attention_mask': [1, 1, 1, 1]}

Padded:
{'input_ids': [101, 7592, 102, 0], 'attention_mask': [1, 1, 1, 0]}


As we can see above, `attention_mask[i]=0` if the token at index `i` correponds to a padding token. The BERT model uses this attention mask to indicate which tokens should have attention applied to them (the non-padding tokens) and which should not (the padding tokens).

Now we write a function to tokenize our entire dataset and featurize our labels. Since the answer to each SQuAD2.0 question is either "unanswerable" or a span in the context passage, we need to provide the model with information about the answer span. The model architecture we'll be using adds a span classification head on top of BERT, which requires that we provide the start and end positions of the answer span (with respect to the context) to the model. If the question is unanswerable, we set these positions to the index of the `[CLS]` token.

In [None]:
def preprocess_data(examples_batch):
  tokenized_examples = tokenizer(
        examples_batch["question"],
        examples_batch["context"],
        truncation="only_second", # only truncate the context
        max_length=max_length,
        padding="max_length",
        return_offsets_mapping=True,
    )

  # Extract start and end positions for answers.
  start_pos = []
  end_pos = []
  for i in range(len(examples_batch["question"])):
    input_ids = tokenized_examples['input_ids'][i]
    answer = examples_batch['answers'][i]
    cls_idx = input_ids.index(tokenizer.cls_token_id)
    # If the question doesn't have an answer:
    if len(answer["answer_start"]) == 0:
      start_pos.append(cls_idx)
      end_pos.append(cls_idx)
    else:
      start_pos.append(answer["answer_start"][0])
      end_pos.append(answer["answer_start"][0] + len(answer["text"][0]))
  tokenized_examples["start_positions"] = start_pos
  tokenized_examples["end_positions"] = end_pos
  return tokenized_examples

Now we can use the `map` function (from the `datasets.Dataset` interface) to apply our preprocessing function to the entire dataset in batches.

In [None]:
max_length = 384
tokenized_datasets = squad.map(preprocess_data, batched=True)

  0%|          | 0/131 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

In [None]:
print("Original example:")
pp.pprint(squad["train"][0])
print("\nParsed answer indexes:")
print(f"Start position: {tokenized_datasets['train'][0]['start_positions']}")
print(f"End position: {tokenized_datasets['train'][0]['end_positions']}")

Original example:
{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born '
            'September 4, 1981) is an American singer, songwriter, record '
            'producer and actress. Born and raised in Houston, Texas, she '
            'performed in various singing and dancing competitions as a child, '
            'and rose to fame in the late 1990s as lead singer of R&B '
            "girl-group Destiny's Child. Managed by her father, Mathew "
            "Knowles, the group became one of the world's best-selling girl "
            "groups of all time. Their hiatus saw the release of Beyoncé's "
            'debut album, Dangerously in Love (2003), which established her as '
            'a solo artist worldwide, earned five Grammy Awards and featured '
            'the Billboard Hot 100 number-one singles "Crazy in Love" and '
            '"Baby Boy".',
 'id': '56be85543aeaaa14008c9063',
 'qu

And now our data is ready for our model!

---

## Fine-tuning a DIstilBERT model for question answering

Once again, we'll be taking advantage of the super convenient Hugging Face `transformers` library to load a pre-trained DistilBERT model. The library also provides a number of variations of the DistilBERT model for fine-tuning on various downstream tasks. We'll be using `DistilBertForQuestionAnswering` here, which adds a span classification head on top of DistilBERT. Currently the weights associated with this head are randomly initialized, but we'll train these during the fine-tuning stage on the SQuAD2.0 data.

In [None]:
from transformers import DistilBertForQuestionAnswering, TrainingArguments, Trainer

# We can override parameters set in the original model config, such as the
# dropout probability.
model = DistilBertForQuestionAnswering.from_pretrained(
    "distilbert-base-uncased",
    dropout=0.1,
    attention_dropout=0.1,
    )

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

Let's take a look at our current model configuration.

In [None]:
pp.pprint(model.config)

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.18.0",
  "vocab_size": 30522
}



In [None]:
training_args = TrainingArguments(
    output_dir="./models/",
    do_train=True,
    do_eval=True,
    per_gpu_train_batch_size=8,
    per_gpu_eval_batch_size=64,
    num_train_epochs=0.5, # due to time/computation constraints
    logging_steps=500,
    logging_first_step=True,
    save_steps=1000,
    evaluation_strategy = "epoch", # evaluate at the end of every epoch
    learning_rate=2e-5,
    weight_decay=0.01,
)

Now we have all the components we need to run training. Doing so is simple with the `Trainer.train()` method; however, training large models such as this one can take quite a bit of time (up to a full GPU day) and memory, so for now we'll only train for half an epoch to demonstrate how the interface works. We'll run the full training script later on the Greene cluster.

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: answers, question, id, offset_mapping, context, title. If answers, question, id, offset_mapping, context, title are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 130319
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 8145
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future ver

Epoch,Training Loss,Validation Loss


Saving model checkpoint to ./models/checkpoint-1000
Configuration saved in ./models/checkpoint-1000/config.json
Model weights saved in ./models/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-1000/special_tokens_map.json


Epoch,Training Loss,Validation Loss


Saving model checkpoint to ./models/checkpoint-2000
Configuration saved in ./models/checkpoint-2000/config.json
Model weights saved in ./models/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-2000/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-2000/special_tokens_map.json
Saving model checkpoint to ./models/checkpoint-3000
Configuration saved in ./models/checkpoint-3000/config.json
Model weights saved in ./models/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-3000/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-3000/special_tokens_map.json
Saving model checkpoint to ./models/checkpoint-4000
Configuration saved in ./models/checkpoint-4000/config.json
Model weights saved in ./models/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in ./models/checkpoint-4000/tokenizer_config.json
Special tokens file saved in ./models/checkpoint-4000/special_tokens_map.jso

---
## Model Evaluation
Now that we've trained our model, we want to evaluate how accurately it answers the questions in the validation dataset. For this purpose, we'll use the `squad_v2` metric which is provided together with the dataset itself via the `datasets` library. We can load it and read about its inputs/outputs via the `datasets.load_metric()` function.

In [None]:
# save this model in the  zip
# Instead of waiting for training to finish, for now we'll load an already finetuned model.
# https://drive.google.com/file/d/1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp/view?usp=sharing
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp" -O model.zip && rm -rf /tmp/cookies.txt
!unzip 'model.zip'

--2022-05-02 21:43:08--  https://docs.google.com/uc?export=download&confirm=t&id=1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp
Resolving docs.google.com (docs.google.com)... 74.125.129.138, 74.125.129.139, 74.125.129.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.129.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-08-c4-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cgoosmhk557ln8blb36om1atki6jgbsp/1651527750000/17679075865986397708/*/1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp?e=download [following]
--2022-05-02 21:43:08--  https://doc-08-c4-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cgoosmhk557ln8blb36om1atki6jgbsp/1651527750000/17679075865986397708/*/1-11ku3cFwP9wBpA0rNFhbzn4rJoqXRxp?e=download
Resolving doc-08-c4-docs.googleusercontent.com (doc-08-c4-docs.googleusercontent.com)... 142.250.128.132, 2607:f8b0:4001:c32::84
Connecting to doc-08-c4-docs.googleusercontent.com (doc-08-c

In [None]:
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-squad/')
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

loading configuration file distilbert-squad/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.18.0",
  "vocab_size": 30522
}

loading weights file distilbert-squad/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForQuestionAnswering.

All the weights of DistilBertForQuestionAnswering were initialized from the model checkpoint at distilbert-squad/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForQuesti

In [None]:
squadv2_metric = load_metric("squad_v2")
print(squadv2_metric)

Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

Metric(name: "squad_v2", features: {'predictions': {'id': Value(dtype='string', id=None), 'prediction_text': Value(dtype='string', id=None), 'no_answer_probability': Value(dtype='float32', id=None)}, 'references': {'id': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}}, usage: """
Computes SQuAD v2 scores (F1 and EM).
Args:
    predictions: List of triple for question-answers to score with the following elements:
        - the question-answer 'id' field as given in the references (see below)
        - the text of the answer
        - the probability that the question has no answer
    references: List of question-answers dictionaries with the following key-values:
            - 'id': id of the question-answer pair (see above),
            - 'answers': a list of Dict {'text': text of the answer as a string}
    no_answer_threshold: float
        Probability threshold

Now let's evaluate our model on the validation dataset!

In [None]:
eval_dataloader = trainer.get_eval_dataloader()
eval_output = trainer.prediction_loop(
                eval_dataloader,
                description="Evaluation",
                prediction_loss_only=False,
            )

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: answers, question, id, title, context, offset_mapping. If answers, question, id, title, context, offset_mapping are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
***** Running Evaluation *****
  Num examples = 11873
  Batch size = 64


In [None]:
pp.pprint(eval_output.metrics)


{'eval_loss': 2.5966732501983643}


Our model outputs logits for the start and end indexes of the predicted answer span, but the `squad_v2` metric expects the predictions formatted as a list of triples `(example_id, answer_text, no_answer_probability)`. It also expects the references (the correct answers) to be formatted as a list of `(example_id, answers)`. Let's write a function to reformat our validation dataset and predictions as required.

In [None]:
import numpy as np

from transformers import EvalPrediction

def predicted_text_from_logits(examples, features, start_logits, end_logits, n_best_size = 10):
  # map feature indices to example IDs
  id2feat = {}
  for i, feature in enumerate(features):
    id2feat[feature["id"]] = i
  # Now extract the text from the contexts for the start and end positions
  # that have the |n_best_size| highest logits. Score these possible answers by
  # the sum of their start and end logit scores. Select the highest scoring one
  # if it has a score higher than the score of the [CLS] token. Otherwise
  # return unanswerable.
  predictions = []
  for example in examples:
    feat_idx = id2feat[example["id"]]
    feature = features[feat_idx]
    cls_index = feature['input_ids'].index(tokenizer.cls_token_id)
    null_score = start_logits[feat_idx][cls_index] + end_logits[feat_idx][cls_index]
    starts = list(np.argsort(start_logits[feat_idx])[-1:-n_best_size-1:-1])
    ends = list(np.argsort(end_logits[feat_idx])[-1:-n_best_size-1:-1])
    offset_mapping = feature["offset_mapping"]
    answers = []
    for start in starts:
      for end in ends:
        if (start > end 
            or end >= len(offset_mapping)
            or offset_mapping[start] is None
            or offset_mapping[end] is None):
          continue
        context_start = feature["offset_mapping"][start][0]
        context_end = feature["offset_mapping"][end][1]
        answers.append({
            "score": start_logits[feat_idx][start] + end_logits[feat_idx][end],
            "text": example["context"][context_start:context_end],
        })
    if len(answers) > 0:
      final_answer = sorted(answers, key=lambda x: x["score"], reverse=True)[0]
    else:
      final_answer = {"score": 0, "text": ""}
    predictions.append({
        "id": example["id"],
        "prediction_text": final_answer["text"] if final_answer["score"] > null_score else "",
        "no_answer_probability": 0.0
    })
  return predictions

start_logits, end_logits = eval_output.predictions
# We need to reload the tokenized validation data because the Trainer removed
# unnecessary columns such as 'id' during training.
tokenized_validation_data = squad["validation"].map(preprocess_data, batched=True)
final_preds = predicted_text_from_logits(
    squad["validation"], tokenized_validation_data, start_logits, end_logits
)

Loading cached processed dataset at /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-dd90e338c3fa6072.arrow


Now let's take a look at an example of the output of `predicted_text_from_logits`.

In [None]:
preds_with_answers = {idx: pred for idx, pred in enumerate(final_preds) if len(pred['prediction_text']) > 0}

print("Original example:")
pp.pprint(squad["validation"][1])
print("\nPredicted answer:")
pp.pprint(preds_with_answers[1])

Original example:
{'answers': {'answer_start': [94, 87, 94, 94],
             'text': ['10th and 11th centuries',
                      'in the 10th and 11th centuries',
                      '10th and 11th centuries',
                      '10th and 11th centuries']},
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: '
            'Normanni) were the people who in the 10th and 11th centuries gave '
            'their name to Normandy, a region in France. They were descended '
            'from Norse ("Norman" comes from "Norseman") raiders and pirates '
            'from Denmark, Iceland and Norway who, under their leader Rollo, '
            'agreed to swear fealty to King Charles III of West Francia. '
            'Through generations of assimilation and mixing with the native '
            'Frankish and Roman-Gaulish populations, their descendants would '
            'gradually merge with the Carolingian-based cultures of West '
            'Francia. The distinc

Now we can use the `squad_v2` metric to evaluate our model. In practice, we'd run this code as a batch job on Greene.

In [None]:
## Compute 'squad_v2' metrics on the validation dataset.
references = [{"id": example["id"], "answers": example["answers"]} for example in squad["validation"]]
pp.pprint(squadv2_metric.compute(predictions=final_preds, references=references))

{'HasAns_exact': 0.11808367071524967,
 'HasAns_f1': 1.6081315887693344,
 'HasAns_total': 5928,
 'NoAns_exact': 95.40790580319596,
 'NoAns_f1': 95.40790580319596,
 'NoAns_total': 5945,
 'best_exact': 50.07159100480081,
 'best_exact_thresh': 0.0,
 'best_f1': 50.07235668399654,
 'best_f1_thresh': 0.0,
 'exact': 47.83121367809315,
 'f1': 48.575170896843595,
 'total': 11873}
