In [None]:
! pip install transformers datasets evaluate accelerate

# Question answering

## Load SQuAD dataset

In [2]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [3]:
squad

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 5000
})

Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [4]:
squad = squad.train_test_split(test_size=0.2)

In [5]:
squad["train"][0]

{'id': '5733f7b64776f419006615e4',
 'title': 'Genocide',
 'context': 'Jonassohn and Björnson postulate that the major reason why no single generally accepted genocide definition has emerged is because academics have adjusted their focus to emphasise different periods and have found it expedient to use slightly different definitions to help them interpret events. For example, Frank Chalk and Kurt Jonassohn studied the whole of human history, while Leo Kuper and R. J. Rummel in their more recent works concentrated on the 20th century, and Helen Fein, Barbara Harff and Ted Gurr have looked at post World War II events. Jonassohn and Björnson are critical of some of these studies, arguing that they are too expansive, and conclude that the academic discipline of genocide studies is too young to have a canon of work on which to build an academic paradigm.',
 'question': 'The two writers suggested that academics adjusted what in their different definitions to assist them in interpreting events

There are several important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

## Preprocess

In [6]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python. It initializes a tokenizer for the pre-trained DistilBERT model, specifically the 'distilbert-base-uncased' version. The tokenizer is responsible for converting the raw text data into a format that the model can understand and process.
"""
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

In [7]:
"""
Summary:
    This code snippet defines a function preprocess_function that is used to preprocess the input data before feeding it into a model. The function takes a dictionary of examples as input, where each example consists of a question, a context, and an answer.

    The function first tokenizes the questions and contexts using the DistilBERT tokenizer and applies some padding and truncation. It then extracts the offset mapping from the tokenizer output, which is used to map the token positions back to the original character positions in the text.

    Next, the function iterates over each example and extracts the start and end positions of the answer in the context. If the answer is not fully contained within the context, it is labeled as (0, 0). Otherwise, the start and end positions of the answer are calculated based on the offset mapping.

    Finally, the function adds the start and end positions of the answers to the inputs dictionary and returns it. This preprocessed data can then be used to train or evaluate a question-answering model.
"""
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0] # TODO: find start character of answer
        end_char = answer["answer_start"][0] + len(answer["text"][0]) # TODO: find end character of answer
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns you don't need:

In [8]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [9]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python. It creates an instance of the DefaultDataCollator class, which is used to collate input data into batches for training or evaluation.

  The DefaultDataCollator class handles the padding and truncation of inputs, as well as the conversion of inputs to tensors. It can be used with any model architecture and is a convenient option for collating data when working with the Transformers library.

  In this particular code snippet, the data_collator variable is initialized with an instance of the DefaultDataCollator class, but no further configuration or usage is shown.
"""
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator() # TODO: make an instance

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForQuestionAnswering):

In [10]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python. It initializes a pre-trained DistilBERT model for question-answering using the AutoModelForQuestionAnswering class.

  The AutoModelForQuestionAnswering class is a wrapper around the underlying DistilBERT model and adds a question-answering head on top of it. The head is responsible for predicting the start and end positions of the answer in the context, given the question and context as input.

  In this particular code snippet, the model variable is initialized with the pre-trained 'distilbert-base-uncased' model for question-answering. This model can then be fine-tuned on a specific question-answering dataset using the Trainer class, which is also provided by the Transformers library.
"""
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased") # TODO: load distilbert-base-uncased model



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
"""
Summary:
    This code snippet is using the Hugging Face's Transformers library in Python to fine-tune a pre-trained DistilBERT model for question-answering on a specific dataset.

    First, the TrainingArguments class is used to define the hyperparameters and settings for the training process. In this particular code snippet, the output directory for the model checkpoints is set to "qa_model", the evaluation strategy is set to "epoch", the learning rate is set to 2e-5, the batch size for training and evaluation is set to 16, the number of training epochs is set to 3, and the weight decay is set to 0.01.

    Next, the Trainer class is used to manage the training and evaluation process. In this particular code snippet, the Trainer instance is initialized with the pre-trained DistilBERT model for question-answering, the TrainingArguments instance, the tokenized training and evaluation datasets, the tokenizer, and the data collator.

    Finally, the train method of the Trainer instance is called to start the training process. The Trainer will automatically handle the data loading, batching, and forward and backward passes, as well as the evaluation and model checkpointing.
"""
training_args = TrainingArguments(
    output_dir="qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# TODO: pass the required arguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad['train'],
    eval_dataset=tokenized_squad['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.396365
2,2.813800,1.851492
3,2.813800,1.719938


TrainOutput(global_step=750, training_loss=2.389683390299479, metrics={'train_runtime': 454.4853, 'train_samples_per_second': 26.403, 'train_steps_per_second': 1.65, 'total_flos': 1175877900288000.0, 'train_loss': 2.389683390299479, 'epoch': 3.0})

In [15]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python to save the tokenizer and the fine-tuned DistilBERT model for question-answering to disk.

  The save_pretrained method of the tokenizer and the model are used to save their respective state to the specified output directory "qa_model". The output directory will contain the necessary files to re-load the tokenizer and the model in the future, including the tokenizer configuration and vocabulary files, and the model configuration, weights, and optimizer state files.

  Saving the tokenizer and the model is important for reproducibility and deployment. The saved files can be used to preprocess and predict on new data, or to fine-tune the model further on new datasets.
"""
# Save the tokenizer
tokenizer.save_pretrained("qa_model")

# Save the model
model.save_pretrained("qa_model")

## Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so you're not completely in the dark about your model's performance.

If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course!

## Inference

In [16]:
"""
Summary:
  Here we defined a question and a context. answer starts at position 46 and ends at position 51 and it is Paris.
"""
question = "What is the capital of France?" # TODO: write a question
context = "France is a country in Europe. Its capital is Paris." # TODO: write a context for your question

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for question answering with your model, and pass your text to it:

In [18]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python to create a question-answering pipeline and use it to answer a question given a context.

  The pipeline function is used to create the question-answering pipeline. In this particular code snippet, the pipeline is initialized with the fine-tuned DistilBERT model for question-answering that was saved to disk in the previous step.

  The question_answerer variable is assigned the resulting pipeline instance. The pipeline is a high-level abstraction that handles the preprocessing, forward and backward passes, and postprocessing of the input data to generate the final answer.

  Finally, the question_answerer pipeline is called with the question and the context as input. The pipeline will use the fine-tuned DistilBERT model to predict the start and end positions of the answer in the context, and extract and return the corresponding text as the final answer.
"""
from transformers import pipeline

question_answerer = pipeline("question-answering", model="qa_model") # TODO: call QA pipeline
question_answerer(question=question, context=context)

{'score': 0.6878398656845093, 'start': 46, 'end': 51, 'answer': 'Paris'}

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [None]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python to load the tokenizer for the fine-tuned DistilBERT model for question-answering and tokenize the input question and context.

  The AutoTokenizer class is used to automatically load the tokenizer for the specified pre-trained model. In this particular code snippet, the tokenizer is loaded from the "qa_model" directory, which was saved to disk in a previous step.

  The tokenizer function is then called with the input question and context, and the return_tensors argument is set to "pt" to convert the output to PyTorch tensors. The resulting inputs dictionary contains the tokenized and encoded input data that can be passed to the fine-tuned DistilBERT model for question-answering.
"""
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("qa_model") # TODO: load your tokenizer
inputs = tokenizer(question, context, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [27]:
"""
Summary:
    This code snippet is using the Hugging Face's Transformers library in Python to load the fine-tuned DistilBERT model for question-answering and use it to predict the answer to a question given a context.

    The AutoModelForQuestionAnswering class is used to automatically load the model architecture and weights for the specified pre-trained model. In this particular code snippet, the model is loaded from the "qa_model" directory, which was saved to disk in a previous step.

    The torch.no_grad() context manager is used to disable the gradient computation and save memory during the forward pass. The model function is then called with the tokenized and encoded input data that was generated in the previous step.

    The resulting outputs dictionary contains the predicted start and end positions of the answer in the context, as well as the corresponding scores and other metadata. The final answer can be extracted from the context using the predicted start and end positions.
"""
import torch
from transformers import AutoModelForQuestionAnswering

model = model.from_pretrained("qa_model") # TODO: load your model
with torch.no_grad():
    outputs = model(**inputs) # TODO: pass your inputs to the model

Get the highest probability from the model output for the start and end positions:

In [28]:
"""
Summary:
  This code snippet is using the PyTorch library in Python to extract the predicted start and end positions of the answer to a question from the output of a fine-tuned DistilBERT model for question-answering.

  The argmax function is used to find the index of the maximum value in the start_logits and end_logits tensors, which correspond to the predicted probabilities for each token in the context being the start and end of the answer, respectively.

  The resulting answer_start_index and answer_end_index variables contain the predicted start and end positions of the answer in the context. These indices can be used to extract the corresponding text from the context and format the final answer.
"""
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [29]:
"""
Summary:
  This code snippet is using the Hugging Face's Transformers library in Python to extract the predicted answer text from the input context using the predicted start and end positions, and then decode the predicted answer tokens back to text.

  The inputs.input_ids tensor is indexed with the first batch element and the predicted start and end positions of the answer to extract the predicted answer tokens. The resulting predict_answer_tokens tensor contains the tokenized representation of the predicted answer.

  The tokenizer.decode function is then called with the predicted answer tokens to decode them back to text. The resulting string is the final predicted answer to the input question.
"""
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'paris'