Reference: https://huggingface.co/docs/transformers/tasks/question_answering

In [None]:
# !pip install transformers datasets evaluate
# !pip install transformers[torch]
# !pip install accelerate -U
# !pip install datasets
# !pip install evaluate
# !pip install git-lfs

In [1]:
from huggingface_hub import notebook_login

notebook_login() # Enter your token to login

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

# Load SQuAD dataset

In [2]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

In [3]:
# Split the datasetâ€™s train split into a train and test set with the train_test_split method

squad = squad.train_test_split(test_size=0.2)

In [4]:
# Then take a look at an example
squad["train"][0]

{'id': '56bfc420a10cfb14005512c2',
 'title': 'BeyoncÃ©',
 'context': 'Described as being "sexy, seductive and provocative" when performing on stage, BeyoncÃ© has said that she originally created the alter ego "Sasha Fierce" to keep that stage persona separate from who she really is. She described Sasha as being "too aggressive, too strong, too sassy [and] too sexy", stating, "I\'m not like her in real life at all." Sasha was conceived during the making of "Crazy in Love", and BeyoncÃ© introduced her with the release of her 2008 album I Am... Sasha Fierce. In February 2010, she announced in an interview with Allure magazine that she was comfortable enough with herself to no longer need Sasha Fierce. However, BeyoncÃ© announced in May 2012 that she would bring her back for her Revel Presents: BeyoncÃ© Live shows later that month.',
 'question': 'Later what did she say about Sasha?',
 'answers': {'text': ['she would bring her back'], 'answer_start': [679]}}

There are several important fields here:

* answers: the starting location of the answer token and the answer text.
* context: background information from which the model needs to extract the answer.
* question: the question a model should answer.

# Preprocess

In [5]:
# The next step is to load a DistilBERT tokenizer to process the question and context fields

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

There are a few preprocessing steps particular to question answering tasks you should be aware of:

Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

In [6]:
# Function to truncate and map the start and end tokens of the answer to the context

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [7]:
# To apply the preprocessing function over the entire dataset, use Datasets map function. You can speed up the map 
# function by setting batched=True to process multiple elements of the dataset at once. Remove any unneeded columns

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using DefaultDataCollator. Unlike other data collators in ðŸ¤— Transformers, the DefaultDataCollator does not apply any additional preprocessing such as padding.

## Training with PyTorch

In [8]:
# PyTorch

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [9]:
# # TensorFlow

# from transformers import DefaultDataCollator

# data_collator = DefaultDataCollator(return_tensors="tf")

# Train

If you arenâ€™t familiar with finetuning a model with the Trainer (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial here (https://huggingface.co/docs/transformers/training#train-with-pytorch-trainer)!

In [10]:
# Begin the Training by Load DistilBERT with AutoModelForQuestionAnswering
# (https://huggingface.co/docs/transformers/v4.32.1/en/model_doc/auto#transformers.AutoModelForQuestionAnswering)

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

* Define your training hyperparameters in TrainingArguments (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is output_dir which specifies where to save your model. Youâ€™ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model).
* Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
* Call train() (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [11]:
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()

In [12]:
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.121683
2,2.651600,1.631942
3,2.651600,1.558441


TrainOutput(global_step=750, training_loss=2.2196768798828126, metrics={'train_runtime': 15394.3558, 'train_samples_per_second': 0.78, 'train_steps_per_second': 0.049, 'total_flos': 1175877900288000.0, 'train_loss': 2.2196768798828126, 'epoch': 3.0})

## Training with TensorFlow

### TBD - There are some errors!!!!

To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters

In [13]:
# from transformers import DefaultDataCollator

# data_collator = DefaultDataCollator(return_tensors="tf")

In [14]:
# from transformers import create_optimizer

# batch_size = 16
# num_epochs = 2
# total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
# optimizer, schedule = create_optimizer(
#     init_lr=2e-5,
#     num_warmup_steps=0,
#     num_train_steps=total_train_steps,
# )

Then you can load DistilBERT with TFAutoModelForQuestionAnswering

In [31]:
# from transformers import TFAutoModelForQuestionAnswering, TFAutoModel

# model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset() (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset)

In [16]:
# tf_train_set = model.prepare_tf_dataset(
#     tokenized_squad["train"],
#     shuffle=True,
#     batch_size=16,
#     collate_fn=data_collator,
# )

# tf_validation_set = model.prepare_tf_dataset(
#     tokenized_squad["test"],
#     shuffle=False,
#     batch_size=16,
#     collate_fn=data_collator,
# )

Configure the model for training with compile (https://keras.io/api/models/model_training_apis/#compile-method)

In [23]:
# import tensorflow as tf

# model.compile(optimizer=optimizer)

The last thing to setup before you start training is to provide a way to push your model to the Hub. This can be done by specifying where to push your model and tokenizer in the PushToHubCallback (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/keras_callbacks#transformers.PushToHubCallback)

In [21]:
# from transformers.keras_callbacks import PushToHubCallback

# callback = PushToHubCallback(
#     output_dir="my_awesome_qa_model",
#     tokenizer=tokenizer,
# )

In [22]:
# model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])

For a more in-depth example of how to finetune a model for question answering, take a look at the corresponding PyTorch notebook (https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) or TensorFlow notebook (https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).

# Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The Trainer (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so youâ€™re not completely in the dark about your modelâ€™s performance.

# Inference

Come up with a question and some context for the model to predict

In [28]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

The simplest way to try out your finetuned model for inference is to use it in a pipeline() (https://huggingface.co/docs/transformers/v4.32.1/en/main_classes/pipelines#transformers.pipeline). Instantiate a pipeline for question answering with your model, and pass your text to it

In [29]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)

{'score': 0.15511634945869446,
 'start': 58,
 'end': 95,
 'answer': '46 languages natural languages and 13'}

In [34]:
# # Automate Question Answering

# from transformers import pipeline

# question_answerer1 = pipeline("question-answering", model="my_awesome_qa_model")
# Your_Question = question_answerer1(input('Ask your question'))