<a href="https://colab.research.google.com/github/srvmishra/Language-Models/blob/main/Extractive_QA_SQuAD_dataset_BERT_base_cased.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notes on QA:
1. Any dataset that has `context`, `question`, and `answer` columns can be used for QA.
2. Extractive QA - pick out answers from the given context - start and end token index prediction - encoder type models.
3. Generative QA - generate answers - encoder-decoder models.
4. Training - one answer per question. Validation - multiple answers for a question.
5. Generating labels for QA - in the context which token is the start and which token is the end. So extractive QA is like a token classification problem similar to NER but the dataset does not contain the text split into words.

Steps that matter in any NLP task while using HuggingFace - data processing - computing specific metrics for task, dataset, and model.

In [1]:
!pip install evaluate
!pip install datasets



### Imports

In [2]:
import numpy as np
import pandas as pd
import markdown
from tqdm.auto import tqdm
import collections

import torch
import torch.nn as nn

from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from huggingface_hub import notebook_login

### Load Dataset

In [3]:
raw_dataset = load_dataset('squad')
raw_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

### Preprocess Training Samples

We use the `bert-base-cased` model.

Tokenization and obtaining the answer token labels is model specific because it depends on the way tokenizer pads the input strings (left vs right) and special tokens like [CLS], [SEP] and what are their ids and where these are added in the `input_ids` resulting from tokenization.
1. i. QA -> Tokenization: tokenizer(question, context)

   ii. multi lingual summary -> same tokenizer for all languages, each language tokenized separately and concatenated into combined dataset

   iii. translation -> there is a source language that is tokenized normally, and there is a target language for which we need to specify it as the target_text argument in tokenization. Both sentences are passed into the tokenizer in the same call, just the difference in keyword argument.
2. Dealing with long context:
  i. `return_overflowing_tokens = True` specifies that we want the extra tokens created after tokenization.

  ii. `stride` specifies the token overlap between the multiple cuts obtained from the same string.

  iii. `truncation = "only_second"` means we only want to truncate the context because we assume that the question is not too long while the context can be.
3. Locating answer indices:
  i. `return_offsets_mapping = True` in the tokenizer specifies token to character mapping.

  ii. `overflow_to_sample_mapping` specifies which token came from which input.

  iii. we have the answer, we can find its length in terms of number of characters and we have the starting character id. we add the two to get the end character id of the answer inside the context. then we use the `offsets_mapping` to get the corresponding token ids.

  iv. there are 3 cases a. answer is not there in the context so we predict `(0, 0)` corresponding to the `[CLS]` token position. This may be different for different models. b. the answer is entirely in one segment - we can directly predict the starting and ending ids from that segment only. c. the starting and the ending of the answer are in different segments - so we use the `offsets_mapping` and the `overflow_to_sample_mapping` to get the required ids.

  v. since both question and context are tokenized as a single call, the `input_ids` contain tokens from the question followed by tokens from the context. so we first determine the starting and ending indices of the context tokens inside the `input_ids`. we use the `sequence_ids()` method of the tokenizer to do this. we could have used `token_type_ids` but not all models provide these.

4. Since most contexts are long and truncation yields several features per sample, we pad everything to a `max_length` instead of using a dynamic padding using DataCollator.

In [4]:
model_ckpt = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [5]:
''' Taken from chapter 7.7 Question Answering from the HuggingFace NLP course '''

max_length = 384
stride = 128


def preprocess_training_examples(examples):
  # remove extra spaces that some questions in the dataset have
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [6]:
train_dataset = raw_dataset['train'].map(preprocess_training_examples, batched=True,
                                         remove_columns=raw_dataset['train'].column_names)

### Preprocess Validation and Test Samples

Processing the validation and test sets:
1. Computing the validation loss does not give any meaningful information. So we do not compute the labels, i.e., start and end indices of the answer for the validation and test sets.
2. How to match model predictions to the provided context. Model predicts the start and end token ids of the answer. We need the `offsets_mapping` for token to character mapping, and a mechanism to map the features to the original sample it comes from. We use the `id` field of the samples to get this. We also use `sequence_ids()` and `overflow_to_sample_mapping` to achieve these.

In [7]:
''' Taken from chapter 7.7 Question Answering from the HuggingFace NLP course '''

def preprocess_validation_examples(examples):
  # remove extra spaces that some questions in the dataset have
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [8]:
validation_dataset = raw_dataset['validation'].map(preprocess_validation_examples, batched=True,
                                                   remove_columns=raw_dataset['validation'].column_names)

### Computing Metrics

1. Since we have padded every sample to a `max_length`, we do not need to use a data collator.
2. The compute_metrics function usually expects a single tuple object of eval predictions that contain the model predictions and the actual labels. The function below is not in that format, so we cannot use it during training with `Trainer`. So we set `evaluation_strategy="no"` in the `TrainingArguments`.
3. Basic idea for matching:

  i. Model outputs contain `start_logits` and `end_logits`.

  ii. We look for the top k best combinations from these. These combinations are expected to give the top k log probabilities and that can be approximately measured by `start_logit` + `end_logit` values.

  iii. Filter out the combinations with negative length `end_logit < start_logit`, or answers that exceed a predefined maximum length, or answers that are entirely outside the context.

In [9]:
''' Taken from chapter 7.7 Question Answering from the HuggingFace NLP course '''

n_best = 20
max_answer_length = 30
metrics = evaluate.load('squad')

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        # In case there is no matching answer inside context
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metrics.compute(predictions=predicted_answers, references=theoretical_answers)

### Finetuning Model

In [10]:
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [12]:
batch_size = 64
logging_steps = len(train_dataset)//batch_size
model_name = f'srvmishra832/SQuAD-extractive_QA-{model_ckpt}'

training_arguments = TrainingArguments(output_dir=model_name, learning_rate=2e-5,
                                       num_train_epochs=3, log_level='error',
                                       per_device_train_batch_size=batch_size,
                                       per_device_eval_batch_size=batch_size,
                                       evaluation_strategy='no', disable_tqdm=False,
                                       weight_decay=0.01, save_strategy='epoch',
                                       fp16=True, push_to_hub=True, logging_steps=logging_steps)



In [13]:
trainer = Trainer(model=model, args=training_arguments,
                  train_dataset=train_dataset, eval_dataset=validation_dataset,
                  tokenizer=tokenizer)

  trainer = Trainer(model=model, args=training_arguments,


In [14]:
trainer.train()
trainer.push_to_hub()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msrvmishra832[0m ([33msrvmishra832-indian-institute-of-science-bangalore[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
1386,1.4623
2772,0.9165
4158,0.7578


CommitInfo(commit_url='https://huggingface.co/srvmishra832/SQuAD-extractive_QA-bert-base-cased/commit/467f3ff7775675d5bca3253c0833de640ea17cc1', commit_message='End of training', commit_description='', oid='467f3ff7775675d5bca3253c0833de640ea17cc1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/srvmishra832/SQuAD-extractive_QA-bert-base-cased', endpoint='https://huggingface.co', repo_type='model', repo_id='srvmishra832/SQuAD-extractive_QA-bert-base-cased'), pr_revision=None, pr_num=None)

In [16]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
results = compute_metrics(start_logits, end_logits, validation_dataset, raw_dataset['validation'])
print(pd.DataFrame.from_dict(results, orient='index').to_markdown())

  0%|          | 0/10570 [00:00<?, ?it/s]

|             |       0 |
|:------------|--------:|
| exact_match | 80.246  |
| f1          | 87.6648 |


More to do:
1. Generative QA
2. Extractive QA with a different model (XLNet)
3. Write a new subclass of `Trainer` to work with the `compute_metrics` function defined here.