# [KerasNLP] question answering by span labeling with BERT

**Author:** [Apoorv Nandan](https://twitter.com/NandanApoorv), updated by [Usha Rengaraju](https://www.linkedin.com/in/usha-rengaraju-b570b7a2/)<br>
**Date created:** 2023/06/21<br>
**Last modified:** 2023/06/21<br>
**Description:** Fine tune pretrained BERT from KerasNLP on SQuAD.

## Introduction

The notebook demonstrates how to find the span of text in the paragraph that answers the question using KerasNLP . KerasNLP is highly modular library for natural language processing with state-of-the-art preset weights and out-of-the-box architectures which can be customized when needed.

In this example ,We fine-tune a BERT model to perform text extraction as follows:

1. Feed the context and the question as inputs to BERT.
2. Take two vectors S and T with dimensions equal to that of
   hidden states in BERT.
3. Compute the probability of each token being the start and end of
   the answer span. The probability of a token being the start of
   the answer is given by a dot product between S and the representation
   of the token in the last layer of BERT, followed by a softmax over all tokens.
   The probability of a token being the end of the answer is computed
   similarly with the vector T.
4. Fine-tune BERT and learn S and T along the way.

In [None]:
!pip install -q keras-nlp

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.7/527.7 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m119.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m440.8/440.8 kB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import keras_nlp

## Set-up BERT tokenizer
BertTokenizer from KerasNLP is used to convert strings to tf.RaggedTensors of token ids.

In [None]:
import tensorflow_text as tf_text
max_len=384
tok = keras_nlp.models.BertTokenizer.from_preset("bert_base_en_uncased", lowercase=True)
tokenizer = tf_text.FastWordpieceTokenizer(tok.vocabulary,support_detokenization=True)

## Load the data

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,consisting of questions from Wikipedia articles, where the answer is a segment of text, or span, from the reading passage.The SQuAD dataset is loaded using keras.utils.text_dataset_from_directory, which utilizes the tf.data.Dataset format.

In [None]:
train_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json"
train_path = keras.utils.get_file("train.json", train_data_url)
eval_data_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
eval_path = keras.utils.get_file("eval.json", eval_data_url)


Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
Downloading data from https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json


In [None]:
start_value=tok.cls_token_id,
end_value=tok.sep_token_id,
packer = keras_nlp.layers.MultiSegmentPacker(max_len, start_value, end_value)

## Preprocess the data

1. Go through the JSON file and store every record as a `SquadExample` object.
2. Go through each `SquadExample` and create `x_train, y_train, x_eval, y_eval`.

In [None]:
from tqdm import tqdm
class SquadExample:
    def __init__(self, question, context, start_char_idx, answer_text, all_answers):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx
        self.answer_text = answer_text
        self.all_answers = all_answers
        self.skip = False

    def preprocess(self):
        context = self.context
        question = self.question
        answer_text = self.answer_text
        start_char_idx = self.start_char_idx

        context = " ".join(str(context).split())
        question = " ".join(str(question).split())
        answer = " ".join(str(answer_text).split())

        end_char_idx = start_char_idx + len(answer)
        if end_char_idx >= len(context):
            self.skip = True
            return

        is_char_in_ans = [0] * len(context)
        for idx in range(start_char_idx, end_char_idx):
            is_char_in_ans[idx] = 1

        tokenized_context = tokenizer.tokenize_with_offsets(context)

        ans_token_idx = []
        for idx,(_,start,end) in enumerate(zip(tokenized_context[0],tokenized_context[1],tokenized_context[2])):
            if sum(is_char_in_ans[start:end]) > 0:
                ans_token_idx.append(idx)

        if len(ans_token_idx) == 0:
            self.skip = True
            return

        start_token_idx = ans_token_idx[0]
        end_token_idx = ans_token_idx[-1]

        tokenized_question = tokenizer.tokenize_with_offsets(question)

        packed = packer((tokenized_context[0],tokenized_question[0][1:]))
        input_ids = packed[0]
        token_type_ids = packed[1]
        padding_mask = input_ids!=0


        self.input_ids = input_ids
        self.token_type_ids = token_type_ids
        self.padding_mask = padding_mask
        self.start_token_idx = start_token_idx
        self.end_token_idx = end_token_idx
        self.context_token_to_char = tuple(zip(tokenized_context[1],tokenized_context[2]))


with open(train_path) as f:
    raw_train_data = json.load(f)

with open(eval_path) as f:
    raw_eval_data = json.load(f)


def create_squad_examples(raw_data):
    squad_examples = []
    for item in tqdm(raw_data["data"][:1]):
        for para in item["paragraphs"][:3]:
            context = para["context"]
            for qa in para["qas"]:
                question = qa["question"]
                answer_text = qa["answers"][0]["text"]
                all_answers = [_["text"] for _ in qa["answers"]]
                start_char_idx = qa["answers"][0]["answer_start"]
                squad_eg = SquadExample(
                    question, context, start_char_idx, answer_text, all_answers
                )
                squad_eg.preprocess()
                squad_examples.append(squad_eg)
    return squad_examples


def create_inputs_targets(squad_examples):
    dataset_dict = {
        "input_ids": [],
        "token_type_ids": [],
        "padding_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }
    for item in squad_examples:
        if item.skip == False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["token_type_ids"],
        dataset_dict["padding_mask"],
    ]
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
    return x, y


train_squad_examples = create_squad_examples(raw_train_data)
x_train, y_train = create_inputs_targets(train_squad_examples)
print(f"{len(train_squad_examples)} training points created.")

eval_squad_examples = create_squad_examples(raw_eval_data)
x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print(f"{len(eval_squad_examples)} evaluation points created.")

100%|██████████| 1/1 [00:09<00:00,  9.54s/it]


15 training points created.


100%|██████████| 1/1 [00:26<00:00, 26.50s/it]

80 evaluation points created.





Create the Question-Answering Model using BertBackbone from KerasNLP which distills the input tokens into dense features that can be used in downstream tasks.

In [None]:
def create_model():
    ## BERT encoder
    encoder = keras_nlp.models.BertBackbone.from_preset("bert_base_en_uncased")

    ## QA Model
    input_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    token_type_ids = layers.Input(shape=(max_len,), dtype=tf.int32)
    attention_mask = layers.Input(shape=(max_len,), dtype=tf.int32)
    embedding = encoder(
        [input_ids, token_type_ids, attention_mask]
    )['sequence_output']

    start_logits = layers.Dense(1, name="start_logit",kernel_initializer='he_uniform')(embedding)
    start_logits = layers.Flatten()(start_logits)

    end_logits = layers.Dense(1, name="end_logit",kernel_initializer='he_uniform')(embedding)
    end_logits = layers.Flatten()(end_logits)

    start_probs = layers.Activation(keras.activations.softmax)(start_logits)
    end_probs = layers.Activation(keras.activations.softmax)(end_logits)

    model = keras.Model(
        inputs=[input_ids, token_type_ids, attention_mask],
        outputs=[start_probs, end_probs],
    )
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    optimizer = keras.optimizers.Adam(10e-7,clipnorm=1.0)
    model.compile(optimizer=optimizer, loss=[loss, loss])
    return model

In [None]:
model = create_model()

model.summary()

Model: "model_11"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_34 (InputLayer)       [(None, 384)]                0         []                            
                                                                                                  
 input_35 (InputLayer)       [(None, 384)]                0         []                            
                                                                                                  
 input_36 (InputLayer)       [(None, 384)]                0         []                            
                                                                                                  
 bert_backbone_11 (BertBack  {'sequence_output': (None,   1094822   ['input_34[0][0]',            
 bone)                        None, 768),                 40         'input_35[0][0]',     

## Connect to the TPU

In [None]:
try: # detect TPUs
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect() # TPU detection
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError: # detect GPUs
    strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machines
print("Number of accelerators: ", strategy.num_replicas_in_sync)

# Create model
with strategy.scope():
    model = create_model()

model.summary()

## Create evaluation Callback

This callback will compute the exact match score using the validation data
after every epoch.

In [None]:
def normalize_text(text):
    text = text.lower()

    # Remove punctuations
    exclude = set(string.punctuation)
    text = "".join(ch for ch in text if ch not in exclude)

    # Remove articles
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    text = re.sub(regex, " ", text)

    # Remove extra white space
    text = " ".join(text.split())
    return text


class ExactMatch(keras.callbacks.Callback):

    def __init__(self, x_eval, y_eval):
        self.x_eval = x_eval
        self.y_eval = y_eval

    def on_epoch_end(self, epoch, logs=None):
        pred_start, pred_end = self.model.predict(self.x_eval)
        count = 0
        eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
        for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
            squad_eg = eval_examples_no_skip[idx]
            offsets = squad_eg.context_token_to_char
            start = np.argmax(start)
            end = np.argmax(end)
            if start >= len(offsets):
                continue
            pred_char_start = offsets[start][0]
            if end < len(offsets):
                pred_char_end = offsets[end][1]
                pred_ans = squad_eg.context[pred_char_start:pred_char_end]
            else:
                pred_ans = squad_eg.context[pred_char_start:]

            normalized_pred_ans = normalize_text(pred_ans)
            normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
            if normalized_pred_ans in normalized_true_ans:
                count += 1
        acc = count / len(self.y_eval[0])
        print(f"\nepoch={epoch+1}, exact match score={acc:.2f}")

## Train and Evaluate

In [None]:
exact_match_callback = ExactMatch(x_eval, y_eval)
model.fit(
    x_train,
    y_train,
    epochs=1,  # For demonstration, 3 epochs are recommended
    verbose=1,
    batch_size=2,
    callbacks=[exact_match_callback],
)




epoch=1, exact match score=0.00


<keras.src.callbacks.History at 0x7f75e8ee6410>

In [None]:
pred = model.predict(x_eval)



In [None]:
pred_start, pred_end = pred
count = 0
eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
for idx, (start, end) in enumerate(zip(pred_start[:5], pred_end[:5])):
    squad_eg = eval_examples_no_skip[idx]
    offsets = squad_eg.context_token_to_char
    start = np.argmax(start)
    end = np.argmax(end)
    if start >= len(offsets):
      continue
    pred_char_start = offsets[start][0]
    if end < len(offsets):
        pred_char_end = offsets[end][1]
        pred_ans = squad_eg.context[pred_char_start:pred_char_end]
    else:
        pred_ans = squad_eg.context[pred_char_start:]
    cont = tokenizer.detokenize(x_eval[0][idx])
    context,question,_ = cont.numpy().decode().split('[SEP]')
    print('context: ',context)
    print('question: ',question)
    print('answer: ',pred_ans)

context:  [CLS] [UNK] [UNK] 50 was an [UNK] football game to determine the champion of the [UNK] [UNK] [UNK] ( [UNK] ) for the 2015 season . [UNK] [UNK] [UNK] [UNK] ( [UNK] ) champion [UNK] [UNK] defeated the [UNK] [UNK] [UNK] ( [UNK] ) champion [UNK] [UNK] 24 – 10 to earn their third [UNK] [UNK] title . [UNK] game was played on [UNK] 7 , 2016 , at [UNK] ' s [UNK] in the [UNK] [UNK] [UNK] [UNK] at [UNK] [UNK] , [UNK] . [UNK] this was the 50th [UNK] [UNK] , the league emphasized the " golden anniversary " with various gold - themed initiatives , as well as temporarily suspending the tradition of naming each [UNK] [UNK] game with [UNK] numerals ( under which the game would have been known as " [UNK] [UNK] [UNK] " ) , so that the logo could prominently feature the [UNK] numerals 50 . 
question:   [UNK] team represented the [UNK] at [UNK] [UNK] 50 ? 
answer:  Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The

## References :

https://keras.io/examples/nlp/text_extraction_with_bert/