# SQuAD Model based on BERT

In this project, I will train a SQuAD model based on the BERT architecture. The goal is to fine-tune a pre-trained BERT model on the SQuAD (Stanford Question Answering Dataset) to enable it to answer questions based on a given context.

**Overview of the Training Process:**

Model: We will be using the base version of BERT, specifically bert-base-uncased, which is one of the most popular variants of the BERT architecture. The base version of BERT has 12 layers (transformer blocks), 12 attention heads, and a total of 110 million parameters. 

**Dataset:** 

The SQuAD dataset, a widely used benchmark for training and evaluating question-answering models, will be used in this project. SQuAD contains over 100,000 question-answer pairs, where each question is paired with a specific passage from which the answer can be extracted.

### Set the Directory


In [1]:
import os
# Current Directory
current_directory = os.getcwd()
# Save the Data Path
data_path = os.path.join(current_directory, 'Data')

## Data Processing

### Tokenizer

Loading the Bert Tokenizer which is specifically designed for tokenizing text data in a way that aligns with the BERT model's architecture

In [2]:
from transformers import BertTokenizer

# Loading the Bert Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### Vocab

Get the vocabulary used for train the Bert Model, it's necessary to process the training and testing data.

In [3]:
# Get the VOCAB from BERT
vocab = tokenizer.get_vocab()

# Save the vocab as txr
with open(os.path.join(data_path, "vocab.txt"), "w", encoding='utf-8') as file:
    for token, index in vocab.items():
        file.write(f"{token}\n")


The  generate_tf_record_from_json_file converts the SQuAD training data and the vocabulary into a TFRecord format, which is a format commonly used in TensorFlow for efficient data storage and input pipeline processing. 


max_seq_length = 384: Defines the maximum sequence length for each input example. The sequences (comprising tokens) that exceed this length will be truncated, and shorter sequences will be padded

In [4]:
from official.nlp.data.squad_lib import generate_tf_record_from_json_file

# Convert train data and vocab into a TFRecord format
input_meta_data = generate_tf_record_from_json_file(
    input_file_path = os.path.join(data_path, "train-v1.1.json"),      # training data
    vocab_file_path = os.path.join(data_path, "vocab.txt"),            # vocabulary
    output_path     = os.path.join(data_path, "train-v1.1.tf_record"), # output file name
    max_seq_length  = 384
)

The create_squad_dataset method is used to construct a training dataset for a BERT-based model using the SQuAD data, which has been previously converted into TFRecord format.

In [5]:
from official.legacy.bert.input_pipeline import create_squad_dataset 

BATCH_SIZE = 4

# Max Secuence (Max number of tokens for context + question) in the training data
max_seq_len = input_meta_data['max_seq_length']

# Construction of the training dataset
dataset = create_squad_dataset(
    file_path   = os.path.join(data_path, "train-v1.1.tf_record"),
    seq_length  = max_seq_len,
    batch_size  = BATCH_SIZE,
    is_training = True
)

Create a custom function to split the dataset into training and validation data.
- To simplify the training cycle, 1500 batches (6000 samples) will be used
- The split is 80% for training and 20% for validation

In [6]:
def train_test_split(dataset, nb_batches, train_prop):
    # Number of training batches
    NB_BATCHES_TRAIN = int(nb_batches * train_prop)
    # Number of validation batches
    NB_BATCHES_VAL = int(nb_batches - NB_BATCHES_TRAIN)
    # Trainig and Validation datasets
    train_dataset = dataset.take(NB_BATCHES_TRAIN)
    val_dataset = dataset.skip(NB_BATCHES_VAL).take(NB_BATCHES_VAL)

    return train_dataset, val_dataset

NB_BATCHES = 1500
TRAIN_PROP = 0.8

NB_BATCHES_TRAIN = int(NB_BATCHES * TRAIN_PROP)

train_dataset, val_dataset = train_test_split(dataset=dataset, nb_batches=NB_BATCHES, train_prop=TRAIN_PROP)


## Model 

For constructing the model, I will use a pre-trained BERT model as the initial layer. The final layer of the model will output two tensors: the first tensor will contain the start_logits, and the second tensor will contain the end_logits. These logits represent the predicted start and end positions for each word in the context, respectively.


### BertSquadLayer
Construction of the final layer

In [8]:
import tensorflow as tf

class BertSquadLayer(tf.keras.layers.Layer):
    def __init__(self):
        super(BertSquadLayer, self).__init__()
        # Final Dense Layer to get the logit for each word in the sentence to be the start and end sentence position 
        self.final_dense = tf.keras.layers.Dense(
              units = 2,
              kernel_initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02)) # Initializate the Params 
    
    def call(self, inputs):
      # Pass throught the final dense the output of the Bert Model 
      logits = self.final_dense(inputs) # (batch_size, seq_len, 2)
    
      # Reshape the logits (the first dim is 2, the firstone for the start position and the second one for the end position)
      logits = tf.transpose(logits, [2, 0, 1]) # (2, batch_size, seq_len)
    
      # Unstack the logits
      unstacked_logits = tf.unstack(logits, axis = 0) # [(batch_size, seq_len), (batch_size, seq_len)]
      
      return unstacked_logits[0], unstacked_logits[1]


### BERTSquad

Is the complete SQuAD Model

In [9]:

from transformers import TFBertModel

class BERTSquad(tf.keras.Model):

    def __init__(self,
                 name="bert_squad"):
        super(BERTSquad, self).__init__(name=name)

        # Create the Bert Layer
        self.bert_layer  = TFBertModel.from_pretrained('bert-base-uncased')
        
        # Final Dense
        self.squad_layer = BertSquadLayer()

    def apply_bert(self, inputs):
        # Obtain Bert Output
        output = self.bert_layer([inputs["input_word_ids"],
                                  inputs["input_mask"],
                                  inputs["input_type_ids"]])
        
        # Obtain the attention for each word in the context
        sequence_output = output.last_hidden_state
        
        return sequence_output

    def call(self, inputs):
        # Apply the bert layer to the input
        seq_output = self.apply_bert(inputs)
        # Get the logits for each context-word after apply the squad_layer layer 
        start_logits, end_logits = self.squad_layer(seq_output)

        return start_logits, end_logits


In [18]:
# Instance of the Bert Model
bert_squad = BERTSquad()

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Optimizer

In [19]:
from official.nlp import optimization

# Set the learning rate and the warmup stems
INIT_LR = 5e-5
WARMUP_STEPS = int((NB_BATCHES_TRAIN) * 0.1)

# Initialize the optimizer
optimizer = optimization.create_optimizer(
    init_lr          = INIT_LR,
    num_train_steps  = NB_BATCHES_TRAIN,
    num_warmup_steps = WARMUP_STEPS)

### Loss

The `squad_loss_fn` function calculates the loss for the SQuAD model.

Finally, the `loss` metric is initialized to keep track of the average loss during model training.

In [20]:
# Loss Function
def squad_loss_fn(input, model_outputs):
    # Get the start and end position
    start_positions = input['start_positions']
    end_positions = input['end_positions']
    
    # Get the model output 
    start_logits, end_logits = model_outputs

    # Calculate the loss for the start and end position
    start_loss = tf.keras.backend.sparse_categorical_crossentropy(
        start_positions, start_logits, from_logits = True)   #from_logits = True. Indica que la salida del modelo no es una probabilidad
    
    end_loss = tf.keras.backend.sparse_categorical_crossentropy(
        end_positions, end_logits, from_logits = True)

    # get the final loss
    # reduce_mean: Calculate the mean loss of the start and end tensors
    total_loss = (tf.reduce_mean(start_loss) + tf.reduce_mean(end_loss)) / 2

    return total_loss

# Initialize the loss
loss = tf.keras.metrics.Mean(name = "loss")

In [21]:
# Compile the model
bert_squad.compile(optimizer,
                   squad_loss_fn)

In [22]:
# Function to calculate the time between epochs
def epoch_time(start, end):
    elapsed_time = end - start
    hours = int(elapsed_time // 3600)
    minutes = int((elapsed_time % 3600) // 60)
    seconds = int(elapsed_time % 60)
    time = f'{hours}h {minutes}m {seconds}s'
    
    return time
    

Function to evaluate the model on the validation dataset

In [26]:
def evaluate(test_data, model, loss):

    loss.reset_state()

    for (inputs, targets) in test_data:

        model_outputs = model(inputs, training=False)
        test_loss = squad_loss_fn(targets, model_outputs)

        loss.update_state(test_loss)
    
    return loss.result().numpy()

Custom trainer to train the SQuAD model

In [24]:
def trainer(n_epochs, train_data, model, loss, val_data=None):
    import time

    # Training and Validation loss lists
    train_loss_list = []
    val_loss_list = []    

    
    for epoch in range(n_epochs):
        print(f"Inicio del Epoch {epoch+1}\n")
        
        # Get the initial time for the epoch
        start = time.time()
        # Restore accumulated loss
        loss.reset_state()
        
        for (batch, (inputs, targets)) in enumerate(train_data):
            
            # Compute the gradients
            with tf.GradientTape() as tape:
                # Get the output and loss
                model_outputs = model(inputs, training=True)
                train_loss_batch = squad_loss_fn(targets, model_outputs)

            # Calculate gradients and update parameters
            gradients = tape.gradient(train_loss_batch, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            # Train loss update
            loss.update_state(train_loss_batch)
            # Get the final time for the epoch
            end = time.time()

            # Print results every 50 batches
            if (batch % 50 == 0) & (batch != 0):
                
                print(f"""Batch {batch:4} | Train: Loss {loss.result():.4f}""")

        print("")
        # Get the current training loss 
        train_loss = loss.result().numpy()
        print(f"Tiempo total para entrenar la epoch {epoch+1}: {epoch_time(start, end)}\n ")
        print(f"""- Train: Loss {train_loss:5.4f}""")

        # Restore the loss
        loss.reset_state()

        # Evaluate the model in the validation dataset
        if val_data is not None:
            val_loss = evaluate(val_data, model, loss)
            print(f"""- Val: Loss {val_loss:5.4f}""")
            print("")
            print(f"-----" * 5)
            val_loss_list.append(val_loss)

        train_loss_list.append(train_loss)

    if val_data is not None:
        return train_loss_list, val_loss_list
    else:
        return train_loss_list, []

### Training

Train the model for 3 epochs

In [25]:
import tensorflow as tf

# Set the logger of TensorFlow to avoid warnings
tf.get_logger().setLevel('ERROR')

NB_EPOCHS = 2

train_results, val_results = trainer(n_epochs   = NB_EPOCHS,
                                     train_data = train_dataset, 
                                     model      = bert_squad,
                                     loss       = loss,
                                     val_data   = val_dataset)

Inicio del Epoch 1

Batch   50 | Train: Loss 5.7525
Batch  100 | Train: Loss 5.0917
Batch  150 | Train: Loss 4.4206
Batch  200 | Train: Loss 3.9761
Batch  250 | Train: Loss 3.6454
Batch  300 | Train: Loss 3.4034
Batch  350 | Train: Loss 3.2752
Batch  400 | Train: Loss 3.1275
Batch  450 | Train: Loss 2.9635
Batch  500 | Train: Loss 2.8342
Batch  550 | Train: Loss 2.7228
Batch  600 | Train: Loss 2.6434
Batch  650 | Train: Loss 2.5663
Batch  700 | Train: Loss 2.4961
Batch  750 | Train: Loss 2.4346
Batch  800 | Train: Loss 2.3804
Batch  850 | Train: Loss 2.3368
Batch  900 | Train: Loss 2.2837
Batch  950 | Train: Loss 2.2381
Batch 1000 | Train: Loss 2.1865
Batch 1050 | Train: Loss 2.1396
Batch 1100 | Train: Loss 2.0965
Batch 1150 | Train: Loss 2.0518

Tiempo total para entrenar la epoch 1: 3h 9m 36s
 
- Train: Loss 2.0122
- Val: Loss 0.6834

-------------------------
Inicio del Epoch 2

Batch   50 | Train: Loss 1.4100
Batch  100 | Train: Loss 1.3054
Batch  150 | Train: Loss 1.1690
Batch  20

###  Model Saving

This section of the code ensures that the directory structure for saving the model exists and then saves the model's weights to that directory.

In [43]:
# Create the model path
model_path = "Model/tf2/tensorflow/1"
# Get the parent directory
parent_directory = os.path.dirname(current_directory)

# A Model foldel will be created if doesn't exist
if 'Model' not in os.listdir(parent_directory):
    os.makedirs(os.path.join(parent_directory, model_path))

# Save SQuAD Model
bert_squad.save(os.path.join(parent_directory, model_path))



## Evaluation


In this section, I will test the model using the SQuAD validation dataset to evaluate its performance. Specifically, we will use the `dev-v1.1.json` file, which is the SQuAD validation set containing 10,570 examples. 

The process involves the following steps:
1. **Load the Testing Dataset**: We read and parse the `dev-v1.1.json` file to obtain the examples that will be used for evaluation.
2. **Model Prediction**: The model, which was trained on the SQuAD training set, will generate predictions on these examples.
3. **Evaluation Metrics**: I will then calculate performance metrics such as Exact Match (EM) and F1 score to assess how well the model's predictions match the ground truth answers in the validation set.


Read the testing samples:

In [44]:
from official.nlp.data.squad_lib import read_squad_examples

# Read the all examples
all_examples = read_squad_examples(
    input_file  = os.path.join(data_path, "dev-v1.1.json"),   # Testing dataset file
    is_training = False,                                     
    version_2_with_negative = False)          

Creation fo the TFrecord file to store the samples

In [45]:
from official.nlp.data.squad_lib import FeatureWriter

# Create the Features in the TFRecord eval file (now it's empty)
eval_writer = FeatureWriter(
    filename    = os.path.join(data_path, "eval.tf_record"), # path and name where the TFRecord is saved
    is_training = False)  

The method 'convert_examples_to_features' converts the samples into a features to be processed and ready to use for| the SQuAD model 

In [46]:
from official.nlp.data.squad_lib import convert_examples_to_features

# Function to append features in the eval_writer TFRecord file
# If the feature has padding will be added into eval_writer. Otherwise into the all_features

def _append_feature(feature, is_padding):
    if not is_padding:
        all_features.append(feature)
    eval_writer.process_feature(feature)

BATCH_SIZE_VAL = 4

all_features = []

dataset_size = convert_examples_to_features(
    examples         = all_examples,                       # Input examples to convert in Features 
    tokenizer        = tokenizer,                          # Tokenizer to process the examples
    max_seq_length   = input_meta_data['max_seq_length'],  # Max input seq len
    doc_stride       = 128,                                # doc_stride=128 to split the secuences longer than max_seq_len every 128 tokens
    max_query_length = 64,                                 # Max query lenght
    is_training      = False,                              # Is no training
    output_fn        = _append_feature,                    # Function to save or process the Features
    batch_size       = BATCH_SIZE_VAL)                                  

eval_writer.close()

In [47]:
# Get the number of batches for the Testing Dataset
NB_BATCHES_TEST = dataset_size // BATCH_SIZE_VAL

print(f"The batch size for testing is {NB_BATCHES_TEST}")

The batch size for testing is 2709


In [48]:
BATCH_SIZE = 4

# Create the eval dataset from the TFRecord eval file
eval_dataset = create_squad_dataset(
    file_path   = os.path.join(data_path, "eval.tf_record"), # path where the TFRecord is saved 
    seq_length  = input_meta_data['max_seq_length'],         # Max seq lenght
    batch_size  = BATCH_SIZE,                                          
    is_training = False)

The next code is used to process model predictions and store the results in a structured format.

In [49]:
import collections

# Create a named tuple collection
RawResult = collections.namedtuple("RawResult",
                                   ["unique_id", "start_logits", "end_logits"])

# Create a function to store the tokens and logits in RawResult
def get_raw_results(predictions):
    for unique_ids, start_logits, end_logits in zip(predictions['unique_ids'],
                                                    predictions['start_logits'],
                                                    predictions['end_logits']):
        yield RawResult(
            unique_id    = unique_ids.numpy(),
            start_logits = start_logits.numpy().tolist(),
            end_logits   = end_logits.numpy().tolist())

# Create all_results list to append the tokens and logits
all_results = []
for count, inputs in enumerate(eval_dataset):
    x, _ = inputs
    # Delete the ids from the input 'x'
    unique_ids = x.pop("unique_ids")
    # Get the logits from the squad bert model
    start_logits, end_logits = bert_squad(x, training=False)

    # Save results in a dictionary
    output_dict = dict(
        unique_ids   = unique_ids,
        start_logits = start_logits,
        end_logits   = end_logits
    )

    # Append results in the all_results list
    for result in get_raw_results(output_dict):
        all_results.append(result)
    if count % 100 == 0:
        print(f"{count:4}/{NB_BATCHES_TEST}")

   0/2709
 100/2709
 200/2709
 300/2709
 400/2709
 500/2709
 600/2709
 700/2709
 800/2709
 900/2709
1000/2709
1100/2709
1200/2709
1300/2709
1400/2709
1500/2709
1600/2709
1700/2709
1800/2709
1900/2709
2000/2709
2100/2709
2200/2709
2300/2709
2400/2709
2500/2709
2600/2709
2700/2709


Save the predictions for each sample

In [52]:
from official.nlp.data.squad_lib import write_predictions

# Output file for the predictions
output_prediction_file    = os.path.join(data_path, "predictions.json")
# Output with the number of best answers
output_nbest_file         = os.path.join(data_path, "nbest_predictions.json")
# Output file for the null log odds
output_null_log_odds_file = os.path.join(data_path, "null_odds.json")

# Write the predictions into a JSON file
write_predictions(
    all_examples = all_examples,                              # Raw sample (question + context)
    all_features = all_features,                              # Features (tokens, attention_mask, type_id)
    all_results  = all_results,                               # logits
    n_best_size  = 20,                                        # number of best answers
    max_answer_length = 30,                                   # Max lenght of the answer
    do_lower_case = True,                                     # Convert text to lower
    output_prediction_file = output_prediction_file,          # Save prediction JSON file
    output_nbest_file  = output_nbest_file,                   # Save n best answer for each question
    output_null_log_odds_file = output_null_log_odds_file,    # Save the probability of  don't answer
    verbose = False)

In [1]:
# Test Evaluation
!python ./Data/evaluate-v1.1.py ./Data/dev-v1.1.json ./Data/predictions.json

{"exact_match": 62.87606433301798, "f1": 73.38428365761662}
