# BERT (Graded)

Welcome to your graded programming assignment on BERT! In this task, you will delve into the exciting realm of **Question Answering** by leveraging the power of the BERT model. This hands-on assignment will guide you through building a model that can interpret and answer questions based on a given set of texts.

You will be utilizing the [SQuAD (Stanford Question Answering Dataset)](https://rajpurkar.github.io/SQuAD-explorer/) for this purpose. SQuAD is a widely recognized dataset in the field of natural language processing, consisting of questions posed on a set of Wikipedia articles, where the goal is to extract the answer to these questions from the provided context.

Your task is to create a functional Question Answering system which uses the capabilities of BERT to understand and respond accurately to questions based on contextual information. By the end of this assignment, you will have deepened your understanding of how BERT works and how it can be applied to solve real-world NLP problems.

**Instructions:**
* Do not modify any of the codes.
* Only write code when prompted. For example in some sections you will find the following,
```
# YOUR CODE GOES HERE
# YOUR CODE STARTS HERE
# TODO
```
Only modify those sections of the code.
* You will find **REFLECTION** under few code cells where you are asked to write your thoughts or interpretations on the outputs.


**You will learn to:**

* Understand the architecture of BERT and its application in NLP tasks.
* Preprocess datasets to be compatible with BERT inputs.
* Implement a BERT-based question answering system using TensorFlow.
* Fine-tune a pre-trained BERT model on the SQuAD dataset.
* Analyze and reflect on the outputs of your model and the effectiveness of your preprocessing and training steps.

# Question Answering using BERT

In [None]:
import tensorflow as tf
from datasets import load_dataset
import numpy as np
from tqdm.auto import tqdm

from helpers import *
from tests import *

## Data Preparation

### Data loading

In [None]:
from datasets import load_dataset

# Load the SQuAD dataset
squad_dataset = load_dataset("squad")

In [None]:
# Inspect the dataset
print(squad_dataset["train"][0])

#### Dataset Representation
It consists of a large number of question-answer pairs, where each question is paired with a corresponding context paragraph from Wikipedia. The answer to each question is a span of text within the context paragraph.

Here's how the SQuAD dataset is typically represented:

```python
{
  "title": "Article Title",
  "context": "Context paragraph",
  "question": "Question",
  "answer": {
    "answer_start": 123,
    "text": "Answer Text"
  }
}
```

### Data Preprocessing

#### Initialize the tokenizer

We're going to intialize the `distilbert-base-uncased` tokenizer.

**DistilBERT:**

* A distilled version of BERT, a powerful language model developed by Google AI.
* Smaller and faster than BERT, while maintaining a significant portion of its performance.
* Well-suited for various NLP tasks like text classification, question answering, and text generation.

```python
DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased'):
```

* `DistilBertTokenizerFast`: This class is a fast tokenizer implementation for DistilBERT. It leverages the tokenizers library for efficient tokenization.
* `from_pretrained('distilbert-base-uncased')`: This part loads a pre-trained DistilBERT tokenizer from the Hugging Face Transformers library. The 'distilbert-base-uncased' string specifies the specific model architecture and vocabulary to use.

In [None]:
# Load the pre-trained tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Tokenize a text input
text = "This is a sample text to be tokenized."
tokens = tokenizer(text)

print(tokens)

This will output a dictionary containing the tokenized input, including token IDs, attention masks, and other relevant information.

By using a pre-trained tokenizer, you can leverage the knowledge and capabilities of the DistilBERT model for various NLP tasks without having to train a tokenizer from scratch.

#### Preprocessing steps

1. **Strip Whitespace:**

   - **Purpose:** Cleans the data by removing any leading or trailing whitespace from each question and context string, ensuring consistency before tokenization.

2. **Tokenization Using BERT Tokenizer:**

   - **Tokenization:** Transforms the text data into tokens that BERT can process.
   - **Settings:**
     - `max_length=384`: Ensures the sequences do not exceed 384 tokens, a practical length that fits within BERT's constraints of 512 tokens (considering special tokens added by BERT).
     - `truncation="only_second"`: Truncates tokens from the context (`input_ids`) if the combined length of the question and context exceeds `max_length`.
     - `stride=128`: Allows a sliding window approach by overlapping context parts by 128 tokens, improving the chance of context coverage.
     - `return_overflowing_tokens=True`: Keeps all splits from truncated contexts, resulting in multiple input sets for contexts longer than `max_length`.
     - `return_offsets_mapping=True`: Returns offsets showing the start and end character positions of each token, crucial for mapping back to original text.
     - `padding="max_length"`: Pads sequences to `max_length`, ensuring uniform input size for model batches.

3. **Offset Mapping and Answer Positioning:**
   ```python
   offset_mapping = inputs.pop("offset_mapping")
   sample_map = inputs.pop("overflow_to_sample_mapping")
   answers = examples["answers"]
   start_positions = []
   end_positions = []
   ```
   - **offset_mapping:** Provides a character span to token index map, useful for locating answer positions in context.
   - **sample_map:** Connects each split of overflowing tokens back to the original sample.
   - **Answers Extraction:** Prepares to determine the start and end token positions of each answer within the tokenized inputs.

4. **Determine Start and End Positions of Answers:**
   ```python
   for i, offset in enumerate(offset_mapping):
       sample_idx = sample_map[i]
       answer = answers[sample_idx]
       start_char = answer["answer_start"][0]
       end_char = start_char + len(answer["text"][0])
       sequence_ids = inputs.sequence_ids(i)
       context_start = sequence_ids.index(1)
       context_end = sequence_ids.index(1, context_start + 1) - 1
   ```
   - **Sequence IDs:** Identifies which tokens belong to the question and which to the context.
   - **Locate Answers:** Utilize the character offsets to locate the respective tokens within the context span.

5. **Handling Answers Outside the Context Span:**
   ```python
   if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
       start_positions.append(0)  # No answer available in this span
       end_positions.append(0)
   ```
   - **No Overlap Handling:** In cases where the answer is not fully contained in the truncated context, use zero as a placeholder indicating no answer.

6. **Identify Token Indices for Answer Bounds:**
   ```python
   else:
       idx = context_start
       while idx <= context_end and offset[idx][0] <= start_char:
           idx += 1
       start_positions.append(idx - 1)

       idx = context_end
       while idx >= context_start and offset[idx][1] >= end_char:
           idx -= 1
       end_positions.append(idx + 1)
   ```
   - **Token Indices:** Finds the precise start (`start_positions`) and end indices (`end_positions`) of the answer within the tokenized context by iterating over token offsets.

7. **Return Processed Inputs:**

   - **Addition of Positions:** Attaches the calculated start and end positions to the tokenized input data, crucial for model training to learn exact answer boundaries.

In [None]:
# TODO

# Preprocessing function
def preprocess_function(examples):

    # Defining the tokenizer
    tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

    # TODO
    # Step 1: Strip whitespace and prepare lists of questions and contexts
    questions =
    contexts =

    # Implement tokenization using the BERT tokenizer
    # Step 2: Use the tokenizer to tokenize questions and contexts
    inputs = tokenizer(

        questions,
        contexts,
        # TODO: Set max_length to 384
        # TODO: Set truncation to "only_second"
        # TODO: Set stride to 128
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        # Set padding to max length
    )

    # Step 3: Extract the position of answers
    # Initialize lists to hold start and end positions
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # Step 4: Iterate through each offset to find the token indices matching the start and end of answers
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        sequence_ids = inputs.sequence_ids(i)
        context_start = sequence_ids.index(1)
        context_end = sequence_ids.index(1, context_start + 1) - 1

        # Step 5: Check if the entire answer is present in the context
        # If the answer is outside the present context, label it appropriately (e.g., (0, 0))
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Step 6: Find the tokens that correspond to the start and end positions of the answer
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # TODO
    # Step 7: Add the start and end positions to the model inputs
    inputs["start_positions"] =
    inputs["end_positions"] =


    return inputs




This preprocessing function handles preparing the SQuAD dataset for use with BERT by ensuring text is appropriately tokenized, answer positions are aligned, and that the model receives uniform input. The function’s logic ensures BERT can effectively learn to map questions to their respective answers within the given context.

## Model Training and Evaluation







In [None]:
# TODO

# Import necessary libraries

# 2. Create Model and Configure Training
def create_qa_model():
    # TODO
    # Step 1: Load the pre-trained Distill-BERT model for question answering from Hugging Face
    model =

    # TODO: Setup an optimizer
    # Step 2: Use create_optimizer compatible with HF model for training
    num_train_steps =  # This should be the total number of training steps
    # - init_lr: The initial learning rate
    # - num_warmup_steps: Gradually increase the learning rate to the target value

    # TODO: Choose and configure the loss function
    # Step 3: Define the SparseCategoricalCrossentropy loss function for the model, ensuring it's suitable for the question-answering task
    loss =

    # TODO: Compiling
    # Step 4: Compile the model with the optimizer and the loss function


    return model

**Relection**

\<Write your thoughts about the model structure here>

In [None]:
# TODO

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

def train_model(model, train_dataset, validation_dataset):
    # Convert to tf.data.Dataset

    # TODO: Prepare train and validation datasets as TensorFlow datasets
    # Step 1: Use the model's prepare_tf_dataset method to create the training and validation datasets
    #         - Set shuffle=True for randomizing training data
    #         - Batch size should be reasonable to fit in memory (e.g., 16)
    train_set =

    val_set =

    # TODO: Add any extra callbacks necessary for saving models or early stopping
    # Step 2: Configure training callbacks
    callbacks = [
      #  - EarlyStopping stops training when there is no improvement to prevent overfitting
      #  - ModelCheckpoint saves the model at the epoch with the lowest validation loss
    ]

    # Adjust callback hooks if needed
    for callback in callbacks:
        if not hasattr(callback, '_implements_train_batch_hooks'):
            callback._implements_train_batch_hooks = lambda: False
            callback._implements_test_batch_hooks = lambda: False
            callback._implements_predict_batch_hooks = lambda: False
            callback._implements_call_batch_hooks = lambda: False

    # Train the model
    # Step 3: Fit the model to the training data while validating on validation data
    #         - Use the history object to analyze or plot training and validation loss
    history = model.fit(
        train_set,
        validation_data=val_set,
        epochs=1,  # Set epochs (minimal to observe speed and preventing long runs)
        callbacks=callbacks
    )

    return history


#### **Key Features**

The model training is responsible for configuring the training process and optimizing the BERT model's performance on the SQuAD dataset. Here's a breakdown of the steps involved:

1. **Dataset Preparation**:

   - **Conversion to TensorFlow Dataset**: Utilizes BERT's `prepare_tf_dataset` method to convert the preprocessed tokenized dataset into a TensorFlow dataset format suitable for training.
   - **Batching and Shuffling**:
     - `shuffle=True` for the training set ensures that each epoch sees data in a different order, promoting better model generalization.
     - `batch_size=16` defines the number of samples processed before updating the model, balancing memory use and training speed.

2. **Training Callbacks**:

   - **EarlyStopping**: Halts training if the validation loss does not improve for 2 consecutive epochs, preventing further overfitting and saving computational time.
   - **ModelCheckpoint**: Saves the model only when it achieves a new lowest validation loss, ensuring the best-performing model is retained.

3. **Model Fitting**:

   - **Training Execution**: The `fit` method iteratively optimizes the model parameters over the training data for the specified number of epochs.
   - **Epochs**: Determines how many times the entire training dataset will pass through the model. Adjust depending on performance and overfitting.
   - **Validation**: Continuously evaluates the model's performance on a separate validation dataset to monitor overfitting and generalization capabilities.

4. **Model History**:
   - **Output Analysis**: The `history` object stores details about the training process, such as loss and accuracy metrics for each epoch, which can be plotted for performance evaluation.



In [None]:
def main():

    print("Running all tests...")
    run_all_tests()

    print("Loading and preprocessing dataset...")
    # Apply preprocessing to the dataset
    tokenized_dataset = squad_dataset.map(
        preprocess_function,
        batched=True,
        remove_columns=squad_dataset["train"].column_names,
    )

    print("Creating model...")
    model = create_qa_model()

    print("Starting training...")
    history = train_model(
        model,
        tokenized_dataset["train"],
        tokenized_dataset["validation"]
    )

    # Save the model
    model.save_pretrained("qa_model_saved")

    test_model_loss(history.history['val_loss'])

    return model, tokenizer, history


if __name__ == "__main__":
    # Set random seed for reproducibility
    tf.random.set_seed(42)

    # Run training
    model, tokenizer, history = main()



**Reflection**

\<Write your observations here>

#### Improvement Strategies

Here are some model improvement strategies you can consider to improve the model:

* **Hyperparameter Tuning:**

  Adjust learning rates (e.g., 2e-5), batch sizes (e.g., 16 or 32), and epochs (2-4).
* **Data Augmentation:**

  Use paraphrasing to create more data. Ensure long contexts are managed with overlapping windows.
* **Advanced Models:**

  Use larger models like BERT-large or switch to RoBERTa/ALBERT for potentially better performance.
* **Regularization:**

  Increase dropout (e.g., 0.2), employ weight decay, and use early stopping to minimize overfitting.
* **Optimizing Training:**

  Implement learning rate warmup and consider layer-wise learning rate decay.
* **Fine-tuning Tips:**

  Freeze lower layers initially; apply task-specific model tweaks.


# Inference

In [None]:
question = "Who wrote Hamlet?"
context = "Hamlet is a tragedy written by William Shakespeare sometime between 1599 and 1601."
answer = get_answer(question, context, model, tokenizer)
print(f"\nQuestion: {question}")
print(f"Context: {context}")
print(f"Answer: {answer}")