# SQuAD Question Answering

## 1 Introduction

Welcome to the "Enhancing Question Answering with Transformer Models" project! In this endeavor, I will be delving into the realm of Natural Language Processing to tackle the challenging task of building a model that can accurately comprehend and answer questions based on given contexts. By harnessing the transformative capabilities of Transformer architectures, I aim to create a robust system that not only understands the nuances of human language but also delivers contextually relevant answers.

The heart of our project beats with the transformative potential of Transformer architecture, a groundbreaking innovation that has revolutionized the field of NLP. Inspired by the "Attention is All You Need" paper, we will harness the capabilities of self-attention mechanisms, multi-head attention, and feedforward neural networks to build models that can efficiently capture intricate linguistic relationships, even in lengthy and complex text.

As I progress through this project, I'll explore data preprocessing, model selection, fine-tuning, and evaluation methodologies, aiming to equip the model with the ability to interpret context and contextually generate insightful answers. Whether it's tackling questions on passages of text, summarizing content, or generating human-like responses, this project is a tribute to the power of modern AI in understanding and manipulating language.

## 2 Project Setup

### 2.1 Packages Installation

In [1]:
try:
    import torch
    import torchvision
    from transformers import pipeline
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    !pip install torch torchvision
    !pip install transformers
    import torch
    import torchvision
    from transformers import pipeline
    print(f"torch version: {torch.__version__}")

torch version: 2.0.1
torchvision version: 0.15.2


In [2]:
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### 2.2 Acquiring The SQuAD Dataset

In [3]:
try:
    from datasets import load_dataset
    raw_datasets = load_dataset("squad")
except:
    !pip install datasets
    from datasets import load_dataset
    raw_datasets = load_dataset("squad")

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## 3 Data Exploration

Data exploration is a crucial step that allows us to understand the nuances and characteristics of the dataset, enabling us to make informed decisions during preprocessing and modeling

### 3.1 Basic Statistics
To perform data exploration on the SQuAD dataset, we're following a systematic process. We're starting by calculating basic statistics such as the number of examples in the dataset, the average length of questions and contexts, and the distribution of answer lengths. These statistics give us a high-level overview of the dataset's composition.

In [5]:
# Access the training split
train_data = raw_datasets["train"]

# Basic statistics
num_examples = len(train_data)
context_lengths = [len(example["context"]) for example in train_data]
question_lengths = [len(example["question"]) for example in train_data]
answer_lengths = [len(example["answers"]["text"][0]) for example in train_data]

avg_context_length = sum(context_lengths) / num_examples
avg_question_length = sum(question_lengths) / num_examples
avg_answer_length = sum(answer_lengths) / num_examples

print(f"Number of examples: {num_examples}")
print(f"Average context length: {avg_context_length:.2f}")
print(f"Average question length: {avg_question_length:.2f}")
print(f"Average answer length: {avg_answer_length:.2f}")

Number of examples: 87599
Average context length: 754.36
Average question length: 59.57
Average answer length: 20.15


### 3.2 Sample Viewing
Next, we're randomly sampling examples from the dataset and visually inspecting them. This hands-on approach helps us grasp the format of questions, contexts, and answer spans. We're also using visualizations such as histograms and box plots to analyze the distribution of question and context lengths, aiding in identifying potential outliers or patterns.

In [6]:
print("Context: ", train_data[0]["context"], "\n")
print("Question: ", train_data[0]["question"], "\n")
print("Answer: ", train_data[0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. 

Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 

Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


### 3.3 Answer Types and Categories
Finally, we will be exploring the different types of answers present in the dataset (e.g., named entities, numeric answers, descriptive answers). This information can guide our preprocessing and model design.

In [7]:
import re

# Regular expression patterns to identify answer types
numeric_pattern = re.compile(r'^\d+(\.\d+)?$')

# Initialize counters for different answer types
numeric_answers = 0
named_entities = 0
descriptive_answers = 0
other_answers = 0

# Loop through examples to categorize answers
for example in train_data:
    answer_text = example["answers"]["text"][0]
    
    if re.match(numeric_pattern, answer_text):
        numeric_answers += 1
    elif answer_text.isupper():
        named_entities += 1
    elif len(answer_text.split()) > 1:
        descriptive_answers += 1
    else:
        other_answers += 1

# Print the counts for different answer types
print(f"Numeric Answers: {numeric_answers}")
print(f"Named Entities: {named_entities}")
print(f"Descriptive Answers: {descriptive_answers}")
print(f"Other Answers: {other_answers}")

Numeric Answers: 6912
Named Entities: 1225
Descriptive Answers: 56857
Other Answers: 22605


### 3.4 Dataset Features

In [8]:
train_data.filter(lambda x: len(x["answers"]["text"]) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

## 4 Data Preprocessing

Data preprocessing plays a pivotal role in our project focused on enhancing question answering using Transformer-based models. It serves as the foundation that empowers these advanced models to understand and interpret human language effectively. By transforming raw text into a structured format, we enable the models to process, learn from, and generate accurate responses based on the input.

Transformers operate at the token level, and each token typically corresponds to a word or subword unit. Preprocessing involves tokenizing the text into these units, allowing the model to understand and process the input. Tokenization is a fundamental step in preparing text data for Transformer models.

Fortunately, we can import AutoTokenizer from the transformers library. AutoTokenizer covers:
1. Tokenization
2. Padding and Truncation
3. Adding Special Tokens
4. Positional Encodings
5. Encoding and Decoding
6. Batching
7. Mapping to Model Inputs

### 4.1 Importing the AutoTokenizer

In [9]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sample text for tokenization
context = train_data[0]["context"]
question = train_data[0]["question"]
tokenized_data = tokenizer(context, question)

# Print the first tokenized example
tokenizer.decode(tokenized_data["input_ids"])

'[CLS] architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. [SEP] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP]'

### 4.2 Exploring the Tokenizer Properties
Modifying tokenizer properties is a crucial step in preparing the SQuAD dataset for training with Transformer-based models. Tokenization is not a one-size-fits-all process; it needs to be customized to suit the unique characteristics of the dataset and task. By modifying tokenizer properties, we ensure that the input data is processed in a way that maximizes the efficiency and effectiveness of the subsequent training process.

1. **Sequence Length Management:**
   By setting the `max_length` parameter to 100, we are restricting the maximum length of the tokenized sequences. This ensures that the tokenized sequences fit within the model's input limitations. Modifying `max_length` allows us to control the input size while considering the tokenization requirements of the model.

2. **Padding and Overflowing Tokens:**
   The `return_overflowing_tokens=True` parameter allows the tokenizer to return overflowing tokens when the input exceeds the specified `max_length`. Modifying this parameter helps you capture complete context information even if it exceeds the set length. This is particularly important for tasks like question answering, where the context is critical for generating accurate answers.

3. **Stride for Overlapping Contexts:**
   The `stride` parameter specifies the step size when sliding the tokenized window over the context. This can lead to overlapping context windows, which can be beneficial for maintaining context continuity across different tokenized examples. This approach ensures that information is not lost when splitting long contexts into smaller segments.

4. **Truncation Strategy:**
   We're using `truncation="only_second"` which truncates the `context` portion of the input in case it exceeds the `max_length`. This strategy maintains the complete question and prioritizes retaining the latter portion of the context, which often contains more relevant information.

5. **Decoding and Understanding Tokens:**
   The loop that decodes and prints the `input_ids` provides insight into how the tokenized sequences are represented. This helps you understand how tokenization affects the structure of the input and provides a way to verify the preprocessing process.

In [10]:
# Tokenize the question and context using the tokenizer
inputs = tokenizer(
    question,                                 # The question to be tokenized
    context,                                  # The context to be tokenized
    max_length=100,                           # Set the maximum combined sequence length
    truncation="only_second",                 # Truncate context if it exceeds max_length
    stride=50,                                # Step size for sliding the tokenized window
    return_overflowing_tokens=True,           # Return overflowing tokens
)

In [11]:
# Iterate over the input_ids of the tokenized sequences
for ids in inputs["input_ids"]:
    # Decode and print the tokenized sequence
    print(tokenizer.decode(ids), "\n")

[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] architecturally, the school has a catholic character. atop the main building's gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the gr [SEP] 

[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint [SEP] 

[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes f

In [12]:
# Print the keys in the 'inputs' dictionary
print(inputs.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'overflow_to_sample_mapping'])


### 4.3 Tokenizing More Examples
This code essentially demonstrates how the tokenizer processes a subset of the dataset, tokenizing questions and contexts and providing additional information about the generated features and their mapping to the original examples.

In [13]:
# Tokenize the questions and contexts using the tokenizer
inputs = tokenizer(
    train_data[2:6]["question"],       # List of questions to be tokenized
    train_data[2:6]["context"],        # List of contexts to be tokenized
    max_length=100,                    # Set the maximum combined sequence length
    truncation="only_second",          # Truncate context if it exceeds max_length
    stride=50,                         # Step size for sliding the tokenized window
    return_overflowing_tokens=True,    # Return overflowing tokens
    return_offsets_mapping=True,       # Return offsets mapping for the original text
)

# Calculate the number of tokenized sequences generated
num_tokenized_sequences = len(inputs['input_ids'])

# Print the number of tokenized sequences and where they come from
print(f"The {num_tokenized_sequences} examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 17 examples gave 17 features.
Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3].


### 4.4 Extracting Start and End Token Positions 
The code processes tokenized sequences and aligns them with the original answer positions in the context to extract start and end token positions for each answer span. The goal is to identify the token positions corresponding to the start and end of the answer text within the tokenized context.

In [14]:
# Get the answers from the original dataset for the processed examples
answers = raw_datasets["train"][2:6]["answers"]

# Initialize empty lists to store start and end positions
start_positions = []
end_positions = []

# Loop through each offset mapping in the tokenized inputs
for i, offset in enumerate(inputs["offset_mapping"]):
    # Get the index of the example in the original dataset
    sample_idx = inputs["overflow_to_sample_mapping"][i]
    # Retrieve the answer for the example
    answer = answers[sample_idx]
    # Calculate the start and end character positions of the answer in the context
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    # Get sequence IDs for the tokenized example
    sequence_ids = inputs.sequence_ids(i)

    # Find the start and end token positions of the context
    idx = 0
    while sequence_ids[idx] != 1:
        idx += 1
    context_start = idx
    while sequence_ids[idx] == 1:
        idx += 1
    context_end = idx - 1

    # Check if the answer is fully inside the context
    if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
        # If not, label the start and end positions as (0, 0)
        start_positions.append(0)
        end_positions.append(0)
    else:
        # Otherwise, find the start token position within the context
        idx = context_start
        while idx <= context_end and offset[idx][0] <= start_char:
            idx += 1
        start_positions.append(idx - 1)

        # Find the end token position within the context
        idx = context_end
        while idx >= context_start and offset[idx][1] >= end_char:
            idx -= 1
        end_positions.append(idx + 1)

# Print the lists of start and end positions
print("Start Positions:", start_positions)
print("End Positions:", end_positions)

Start Positions: [81, 49, 17, 0, 0, 57, 19, 33, 0, 0, 0, 63, 27, 0, 0, 0, 0]
End Positions: [83, 51, 19, 0, 0, 63, 25, 39, 0, 0, 0, 64, 28, 0, 0, 0, 0]


### 4.5 Verifying our Approach
This code snippet illustrates how the tokenized answer span is decoded and compared to the original answer text, serving as a visual verification step to ensure that the preprocessing and tokenization were successful in retaining the semantic content of the answers.

In [15]:
# Initialize index to select an example
idx = 0

# Obtain the sample index for the current tokenized sequence
sample_idx = inputs["overflow_to_sample_mapping"][idx]

# Retrieve the answer text for the current example
answer = answers[sample_idx]["text"][0]

# Obtain the start and end token positions for the answer span
start = start_positions[idx]
end = end_positions[idx]

# Decode the tokenized answer span using the start and end positions
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])

# Print the comparison between the theoretical answer and the labeled answer span
print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")


Theoretical answer: the Main Building, labels give: the main building


### 4.6 Consolidating The Code
This code defines a function called `preprocess_training_examples` that preprocesses training examples for a question answering task using the SQuAD dataset. It tokenizes both questions and contexts using the specified `max_length` and `stride` parameters. It then processes the tokenized inputs to extract answer positions within the context, considering overflow, offset mappings, and sequence IDs. Finally, it adds the start and end positions of the answer spans to the tokenized inputs and returns the processed inputs. 

In [16]:
# Define maximum sequence length and stride for tokenization
max_length = 384
stride = 128

# Define a function to preprocess training examples
def preprocess_training_examples(examples):
    # Strip leading and trailing whitespace from question texts
    questions = [q.strip() for q in examples["question"]]
    
    # Tokenize questions and contexts using the tokenizer
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Extract offset mappings and sample mappings
    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    
    # Extract answers from the examples
    answers = examples["answers"]
    
    # Initialize lists to store start and end positions for answer spans
    start_positions = []
    end_positions = []

    # Iterate over each tokenized example and process answer positions
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # Add start and end positions to the tokenized inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

### 4.7 Preprocessing the Training Dataset

In [17]:
# Map the preprocess_training_examples function to the training examples
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

# Get the lengths of the original and processed datasets
original_dataset_length = len(raw_datasets["train"])
processed_dataset_length = len(train_dataset)

# Print the lengths
print(original_dataset_length, processed_dataset_length)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

87599 88524


### 4.8 Preprocessing the Validation Dataset

In [18]:
# Map the preprocess_training_examples function to the validation examples
validation_dataset = raw_datasets["validation"].map(
    preprocess_training_examples,      # Use the same preprocessing function
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

# Get the lengths of the original and processed validation datasets
original_validation_length = len(raw_datasets["validation"])
processed_validation_length = len(validation_dataset)

# Print the lengths
print(original_validation_length, processed_validation_length)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

10570 10784


## 5 Designing a Baseline

### 5.1 Model Selection

### 5.2 Model Finetuning

### 5.3 Training

### 5.4 Evaluation

## 6 Implementing Transformer-based Models

## 7 Using the Best Model

## 8 Conclusion