# SQuAD Question Answering

## 1 Introduction

Welcome to the "Enhancing Question Answering with Transformer Models" project! In this endeavor, I will be delving into the realm of Natural Language Processing to tackle the challenging task of building a model that can accurately comprehend and answer questions based on given contexts. By harnessing the transformative capabilities of Transformer architectures, I aim to create a robust system that not only understands the nuances of human language but also delivers contextually relevant answers.

The heart of our project beats with the transformative potential of Transformer architecture, a groundbreaking innovation that has revolutionized the field of NLP. Inspired by the "Attention is All You Need" paper, we will harness the capabilities of self-attention mechanisms, multi-head attention, and feedforward neural networks to build models that can efficiently capture intricate linguistic relationships, even in lengthy and complex text.

As I progress through this project, I'll explore data preprocessing, model selection, fine-tuning, and evaluation methodologies, aiming to equip the model with the ability to interpret context and contextually generate insightful answers. Whether it's tackling questions on passages of text, summarizing content, or generating human-like responses, this project is a tribute to the power of modern AI in understanding and manipulating language.

## 2 Project Setup

### 2.1 Packages Installation

In [1]:
try:
    import torch
    import torchvision
    from transformers import pipeline
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    !pip install torch torchvision
    !pip install transformers
    import torch
    import torchvision
    from transformers import pipeline
    print(f"torch version: {torch.__version__}")

torch version: 2.0.1
torchvision version: 0.15.2


In [2]:
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### 2.2 Acquiring The SQuAD Dataset

In [3]:
try:
    from datasets import load_dataset
    raw_datasets = load_dataset("squad")
except:
    !pip install datasets
    from datasets import load_dataset
    raw_datasets = load_dataset("squad")

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/66/f8/38298237d18d4b6a8ee5dfe390e97bed5adb8e01ec6f9680c0ddf3066728/datasets-2.14.4-py3-none-any.whl.metadata
  Downloading datasets-2.14.4-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=8.0.0 (from datasets)
  Obtaining dependency information for pyarrow>=8.0.0 from https://files.pythonhosted.org/packages/77/0d/3a698f5fee20e6086017ae8a0fe8eac40eebceb7dc66e96993b10503ad58/pyarrow-13.0.0-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Downloading pyarrow-13.0.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (3.0 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Obtaining dependency information for dill<0.3.8,>=0.3.0 from https://files.pythonhosted.org/packages/f5/3a/74a29b11cf2cdfcd6ba89c0cecd70b37cd1ba7b77978ce611eb7a146a832/dill-0.3.7-py3-none-any.whl.metadata
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting pandas (from datasets)
  Obtaining dependen

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

## 3 Data Exploration

Data exploration is a crucial step that allows us to understand the nuances and characteristics of the dataset, enabling us to make informed decisions during preprocessing and modeling

### 3.1 Basic Statistics
To perform data exploration on the SQuAD dataset, we're following a systematic process. We're starting by calculating basic statistics such as the number of examples in the dataset, the average length of questions and contexts, and the distribution of answer lengths. These statistics give us a high-level overview of the dataset's composition.

In [5]:
# Access the training split
train_data = raw_datasets["train"]

# Basic statistics
num_examples = len(train_data)
context_lengths = [len(example["context"]) for example in train_data]
question_lengths = [len(example["question"]) for example in train_data]
answer_lengths = [len(example["answers"]["text"][0]) for example in train_data]

avg_context_length = sum(context_lengths) / num_examples
avg_question_length = sum(question_lengths) / num_examples
avg_answer_length = sum(answer_lengths) / num_examples

print(f"Number of examples: {num_examples}")
print(f"Average context length: {avg_context_length:.2f}")
print(f"Average question length: {avg_question_length:.2f}")
print(f"Average answer length: {avg_answer_length:.2f}")

Number of examples: 87599
Average context length: 754.36
Average question length: 59.57
Average answer length: 20.15


### 3.2 Sample Viewing
Next, we're randomly sampling examples from the dataset and visually inspecting them. This hands-on approach helps us grasp the format of questions, contexts, and answer spans. We're also using visualizations such as histograms and box plots to analyze the distribution of question and context lengths, aiding in identifying potential outliers or patterns.

In [8]:
print("Context: ", train_data[0]["context"], "\n")
print("Question: ", train_data[0]["question"], "\n")
print("Answer: ", train_data[0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. 

Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 

Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


### 3.3 Answer Types and Categories
Finally, we will be exploring the different types of answers present in the dataset (e.g., named entities, numeric answers, descriptive answers). This information can guide our preprocessing and model design.

In [7]:
import re

# Regular expression patterns to identify answer types
numeric_pattern = re.compile(r'^\d+(\.\d+)?$')

# Initialize counters for different answer types
numeric_answers = 0
named_entities = 0
descriptive_answers = 0
other_answers = 0

# Loop through examples to categorize answers
for example in train_data:
    answer_text = example["answers"]["text"][0]
    
    if re.match(numeric_pattern, answer_text):
        numeric_answers += 1
    elif answer_text.isupper():
        named_entities += 1
    elif len(answer_text.split()) > 1:
        descriptive_answers += 1
    else:
        other_answers += 1

# Print the counts for different answer types
print(f"Numeric Answers: {numeric_answers}")
print(f"Named Entities: {named_entities}")
print(f"Descriptive Answers: {descriptive_answers}")
print(f"Other Answers: {other_answers}")


Numeric Answers: 6912
Named Entities: 1225
Descriptive Answers: 56857
Other Answers: 22605


## 4 Data Preprocessing

## 5 Designing a Baseline

### 5.1 Model Selection

### 5.2 Model Finetuning

### 5.3 Training

### 5.4 Evaluation

## 6 Implementing Transformer-based Models

## 7 Using the Best Model

## 8 Conclusion