# **1. Setup**

In [5]:
!pip install transformers datasets evaluate



# **2. Load and Explore Dataset**

In [2]:
# Step 2: Load and Explore Dataset
from datasets import load_dataset

# Load the SQuAD v1.1 dataset
dataset = load_dataset("squad")

# Print the dataset structure
print(dataset)

# Print the first training sample to understand its fields
print("\n--- Sample Training Example ---")
print(dataset['train'][0])

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

--- Sample Training Example ---
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects th

# **3. Tokenization and Model Setup**

In [3]:
# @title ## 3. & 4. Tokenization and Model Setup
# We'll load the BERT tokenizer and model, then create a comprehensive
# preprocessing function to prepare the data for QA.

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# This is the model checkpoint we'll use
model_checkpoint = "bert-base-uncased"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Load the model for the Question Answering task
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# --- Configuration for preprocessing ---
# Some contexts are very long. We'll split them into smaller chunks.
max_length = 384  # The maximum length of a feature (question + context)
doc_stride = 128  # The number of overlapping tokens between chunks

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2025-11-12 18:59:51.217461: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762973991.440091      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762973991.511559      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# @title ### 3.1. Preprocessing Function
# This is the most complex part of QA.
# We need to:
# 1. Tokenize (question, context) pairs.
# 2. Handle long contexts by splitting them into overlapping chunks (stride).
# 3. Map the character-based answer start/end to token-based start/end positions.

def preprocess_function(examples):
    # Tokenize the questions and contexts together.
    # 'truncation="only_second"' truncates the context, not the question.
    # 'return_overflowing_tokens=True' creates multiple features for long contexts.
    # 'return_offsets_mapping=True' gives us char-to-token mappings.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # This 'sample_mapping' helps us map from a feature back to its original example.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # 'offset_mapping' helps us map from tokens to characters in the context.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # We now label the start and end token positions.
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the [CLS] token's index (0).
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the sequence corresponding to this example (to know what is context and what is question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Get the original example index.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If no answers are given, set [CLS] as the answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Get the character start and end of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Find the token start index in the current span.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:  # 1 marks the context
                token_start_index += 1

            # Find the token end index.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is outside the current span.
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                # Answer is not in this span, label with [CLS]
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Answer is in this span. Find the exact token start and end.
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)

                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [5]:
# @title ### 3.2. Apply Preprocessing (on Full Dataset)

# NOTE: We are now processing the *entire* SQuAD dataset.
# This will take significantly longer than the subset.

# Apply the preprocessing function to the full datasets
tokenized_train = dataset["train"].map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

tokenized_validation = dataset["validation"].map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["validation"].column_names,
)

print("\n--- Tokenized Datasets (Full) ---")
print(tokenized_train)
print(tokenized_validation)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]


--- Tokenized Datasets (Full) ---
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 88524
})
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 10784
})


# **4. Fine-tuning**

In [13]:
!pip install -U transformers accelerate



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecti

In [17]:
from transformers import Trainer, TrainingArguments
import transformers.utils.logging as logging
logging.set_verbosity_info()


In [20]:
training_args = TrainingArguments(
    output_dir="./qa_bert_finetuned",
    eval_strategy="epoch",      # ✅ for older versions
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=100,
    report_to="none",
    disable_tqdm=False,
    load_best_model_at_end=False,
)


PyTorch: setting up devices


In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation,
    tokenizer=tokenizer,
)

print("\n--- Starting Training ---\n")
trainer.train()
print("\n--- Training Complete ---\n")


  trainer = Trainer(



--- Starting Training ---



***** Running training *****
  Num examples = 88,524
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Training with DataParallel so batch size has been adjusted to: 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 8,301
  Number of trainable parameters = 108,893,186


Epoch,Training Loss,Validation Loss
1,1.0368,1.018393
2,0.7636,0.997935
3,0.5549,1.075303


Saving model checkpoint to ./qa_bert_finetuned/checkpoint-500
Configuration saved in ./qa_bert_finetuned/checkpoint-500/config.json
Model weights saved in ./qa_bert_finetuned/checkpoint-500/model.safetensors
tokenizer config file saved in ./qa_bert_finetuned/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./qa_bert_finetuned/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./qa_bert_finetuned/checkpoint-1000
Configuration saved in ./qa_bert_finetuned/checkpoint-1000/config.json
Model weights saved in ./qa_bert_finetuned/checkpoint-1000/model.safetensors
tokenizer config file saved in ./qa_bert_finetuned/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./qa_bert_finetuned/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./qa_bert_finetuned/checkpoint-1500
Configuration saved in ./qa_bert_finetuned/checkpoint-1500/config.json
Model weights saved in ./qa_bert_finetuned/checkpoint-1500/model.safetensors
tokenizer config

SafetensorError: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" })

In [23]:
import shutil, glob

checkpoints = glob.glob("./qa_bert_finetuned/checkpoint-*")
for ck in checkpoints:
    shutil.rmtree(ck, ignore_errors=True)

print("✔ All old checkpoints deleted. Disk space cleared.")


✔ All old checkpoints deleted. Disk space cleared.


In [31]:
trainer.save_model("./qa_bert_finetuned_final")
tokenizer.save_pretrained("./qa_bert_finetuned_final")

print("✔ Final model saved successfully without retraining.")
print("\n--- Training Complete ---\n")


Saving model checkpoint to ./qa_bert_finetuned_final
Configuration saved in ./qa_bert_finetuned_final/config.json
Model weights saved in ./qa_bert_finetuned_final/model.safetensors
tokenizer config file saved in ./qa_bert_finetuned_final/tokenizer_config.json
Special tokens file saved in ./qa_bert_finetuned_final/special_tokens_map.json
tokenizer config file saved in ./qa_bert_finetuned_final/tokenizer_config.json
Special tokens file saved in ./qa_bert_finetuned_final/special_tokens_map.json


✔ Final model saved successfully without retraining.

--- Training Complete ---



# **5. Evaluation**

In [32]:
# @title ## 6. Evaluation & Testing
# The 'Trainer' automatically runs evaluation (computing the loss) on the
# validation set. Now we'll perform the custom tests.

import torch

# Put the model in evaluation mode
model.eval()

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def get_answer(question, context):
    """
    Helper function to get an answer from the fine-tuned model.
    """
    # 1. Tokenize the input
    inputs = tokenizer(question, context, return_tensors="pt").to(device)

    # 2. Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)

    # 3. Get the most likely start and end token indices
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    start_index = torch.argmax(start_logits, dim=1).item()
    end_index = torch.argmax(end_logits, dim=1).item()

    # 4. Decode the tokens back to text
    # Ensure start_index is not after end_index
    if start_index <= end_index:
        input_ids = inputs["input_ids"].tolist()[0]
        answer_tokens = input_ids[start_index : end_index + 1]

        # Handle the [CLS], [SEP], and [PAD] tokens
        if tokenizer.cls_token_id in answer_tokens:
             return "[Answer in CLS token - likely not found]"

        answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    else:
        answer = "[Could not find a valid answer span]"

    return answer

In [33]:
# ===========================
#    TEST 1 (Assignment)
# ===========================

question_1 = "Who developed the theory of relativity?"
context_1 = "Albert Einstein developed the theory of relativity in the early 20th century. It is a cornerstone of modern physics."
real_answer_1 = "Albert Einstein"

print(f"Context: {context_1}")
print(f"Question: {question_1}")
print(f"Model Answer: {get_answer(question_1, context_1)}")
print(f"(Real Answer: {real_answer_1})")

print("\n" + "="*80 + "\n")


# ===========================
#    TEST 2 (From SQuAD Val)
# ===========================

validation_sample = dataset["validation"][10]
question_2 = validation_sample["question"]
context_2 = validation_sample["context"]
real_answer_2 = validation_sample["answers"]["text"][0]

print(f"Context: {context_2}")
print(f"Question: {question_2}")
print(f"Model Answer: {get_answer(question_2, context_2)}")
print(f"(Real Answer: {real_answer_2})")

print("\n" + "="*80 + "\n")


# ===========================
#    TEST 3 (General Knowledge)
# ===========================

question_3 = "What is the capital city of France?"
context_3 = "France is a European country known for its culture, cuisine, and history. Its capital city, Paris, is famous for the Eiffel Tower."
real_answer_3 = "Paris"

print(f"Context: {context_3}")
print(f"Question: {question_3}")
print(f"Model Answer: {get_answer(question_3, context_3)}")
print(f"(Real Answer: {real_answer_3})")

print("\n" + "="*80 + "\n")


# ===========================
#    TEST 4 (History)
# ===========================

question_4 = "Who was the first President of the United States?"
context_4 = "The United States was founded in the late 18th century. George Washington served as the first President after the country gained independence."
real_answer_4 = "George Washington"

print(f"Context: {context_4}")
print(f"Question: {question_4}")
print(f"Model Answer: {get_answer(question_4, context_4)}")
print(f"(Real Answer: {real_answer_4})")

print("\n" + "="*80 + "\n")


# ===========================
#    TEST 5 (SQuAD Style)
# ===========================

question_5 = "What is the main ingredient in guacamole?"
context_5 = "Guacamole is a traditional Mexican dip made primarily from mashed avocados. It often includes lime, salt, tomatoes, and onions."
real_answer_5 = "avocados"

print(f"Context: {context_5}")
print(f"Question: {question_5}")
print(f"Model Answer: {get_answer(question_5, context_5)}")
print(f"(Real Answer: {real_answer_5})")

print("\n" + "="*80 + "\n")


# ===========================
#    TEST 6 (Long Context Stress Test)
# ===========================

context_6 = (
    "The Amazon rainforest, located in South America, is one of the most diverse ecosystems on Earth. "
    "It spans multiple countries including Brazil, Peru, and Colombia. "
    "The region plays a significant role in regulating global oxygen levels. "
    "Many scientists call it the 'lungs of the planet' because it produces a large portion of the world's oxygen."
)

question_6 = "Which continent is the Amazon rainforest located in?"
real_answer_6 = "South America"

print(f"Context: {context_6}")
print(f"Question: {question_6}")
print(f"Model Answer: {get_answer(question_6, context_6)}")
print(f"(Real Answer: {real_answer_6})")

print("\n" + "="*80 + "\n")


Context: Albert Einstein developed the theory of relativity in the early 20th century. It is a cornerstone of modern physics.
Question: Who developed the theory of relativity?
Model Answer: albert einstein
(Real Answer: Albert Einstein)


Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature 

In [34]:
import evaluate
metric = evaluate.load("squad")

result = trainer.evaluate()
result



***** Running Evaluation *****
  Num examples = 10784
  Batch size = 32


{'eval_loss': 1.075303077697754}

In [9]:
import evaluate

# Load the SQuAD metric
metric = evaluate.load("squad")

# Define some sample predictions and references in the SQuAD format
sample_predictions = [
    {"id": "5733be284776f41900661182", "prediction_text": "Albert Einstein"},
    {"id": "5733be284776f41900661183", "prediction_text": "February 7, 2016"}
]

sample_references = [
    {"id": "5733be284776f41900661182", "answers": {"answer_start": [0], "text": ["Albert Einstein", "Einstein"]}},
    {"id": "5733be284776f41900661183", "answers": {"answer_start": [0], "text": ["February 7, 2016", "Feb 7, 2016"]}}
]

# Compute the metrics
results = metric.compute(predictions=sample_predictions, references=sample_references)

print("Demonstrating SQuAD Metric Usage (Sample Data):")
print(results)

Demonstrating SQuAD Metric Usage (Sample Data):
{'exact_match': 100.0, 'f1': 100.0}


# **7. Report in Notebook**

Introduction: Classification vs. Question Answering

Text Classification is a task where the model assigns a single, predefined label (or category) to an entire piece of text. For example, in sentiment analysis, the input "I love this movie!" would be classified with the label "Positive." The model's output is a probability score for each possible category, and it answers the question: "What is this text about?" or "What is the sentiment of this text?"

Extractive Question Answering (QA), which we are doing in this project, is fundamentally different. The model is given two inputs: a question and a context (a passage of text). Its goal is not to classify the text, but to find the span of text within the context that answers the question. Instead of predicting one label, the model predicts two: the start token and the end token of the answer. This is why we use AutoModelForQuestionAnswering, which places two prediction "heads" on top of the BERT model—one to find the start of the answer and one to find the end.

Reflection

This project was a fascinating introduction to a Transformer task beyond simple classification. The most challenging concept by far was understanding the complex data preprocessing required for Question Answering. Unlike classification, where one input maps to one output, a single long context here had to be split into multiple overlapping "features" to fit within the model's token limit. The most critical part of this was mapping the character-based answer indices from the SQuAD dataset to the new token-based indices, which involved using the offset_mapping to find the exact start and end tokens. It was also important to handle cases where the answer wasn't in a given span by labeling the [CLS] token. Using the Trainer API made the fine-tuning process itself surprisingly simple, which allowed me to focus on the data preparation, which is clearly the most complex part of the entire pipeline.