This initial cell prepares the Python environment for the experiment. First, it installs two critical libraries using `pip`: `transformers` for accessing the pre-trained models from Hugging Face, and `torch`, the underlying deep learning framework. The output of the installation is suppressed with `>> /dev/null` to keep the notebook clean. After the installation, the `time` module is imported to measure model inference speed, and the `collections` module is imported for use in calculating the F1-score metric.

In [102]:
!pip install transformers torch >> /dev/null
import time
import collections

Here, the specific pre-trained model for the experiment is selected. Three different question-answering models from the Hugging Face Hub are provided as options. For each experimental run, one of the model names is uncommented to set it as the active `model_name`. The selected models represent a range of architectures and sizes: a standard BERT-base model, a smaller DistilBERT model optimized for speed, and a larger BERT-large model known for higher accuracy. This setup allows for a controlled comparison of their performance on the same task.

In [103]:
# 2. MODEL SELECTION
# Model 1:
#model_name = 'deepset/bert-base-cased-squad2'

# Model 2:
#model_name = 'distilbert-base-cased-distilled-squad'

# Model 3:
model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'

This cell defines the context paragraph that the question-answering model will use as its source of information. The context is a news article concerning the repatriation of Filipino workers from Lebanon, stored as a single, multi-line string in the `context` variable. The model's task is to extract answers exclusively from this provided text.

In [104]:
# 3. DATASET PREPARATION (Contex)
context = """
MANILA – The government is arranging chartered flights for the repatriation of more than 200 overseas Filipino workers in Beirut, Lebanon, the Department of Migrant Workers (DMW) said Wednesday. “We are trying to provide for chartered flights. We’re talking to airline companies so that the chartered flights would be able to accommodate for example, no less than 300 overseas Filipino workers from Beirut,” DMW Undersecretary Bernard Olalia said in a Palace press briefing. This was after the scheduled flights of around 15 OFWs on Sept. 25 were cancelled because of the recent bombings in Beirut. Olalia said around 111 OFWs are staying in four temporary shelters in Beirut and waiting for their repatriation. An additional 110 OFWs are applying for exit permits from the Lebanese government, Olalia said. “Apart from the documented OFWs, we have undocumented OFWs who need to secure travel documents and once they’re given travel documents, we will help them in securing also exit visas or exit permits from the Immigration of the Lebanese government,” he said. Olalia, however, said the Philippine government is facing several challenges, including securing landing rights for chartered flights. He said land and sea routes are being considered, in case the situation escalates and makes it “impossible” to take the air route. “The DMW is also studying the possibility of other routes. Apart from air route, we will be assessing the sea and the land route, should the case or the situation there worsen,” Olalia said. He said the DMW, the Overseas Workers Welfare Administration (OWWA), and other concerned agencies will adopt a “whole-of-government assistance" upon the directive of President Ferdinand R. Marcos Jr. He said each repatriated OFW will get PHP150,000 in financial assistance from the DMW and OWWA, as well as psychosocial services. Israel has intensified its airstrikes across the northern border into Lebanon, targeting the Iran-backed militant group Hezbollah. Iran fired ballistic missiles in Israel on Tuesday night, following the deadly attacks on Gaza and Lebanon and the recent killings of Hamas, Hezbollah, and Islamic Revolutionary Guard Corps leaders. Olalia said no Filipinos were hurt since the attacks were launched. “We have men on the ground. They work around the clock. At ‘yung mga staff po natin, dinagdagan na po natin (And we augmented our staff) both in Lebanon at (and) nearby posts to be able to provide safest route, to evacuate and ultimately to facilitate the repatriation of our OFWs both either in Lebanon or in Israel,” he said. (PNA)
"""

With the context defined, the next step is to prepare the questions that the model must answer. A Python list named `questions` is created, containing 10 specific questions. Each question is designed to be directly answerable using only the information available in the `context` article, covering a range of details such as names, numbers, and reasons.

In [105]:
# (10 Questions) stored in a list.
questions = [
    "How many overseas Filipino workers are being repatriated from Beirut, Lebanon?",
    "Who is the DMW Undersecretary that was quoted in the article?",
    "Why were the scheduled flights on September 25 cancelled?",
    "How many OFWs are currently staying in temporary shelters in Beirut?",
    "What other routes are being considered if air travel becomes impossible?",
    "Who directed the \"whole-of-government assistance\" for the OFWs?",
    "How much financial assistance will each repatriated OFW receive?",
    "Which militant group is Israel targeting in Lebanon?",
    "According to the article, were any Filipinos hurt in the attacks?",
    "What are the undocumented OFWs applying for from the Lebanese government?"
]

To enable a quantitative evaluation of the model, this cell defines the ground-truth answers. A list named `correct_answers` is created, containing the precise, correct answer for each of the 10 questions. This list is parallel to the `questions` list, ensuring that `correct_answers[i]` is the ground truth for `questions[i]`. These answers will be used later to compare against the model's predictions to calculate the Exact Match and F1-Score metrics.

In [106]:
# Correct answers
correct_answers = [
    "more than 200",
    "Bernard Olalia",
    "the recent bombings in Beirut",
    "around 111",
    "land and sea routes",
    "President Ferdinand R. Marcos Jr.",
    "PHP150,000",
    "Hezbollah",
    "no Filipinos were hurt",
    "exit permits"
]

This cell handles the loading and initialization of the first model, `bert-base-cased-squad2`. It begins by importing the necessary `BertForQuestionAnswering` and `BertTokenizer` classes from the `transformers` library. The `.from_pretrained()` method then downloads and loads the specified model and its tokenizer from the Hugging Face Hub. Finally, these components are assembled into a `question-answering` pipeline. This `qna_pipeline` object simplifies the process of getting predictions from the model.

In [87]:
# 4. MODEL 1 - MODEL AND PIPELINE INITIALIZATION
from transformers import pipeline, BertForQuestionAnswering, BertTokenizer

print(f"--- Loading model: {model_name} ---")
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
qna_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)
print("--- Model loaded successfully. Starting experiment. ---")

--- Loading model: deepset/bert-base-cased-squad2 ---


Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


--- Model loaded successfully. Starting experiment. ---


This cell sets up the third and final model, `bert-large-uncased-whole-word-masking-finetuned-squad`. Following the established pattern, it imports the `BertForQuestionAnswering` and `BertTokenizer` classes and uses the `.from_pretrained()` method to load this larger, more powerful model and its corresponding tokenizer. The resulting components are then used to create the `qna_pipeline` for this experiment run.

In [99]:
# 4. MODEL2 - MODEL AND PIPELINE INITIALIZATION
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer

print(f"--- Loading model: {model_name} ---")
loaded_model = DistilBertForQuestionAnswering.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
qna_pipeline = pipeline('question-answering', model=loaded_model, tokenizer=tokenizer)
print("--- Model loaded successfully. Starting experiment. ---")

--- Loading model: distilbert-base-cased-distilled-squad ---


Device set to use cpu


--- Model loaded successfully. Starting experiment. ---


This cell sets up the third and final model, `bert-large-uncased-whole-word-masking-finetuned-squad`. Following the established pattern, it imports the `BertForQuestionAnswering` and `BertTokenizer` classes and uses the `.from_pretrained()` method to load this larger, more powerful model and its corresponding tokenizer. The resulting components are then used to create the `qna_pipeline` for this experiment run.

In [110]:
# 4. MODEL3 - MODEL AND PIPELINE INITIALIZATION
from transformers import BertForQuestionAnswering, BertTokenizer

model_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'
print(f"--- Loading model: {model_name} ---")
loaded_model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)
qna_pipeline = pipeline('question-answering', model=loaded_model, tokenizer=tokenizer)
print("--- Model loaded successfully. Starting experiment. ---")

--- Loading model: bert-large-uncased-whole-word-masking-finetuned-squad ---


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


--- Model loaded successfully. Starting experiment. ---


This cell defines a crucial helper function, `calculate_f1_score`, for evaluating the model's predictions. The F1-score provides a more nuanced measure of answer quality than a simple exact match. The function works by tokenizing both the predicted and ground-truth answers into words. It then calculates the number of common tokens between them to compute Precision and Recall. Finally, it returns the harmonic mean of these two values, providing a score that reflects the degree of overlap between the predicted and correct answers.

In [111]:
# HELPER FUNCTION TO CALCULATE F1-SCORE
def calculate_f1_score(prediction, ground_truth):
    """Calculates the F1-score between a predicted answer and a true answer."""
    pred_tokens = prediction.lower().split()
    truth_tokens = ground_truth.lower().split()

    # if either is empty, F1 is 0
    if not pred_tokens or not truth_tokens:
        return 0

    common_tokens = collections.Counter(pred_tokens) & collections.Counter(truth_tokens)
    num_common = sum(common_tokens.values())

    precision = num_common / len(pred_tokens)
    recall = num_common / len(truth_tokens)

    if precision + recall == 0:
        return 0

    f1 = 2 * (precision * recall) / (precision + recall)
    return f1

This is the main execution block of the experiment. It iterates through each of the 10 questions using a `for` loop. Inside the loop, for each question, the following steps are performed: the start time is recorded; the `qna_pipeline` is called with the question and context to get a prediction; and the end time is recorded to calculate the `inference_time`. After extracting the predicted answer text and its confidence score from the result object, two key metrics are calculated: Exact Match (EM), a binary score of 1 or 0, and the F1-score, using the helper function defined previously. All of these results—the predicted answer, inference time, confidence score, EM, and F1-score—are then printed in a structured format for each question.

In [112]:
# 5. EXPERIMENT EXECUTION
for i, question in enumerate(questions):
    print(f"\nQ{i+1}: {question}")

    # Get the correct answer for this question
    true_answer = correct_answers[i]
    print(f"Correct Answer: {true_answer}")

    # Start timer
    start_time = time.time()
    # Get the answer from the pipeline
    answer_obj = qna_pipeline({'question': question, 'context': context})
    # End timer and calculate duration
    end_time = time.time()
    inference_time = end_time - start_time

    # Extract predicted answer text
    predicted_answer = answer_obj['answer']

    # Calculate metrics
    confidence_score = answer_obj['score']
    exact_match = 1 if predicted_answer.lower() == true_answer.lower() else 0
    f1 = calculate_f1_score(predicted_answer, true_answer)

    # Print all the results needed for your Excel sheet
    print(f"Model's Answer: {predicted_answer}")
    print(f"Inference Time (s): {inference_time:.4f}")
    print(f"Confidence Score: {confidence_score:.4f}")
    print(f"Exact Match (EM): {exact_match}")
    print(f"F1-Score: {f1:.4f}")
    print("--------------------------------------------------")


Q1: How many overseas Filipino workers are being repatriated from Beirut, Lebanon?
Correct Answer: more than 200




Model's Answer: 300
Inference Time (s): 28.0460
Confidence Score: 0.1870
Exact Match (EM): 0
F1-Score: 0.0000
--------------------------------------------------

Q2: Who is the DMW Undersecretary that was quoted in the article?
Correct Answer: Bernard Olalia
Model's Answer: Bernard Olalia
Inference Time (s): 35.2427
Confidence Score: 0.9864
Exact Match (EM): 1
F1-Score: 1.0000
--------------------------------------------------

Q3: Why were the scheduled flights on September 25 cancelled?
Correct Answer: the recent bombings in Beirut
Model's Answer: recent bombings in Beirut.
Inference Time (s): 16.9388
Confidence Score: 0.3211
Exact Match (EM): 0
F1-Score: 0.6667
--------------------------------------------------

Q4: How many OFWs are currently staying in temporary shelters in Beirut?
Correct Answer: around 111
Model's Answer: 111
Inference Time (s): 16.6202
Confidence Score: 0.6898
Exact Match (EM): 0
F1-Score: 0.6667
--------------------------------------------------

Q5: What othe