# Assessing the Impact of Question Ambiguity in Question-Answering Systems

**Objective:** Evaluate how effectively LLMs detect and handle ambiguity in questions and analyze the impact on system performance and output uncertainty.

# STEP 4: Model Evaluation with [CLS] Token Masking

- Perform the same precedure as in STEP 3, but with masking special token during answer prediction.

In [1]:
# Import libraries

!pip install evaluate

import torch, evaluate, pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from tqdm import tqdm

import pandas as pd
import evaluate
import string
import torch
import re
import ast
import os

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, evaluate
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.5.1
    Uninstalling fsspec-2025.5.1:
      Successfully uninstalled fsspec-2025.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.8.0 requires google-clou

2025-09-18 19:19:59.232230: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758223199.456893      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758223199.513927      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Data Preprocessing

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Load the Dataset</b> 
</div>

- As input, use the same dataset as in STEP 3.

In [12]:
# Load the dataset
df = pd.read_csv("data/dataset_for_evaluation_task_3.csv", sep=";")
len(df)

1265

In [13]:
df.head()

Unnamed: 0,context,target_word,question,annotated_question,answers
0,Beyoncé's first solo recording was a feature o...,album,"The album, Dangerously in Love achieved what s...",What spot did the album achieve?,"{'text': ['number four'], 'answer_start': [123]}"
1,Following the disbandment of Destiny's Child i...,album,"After her second solo album, what other entert...","After her second album, what other entertainme...","{'text': ['acting'], 'answer_start': [207]}"
2,Beyoncé's first solo recording was a feature o...,album,Beyonce's first album by herself was called what?,Her first album was called what?,"{'text': ['Dangerously in Love'], 'answer_star..."
3,Beyoncé's first solo recording was a feature o...,album,Beyonce's first solo album in the U.S. with wh...,Her first album in the U.S. featured which art...,"{'text': ['Jay Z'], 'answer_start': [48]}"
4,"In November 2003, she embarked on the Dangerou...",album,Destiny's Child's final album was named what?,Their final album was named what?,"{'text': ['Destiny Fulfilled'], 'answer_start'..."


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Data Normalization</b> 
</div>

Define functions:
- parse_answer() – parses dictionaries.
- extract_answer_text() – extracts and normalizes the correct answer text.
  
Apply normalization to all relevant text fields: context, question, annotated_question, and answers.

Source for code:

- *[ast.literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval)*

In [14]:
def parse_answer(value):
    """Convert stringified dicts into Python dictionaries."""
    if isinstance(value, str) and value.strip().startswith("{") and "'text'" in value:
        try:
            return ast.literal_eval(value)
        except:
            return value  
    return value


def extract_answer_text(x):
    """Extract raw answer text (no normalization), whether it's a dict, string, or number."""
    if isinstance(x, dict) and 'text' in x:
        return str(x['text'])
    elif isinstance(x, (int, float)):
        return str(x)
    elif isinstance(x, str):
        return x
    else:
        return ""


# Parse answers
df['answers'] = df['answers'].apply(parse_answer)

# Extract raw text
df['ground_truth_answer'] = df['answers'].apply(extract_answer_text)

In [15]:
df[['answers', 'ground_truth_answer']].head()

Unnamed: 0,answers,ground_truth_answer
0,"{'text': ['number four'], 'answer_start': [123]}",['number four']
1,"{'text': ['acting'], 'answer_start': [207]}",['acting']
2,"{'text': ['Dangerously in Love'], 'answer_star...",['Dangerously in Love']
3,"{'text': ['Jay Z'], 'answer_start': [48]}",['Jay Z']
4,"{'text': ['Destiny Fulfilled'], 'answer_start'...",['Destiny Fulfilled']


## Load Model and Tokenizer

In [16]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import BertTokenizer, BertForQuestionAnswering
from transformers import RobertaTokenizer, RobertaForQuestionAnswering

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Connect to GPU</b> 
</div>

In [17]:
# Connect to GPU (if available)

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 2: Define Models and Tokenizers</b> 
</div>

In [18]:
# BERT
bert_base_tokenizer = AutoTokenizer.from_pretrained("twmkn9/bert-base-uncased-squad2")
bert_base_model = AutoModelForQuestionAnswering.from_pretrained("twmkn9/bert-base-uncased-squad2").to(device)

bert_large_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
bert_large_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad').to(device)


# RoBERTa
roberta_base_tokenizer = RobertaTokenizer.from_pretrained('deepset/roberta-base-squad2')
roberta_base_model = RobertaForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2').to(device)

roberta_large_tokenizer = RobertaTokenizer.from_pretrained('deepset/roberta-large-squad2')
roberta_large_model = RobertaForQuestionAnswering.from_pretrained('deepset/roberta-large-squad2').to(device)


# DeBERTa
deberta_base_tokenizer = AutoTokenizer.from_pretrained("deepset/deberta-v3-base-squad2")
deberta_base_model = AutoModelForQuestionAnswering.from_pretrained("deepset/deberta-v3-base-squad2").to(device)

deberta_large_tokenizer = AutoTokenizer.from_pretrained("deepset/deberta-v3-large-squad2")
deberta_large_model = AutoModelForQuestionAnswering.from_pretrained("deepset/deberta-v3-large-squad2").to(device)

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at twmkn9/bert-base-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/696 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/379 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/992 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/735M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/392 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

In [19]:
# Define a dictionary
models = {
    "bert-base": (bert_base_tokenizer, bert_base_model),
    "bert-large": (bert_large_tokenizer, bert_large_model),
    "roberta-base": (roberta_base_tokenizer, roberta_base_model),
    "roberta-large": (roberta_large_tokenizer, roberta_large_model),
    "deberta-base": (deberta_base_tokenizer, deberta_base_model),
    "deberta-large": (deberta_large_tokenizer, deberta_large_model)
}

## Answer Prediction

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>Step 1: Answer Prediction - MASKING [CLS]</b> 
</div>

Iterates through each row in the dataset. Predicts answers for both the original and ambiguous question. Normalizes predictions and stores them in two new columns: *pred_answer_orig* and *pred_answer_ambig*. 

Source for code:
- [-float('inf')](https://stackoverflow.com/questions/34264710/what-is-the-point-of-floatinf-in-python)

In [21]:
# MASKING [CLS]

# Predict answers for original and ambiguous questions for all models

# Iterate through models
for model_name, (tokenizer, model) in models.items():
    print(f"\n Predicting answers with: {model_name}...")
    
    model.eval()  # model to evaluation mode

    # Store answers 
    pred_answers_orig = []
    pred_answers_ambig = []

    # Iterate through each row in a dataset
    for _, row in tqdm(df.iterrows(), total=len(df)):
        context = row['context']
        question_orig = row['question']
        question_ambig = row['annotated_question']

        # Encode input  for original question (question and context)
        inputs = tokenizer.encode_plus(
            question_orig,
            context,
            return_tensors='pt',
            truncation=True, # ensures the input not longer that max_length
            max_length=512 # max number of tokents
        ) 
        
        inputs = {k: v.to(device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model(**inputs)
            
        # Make a copy of the logits, but not modify the output
        start_logits = outputs.start_logits.clone()
        end_logits = outputs.end_logits.clone()

        # Make the value of the [cls] token smaller than any number
        # Prevent model to predict [CLS] or <s> tokens
        start_logits[0][0] = -float('inf')
        end_logits[0][0] = -float('inf')

        # Select start and end position of the prediction
        start = torch.argmax(start_logits)
        end = torch.argmax(end_logits) + 1
        answer_orig = tokenizer.convert_tokens_to_string(
            tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start:end])
        ) 
        pred_answers_orig.append(answer_orig)

        
        # Predict for ambiguous question
        inputs = tokenizer.encode_plus(
            question_ambig,
            context,
            return_tensors='pt',
            truncation=True,
            max_length=512
        )
        
        inputs = {k: v.to(device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model(**inputs)
            
        # Make a copy of the logits, but not modify the output
        start_logits = outputs.start_logits.clone()
        end_logits = outputs.end_logits.clone()

        # Make the value of the [cls] token smaller than any number
        # Prevent model to predict [CLS] or <s> tokens
        start_logits[0][0] = -float('inf')
        end_logits[0][0] = -float('inf')

        # Select start and end position of the prediction
        start = torch.argmax(start_logits)
        end = torch.argmax(end_logits) + 1
        answer_ambig = tokenizer.convert_tokens_to_string(
            tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start:end])
        ) 
        pred_answers_ambig.append(answer_ambig)
        

    # Save predictions
    df[f'{model_name}_pred_orig'] = pred_answers_orig
    df[f'{model_name}_pred_ambig'] = pred_answers_ambig


 Predicting answers with: bert-base...


100%|██████████| 1265/1265 [00:27<00:00, 45.93it/s]



 Predicting answers with: bert-large...


  9%|▉         | 114/1265 [00:07<01:06, 17.20it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
100%|██████████| 1265/1265 [01:13<00:00, 17.22it/s]



 Predicting answers with: roberta-base...


  9%|▉         | 112/1265 [00:02<00:27, 41.82it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
100%|██████████| 1265/1265 [00:29<00:00, 42.29it/s]



 Predicting answers with: roberta-large...


  9%|▉         | 114/1265 [00:06<01:05, 17.68it/s]Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
100%|██████████| 1265/1265 [01:12<00:00, 17.50it/s]



 Predicting answers with: deberta-base...


100%|██████████| 1265/1265 [00:56<00:00, 22.54it/s]



 Predicting answers with: deberta-large...


100%|██████████| 1265/1265 [01:55<00:00, 10.98it/s]


In [23]:
df.to_csv("data/model_answer_predictions_no_cls_4.csv", index=False)
df.head()

Unnamed: 0,context,target_word,question,annotated_question,answers,ground_truth_answer,bert-base_pred_orig,bert-base_pred_ambig,bert-large_pred_orig,bert-large_pred_ambig,roberta-base_pred_orig,roberta-base_pred_ambig,roberta-large_pred_orig,roberta-large_pred_ambig,deberta-base_pred_orig,deberta-base_pred_ambig,deberta-large_pred_orig,deberta-large_pred_ambig
0,Beyoncé's first solo recording was a feature o...,album,"The album, Dangerously in Love achieved what s...",What spot did the album achieve?,"{'text': ['number four'], 'answer_start': [123]}",['number four'],number four,number four,number four,,four,number four,four,atop the Billboard 200,number four,four,four,atop the Billboard 200
1,Following the disbandment of Destiny's Child i...,album,"After her second solo album, what other entert...","After her second album, what other entertainme...","{'text': ['acting'], 'answer_start': [207]}",['acting'],acting,acting,acting,acting,acting,acting,acting,acting,acting,acting,acting,acting
2,Beyoncé's first solo recording was a feature o...,album,Beyonce's first album by herself was called what?,Her first album was called what?,"{'text': ['Dangerously in Love'], 'answer_star...",['Dangerously in Love'],dangerously in love,dangerously in love,dangerously in love,dangerously in love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love
3,Beyoncé's first solo recording was a feature o...,album,Beyonce's first solo album in the U.S. with wh...,Her first album in the U.S. featured which art...,"{'text': ['Jay Z'], 'answer_start': [48]}",['Jay Z'],jay z,jay z,jay z,jay z,Jay Z,"Jay Z's ""'03 Bonnie & Clyde"" that was release...",Jay Z,Jay Z,Jay Z,Jay Z,Jay Z,Jay Z
4,"In November 2003, she embarked on the Dangerou...",album,Destiny's Child's final album was named what?,Their final album was named what?,"{'text': ['Destiny Fulfilled'], 'answer_start'...",['Destiny Fulfilled'],destiny fulfilled,destiny fulfilled,destiny fulfilled,destiny fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled,Destiny Fulfilled


<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>2. Save Answer Predictions fo Target Words used in Error Analysis</b> 
</div>

In [24]:
# Save words for error analysis
words = ["cabinet", "cell", "fan", "season"]

df_words_no_cls = df[df["target_word"].isin(words)]
df_words_no_cls.head(3)

Unnamed: 0,context,target_word,question,annotated_question,answers,ground_truth_answer,bert-base_pred_orig,bert-base_pred_ambig,bert-large_pred_orig,bert-large_pred_ambig,roberta-base_pred_orig,roberta-base_pred_ambig,roberta-large_pred_orig,roberta-large_pred_ambig,deberta-base_pred_orig,deberta-base_pred_ambig,deberta-large_pred_orig,deberta-large_pred_ambig
218,Legislative power lies with the Nitijela. The ...,cabinet,How many ministers are in the Presidential Cab...,How many people are in the cabinet?,"{'text': ['ten'], 'answer_start': [250]}",['ten'],ten,ten,ten,ten,ten,ten,ten,ten,ten,ten,ten,ten
219,"Preceding the reform law, in August 1952, comm...",cabinet,How many posts did the Muslim Brotherhood get ...,How many roles did the group hold in the cabinet?,"{'text': ['two'], 'answer_start': [537]}",['two'],two,four,,two,four,four,two,two,two,two of its members,two,two
220,New Haven is the birthplace of former presiden...,cabinet,"Serving in President Obama's cabinet, this man...","Serving in the cabinet, this individual also s...","{'text': ['John Kerry'], 'answer_start': [419]}",['John Kerry'],john kerry,secretary of state john kerry,john kerry,john kerry,John Kerry,John Kerry,John Kerry,Yale Law School,John Kerry,John Kerry,John Kerry,John Kerry


In [25]:
df_words_no_cls.tail(3)

Unnamed: 0,context,target_word,question,annotated_question,answers,ground_truth_answer,bert-base_pred_orig,bert-base_pred_ambig,bert-large_pred_orig,bert-large_pred_ambig,roberta-base_pred_orig,roberta-base_pred_ambig,roberta-large_pred_orig,roberta-large_pred_ambig,deberta-base_pred_orig,deberta-base_pred_ambig,deberta-large_pred_orig,deberta-large_pred_ambig
943,"In season eight, Latin Grammy Award-nominated ...",season,Who was added as a fourth judge in the eighth ...,Who was brought in during the season for a fou...,"{'text': ['Kara DioGuardi'], 'answer_start': [...",['Kara DioGuardi'],kara dioguardi,kara dioguardi,kara dioguardi,kara dioguardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi,Kara DioGuardi
944,The first season was co-hosted by Ryan Seacres...,season,Who was the only host of American Idol after s...,Who hosted the show once season one had passed...,"{'text': ['Ryan Seacrest'], 'answer_start': [34]}",['Ryan Seacrest'],brian dunkleman,brian dunkleman,ryan seacrest,ryan seacrest,Brian Dunkleman,Ryan Seacrest,Ryan Seacrest,Ryan Seacrest,Ryan Seacrest,Ryan Seacrest,Ryan Seacrest,Ryan Seacrest
945,Guest judges may occasionally be introduced. I...,season,Who were the guest judges in season two?,Who joined in as a guest during the second sea...,"{'text': ['Lionel Richie and Robin Gibb'], 'an...",['Lionel Richie and Robin Gibb'],lionel richie and robin gibb,"donna summer, quentin tarantino",lionel richie and robin gibb,lionel richie and robin gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie and Robin Gibb,Lionel Richie


In [27]:
df_words_no_cls.to_csv("data/error_analysis_words_no_cls_4.csv", index=False)

## Evaluation

<div style="position:relative;padding:.75rem 1.25rem;margin-bottom:1rem;border:1px solid transparent;border-radius:.25rem;background-color:#dae8fc;border-color:#6c8ebf;color:#0c5460">
<b>1. Overall Evaluation</b> 
</div>

Computes Exact Match (EM) and F1 score for both question types using the standard SQuAD metric. Saves the overall results as a DataFrame.

- Source code with SQuAD [metrics](https://huggingface.co/spaces/evaluate-metric/squad_v2)

In [28]:
# Data obtained after answer predictions
df.head(3)

Unnamed: 0,context,target_word,question,annotated_question,answers,ground_truth_answer,bert-base_pred_orig,bert-base_pred_ambig,bert-large_pred_orig,bert-large_pred_ambig,roberta-base_pred_orig,roberta-base_pred_ambig,roberta-large_pred_orig,roberta-large_pred_ambig,deberta-base_pred_orig,deberta-base_pred_ambig,deberta-large_pred_orig,deberta-large_pred_ambig
0,Beyoncé's first solo recording was a feature o...,album,"The album, Dangerously in Love achieved what s...",What spot did the album achieve?,"{'text': ['number four'], 'answer_start': [123]}",['number four'],number four,number four,number four,,four,number four,four,atop the Billboard 200,number four,four,four,atop the Billboard 200
1,Following the disbandment of Destiny's Child i...,album,"After her second solo album, what other entert...","After her second album, what other entertainme...","{'text': ['acting'], 'answer_start': [207]}",['acting'],acting,acting,acting,acting,acting,acting,acting,acting,acting,acting,acting,acting
2,Beyoncé's first solo recording was a feature o...,album,Beyonce's first album by herself was called what?,Her first album was called what?,"{'text': ['Dangerously in Love'], 'answer_star...",['Dangerously in Love'],dangerously in love,dangerously in love,dangerously in love,dangerously in love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love,Dangerously in Love


In [32]:
# MASKING [CLS]

# Load the metric
metric = evaluate.load("squad")

# Store the results
overall_results_masked = []

for model_name in models.keys():
    print(f"\nEvaluating {model_name}...")

    # Extract predictions for original and ambiguous questions
    preds_orig = [
        {'id': str(i), 'prediction_text': pred}
        for i, pred in enumerate(df[f"{model_name}_pred_orig"])
    ]
    preds_ambig = [
        {'id': str(i), 'prediction_text': pred}
        for i, pred in enumerate(df[f"{model_name}_pred_ambig"])
    ]

    # Special token count
    def special_token_count(preds):
        """Count how many times special [CLS] token or 'empty' answer was predicted.
        Function returns number of predictions withoug real answer"""
        return sum(
            1 for prediction in preds 
            if str(prediction).strip().lower() in ["", "cls", "[cls]"])

    # Number of predictions in both question types
    cls_count_orig_masked = special_token_count(df[f"{model_name}_pred_orig"])
    cls_count_ambig_masked = special_token_count(df[f"{model_name}_pred_ambig"])

    # Ground truth answer references
    references = [
        {'id': str(i), 'answers': {'text': [ans], 'answer_start': [0]}}
        for i, ans in enumerate(df['ground_truth_answer'])
    ]

    # Compute metrics
    results_orig_masked = metric.compute(predictions=preds_orig, references=references)
    results_ambig_masked = metric.compute(predictions=preds_ambig, references=references)

    # Print results
    print(f"{model_name}:")
    print(f"  Original Question - EM: {results_orig_masked['exact_match']:.2f}, F1: {results_orig_masked['f1']:.2f}")
    print(f"  Ambiguous Question - EM: {results_ambig_masked['exact_match']:.2f}, F1: {results_ambig_masked['f1']:.2f}")
    print(f"  Special token predicted: {cls_count_orig_masked} times (orig), {cls_count_ambig_masked} times (ambig)")

    # Save all results
    overall_results_masked.append({
        "model": model_name,
        "em_orig": results_orig_masked['exact_match'],
        "f1_orig": results_orig_masked['f1'],
        "em_ambig": results_ambig_masked['exact_match'],
        "f1_ambig": results_ambig_masked['f1'],
        "cls_count_orig": cls_count_orig_masked,
        "cls_count_ambig": cls_count_ambig_masked
    })

# Save to file
overall_df_masked = pd.DataFrame(overall_results_masked)
overall_df_masked.to_csv("data/all_models_em_f1_results_no_cls_4.csv", sep=',', index=False)


Evaluating bert-base...
bert-base:
  Original Question - EM: 83.56, F1: 89.33
  Ambiguous Question - EM: 65.38, F1: 74.20
  Special token predicted: 5 times (orig), 31 times (ambig)

Evaluating bert-large...
bert-large:
  Original Question - EM: 81.34, F1: 88.84
  Ambiguous Question - EM: 66.80, F1: 76.87
  Special token predicted: 5 times (orig), 21 times (ambig)

Evaluating roberta-base...
roberta-base:
  Original Question - EM: 83.87, F1: 92.07
  Ambiguous Question - EM: 67.83, F1: 78.13
  Special token predicted: 9 times (orig), 24 times (ambig)

Evaluating roberta-large...
roberta-large:
  Original Question - EM: 93.83, F1: 97.26
  Ambiguous Question - EM: 72.81, F1: 80.94
  Special token predicted: 4 times (orig), 14 times (ambig)

Evaluating deberta-base...
deberta-base:
  Original Question - EM: 97.55, F1: 98.75
  Ambiguous Question - EM: 74.07, F1: 80.50
  Special token predicted: 3 times (orig), 28 times (ambig)

Evaluating deberta-large...
deberta-large:
  Original Question

## Plotting

- model comparison with and without special token masking

In [33]:
# load the dataset
df = pd.read_csv("data/all_models_em_f1_results_3.csv")
df

Unnamed: 0,model,em_orig,f1_orig,em_ambig,f1_ambig,cls_count_orig,cls_count_ambig
0,bert-base,83.162055,88.909852,61.027668,69.10788,13,134
1,bert-large,81.343874,88.841362,66.798419,76.872061,5,22
2,roberta-base,82.213439,90.053486,61.264822,70.539917,10,20
3,roberta-large,93.596838,96.959785,64.822134,71.488967,4,15
4,deberta-base,97.470356,98.675843,68.774704,74.048739,4,162
5,deberta-large,91.225296,96.366815,64.664032,72.073065,4,190


In [34]:
df.columns

Index(['model', 'em_orig', 'f1_orig', 'em_ambig', 'f1_ambig', 'cls_count_orig',
       'cls_count_ambig'],
      dtype='object')

In [35]:
# load the dataset
df_masked = pd.read_csv("data/all_models_em_f1_results_no_cls_4.csv")

df_masked

Unnamed: 0,model,em_orig,f1_orig,em_ambig,f1_ambig,cls_count_orig,cls_count_ambig
0,bert-base,83.557312,89.327695,65.375494,74.198342,5,31
1,bert-large,81.343874,88.841362,66.798419,76.872061,5,21
2,roberta-base,83.873518,92.070437,67.826087,78.127571,9,24
3,roberta-large,93.833992,97.26018,72.806324,80.943769,4,14
4,deberta-base,97.549407,98.745013,74.071146,80.503608,3,28
5,deberta-large,91.225296,96.366815,73.438735,81.904492,3,15


In [36]:
df_masked.columns

Index(['model', 'em_orig', 'f1_orig', 'em_ambig', 'f1_ambig', 'cls_count_orig',
       'cls_count_ambig'],
      dtype='object')

In [40]:
# Merge datasets for comparison
df = df.merge(df_masked, on="model", suffixes=("_unmasked", "_masked"))
df

Unnamed: 0,model,em_orig_unmasked,f1_orig_unmasked,em_ambig_unmasked,f1_ambig_unmasked,cls_count_orig_unmasked,cls_count_ambig_unmasked,em_orig_masked,f1_orig_masked,em_ambig_masked,f1_ambig_masked,cls_count_orig_masked,cls_count_ambig_masked
0,bert-base,83.162055,88.909852,61.027668,69.10788,13,134,83.557312,89.327695,65.375494,74.198342,5,31
1,bert-large,81.343874,88.841362,66.798419,76.872061,5,22,81.343874,88.841362,66.798419,76.872061,5,21
2,roberta-base,82.213439,90.053486,61.264822,70.539917,10,20,83.873518,92.070437,67.826087,78.127571,9,24
3,roberta-large,93.596838,96.959785,64.822134,71.488967,4,15,93.833992,97.26018,72.806324,80.943769,4,14
4,deberta-base,97.470356,98.675843,68.774704,74.048739,4,162,97.549407,98.745013,74.071146,80.503608,3,28
5,deberta-large,91.225296,96.366815,64.664032,72.073065,4,190,91.225296,96.366815,73.438735,81.904492,3,15


In [41]:
df.columns

Index(['model', 'em_orig_unmasked', 'f1_orig_unmasked', 'em_ambig_unmasked',
       'f1_ambig_unmasked', 'cls_count_orig_unmasked',
       'cls_count_ambig_unmasked', 'em_orig_masked', 'f1_orig_masked',
       'em_ambig_masked', 'f1_ambig_masked', 'cls_count_orig_masked',
       'cls_count_ambig_masked'],
      dtype='object')

In [42]:
# Count F1 difference between masked and unmasked tokens
df["delta_f1_orig"]  = df["f1_orig_masked"]  - df["f1_orig_unmasked"]
df["delta_f1_ambig"] = df["f1_ambig_masked"] - df["f1_ambig_unmasked"]

# Count special token prediction difference between masked and unmasked tokens
df["delta_cls_orig"]  = df["cls_count_orig_masked"]  - df["cls_count_orig_unmasked"]
df["delta_cls_ambig"] = df["cls_count_ambig_masked"] - df["cls_count_ambig_unmasked"]


In [43]:
df

Unnamed: 0,model,em_orig_unmasked,f1_orig_unmasked,em_ambig_unmasked,f1_ambig_unmasked,cls_count_orig_unmasked,cls_count_ambig_unmasked,em_orig_masked,f1_orig_masked,em_ambig_masked,f1_ambig_masked,cls_count_orig_masked,cls_count_ambig_masked,delta_f1_orig,delta_f1_ambig,delta_cls_orig,delta_cls_ambig
0,bert-base,83.162055,88.909852,61.027668,69.10788,13,134,83.557312,89.327695,65.375494,74.198342,5,31,0.417843,5.090461,-8,-103
1,bert-large,81.343874,88.841362,66.798419,76.872061,5,22,81.343874,88.841362,66.798419,76.872061,5,21,0.0,0.0,0,-1
2,roberta-base,82.213439,90.053486,61.264822,70.539917,10,20,83.873518,92.070437,67.826087,78.127571,9,24,2.016951,7.587654,-1,4
3,roberta-large,93.596838,96.959785,64.822134,71.488967,4,15,93.833992,97.26018,72.806324,80.943769,4,14,0.300395,9.454802,0,-1
4,deberta-base,97.470356,98.675843,68.774704,74.048739,4,162,97.549407,98.745013,74.071146,80.503608,3,28,0.06917,6.454869,-1,-134
5,deberta-large,91.225296,96.366815,64.664032,72.073065,4,190,91.225296,96.366815,73.438735,81.904492,3,15,0.0,9.831427,-1,-175


In [45]:
df.to_csv("data/df_masked_unmasked_cls_compare_4.csv", index=False)