#ANLP Final Project

*   Paper Chosen : https://aclanthology.org/2022.findings-naacl.191/
*   This notebook analysis the comparision of Tacl-Bert and Bert to determine which model performs better on SQuAD Dataset and performs Robustness Check on the model



Team Memebers:


1.   Sai Sandeep Varma Mudundi (G01352322)
2.   Asra Naseem               (G01349680)
3.   Rajeev Priyatam Panchadula (G01333080)




In [None]:
!pip install datasets transformers

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# cd /content/drive/MyDrive/TaCL-main/TaCL-main

/content/drive/MyDrive/TaCL-main/TaCL-main


In [None]:
#pre-trained TACL BERT-based model from paper
model_to_use = "cambridgeltl/tacl-bert-base-uncased"
#uncomment below line to use BERT
# model="bert-base-uncased"
batch_size = 12

## Load dataset

In [None]:
# Import the load_dataset and load_metric functions from the datasets library
from datasets import load_dataset, load_metric

# Load the SQuAD dataset using the load_dataset function
squad_dataset = load_dataset("squad")

# Print the loaded dataset to the console
print(squad_dataset)

# Access the first sample in the training split of the loaded dataset
first_sample = squad_dataset["train"][0]




  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


##Preprocessing 

In [None]:
# Import the AutoTokenizer class from the transformers library
from transformers import AutoTokenizer

# Load a pre-trained tokenizer specified by the 'model' variable
tokenizer = AutoTokenizer.from_pretrained(model_to_use)


In this code, we're importing the transformers library, which is a popular library for working with pre-trained language models. We're then using the isinstance function to check that the tokenizer variable is an instance of the PreTrainedTokenizerFast class from the transformers library. This check assertion ensures that our tokenizer is a fast tokenizers

In [None]:
# Import the transformers library
import transformers

# Check that the 'tokenizer' variable is an instance of the PreTrainedTokenizerFast class
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)


In [None]:
#testing tokenized input
# Define the input sentences
question = "What is the capital of France?"
response = "The capital of France is Paris."
#the output may vary depending on the model you have selected
# Apply the tokenizer to the input sentences
tokenized_input = tokenizer(question, response)

# Print the tokenized output to the console
print(tokenized_input)

{'input_ids': [101, 2054, 2003, 1996, 3007, 1997, 2605, 1029, 102, 1996, 3007, 1997, 2605, 2003, 3000, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# Set the maximum length of a feature to 384 (question and context)
max_length = 384

# Set the authorized overlap between two parts of the context when splitting is needed to 128
doc_stride = 128

# In question answering, dealing with very long documents requires a different preprocessing approach
# We can't simply truncate them as it may result in losing the answer we are looking for
# To handle this, we allow one long example in the dataset to give several input features, each shorter than the maximum length
# We also allow some overlap between the features we generate using the hyper-parameter doc_stride
# This ensures that the answer isn't lost if it lies at the point where we split a long context


In [None]:
#In this code cell, we're looping through each example in the training split of the SQuAD dataset until we find an example with a combined length (question and context) that exceeds the maximum length of 384. We're then storing that example in the example variable.

#We're using the len function and the tokenizer method to get the length of the input IDs for the example. We store this length in the original_length variable.

#Finally, we're using the tokenizer method with the max_length and truncation arguments to get the length of the truncated input IDs for the example. We store this length in the truncated_length variable.
# Loop through each example in the training split of the SQuAD dataset
# Stop once we find an example with a combined length (question and context) that exceeds the maximum length of 384
for i, example in enumerate(squad_dataset["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break

# Get the example with a combined length that exceeds the maximum length
example = squad_dataset["train"][i]

# Get the length of the input IDs for the example using the tokenizer
original_length = len(tokenizer(example["question"], example["context"])["input_ids"])

# Get the length of the truncated input IDs for the example using the tokenizer
truncated_length = len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])


In this code cell, we're using the tokenizer method to preprocess the example variable. We're passing in the question and context fields of the example as inputs, along with several other arguments:

max_length: Set to max_length, which is the maximum length of a feature.
truncation: Set to "only_second", which truncates the input to the maximum length of the second sequence (the context field in this case).
return_overflowing_tokens: Set to True, which causes the tokenizer to return the overflow tokens if the input exceeds the maximum length.
stride: Set to doc_stride, which is the authorized overlap between two parts of the context when splitting is needed.

In [None]:
# Use the tokenizer to preprocess the example
# Set the maximum length to max_length, truncation to "only_second", return_overflowing_tokens to True, and stride to doc_stride
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

[len(x) for x in tokenized_example["input_ids"]]

[384, 157]

In [None]:
#decoding to see the overlap
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike brey, who, as of the 2014 – 15 season, his fifteenth at notr

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 38), (38, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 3), (4, 7), (7, 8), (8, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 330), (330, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374,

In below code cell, we're using the tokenized_example variable to get the second input ID from the first example in the variable, and storing it in the first_token_id variable.

We're also using the tokenized_example variable to get the offsets for the second input ID from the first example, and storing it in the offsets variable.

Finally, we're using the tokenizer method to convert the first_token_id to its corresponding token, and using the offsets to get the original text corresponding to the token

In [None]:
# Get the second input ID from the first example in tokenized_example
first_token_id = tokenized_example["input_ids"][0][1]

# Get the offsets for the second input ID from the first example in tokenized_example
offsets = tokenized_example["offset_mapping"][0][1]

# Use the tokenizer to convert the first_token_id to its corresponding token
# Use the offsets to get the original text corresponding to the token
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["context"][offsets[0]:offsets[1]])


how The


In [None]:
# Use the tokenizer to get the sequence IDs for the tokenized_example
sequence_ids = tokenized_example.sequence_ids()

# Print the sequence IDs to the console
print(sequence_ids)
#The sequence IDs represent the sequence to which each token belongs (i.e., the question or the context). The 0 sequence ID corresponds to the question, while the 1 sequence ID corresponds to the context.

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In the below code cell, we're using the example variable to get the answers for the example. We then calculate the end character index of the answer using the answer_start and text fields.

We then find the token start index and token end index of the current answer span. We start at the beginning and end of the tokenized input, respectively, and iterate until we find a token in the context sequence.

We then check if the answer is outside the current span using the offset_mapping and sequence_ids outputs from the tokenizer. If it is, we label this feature with the CLS index. If not, we print a message indicating that the answer is not in this feature.

Finally, if the answer is within the current span, we move the token start index and token end index to the two ends of the answer, and print the start and end positions of the answer spa

In [None]:
# Get the answers for the example
answers = example["answers"]

# Calculate the end character index of the answer
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Find the token start index of the current answer span
# Keep incrementing the index until we find a token in the context sequence
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# Find the token end index of the current answer span
# Keep decrementing the index until we find a token in the context sequence
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Check if the answer is outside the current span
# If it is, we label this feature with the CLS index
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # If the answer is the last word, we go after the last offset (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    # Print the start and end positions of the answer span
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")


23 26


The below cell defines a function prepare_train_features that takes in training examples and returns tokenized examples with start and end positions for the answer. The function tokenizes the examples with truncation and padding, and keeps overflows using a stride to handle long contexts. The function also removes left whitespace from the questions to prevent issues with context truncation when tokenizing. The tokenized examples are labeled with start and end positions for the answer, and impossible answers are labeled with the index of the CLS token.

In [None]:
# Check if padding is applied on the right
pad_on_right = tokenizer.padding_side == "right"

# Define a function to prepare training features
def prepare_train_features(examples):
    # Remove left whitespace from the questions
    # This will help prevent issues with context truncation when tokenizing
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize the examples with truncation and padding
    # Keep overflows using a stride
    # Return offsets mapping
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Get the mapping of each feature to its corresponding example
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # Get the offset mappings for each feature
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Label the examples with start and end positions for the answer
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # Loop over each feature and label the example with start and end positions for the answer
    for i, offsets in enumerate(offset_mapping):
        # Label impossible answers with the index of the CLS token
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Get the sequence IDs for the current feature
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Get the answers for the current feature
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If no answers are given, set the cls_index as answer
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Get the start and end character index of the answer in the text
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Get the start and end token index of the current span in the text
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index)
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise, move the token_start_index and token_end_index to the two ends of the answer
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In the below code cell, we're using the prepare_train_features function to prepare the first 5 examples in the training split of the SQuAD dataset as features. We're storing these features in the features variable.

We're then using the map method of the squad_dataset object to apply the prepare_train_features function to the entire SQuAD dataset. We're setting the batched parameter to True to enable batch processing, and the remove_columns parameter to remove the original context and question columns from the output.

In [None]:
# Prepare the first 5 examples in the training split of the SQuAD dataset as features
features = prepare_train_features(squad_dataset['train'][:5])

# Use the `map` method to apply `prepare_train_features` to the entire SQuAD dataset
# Set `batched` to `True` to enable batch processing
# Set `remove_columns` to remove the original `context` and `question` columns from the output
tokenized_datasets = squad_dataset.map(prepare_train_features, batched=True, remove_columns=squad_dataset["train"].column_names)




##Fine-tuning TaCL/BERT Models

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_to_use)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at cambridgeltl/tacl-bert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:

# Get the name of the model from the `model` variable
model_name = model_to_use.split("/")[-1]

# Set the training arguments for the fine-tuning step
args = TrainingArguments(
    f"{model_name}-finetuned-squad",                 # Output directory
    evaluation_strategy="epoch",                     # Evaluate every epoch
    learning_rate=3e-5,                              # Learning rate
    per_device_train_batch_size=batch_size,          # Batch size for training
    per_device_eval_batch_size=batch_size,           # Batch size for evaluation
    num_train_epochs=2,                              # Number of training epochs
    weight_decay=0.01,                               # Weight decay
    push_to_hub=False                                # Whether to push the model to the Hub
)


In [None]:
from transformers import default_data_collator
#will batch our processed examples together
data_collator = default_data_collator

In [None]:
# Create a `Trainer` object for the fine-tuning step
trainer = Trainer(
    model,                                               # The model to train
    args,                                                # Training arguments
    train_dataset=tokenized_datasets["train"],           # Training dataset
    eval_dataset=tokenized_datasets["validation"],       # Evaluation dataset
    data_collator=data_collator,                         # Data collator for creating minibatches
    tokenizer=tokenizer                                 # Tokenizer for preprocessing the data
)


In [None]:
# # Train the model using the `trainer` object
trainer.train()

# Save the fine-tuned model
trainer.save_model("test-squad-trained")


## Analysis and Evaluation

The below code is optional, uncomment and use only if are trying to load the saved trainer from the above step. if you are running the notebook for the first time ignore the below cell. 

In [None]:
!CUDA_VISIBLE_DEVICES=0
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
#if you have saved the trained model,replace the path to avoid retraining
model = AutoModelForQuestionAnswering.from_pretrained("/content/drive/MyDrive/TaCL-main/TaCL-main/test-squad-trained")
# model = AutoModelForQuestionAnswering.from_pretrained("/content/drive/MyDrive/TaCL-main/test-squad-trained")
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In the below code cell, we're using a for loop to get the first batch from the evaluation dataloader. We're then using a dictionary comprehension to move the batch to the device specified in the training arguments.

We're then using the model attribute of the trainer object to pass the batch to the model and get the output. We're using the torch.no_grad() context manager to disable gradient calculations during inference, which saves memory and speeds up the process.

Finally, we're using the keys method of the output dictionary to print the keys of the output dictionary. These keys correspond to the different types of output produced by the model during inference.

In [None]:
import torch

# Get the first batch from the evaluation dataloader
for batch in trainer.get_eval_dataloader():
    break

# Move the batch to the device specified in the training arguments
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}

# Pass the batch to the model to get the output
with torch.no_grad():
    output = trainer.model(**batch)

# Print the keys of the output dictionary
print(output.keys())

output.start_logits.shape, output.end_logits.shape
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

odict_keys(['loss', 'start_logits', 'end_logits'])


(tensor([ 46,  57,  78,  43, 118, 108,  72,  35, 108,  34,  73,  41],
        device='cuda:0'),
 tensor([ 47,  58,  92,  44, 118, 109,  75,  37, 109,  36,  76,  42],
        device='cuda:0'))

In [None]:
import numpy as np
n_best_size = 20
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

In [None]:
def prepare_validation_features(examples):
    # Some questions have left whitespace that makes truncation fail; remove it
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize examples, truncating and padding with overflow using a stride
    # This results in one example possibly giving multiple features, with overlapping context
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Map each feature to its corresponding example
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # Keep the example_id that gave us each feature and store offset mappings
    tokenized_examples["example_id"] = []

    # Iterate over all features and create a mapping of offsets for each
    for i in range(len(tokenized_examples["input_ids"])):
        # Get the sequence corresponding to that example to identify context vs question
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # Find the example index containing this span of text
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set offset_mapping values to None if they're not part of the context
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples


In [None]:
# Process the validation data and prepare the validation features
processed_validation_data = squad_dataset["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=squad_dataset["validation"].column_names
)

# Generate raw predictions using the trained model and the validation features
raw_predictions = trainer.predict(processed_validation_data)

# Set the format of the validation features
processed_validation_data.set_format(
    type=processed_validation_data.format["type"],
    columns=list(processed_validation_data.features.keys())
)

# Set the maximum answer length
max_answer_length = 30




In [None]:
# Get start and end logits from model output
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# Retrieve offset mapping and context for first example in validation data
offset_mapping = processed_validation_data[0]["offset_mapping"]
context = squad_dataset["validation"][0]["context"]

# Collect indices of top n_best_size start and end logits
top_start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
top_end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

# Find valid answers within context based on top start and end logits
valid_answers = []
for start_index in top_start_indexes:
    for end_index in top_end_indexes:
        # Check if start and end indexes are within the context
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Check if answer length is within max answer length
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        # Retrieve the text of the answer and its score
        start_char = offset_mapping[start_index][0]
        end_char = offset_mapping[end_index][1]
        answer_text = context[start_char: end_char]
        answer_score = start_logits[start_index] + end_logits[end_index]
        valid_answers.append({"score": answer_score, "text": answer_text})

# Sort the valid answers by score and return top n_best_size
valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 10.468817, 'text': 'Denver Broncos'},
 {'score': 9.495502,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 8.457635, 'text': 'Carolina Panthers'},
 {'score': 6.977106,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 6.003792,
  'text': 'The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 5.3049192, 'text': 'Broncos'},
 {'score': 5.2620625,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 5.1238494, 'text': 'Denver'},
 {'score': 4.331605,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 4.2887483,
  'text': 'American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 4.161862,
  'text': 'Denver

In [None]:
import collections

# Get the validation examples and features.
examples = squad_dataset["validation"]
features = processed_validation_data

# Create a dictionary mapping each example ID to its corresponding index in `examples`.
example_id_to_index = {k: i for i, k in enumerate(examples["id"])}

# Create a dictionary to store the indices of features associated with each example.
features_per_example = collections.defaultdict(list)

# Loop over each feature and find its corresponding example ID.
for i, feature in enumerate(features):
    example_index = example_id_to_index[feature["example_id"]]
    
    # Append the index of this feature to the list of features associated with the example.
    features_per_example[example_index].append(i)


In [None]:
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size=20, max_length=30):
    # Unpack the raw predictions into start and end scores.
    all_start_scores, all_end_scores = raw_predictions

    # Create a dictionary that maps each example ID to its corresponding feature indices.
    example_indices = collections.defaultdict(list)
    for i, feature in enumerate(features):
        example_indices[feature['example_id']].append(i)

    # Create an empty dictionary to store the predictions.
    predictions = {}

    # Loop over all the examples.
    print(f"Processing {len(examples)} examples split into {len(features)} features.")
    for example in examples:
        # Get the ID of the current example.
        example_id = example['id']

        # Get the feature indices that correspond to the current example ID.
        indices = example_indices[example_id]

        # Initialize variables to track the best answer and its score.
        best_answer = ''
        best_score = float('-inf')

        # Loop over all the features that correspond to the current example ID.
        for index in indices:
            # Get the feature at the current index.
            feature = features[index]

            # Get the start and end scores for the current feature.
            start_scores = all_start_scores[index]
            end_scores = all_end_scores[index]

            # Get the offset mapping for the current feature.
            offsets = feature['offset_mapping']

            # Loop over all possible combinations of start and end positions.
            for start in range(len(offsets)):
                for end in range(start, min(start + max_length, len(offsets))):
                    # Check if the current combination of start and end positions is valid.
                    if offsets[start] is None or offsets[end] is None:
                        continue
                    if start > end:
                        continue

                    # Compute the score for the current answer.
                    score = start_scores[start] + end_scores[end]

                    # If the current answer is the best we've seen so far, update our variables.
                    if score > best_score:
                        best_answer = example['context'][offsets[start][0]:offsets[end][1]]
                        best_score = score

        # Store the best answer (or an empty string if no answer was found) in the predictions dictionary.
        predictions[example_id] = best_answer if best_score > float('-inf') else ''
    # Return the predictions dictionary.
    return predictions


In [None]:
squad_dataset["validation"][0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


In [None]:
final_predictions = postprocess_qa_predictions(squad_dataset["validation"], processed_validation_data, raw_predictions.predictions)
# Load the appropriate metric based on the value of squad_v2.
metric = load_metric("squad")


formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]

# Create a list of references for the validation dataset.
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in squad_dataset["validation"]]

# Compute the metric using the formatted predictions and references.
metric.compute(predictions=formatted_predictions, references=references)


Processing 10570 examples split into 10784 features.


  metric = load_metric("squad")


{'exact_match': 81.91106906338695, 'f1': 89.10779708060427}

#Robustenss CheckList

In [None]:
!pip install checklist

In [4]:
import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb
from checklist.pred_wrapper import PredictorWrapper

In [5]:
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
class CustomModel:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    def predict_pairs(self, pairs):
        # Convert pairs into the required format for your model's input
        inputs = self.tokenizer(pairs, padding=True, truncation=True, return_tensors="pt")

        # Get the raw predictions from your model
        outputs = self.model(**inputs)

        # Get start and end logits
        start_logits, end_logits = outputs.start_logits, outputs.end_logits

        # Get start and end positions
        start_positions = torch.argmax(start_logits, dim=1)
        end_positions = torch.argmax(end_logits, dim=1)

        # Extract the answers
        answers = []
        for i, (start, end) in enumerate(zip(start_positions, end_positions)):
            # Get the question and split it into the first and second names
            question = pairs[i][1]
            first_name, second_name = question.split(' ')[0], question.split(' ')[-1]
            answer = self.tokenizer.decode(inputs["input_ids"][i][start : end + 1], skip_special_tokens=True)
            # Capitalize the first letter of the answer based on which name it corresponds to in the question
            if first_name.lower() in answer.lower():
                answer = answer.replace(first_name, first_name.capitalize(), 1)
            elif second_name.lower() in answer.lower():
                answer = answer.replace(second_name, second_name.capitalize(), 1)
            answers.append(answer.capitalize())

        # Return the list of predictions
        return answers


model_name = "sandeepvarma99/TaCL-bert-base-uncased-finetuned-squad-task"
# model_name = "bert-base-uncased"
custom_model = CustomModel(model_name)

# Update the invert function to use the custom model's predict_pairs method
invert = lambda a: custom_model.predict_pairs([(x[1], x[0]) for x in a])

new_pp = PredictorWrapper.wrap_predict(invert)

# Get predictions
predictions = custom_model.predict_pairs([('Who is smarter?', 'John is smart')])
print(predictions)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


['John']


In [3]:
invert([('John is smart', 'Who is smart')])

['John']

In [4]:
editor = checklist.editor.Editor()
editor.tg

<checklist.text_generation.TextGenerator at 0x7f188cddca00>

In [5]:
nlp = spacy.load('en_core_web_sm')

In [6]:
def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [7]:
def format_squad(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'Q: %s\n' % (q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [8]:
import json
def load_squad(fold='validation'):
    answers = []
    data = []
    ids = []
    files = {
        'validation': '/content/drive/MyDrive/TaCL-main/TaCL-main/datasets/dev-v1.1.json',
        'train': '/content/drive/MyDrive/TaCL-main/TaCL-main/datasets/train-v1.1.json',
        }
    f = json.load(open(files[fold]))
    for t in f['data']:
        for p in t['paragraphs']:
            context = p['context']
            for qa in p['qas']:
                data.append({'passage': context, 'question': qa['question'], 'id': qa['id']})
                answers.append(set([(x['text'], x['answer_start']) for x in qa['answers']]))
    return data, answers

In [None]:
import spacy
import pickle
from tqdm import tqdm
# from load_squad import load_squad

# Load the SQuAD dataset
data, answers = load_squad()

# Load the Spacy model
nlp = spacy.load('en_core_web_sm', disable=['ner'])

# Process the passages and questions using Spacy and save the mapping to a pickled file
processed_squad = {}
for i, d in enumerate(tqdm(data)):
    processed_squad[d['passage']] = nlp(d['passage'])
    processed_squad[d['question']] = nlp(d['question'])
with open('processed_squad.pkl', 'wb') as f:
    pickle.dump(processed_squad, f)


In [12]:
import pickle
data, answers =  load_squad()
spacy_map =  pickle.load(open('/content/drive/MyDrive/TaCL-main/TaCL-main/checklist/processed_squad.pkl', 'rb'))
pairs = [(x['passage'], x['question']) for x in data]
processed_pairs = [(spacy_map[x[0]], spacy_map[x[1]]) for x in pairs]

In [9]:
suite = TestSuite()


#Vocabulary

In [14]:
print(', '.join(editor.suggest('{first_name} is {mask} than {first_name2}.')[:60]))
adj = ['old', 'smart', 'tall', 'young', 'strong', 'short', 'tough', 'cool', 'fast', 'nice', 'small', 'dark', 'wise', 'rich', 'great', 'weak', 'high', 'slow', 'strange', 'clean']
adj = [(x.rstrip('e'), x) for x in adj]
adj[2]

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


smarter, older, better, younger, taller, worse, stronger, different, shorter, cooler, nicer, tougher, bigger, hotter, happier, smaller, wiser, more, faster, richer, darker, thinner, weaker, less, larger, quieter, cleaner, heavier, healthier, closer, colder, slower, wealthier, quicker, longer, harder, safer, lighter, warmer, brighter, cheaper, sharper, higher, louder, thicker, greater, lower, easier, deeper, poorer, softer, smoother, simpler, stranger, newer, other, superior, clearer, stricter, tighter


('tall', 'tall')

In [15]:
t = editor.template(
    [(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is less {adj[1]}?'
    ),(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is {adj[0]}er?'
    )
    ],
    labels = ['{first_name1}','{first_name}'],
    adj=adj,
    remove_duplicates=True,
    nsamples=100,
    save=True
    )
name = 'A is COMP than B. Who is more / less COMP?'
description = ''
test = MFT(**t, name=name, description=description, capability='Vocabulary')
test.run(new_pp)
test.summary(n=10, format_example_fn=format_squad_with_context)
suite.add(test,overwrite=True)

Predicting 200 examples
Test cases:      100
Fails (rate):    38 (38.0%)

Example fails:
C: Victoria is shorter than Ralph.
Q: Who is less short?
A: Ralph
P: Victoria


----
C: Kim is smaller than Emily.
Q: Who is less small?
A: Emily
P: Kim


----
C: Henry is shorter than Katie.
Q: Who is less short?
A: Katie
P: Henry is shorter than katie


----
C: Sharon is cleaner than Samuel.
Q: Who is less clean?
A: Samuel
P: Sharon is cleaner than samuel


----
C: Catherine is shorter than Amy.
Q: Who is less short?
A: Amy
P: Catherine


----
C: Jeff is younger than Alice.
Q: Who is less young?
A: Alice
P: Jeff


----
C: Paul is darker than Katherine.
Q: Who is less dark?
A: Katherine
P: Paul is darker than katherine


----
C: Sarah is higher than Steven.
Q: Who is less high?
A: Steven
P: Sarah


----
C: Eleanor is greater than Victoria.
Q: Who is less great?
A: Victoria
P: Eleanor is greater than victoria


----
C: Helen is shorter than Roger.
Q: Who is less short?
A: Roger
P: Helen


----


In [None]:
test.summary(n=38, format_example_fn=format_squad_with_context)

In [10]:
def crossproduct(t):
    # takes the output of editor.template and does the cross product of contexts and qas
    ret = []
    ret_labels = []
    for x in t.data:
        cs = x['contexts']
        qas = x['qas']
        d = list(itertools.product(cs, qas))
        ret.append([(x[0], x[1][0]) for x in d])
        ret_labels.append([x[1][1] for x in d])
    t.data = ret
    t.labels = ret_labels
    return t

In [11]:
state = editor.suggest('John is very {mask} about the project.')[:20]
print(', '.join(editor.suggest('John is {mask} {state} about the project.', state=state)[:30]))
very = ['very', 'extremely', 'really', 'quite', 'incredibly', 'particularly', 'highly', 'super']
somewhat = ['a little', 'somewhat', 'slightly', 'mildly']

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


very, pretty, extremely, also, still, quite, more, really, not, clearly, fairly, incredibly, particularly, now, understandably, rather, cautiously, surprisingly, certainly, feeling, so, especially, definitely, generally, most, highly, super, reportedly, being, obviously


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {very} {s} about the project. {first_name1} is {s} about the project.',
            '{first_name1} is {s} about the project. {first_name} is {very} {s} about the project.',
            '{first_name} is {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {s} about the project.',
            '{first_name} is {very} {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {very} {s} about the project.',
        ],
        'qas': [
            (
                'Who is most {s} about the project?',
                '{first_name}'
            ), 
            (
                'Who is least {s} about the project?',
                '{first_name1}'
            ), 
            
        ]
        
    },
    s = state,
    very=very,
    somewhat=somewhat,
    remove_duplicates=True,
    nsamples=25,
    save=True
    ))
name = 'Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?'
desc = ''
test = MFT(**t, name=name, description=desc, capability='Vocabulary')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test,overwrite=True)

Predicting 300 examples


In [None]:
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
class CustomModel:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    def predict_pairs(self, pairs):
        # Convert pairs into the required format for your model's input
        inputs = self.tokenizer(pairs, padding=True, truncation=True, return_tensors="pt")

        # Get the raw predictions from your model
        outputs = self.model(**inputs)

        # Get start and end logits
        start_logits, end_logits = outputs.start_logits, outputs.end_logits

        # Get start and end positions
        start_positions = torch.argmax(start_logits, dim=1)
        end_positions = torch.argmax(end_logits, dim=1)

        # Extract the answers
        answers = []
        for i, (start, end) in enumerate(zip(start_positions, end_positions)):
            answer = self.tokenizer.decode(inputs["input_ids"][i][start : end + 1], skip_special_tokens=True)
            answers.append(answer)

        # Return the list of predictions
        return answers



model_name = "sandeepvarma99/tacl-bert-squad-trained"
# model_name = "bert-base-uncased"
custom_model = CustomModel(model_name)

# Update the invert function to use the custom model's predict_pairs method
invert = lambda a: custom_model.predict_pairs([(x[1], x[0]) for x in a])

new_pp = PredictorWrapper.wrap_predict(invert)

# Get predictions
predictions = custom_model.predict_pairs([('Who is smarter?', 'John is smart')])
print(predictions)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


['John']


#Taxonomy

##Size, chape, color, age, material

In [None]:
import munch
order = ['size', 'shape', 'age', 'color']
props = []
properties = {
    'color' : ['red', 'blue','yellow', 'green', 'pink', 'white', 'black', 'orange', 'grey', 'purple', 'brown'],
    'size' : ['big', 'small', 'tiny', 'enormous'],
    'age' : ['old', 'new'],
    'shape' : ['round', 'oval', 'square', 'triangular'],
    'material' : ['iron', 'wooden', 'ceramic', 'glass', 'stone']
}
for i in range(len(order)):
    for j in range(i + 1, len(order)):
        p1, p2 = order[i], order[j]
        for v1, v2 in itertools.product(properties[p1], properties[p2]):
            props.append(munch.Munch({
                'p1': p1,
                'p2': p2,
                'v1': v1,
                'v2': v2,
            }))

In [None]:
print(', '.join(editor.suggest('There is {a:p.v1} {p.v2} {mask} in the room.', p=props, verbose=False)[:30]))
objects = ['box', 'clock', 'table', 'object', 'toy', 'painting', 'sculpture', 'thing', 'figure']

sofa, couch, wall, carpet, chair, table, light, lamp, door, clock, mirror, desk, bed, TV, bar, television, window, box, tree, painting, curtain, fan, fridge, screen, wallpaper, piano, rug, shelf, camera, candle


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            'There is {a:p.v1} {p.v2} {obj} in the room.',
            'There is {a:obj} in the room. The {obj} is {p.v1} and {p.v2}.',
        ],
        'qas': [
            (
                'What {p.p1} is the {obj}?',
                '{p.v1}'
            ), 
            (
                'What {p.p2} is the {obj}?',
                '{p.v2}'
            ), 
            
        ]
        
    },
    obj=objects,
    p=props,
    remove_duplicates=True,
    nsamples=5,
    save=True
    ))
name = 'size, shape, age, color'
desc = ''
test = MFT(**t, name=name, description=desc, capability='Taxonomy')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test,overwrite=True)

Predicting 20 examples
Test cases:      5
Fails (rate):    4 (80.0%)

Example fails:
C: There is a big black clock in the room.
Q: What size is the clock?
A: big
P: big black


----
C: There is an old blue painting in the room.
Q: What age is the painting?
A: old
P: old blue

C: There is a painting in the room. The painting is old and blue.
Q: What age is the painting?
A: old
P: old and blue

C: There is a painting in the room. The painting is old and blue.
Q: What color is the painting?
A: blue
P: old and blue


----
C: There is a tiny round thing in the room.
Q: What size is the thing?
A: tiny
P: tiny round

C: There is a thing in the room. The thing is tiny and round.
Q: What size is the thing?
A: tiny
P: tiny and round

C: There is a thing in the room. The thing is tiny and round.
Q: What shape is the thing?
A: round
P: tiny and round


----


###Professions vs nationalities

In [None]:
professions = editor.suggest('{first_name} works as {a:mask}.')[:30]
professions += editor.suggest('{first_name} {last_name} works as {a:mask}.')[:30]
professions = list(set(professions))
if 'translator' in professions:
    professions.remove('translator')

In [None]:
def clean(string):
    return string.lstrip('[a,the,an,in,at] ').rstrip('.')

In [None]:
def expect_squad(x, pred, conf, label=None, meta=None):
    return clean(pred) == clean(label)
expect_squad = Expect.single(expect_squad)

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:nat} {prof}.',
            '{first_name} is {a:prof}. {first_name} is {nat}.',
            '{first_name} is {nat}. {first_name} is {a:prof}.',
            '{first_name} is {nat} and {a:prof}.',
            '{first_name} is {a:prof} and {nat}.',
        ],
        'qas': [
            (
                'What is {first_name}\'s job?',
                '{prof}'
            ), 
            (
                'What is {first_name}\'s nationality?',
                '{nat}'
            ), 
            
        ]
        
    },
    nat = editor.lexicons['nationality'][:10],
    prof=professions,
    remove_duplicates=True,
    nsamples=5,
    save=True,
    ))
name = 'Profession vs nationality'
test = MFT(**t, name=name, expect=expect_squad, description='',  capability='Taxonomy')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test,overwrite=True)

Predicting 50 examples
Test cases:      5
Fails (rate):    3 (60.0%)

Example fails:
C: Florence is an Indian intern.
Q: What is Florence's job?
A: intern
P: Indian intern


----
C: Henry is a Chinese educator.
Q: What is Henry's job?
A: educator
P: Chinese educator


----
C: Dave is a Japanese author.
Q: What is Dave's job?
A: author
P: Japanese author


----


###Animal vs vehicle

In [None]:
animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} has {a:animal} and {a:vehicle}.',
            '{first_name} has {a:vehicle} and {a:animal}.',
        ],
        'qas': [
            (
                'What animal does {first_name} have?',
                '{animal}'
            ), 
            (
                'What vehicle does {first_name} have?',
                '{vehicle}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=5,
    save=True
    ))
name = 'Animal vs Vehicle'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)

Predicting 20 examples
Test cases:      5
Fails (rate):    0 (0.0%)


In [None]:
animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} bought {a:animal}. {first_name2} bought {a:vehicle}.',
            '{first_name2} bought {a:vehicle}. {first_name} bought {a:animal}.',
        ],
        'qas': [
            (
                'Who bought an animal?',
                '{first_name}'
            ), 
            (
                'Who bought a vehicle?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=5,
    save=True
    ))
name = 'Animal vs Vehicle v2'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)

Predicting 20 examples
Test cases:      5
Fails (rate):    3 (60.0%)

Example fails:
C: Sandra bought a firetruck. Elizabeth bought a duck.
Q: Who bought an animal?
A: Elizabeth
P: Sandra


----
C: Roy bought a bike. Jean bought a cow.
Q: Who bought an animal?
A: Jean
P: Roy


----
C: Don bought a guinea pig. Fiona bought a tractor.
Q: Who bought a vehicle?
A: Fiona
P: Don

C: Fiona bought a tractor. Don bought a guinea pig.
Q: Who bought an animal?
A: Don
P: Fiona bought a tractor. Don


----


In [None]:
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[0]}. {first_name2} is very {s2[0]}.',
            '{first_name2} is very {s2[0]}. {first_name} is very {s1[0]}.',
        ],
        'qas': [
            (
                'Who is {s1[1]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[1]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=5,
    save=True
   ))
t += crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[1]}. {first_name2} is very {s2[1]}.',
            '{first_name2} is very {s2[1]}. {first_name} is very {s1[1]}.',
        ],
        'qas': [
            (
                'Who is {s1[0]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[0]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=5,
    save=True
    )) 
name = 'Synonyms'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 36 examples
Test cases:      9
Fails (rate):    0 (0.0%)


In [None]:
comp_pairs = [('better', 'worse'), ('older', 'younger'), ('smarter', 'dumber'), ('taller', 'shorter'), ('bigger', 'smaller'), ('stronger', 'weaker'), ('faster', 'slower'), ('darker', 'lighter'), ('richer', 'poorer'), ('happier', 'sadder'), ('louder', 'quieter'), ('warmer', 'colder')]
comp_pairs = list(set(comp_pairs))#list(set(comp_pairs + [(x[1], x[0]) for x in comp_pairs]))

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {comp[0]} than {first_name1}.',
            '{first_name1} is {comp[1]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is {comp[1]}?',
                '{first_name1}',
            ),
            (
                'Who is {comp[0]}?',
                '{first_name}',
            )
            
        ]
        ,
    },
    comp=comp_pairs,
    remove_duplicates=True,
    nsamples=5,
    save=True
    ))
name = 'A is COMP than B. Who is antonym(COMP)? B'
test = MFT(**t, name=name, description='', capability='Taxonomy')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 20 examples
Test cases:      5
Fails (rate):    5 (100.0%)

Example fails:
C: Cynthia is older than Stephanie.
Q: Who is younger?
A: Stephanie
P: Cynthia is older than Stephanie


----
C: Thomas is poorer than Katie.
Q: Who is richer?
A: Katie
P: Thomas is poorer than Katie


----
C: Rachel is older than Frederick.
Q: Who is younger?
A: Frederick
P: Rachel

C: Frederick is younger than Rachel.
Q: Who is older?
A: Rachel
P: Frederick


----


In [None]:
antonym_adjs = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is more {a[0]} than {first_name1}.',
            '{first_name1} is more {a[1]} than {first_name}.',
            '{first_name} is less {a[1]} than {first_name1}.',
            '{first_name1} is less {a[0]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is more {a[0]}?',
                '{first_name}',
            ),
            (
                'Who is less {a[0]}?',
                '{first_name1}',
            ),
            (
                'Who is more {a[1]}?',
                '{first_name1}',
            ),
            (
                'Who is less {a[1]}?',
                '{first_name}',
            ),
        ]
        ,
    },
    a = antonym_adjs,
    remove_duplicates=True,
    nsamples=5,
    save=True
    ))
name = 'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.'
test = MFT(**t, name=name, description='', capability='Taxonomy')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 80 examples
Test cases:      5
Fails (rate):    5 (100.0%)

Example fails:
C: Julia is more humble than Caroline.
Q: Who is more proud?
A: Caroline
P: Julia

C: Julia is more humble than Caroline.
Q: Who is less proud?
A: Julia
P: Caroline

C: Caroline is more proud than Julia.
Q: Who is less humble?
A: Caroline
P: Julia


----
C: Victoria is more rude than Roy.
Q: Who is less polite?
A: Victoria
P: Roy

C: Roy is more polite than Victoria.
Q: Who is more rude?
A: Victoria
P: Roy

C: Victoria is less polite than Roy.
Q: Who is more rude?
A: Victoria
P: Roy


----
C: Virginia is more unhappy than Thomas.
Q: Who is more happy?
A: Thomas
P: Virginia

C: Thomas is more happy than Virginia.
Q: Who is more unhappy?
A: Virginia
P: Thomas

C: Thomas is more happy than Virginia.
Q: Who is less unhappy?
A: Thomas
P: Virginia


----


#Robustness

In [None]:
def question_typo(x):
    return (x[0], Perturb.add_typos(x[1]))
t = Perturb.perturb(pairs, question_typo, nsamples=5)
test = INV(**t, name='Question typo', capability='Robustness', description='')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad)
suite.add(test, overwrite=True)

Predicting 10 examples
Test cases:      5
Fails (rate):    0 (0.0%)


In [None]:
def contractions(x):
    conts = Perturb.contractions(x[1])
    return [(x[0], a) for a in conts]
t = Perturb.perturb(pairs, contractions, nsamples=2)
test = INV(**t, name='Question contractions', capability='Robustness', description='')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad)
suite.add(test)

Predicting 4 examples
Test cases:      2
Fails (rate):    1 (50.0%)

Example fails:
Q: What is a mechanism that can help plants block virus replication?
P: RNA silencing mechanisms

Q: What's a mechanism that can help plants block virus replication?
P: RNA silencing


----


In [None]:
random_sentences = set()
for x, _ in processed_pairs:
    for y in x.sents:
        random_sentences.add(y.text)
random_sentences = list(random_sentences)

In [None]:
def add_random_sentence(x, **kwargs):
    random_s = np.random.choice(random_sentences)
    while random_s in x[0]:
        random_s = np.random.choice(random_sentences)
    random_s = random_s.strip('.') + '. '
    meta = ['add to end: %s' % random_s, 'add to beg: %s' % random_s]
    return [(x[0] + random_s, x[1]), (random_s + x[0], x[1])], meta

def format_add(x, pred, conf, label=None, meta=None):
    ret = format_squad(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s\n' % meta
    return ret

t = Perturb.perturb(pairs, add_random_sentence, nsamples=2, meta=True)
test = INV(**t, name='Add random sentence to context', capability='Robustness', description='')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_add)
suite.add(test)

Predicting 6 examples
Test cases:      2
Fails (rate):    0 (0.0%)


#NER

In [None]:
import re
def change_thing(change_fn):
    def change_both(cq, **kwargs):
        context, question = cq
        a = change_fn(context, meta=True)
        if not a:
            return None
        changed, meta = a
        ret = []
        for c, m in zip(changed, meta):
            new_q = re.sub(r'\b%s\b' % re.escape(m[0]), m[1], question.text)
            ret.append((c, new_q))
        return ret, meta
    return change_both
            

In [None]:
def expect_same(orig_pred, pred, orig_conf, conf, labels=None, meta=None):
    if not meta:
        return pred == orig_pred
    return pred == re.sub(r'\b%s\b' % re.escape(meta[0]), meta[1], orig_pred)

def format_replace(x, pred, conf, label=None, meta=None):
    ret = format_squad(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s -> %s\n' % meta
    return ret

def format_replace_context(x, pred, conf, label=None, meta=None):
    ret = format_squad_with_context(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s -> %s\n' % meta
    return ret

In [None]:
t = Perturb.perturb(processed_pairs, change_thing(Perturb.change_names), nsamples=2, meta=True)

test = INV(**t, name='Change name everywhere', capability='NER',
          description='', expect=Expect.pairwise(expect_same))
test.run(new_pp)
test.summary(3, format_example_fn=format_replace)
suite.add(test, overwrite=True)

In [None]:
t = Perturb.perturb(processed_pairs, change_thing(Perturb.change_location), nsamples=1, meta=True)

test = INV(**t, name='Change location everywhere', capability='NER',
          description='', expect=Expect.pairwise(expect_same))
test.run(new_pp)
test.summary(3, format_example_fn=format_replace)
suite.add(test, overwrite=True)

#Temporal

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            'Both {first_name} and {first_name2} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
            'Both {first_name2} and {first_name} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
        ],
        'qas': [
            (
                'Who is {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'There was a change in profession'
test = MFT(**t, expect=expect_squad, capability='Temporal', name=name, description='' )
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 4 examples
Test cases:      2
Fails (rate):    2 (100.0%)

Example fails:
C: Both Scott and Alex were accountants, but there was a change in Scott, who is now an escort.
Q: Who is an escort?
A: Scott
P: Scott and Alex were accountants, but there was a change in Scott

C: Both Alex and Scott were accountants, but there was a change in Scott, who is now an escort.
Q: Who is an escort?
A: Scott
P: Scott were accountants, but there was a change in Scott


----
C: Both Marie and Robin were interpreters, but there was a change in Robin, who is now an agent.
Q: Who is an agent?
A: Robin
P: Robin were interpreters, but there was a change in Robin


----


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} became a {prof} before {first_name2} did.',
            '{first_name2} became a {prof} after {first_name} did.',
        ],
        'qas': [
            (
                'Who became a {prof} first?',
                '{first_name}'
            ), 
            (
                'Who became a {prof} last?',
                '{first_name2}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'Understanding before / after -> first / last.'
test = MFT(**t, expect=expect_squad, capability='Temporal', name=name, description='' )
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 8 examples
Test cases:      2
Fails (rate):    2 (100.0%)

Example fails:
C: Carolyn became a waitress before Alexander did.
Q: Who became a waitress last?
A: Alexander
P: Carolyn

C: Alexander became a waitress after Carolyn did.
Q: Who became a waitress last?
A: Alexander
P: Carolyn


----
C: Kathy became a economist before Nick did.
Q: Who became a economist last?
A: Nick
P: Kathy


----


#Negation

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is not {a:prof}. {first_name2} is.',
            '{first_name2} is {a:prof}. {first_name} is not.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'Negation in context, may or may not be in question'
test = MFT(**t, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 8 examples
Test cases:      2
Fails (rate):    1 (50.0%)

Example fails:
C: Heather is not an investor. Arthur is.
Q: Who is an investor?
A: Arthur
P: Heather


----


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:prof}. {first_name2} is {a:prof2}.',
            '{first_name2} is {a:prof2}. {first_name} is {a:prof}.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is {a:prof2}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'Negation in question only.'
test = MFT(**t, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 16 examples
Test cases:      2
Fails (rate):    2 (100.0%)

Example fails:
C: Jane is an analyst. Andrew is an interpreter.
Q: Who is not an analyst?
A: Andrew
P: Jane

C: Jane is an analyst. Andrew is an interpreter.
Q: Who is not an interpreter?
A: Jane
P: Andrew

C: Andrew is an interpreter. Jane is an analyst.
Q: Who is not an analyst?
A: Andrew
P: Jane


----
C: Sophie is an economist. Martin is an attorney.
Q: Who is not an economist?
A: Martin
P: Sophie

C: Sophie is an economist. Martin is an attorney.
Q: Who is not an attorney?
A: Sophie
P: Martin

C: Martin is an attorney. Sophie is an economist.
Q: Who is not an economist?
A: Martin
P: Sophie


----


#Fairness spinoff

In [None]:
import collections
fewer_profs = ['doctor', 'nurse', 'secretary', 'CEO']
t = editor.template(
    [
        ('{male} is not {a:prof}, {female} is.', 'Who is {a:prof}?', '{female}', 'woman', '{prof}'),
        ('{female} is not {a:prof}, {male} is.', 'Who is {a:prof}?', '{male}', 'man', '{prof}'),
    ],
#     prof=professions + ['doctor'],
    prof=fewer_profs,
    remove_duplicates=True,
    nsamples=10,
    unroll=True,
    save=True,
    )
data = [(d[0], d[1]) for d in t.data]
labels = [d[2] for d in t.data]
meta = [(d[3], d[4]) for d in t.data]

test = MFT(data, expect=expect_squad, labels=labels, meta=meta, templates=t.templates,
          name='M/F failure rates should be similar for different professions', capability='Fairness',
          description='Using negation in context.')
test.run(new_pp)

def print_fair(test):
    c = collections.Counter(test.meta)
    fail = collections.Counter([tuple(x) for x in np.array(test.meta)[test.fail_idxs()]])
    profs = set()
    for sex, prof in fail:
        profs.add(prof)
    prof_fail = {}
    get_fail = lambda f:fail[f] / c[f]
    for prof in profs:
        fail_m = get_fail(('man', prof))
        fail_f = get_fail(('woman', prof))
        prof_fail[prof] = (fail_m, fail_f)
    print('%-13s fail_men fail_women (count)' % 'profession')
    for prof, vs in sorted(prof_fail.items(), key=lambda x:max(x[1][0], x[1][1]), reverse=True):
        fail_m, fail_f = vs
        print('%-13s   %.1f      %.1f     (%d)' % (prof, 100 * fail_m, 100 * fail_f, c[('man', prof)]))
print_fair(test)
suite.add(test)

Predicting 20 examples
profession    fail_men fail_women (count)
CEO             0.0      100.0     (4)
nurse           40.0      40.0     (5)


#Coref

In [None]:
if 'actress' in professions:
    professions.remove('actress')

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{female} and {male} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{male} and {female} are friends. She is {a:prof2}, and he is {a:prof1}.',
            '{female} and {male} are friends. She is {a:prof2}, and he is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{male}'
            ), 
            (
                'Who is {a:prof2}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'Basic coref, he / she'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 16 examples
Test cases:      2
Fails (rate):    2 (100.0%)

Example fails:
C: Eric and Carolyn are friends. He is a model, and she is a DJ.
Q: Who is a model?
A: Eric
P: Eric and Carolyn

C: Carolyn and Eric are friends. He is a model, and she is a DJ.
Q: Who is a DJ?
A: Carolyn
P: Eric

C: Eric and Carolyn are friends. She is a DJ, and he is a model.
Q: Who is a model?
A: Eric
P: Carolyn


----
C: Walter and Annie are friends. He is a secretary, and she is an architect.
Q: Who is a secretary?
A: Walter
P: Walter and Annie

C: Annie and Walter are friends. He is a secretary, and she is an architect.
Q: Who is an architect?
A: Annie
P: Walter

C: Walter and Annie are friends. She is an architect, and he is a secretary.
Q: Who is a secretary?
A: Walter
P: Annie


----


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. His mom is {a:prof}.',
            '{female} and {male} are friends. His mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{male}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))
t += crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. Her mom is {a:prof}.',
            '{female} and {male} are friends. Her mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    ))

name = 'Basic coref, his / her'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 8 examples
Test cases:      4
Fails (rate):    4 (100.0%)

Example fails:
C: Richard and Melissa are friends. His mom is an editor.
Q: Whose mom is an editor?
A: Richard
P: Melissa


----
C: Fiona and Louis are friends. Her mom is an editor.
Q: Whose mom is an editor?
A: Fiona
P: Fiona and Louis


----
C: Anna and Roger are friends. Her mom is an editor.
Q: Whose mom is an editor?
A: Anna
P: Roger


----


In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} and {first_name2} are friends. The former is {a:prof1}.',
            '{first_name2} and {first_name} are friends. The latter is {a:prof1}.',
            '{first_name} and {first_name2} are friends. The former is {a:prof1} and the latter is {a:prof2}.',
            '{first_name2} and {first_name} are friends. The former is {a:prof2} and the latter is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=2,
    save=True
    ))
name = 'Former / Latter'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 8 examples
Test cases:      2
Fails (rate):    2 (100.0%)

Example fails:
C: Sophie and Rebecca are friends. The former is an editor.
Q: Who is an editor?
A: Sophie
P: Rebecca

C: Sophie and Rebecca are friends. The former is an editor and the latter is an author.
Q: Who is an editor?
A: Sophie
P: Sophie and Rebecca


----
C: Wendy and Karen are friends. The former is an executive.
Q: Who is an executive?
A: Wendy
P: Karen

C: Wendy and Karen are friends. The former is an executive and the latter is an organizer.
Q: Who is an executive?
A: Wendy
P: Karen


----


#SRL

In [None]:
import pattern
import pattern.en
pverb = ['love', 'hate', 'like', 'remember', 'recognize', 'trust', 'deserve', 'understand', 'blame', 'dislike', 'prefer', 'follow', 'notice', 'hurt', 'bother', 'support', 'believe', 'accept', 'attack']
a = pattern.en.tenses('loves')[0]
b = pattern.en.tenses('stolen')[0]
pverb = [(pattern.en.conjugate(v, *a), pattern.en.conjugate(v, *b)) for v in pverb]

t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} {v[0]} {first_name2}.',
            '{first_name2} is {v[1]} by {first_name}.',
        ],
        'qas': [
            (
                'Who {v[0]}?',
                '{first_name}'
            ), 
            (
                'Who is {v[1]}?',
                '{first_name2}'
            ), 
        ]
        
    },
    v=pverb,
    remove_duplicates=True,
    nsamples=2,
    ))
name = 'Agent / object distinction'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='SRL')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

In [None]:
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} {v[0]} {first_name2}. {first_name2} {v[0]} {first_name3}.',
            '{first_name} {v[0]} {first_name2}. {first_name3} is {v[1]} by {first_name2}.',
            '{first_name2} is {v[1]} by {first_name}. {first_name2} {v[0]} {first_name3}.',
            '{first_name2} is {v[1]} by {first_name}. {first_name3} is {v[1]} by {first_name2}.',
        ],
        'qas': [
            (
                'Who {v[0]} {first_name2}?',
                '{first_name}'
            ), 
            (
                'Who {v[0]} {first_name3}?',
                '{first_name2}'
            ), 
            (
                'Who is {v[1]} by {first_name}?',
                '{first_name2}'
            ), 
            (
                'Who is {v[1]} by {first_name2}?',
                '{first_name3}'
            ), 
        ]
        
    },
    save=True,
    v=pverb,
    remove_duplicates=True,
    nsamples=1,
    ))
name = 'Agent / object distinction with 3 agents'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='SRL')
test.run(new_pp)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 16 examples
Test cases:      1
Fails (rate):    1 (100.0%)

Example fails:
C: Bob p Judy. Judy p Melissa.
Q: Who p Judy?
A: Bob
P: Bob p Judy. Judy p Melissa

C: Bob p Judy. Melissa is r by Judy.
Q: Who p Melissa?
A: Judy
P: Bob

C: Bob p Judy. Melissa is r by Judy.
Q: Who is r by Bob?
A: Judy
P: Melissa


----


In [None]:
suite.summary(n=3, format_example_fn=format_squad_with_context)

Robustness

Question contractions
Test cases:      2
Fails (rate):    1 (50.0%)

Example fails:
C: Unlike animals, plants lack phagocytic cells, but many plant immune responses involve systemic chemical signals that are sent through a plant. Individual plant cells respond to molecules associated with pathogens known as Pathogen-associated molecular patterns or PAMPs. When a part of a plant becomes infected, the plant produces a localized hypersensitive response, whereby cells at the site of infection undergo rapid apoptosis to prevent the spread of the disease to other parts of the plant. Systemic acquired resistance (SAR) is a type of defensive response used by plants that renders the entire plant resistant to a particular infectious agent. RNA silencing mechanisms are particularly important in this systemic response as they can block virus replication.
Q: What is a mechanism that can help plants block virus replication?
P: RNA silencing mechanisms

C: Unlike animals, plants lack phag

In [None]:
print_fair(suite.tests['M/F failure rates should be similar for different professions'])

profession    fail_men fail_women (count)
CEO             0.0      100.0     (4)
nurse           40.0      40.0     (5)


#Error Analysis

In [None]:

# Get the incorrect predictions
incorrect_predictions = []
for pred, ref in zip(formatted_predictions, references):
    if pred['prediction_text'] not in ref['answers']['text']:
        incorrect_predictions.append((pred['id'], pred['prediction_text'], ref['answers']['text']))
        
# Print the incorrect predictions
print("Incorrect Predictions:")
for i, (id_, pred, refs) in enumerate(incorrect_predictions[:3]):
    print(f"Example {i+1} - ID: {id_}")
    print()

Incorrect Predictions:
Example 1 - ID: 56be4db0acb8001400a502ee

Example 2 - ID: 56be8e613aeaaa14008c90d1

Example 3 - ID: 56bea9923aeaaa14008c91b9



In [None]:
incorrect_predictions = []
for i, prediction in enumerate(formatted_predictions):
    pred_text = prediction['prediction_text'].strip()
    example_id = prediction['id']
    example = next((x for x in references if x['id'] == example_id), None)
    if example is None:
        continue
    answer_texts = example['answers']['text']
    question_text = squad_dataset["validation"][i]["question"]
    if pred_text not in answer_texts:
        correct_text = answer_texts[0]
        incorrect_predictions.append({"question": question_text, "incorrect_answer": pred_text, "correct_answer": correct_text})
print("Number of incorrect predictions:", len(incorrect_predictions))
print()
for pred in incorrect_predictions[0:20]:
    print("Question:", pred["question"])
    print("answer:", pred["incorrect_answer"])
    print("Correct answer:", pred["correct_answer"])
    print("--------------------------------------------------")


Number of incorrect predictions: 2288

Question: Where did Super Bowl 50 take place?
answer: Levi's Stadium in the San Francisco Bay Area at Santa Clara, California
Correct answer: Santa Clara, California
--------------------------------------------------
Question: What was the theme of Super Bowl 50?
answer: golden anniversary
Correct answer: "golden anniversary"
--------------------------------------------------
Question: What was the theme of Super Bowl 50?
answer: golden anniversary
Correct answer: "golden anniversary"
--------------------------------------------------
Question: Where was Super Bowl 50 held?
answer: Levi's Stadium in the San Francisco Bay Area at Santa Clara, California
Correct answer: Santa Clara, California.
--------------------------------------------------
Question: The name of the NFL championship game is?
answer: Super Bowl 50
Correct answer: Super Bowl
--------------------------------------------------
Question: What 2015 NFL team one the AFC playoff?
answer

We conducted an error analysis on the TACL-BERT and BERT models, both trained on the SQuAD 1.0 dataset and identified several reasons for incorrect predictions. Some potential reasons include imprecise context understanding, over-specificity or insufficient specificity, difficulties handling ambiguity, misinterpretation of questions, and sensitivity to phrasing and paraphrasing.
TACL model:
1.	Context Understanding: The model fails to understand context accurately or comprehensively, leading to incorrect answers. Example: It answered "reciprocating steam engines" instead of "steam turbines" for 20th-century ship propulsion.
2.	Specificity Issues: The model generated overly specific or insufficiently specific answers. Example: It answered, "open loop system" instead of "open loop" and "hydrogen and helium" instead of "helium" in two different questions.
3.	Ambiguity Handling: The model struggles a bit with ambiguous questions or those with multiple valid answers. Example: It answered "Richard Trevithick and, separately, Oliver Evans" instead of "Oliver Evans" for the creator of an engine using high-pressure steam in 1801.
4.	Misinterpretation or Misrepresentation: The model misunderstands questions, leading to incorrect predictions. Example: It answered "edge railed rack and pinion Middleton Railway" instead of "Middleton Railway" for the railroad where Salamanca was used.
5.	Implicit Information Capture: The model struggles to capture implicit information or answer questions requiring deeper context understanding. Example: It answered "relatively little work is required to drive the pump" instead of the more specific answer about the Rankine cycle's compression stage.
6.	Sensitivity to Phrasing and Paraphrasing: The model could have difficulty with differently phrased or paraphrased questions due to biases or the inability to generalize. Example: It answered "turbine type steam engines" instead of "turbine" for the steam engine producing most electricity today.
BERT Model:
2.	Context understanding: BERT struggles with context, leading to incorrect answers. Example: Answering "the source of most of the chemical energy" instead of "chemical energy" for oxygen's role in combustion.
3.	Incorrect information extraction: BERT extracts wrong information from the text. Example: Answering "light sky-blue color" for the clarity of liquid oxygen, instead of "clear."
Areas where both models fail, but TACL performs better:
1.	Disambiguation: TACL outperforms BERT in disambiguating concepts. Example: TACL correctly answers "methane" as the primary component of natural gas, while BERT lists multiple components.
2.	Answer specificity: TACL provides more specific answers than BERT. Example: TACL answers "to generate ATP through cellular respiration" for mitochondria's role, while BERT answers "energy production."
Both models encounter challenges in handling ambiguity, as demonstrated by their inability to correctly answer a question about the creator of a high-pressure steam engine in 1801. Misinterpretation of questions also led to errors, such as when the model provided a more technical response to a question about the railroad on which Salamanca was used.
Despite these limitations, TACL-BERT generally outperforms BERT-base-uncased in areas such as context understanding, information extraction, disambiguation, and specificity. This improvement can be attributed to factors like better context comprehension, more accurate extraction, superior disambiguation capabilities, and the generation of more specific answers.
Additional Analysis from the paper: The authors could have investigated the model's performance on questions that require an understanding of implicit information or context. They could have performed a more granular analysis of the questions based on the level of context required, which might have helped identify specific areas where the model needs improvement. The authors could have examined the model's sensitivity to phrasing and paraphrasing by creating multiple versions of the same question with different linguistic structures. Analyzing the model's performance on these variations would help identify any biases in the training data or areas where the model struggles to generalize. They could have analyzed the model's performance across different data splits, including varying levels of question similarity between the training and evaluation sets. This would help identify any overfitting issues and provide insights into the model's ability to generalize to new questions.
Despite these limitations, TACL-BERT generally outperforms BERT-base-uncased in areas such as context understanding, information extraction, disambiguation, and specificity. The performance of TACL-BERT, both in terms of failures and improvements over the BERT model, can be attributed to several factors. First, the architecture of TACL-BERT allows for more effective context comprehension, which aids in understanding complex questions and producing more accurate answers. Second, the model's enhanced information extraction capabilities enable it to identify and focus on relevant textual information. Third, the TACL-BERT model exhibits superior disambiguation skills, allowing it to distinguish between similar concepts and determine the most appropriate response. Lastly, the model tends to generate more specific answers, which can be beneficial in certain cases but can also lead to over-specificity issues. These attributes contribute to TACL-BERT's overall performance, with the model outperforming BERT in several areas while still facing challenges in handling ambiguity and certain question types.
In conclusion, the errors in TACL-BERT and BERT models can stem from factors such as preprocessing, tokenization, model architecture, training data, and hyperparameter tuning. While both models have limitations in understanding context, extracting correct information, disambiguating concepts, and providing sufficiently specific answers, TACL-BERT typically performs better in these areas. Further investigation and optimization of these models can lead to a more accurate and reliable performance in question-answering tasks.


## Unit Test on TaCL-Bert vs BERT

In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel

# Load TACL model
tacl_tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/tacl-bert-base-uncased")
tacl_model = AutoModel.from_pretrained("cambridgeltl/tacl-bert-base-uncased")

# Load BERT model
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased")

# Choose a sentence to compare
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence for both models
tacl_tokens = tacl_tokenizer.tokenize(sentence)
bert_tokens = bert_tokenizer.tokenize(sentence)

# Get the representations for each token in the sentence for both models
with torch.no_grad():
    tacl_outputs = tacl_model(torch.tensor([tacl_tokenizer.encode(sentence)]))[0][0]
    bert_outputs = bert_model(torch.tensor([bert_tokenizer.encode(sentence)]))[0][0]

# Calculate self-similarity for both models
def self_similarity(outputs):
    cos_sim = torch.nn.CosineSimilarity(dim=-1)
    similarity = []
    for i in range(outputs.shape[0]):
        for j in range(i+1, outputs.shape[0]):
            similarity.append(cos_sim(outputs[i], outputs[j]).item())
    return np.mean(similarity)

tacl_self_sim = self_similarity(tacl_outputs)
bert_self_sim = self_similarity(bert_outputs)

# Print results
print("TACL self-similarity:", tacl_self_sim)
print("BERT self-similarity:", bert_self_sim)
if tacl_self_sim < bert_self_sim:
    print("TACL has more discriminative token representations than BERT.")
else:
    print("BERT has more discriminative token representations than TACL.")


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


TACL self-similarity: 0.3019088080547976
BERT self-similarity: 0.4077328010038896
TACL has more discriminative token representations than BERT.


In the sample unit test, we used the sentence "The quick brown fox jumps over the lazy dog." as input for both the TACL and BERT models. This sentence is commonly used as it contains every letter of the English alphabet, making it a good choice for testing token representations. The test results showed that the TACL self-similarity was 0.3019, while the BERT self-similarity was 0.4077. TACL achieved better self-similarity scores due to its training process, which specifically optimizes the model for isotropic token representation spaces. In other words, TACL encourages more uniform and evenly distributed token embeddings, leading to better generalization and performance on downstream tasks. This is evident in the lower self-similarity score for TACL compared to BERT in the unit test, which suggests that the TACL model captures more nuanced and distinct features for each token.

##References


*  https://github.com/yxuansu/TaCL/tree/main/english_benchmark 
*   https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
*   https://github.com/marcotcr/checklist/tree/master
