# Running checklist test suite for SQuAD
source: code from https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb with minor changes
        

In [1]:
%load_ext autoreload
%autoreload 2

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, \
    AutoModelForQuestionAnswering, Trainer, TrainingArguments, HfArgumentParser
from transformers import pipeline 


In [2]:

model_name = "trained_model_squad1/"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = pipeline('question-answering', model=model, tokenizer=tokenizer)


In [3]:
suite_path = 'squad_suite.pkl'
suite = TestSuite.from_file(suite_path)

In [4]:
def predconfs(context_question_pairs):
    preds = []
    confs = []
    for c, q in context_question_pairs:
        try:
            p = model(question=q, context=c, truncation=True, )
        except:
            print('Failed', q)
            preds.append(' ')
            confs.append(1)
        preds.append(p['answer'])
        confs.append(p['score'])
    return preds, np.array(confs)

In [5]:
suite.run(predconfs, n=100, overwrite=True)

Running A is COMP than B. Who is more COMP?
Predicting 100 examples


  fw_args = {k: torch.tensor(v, device=self.device) for (k, v) in fw_args.items()}


Running A is COMP than B. Who is less COMP?
Predicting 100 examples
Running Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Predicting 1200 examples
Running size, shape, age, color
Predicting 400 examples
Running Profession vs nationality
Predicting 1000 examples
Running Animal vs Vehicle
Predicting 400 examples
Running Animal vs Vehicle v2
Predicting 400 examples
Running Synonyms
Predicting 400 examples
Running A is COMP than B. Who is antonym(COMP)? B
Predicting 400 examples
Running A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
Predicting 1600 examples
Running Question typo
Predicting 200 examples
Running Question contractions
Predicting 200 examples
Running Add random sentence to context
Predicting 300 examples
Running Change name everywhere
Predicting 1100 examples
Running Change location everywhere
Predicting 1100 examples
Running There was a change in profession
Predicting 200 examples
Runn

In [6]:
def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [7]:
suite.summary(format_example_fn=format_squad_with_context)

Vocabulary

A is COMP than B. Who is more COMP?
Test cases:      499
Test cases run:  100
Fails (rate):    0 (0.0%)


A is COMP than B. Who is less COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Frank is cleaner than Charlotte.
Q: Who is less clean?
A: Charlotte
P: Frank

----
C: Tony is richer than Pamela.
Q: Who is less rich?
A: Pamela
P: Tony

----
C: Nancy is taller than Susan.
Q: Who is less tall?
A: Susan
P: Nancy

----


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      498
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Joseph is highly upbeat about the project. Martin is upbeat about the project.
Q: Who is least upbeat about the project?
A: Martin
P: Joseph

C: Martin is upbeat about the project. Joseph is highly upbeat about the project.
Q: Who is least upbeat about the project?
A: Martin
P: Joseph

C: Joseph is highly upbeat about the project. Martin is mildly up