# Running checklist test suite for SQuAD
source: code from https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb with minor changes

This notebook is used to run the SQuAD test suite. It includes loading the question answering pipeline, running the test cases, and summarizing the model's test suite.
        

In [1]:
%load_ext autoreload
%autoreload 2

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite

import numpy as np
import spacy
from checklist.perturb import Perturb
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, \
    AutoModelForQuestionAnswering, Trainer, TrainingArguments, HfArgumentParser
from transformers import pipeline 


In [2]:

model = pipeline('question-answering')
model({
    'context': 'A new strain of flu that has the potential to become a pandemic has been identified by scientists.',
    'question': 'What has been discovered by scientists?'
})

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.





{'score': 0.38112929463386536,
 'start': 0,
 'end': 19,
 'answer': 'A new strain of flu'}

In [3]:
suite_path = 'squad_suite.pkl' #define path of the pkl
suite = TestSuite.from_file(suite_path)

In [4]:
# List of question pair(inout) return two lists. to provide answer the test case
def predconfs(context_question_pairs):
    preds = []
    confs = []
    for c, q in context_question_pairs:
        try:
            p = model(question=q, context=c, truncation=True, )
        except:
            print('Failed', q)
            preds.append(' ')
            confs.append(1)
        preds.append(p['answer'])
        confs.append(p['score'])
    return preds, np.array(confs)

In [5]:
suite.run(predconfs, n=100, overwrite=True) #run the suite (Test case) 100 samples, predconf will provide the answer
# suite.run()

Running A is COMP than B. Who is more COMP?
Predicting 100 examples
Running A is COMP than B. Who is less COMP?
Predicting 100 examples
Running Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Predicting 1200 examples
Running size, shape, age, color
Predicting 400 examples
Running Profession vs nationality
Predicting 1000 examples
Running Animal vs Vehicle
Predicting 400 examples
Running Animal vs Vehicle v2
Predicting 400 examples
Running Synonyms
Predicting 400 examples
Running A is COMP than B. Who is antonym(COMP)? B
Predicting 400 examples
Running A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
Predicting 1600 examples
Running Question typo
Predicting 200 examples
Running Question contractions
Predicting 201 examples
Running Add random sentence to context
Predicting 300 examples
Running Change name everywhere
Predicting 1100 examples
Running Change location everywhere
Predicting 1100 examples
R

In [6]:
# function to format the SQuAD context in the test case
def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [22]:
#summarize the test should include three samples and use the `format_squad_with_context()` function to farmat the result
suite_summary = suite.summary(n=3, format_example_fn=format_squad_with_context) 

Vocabulary

A is COMP than B. Who is more COMP?
Test cases:      498
Test cases run:  100
Fails (rate):    2 (2.0%)

Example fails:
C: Rose is taller than Julia.
Q: Who is taller?
A: Rose
P: Julia

----
C: Sharon is higher than Donna.
Q: Who is higher?
A: Sharon
P: Sharon is higher than Donna

----


A is COMP than B. Who is less COMP?
Test cases:      497
Test cases run:  100
Fails (rate):    99 (99.0%)

Example fails:
C: Tom is smarter than Lawrence.
Q: Who is less smart?
A: Lawrence
P: Tom

----
C: William is stranger than Gary.
Q: Who is less strange?
A: Gary
P: William

----
C: Alex is faster than Julie.
Q: Who is less fast?
A: Julie
P: Alex

----


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      499
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Margaret is extremely hopeful about the project. Donald is hopeful about the project.
Q: Who is most hopeful about the project?
A: Margaret
P: Donald

C: Donald is hopeful

In [33]:

# from SuiteSummarizerModel import SuiteSummarizerModel

display(suite.visual_summary_table())
# suite.visual_summary(format_example_fn=format_squad_with_context)


Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'A is COMP than B. Wh…