# SQuAD 
This notebook is used to create the test suite for the SQuAD (Stanford Question Answering Dataset) task. It involves steps like selecting and preprocessing the data and defining test cases. For example, this test case will provide the context which meant some information, and author needs to provide question and exactly answer. and use model question-answering that has defined above,  `predconfs` function to provide and answer. and in this file has provided short summary of each test case. 
    
    As an example of this, 
    ```
    C: Laura became a waitress before Roy did.
    Q: Who became a waitress last?
    A: Roy
    P: Laura
    ```

    Above is an example of fail test case, you can see that the context is "Laura became a waitress `before` Roy did.". The question is "Who became a waitress `last`?". The context and question are different is before and last, so the expected answer is "Roy", but the predicted answer from the model is "Laura". So mean that the question-answering model isn't understand the word before and after.
    This task has tested on many capabilities.
    
    - Capability: Vocabulary
    - Taxonomy: 
        - Size, shape, color, age, material
        - Professions vs nationalities
        - Animal vs vehicle
    - Robustness
    - NER
    - Temporal
    - Negation
    - Fairness spinoff
    - Coreference (Understanding which entities are referred to by “his / her”, “former / latter”)
    - Semantic Role Labeling (SRL) (Understanding roles (Role and subject))

    This notebook is responsible for creating test cases for SQuAD. After processing all test suites, it will save the test cases to squad_suite.pkl


Note:
- MFT(Minimum Functionality Test): focuses on evaluating whether a model has the basic functionality 
- DIR(Directional Expectation test). determine whether a model’s predictions are consistent with a prior expectation or hypothesis 
- INV (Invariance testing) is a type of testing in ML that checks whether a model is invariant to certain transformations or changes in the input data. 



ref:
- https://www.godeltech.com/how-to-automate-the-testing-process-for-machine-learning-systems/

perturb is dataset evaluate the robustness(BERT)

For the SQuAD use the dataset in the `datasets` library



In [1]:
# %load_ext autoreload
# %autoreload 2
%reload_ext autoreload
%autoreload 0

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb


In [2]:
import tensorflow as tf


In [3]:
import dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, \
    AutoModelForQuestionAnswering, Trainer, TrainingArguments, HfArgumentParser
from transformers import pipeline 

model = pipeline('question-answering')

# This is just test the pipeline
model({
    'context': 'A new strain of flu that has the potential to become a pandemic has been identified by scientists.',
    'question': 'What has been discovered by scientists?'
})

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.





{'score': 0.38112929463386536,
 'start': 0,
 'end': 19,
 'answer': 'A new strain of flu'}

In [5]:
editor = checklist.editor.Editor() # creates an instance of the Editor class 
editor.tg # generate the new text data of the Checklist library fro the testing model and suggest the word

<checklist.text_generation.TextGenerator at 0x16e820f7090>

In [6]:
# List of question pair(inout) return two lists. to provide answer the test case
def predconfs(context_question_pairs):
    """
    output: predictions, confidence 
    source: https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/tutorials/5.%20Testing%20transformer%20pipelines.ipynb
    """
    preds = []
    confs = []
    for c, q in context_question_pairs:
        try:
            p = model(question=q, context=c, truncation=True, )
        except:
            print('Failed', q)
            preds.append(' ')
            confs.append(1)
        preds.append(p['answer'])
        confs.append(p['score'])
    return preds, np.array(confs)

In [7]:
# function to format the SQuAD context in the test case
# e.g., (this is exmaple fail, just shown the format)
# C: Claire is shorter than Donald.
# Q: Who is shorter?
# A: Claire (answer that answer)
# P: Donald (predicted answer)
def format_squad_with_context(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'C: %s\nQ: %s\n' % (c, q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [8]:
# format the SQuAD without context
# e.g.,
# Q: Where have the powers maintained peace in recent years?
# P: United Nations
def format_squad(x, pred, conf, label=None, *args, **kwargs):
    c, q = x
    ret = 'Q: %s\n' % (q)
    if label is not None:
        ret += 'A: %s\n' % label
    ret += 'P: %s\n' % pred
    return ret

In [9]:
# Test suite is container for the unit test. used for the test case.
suite = TestSuite()

## Vocabulary

In [10]:
# suggest the mask from the context. In this context mask provided the adjective.
print(', '.join(editor.suggest('{first_name} is {mask} than {first_name2}.')[:60]))

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


smarter, better, older, younger, taller, worse, different, stronger, cooler, nicer, tougher, shorter, bigger, hotter, more, darker, happier, smaller, faster, richer, wiser, thinner, less, weaker, larger, quieter, cleaner, closer, healthier, heavier, colder, slower, harder, wealthier, safer, quicker, longer, higher, cheaper, thicker, louder, sharper, lighter, warmer, brighter, greater, deeper, lower, easier, softer, smoother, poorer, other, stranger, newer, stricter, simpler, clearer, superior, tighter


In [11]:
# define adjective
adj = ['old', 'smart', 'tall', 'young', 'strong', 'short', 'tough', 'cool', 'fast', 'nice', 'small', 'dark', 'wise', 'rich', 'great', 'weak', 'high', 'slow', 'strange', 'clean']

# append new column with same adjective
adj = [(x.rstrip('e'), x) for x in adj]


In [12]:
adj[2]

('tall', 'tall')

In [13]:
# Add the test case
# This test expect the answer will be first_name
t = editor.template(
    [(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is {adj[0]}er?'
    )
    ],
    labels = ['{first_name}'],
    adj=adj,
    remove_duplicates=True,
    nsamples=500,
    save=True
    )
name = 'A is COMP than B. Who is more COMP?'
description = ''
test = MFT(**t, name=name, description=description, capability='Vocabulary')
suite.add(test)

In [14]:
# Run the example with 100 samples
test.run(predconfs, n=100, overwrite=True)

Predicting 100 examples


In [15]:
# format the sample when summarizing the table
test.summary(format_example_fn=format_squad_with_context)

Test cases:      498
Test cases run:  100
Fails (rate):    2 (2.0%)

Example fails:
C: Claire is shorter than Donald.
Q: Who is shorter?
A: Claire
P: Donald

----
C: Alison is greater than Ruth.
Q: Who is greater?
A: Alison
P: Ruth

----


In [16]:
# This test case error because in the predconf does not understand what is the less
t = editor.template(
    [(
    '{first_name} is {adj[0]}er than {first_name1}.',
    'Who is less {adj[1]}?'
    )
    ],
    labels = ['{first_name1}'], # label the right answer
    adj=adj,
    remove_duplicates=True,
    nsamples=500,
    save=True
    )
name = 'A is COMP than B. Who is less COMP?'
description = ''
test = MFT(**t, name=name, description=description, capability='Vocabulary')
suite.add(test)

In [17]:
# Run the example with 100 samples
test.run(predconfs, n=100)

Predicting 100 examples


In [18]:
# format the sample when summarizing the table
test.summary(format_example_fn=format_squad_with_context)

Test cases:      497
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Martha is nicer than Linda.
Q: Who is less nice?
A: Linda
P: Martha

----
C: Julie is younger than Joan.
Q: Who is less young?
A: Joan
P: Julie

----
C: William is weaker than Lawrence.
Q: Who is less weak?
A: Lawrence
P: William

----


In [19]:
# function that contain the context and the qas
# cross product between qas and context that contained associated labels
def crossproduct(t):
    # takes the output of editor.template and does the cross product of contexts and qas
    ret = []
    ret_labels = []
    for x in t.data:
        cs = x['contexts']
        qas = x['qas']
        d = list(itertools.product(cs, qas))
        ret.append([(x[0], x[1][0]) for x in d])
        ret_labels.append([x[1][1] for x in d])
    t.data = ret
    t.labels = ret_labels
    return t


In [20]:
# show the suggestion words
state = editor.suggest('John is very {mask} about the project.')[:20]
print(', '.join(editor.suggest('John is {mask} {state} about the project.', state=state)[:30]))

# define
very = ['very', 'extremely', 'really', 'quite', 'incredibly', 'particularly', 'highly', 'super']
somewhat = ['a little', 'somewhat', 'slightly', 'mildly']

very, pretty, extremely, also, still, quite, more, really, not, clearly, fairly, incredibly, particularly, now, understandably, rather, cautiously, surprisingly, certainly, feeling, so, especially, definitely, generally, most, highly, super, reportedly, being, obviously


In [21]:
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {very} {s} about the project. {first_name1} is {s} about the project.',
            '{first_name1} is {s} about the project. {first_name} is {very} {s} about the project.',
            '{first_name} is {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {s} about the project.',
            '{first_name} is {very} {s} about the project. {first_name1} is {somewhat} {s} about the project.',
            '{first_name1} is {somewhat} {s} about the project. {first_name} is {very} {s} about the project.',
        ],
        'qas': [
            (
                'Who is most {s} about the project?',
                '{first_name}'
            ), 
            (
                'Who is least {s} about the project?',
                '{first_name1}'
            ), 
            
        ]
        
    },
    s = state,
    very=very,
    somewhat=somewhat,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?'
desc = ''
test = MFT(**t, name=name, description=desc, capability='Vocabulary')
suite.add(test)


In [22]:
test.run(predconfs, n=100) # run the samples test case
test.summary(n=3, format_example_fn=format_squad_with_context) #summarize the test should include three samples. (assume that I do not know the result)


Predicting 1200 examples
Test cases:      499
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Catherine is a little bullish about the project. Adam is bullish about the project.
Q: Who is least bullish about the project?
A: Catherine
P: Adam

C: Catherine is a little bullish about the project. Adam is super bullish about the project.
Q: Who is least bullish about the project?
A: Catherine
P: Adam

C: Catherine is bullish about the project. Adam is super bullish about the project.
Q: Who is least bullish about the project?
A: Catherine
P: Adam


----
C: Jim is upbeat about the project. Lucy is somewhat upbeat about the project.
Q: Who is most upbeat about the project?
A: Jim
P: Lucy

C: Jim is extremely upbeat about the project. Lucy is upbeat about the project.
Q: Who is most upbeat about the project?
A: Jim
P: Lucy

C: Jim is extremely upbeat about the project. Lucy is somewhat upbeat about the project.
Q: Who is most upbeat about the project?
A: Jim
P: Lucy


--

## Taxonomy

### Size, chape, color, age, material

In [23]:
import munch
# Initialize the data
order = ['size', 'shape', 'age', 'color'] # assign name to use to append on the props. Accessing the list name in the properties
props = []
properties = {
    'color' : ['red', 'blue','yellow', 'green', 'pink', 'white', 'black', 'orange', 'grey', 'purple', 'brown'],
    'size' : ['big', 'small', 'tiny', 'enormous'],
    'age' : ['old', 'new'],
    'shape' : ['round', 'oval', 'square', 'triangular'],
    'material' : ['iron', 'wooden', 'ceramic', 'glass', 'stone']
}

# Map the data append data into props
for i in range(len(order)):
    for j in range(i + 1, len(order)):
        p1, p2 = order[i], order[j]
        for v1, v2 in itertools.product(properties[p1], properties[p2]):
            props.append(munch.Munch({
                'p1': p1,
                'p2': p2,
                'v1': v1,
                'v2': v2,
            }))


In [24]:
# suggest word from the context
print(', '.join(editor.suggest('There is {a:p.v1} {p.v2} {mask} in the room.', p=props, verbose=False)[:30]))

# Assign the object
objects = ['box', 'clock', 'table', 'object', 'toy', 'painting', 'sculpture', 'thing', 'figure']


sofa, couch, wall, carpet, chair, table, light, lamp, door, clock, mirror, desk, bed, TV, bar, television, window, box, tree, painting, curtain, fan, fridge, screen, wallpaper, piano, rug, shelf, camera, candle


In [25]:
# Test case on the taxonomy
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# Map 2 qas into context. To check that if change the question the answer will answer correctly. The context contains all information.
t = crossproduct(editor.template(
    {
        'contexts': [
            'There is {a:p.v1} {p.v2} {obj} in the room.',
            'There is {a:obj} in the room. The {obj} is {p.v1} and {p.v2}.',
        ],
        'qas': [
            (
                'What {p.p1} is the {obj}?',
                '{p.v1}'
            ), 
            (
                'What {p.p2} is the {obj}?',
                '{p.v2}'
            ), 
            
        ]
        
    },
    obj=objects,
    p=props,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'size, shape, age, color'
desc = ''
test = MFT(**t, name=name, description=desc, capability='Taxonomy')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      500
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: There is a painting in the room. The painting is oval and yellow.
Q: What shape is the painting?
A: oval
P: oval and yellow

C: There is an oval yellow painting in the room.
Q: What shape is the painting?
A: oval
P: oval yellow

C: There is a painting in the room. The painting is oval and yellow.
Q: What color is the painting?
A: yellow
P: oval and yellow


----
C: There is a toy in the room. The toy is big and pink.
Q: What size is the toy?
A: big
P: big and pink

C: There is a big pink toy in the room.
Q: What size is the toy?
A: big
P: pink


----
C: There is a clock in the room. The clock is tiny and orange.
Q: What size is the clock?
A: tiny
P: tiny and orange

C: There is a tiny orange clock in the room.
Q: What size is the clock?
A: tiny
P: orange


----


### Professions vs nationalities

use `editor.suggest(...)` enerate suggestions based on the input string (first_name and profession which on a:mask).
- {first_name} represents the suggestion name where a person's first name should be inserted.
- {a:mask} represents a job title. `:mask` indicate a special words that provide the suggestion data for this holder.


[:30] specifies that only the first 30 items from that list are provided.


`editor.suggest('{first_name} {last_name} works as {a:mask}.')` is similar to the first, but this template also includes {last_name}. The suggestion generates both the first_name and last_name along with the profession. When adding the `last_name ` can createmore specific suggestion compared to the profession that contain only the `first_name`. Append this to the second list stored in prefessions. This is better that combbine into 1 list because it may cause the duplicate value.


In [26]:
professions = editor.suggest('{first_name} works as {a:mask}.')[:30]
professions += editor.suggest('{first_name} {last_name} works as {a:mask}.')[:30]
professions = list(set(professions)) # use set to remove duplicate the element, them convert back into list
if 'translator' in professions:
    professions.remove('translator') # remove this jobs

In [27]:
# clean the data
def clean(string):
    return string.lstrip('[a,the,an,in,at] ').rstrip('.')

In [28]:
# compared clean function and predicted answer
def expect_squad(x, pred, conf, label=None, meta=None):
    return clean(pred) == clean(label)
expect_squad = Expect.single(expect_squad)

In [29]:
# Test case on the profession and nationality
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# the example of failed, the predicted answer provided occupation with nationality
# to test that if change the question, jobs/nationality. The predicted answer can answer that which one is the nationality or profession.
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:nat} {prof}.',
            '{first_name} is {a:prof}. {first_name} is {nat}.',
            '{first_name} is {nat}. {first_name} is {a:prof}.',
            '{first_name} is {nat} and {a:prof}.',
            '{first_name} is {a:prof} and {nat}.',
        ],
        'qas': [
            (
                'What is {first_name}\'s job?',
                '{prof}'
            ), 
            (
                'What is {first_name}\'s nationality?',
                '{nat}'
            ), 
            
        ]
        
    },
    nat = editor.lexicons['nationality'][:10],
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    save=True,
    ))
name = 'Profession vs nationality'
test = MFT(**t, name=name, expect=expect_squad, description='',  capability='Taxonomy')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1000 examples
Test cases:      500
Test cases run:  100
Fails (rate):    15 (15.0%)

Example fails:
C: Emily is a Pakistani model.
Q: What is Emily's job?
A: model
P: Pakistani model


----
C: Ken is a Russian model.
Q: What is Ken's job?
A: model
P: Russian model


----
C: Philip is an Indian model.
Q: What is Philip's job?
A: model
P: Indian model


----


### Animal vs vehicle

In [30]:
# Test case on the animal and vehicle
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# the example of failed, the predicted answer provided animal and vehicle
# to test that if change the question, animal/vehicle. The predicted answer can answer that which one is the animal or vehicle or not.
# The predicted answer does not understand the word `and` and the mask word that has been assigned to vehicle and animals
animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} has {a:animal} and {a:vehicle}.',
            '{first_name} has {a:vehicle} and {a:animal}.',
        ],
        'qas': [
            (
                'What animal does {first_name} have?',
                '{animal}'
            ), 
            (
                'What vehicle does {first_name} have?',
                '{vehicle}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'Animal vs Vehicle'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)


Predicting 400 examples
Test cases:      500
Test cases run:  100
Fails (rate):    47 (47.0%)

Example fails:
C: Jeff has a hamster and a van.
Q: What vehicle does Jeff have?
A: van
P: hamster and a van

C: Jeff has a van and a hamster.
Q: What vehicle does Jeff have?
A: van
P: van and a hamster


----
C: Claire has an iguana and a van.
Q: What vehicle does Claire have?
A: van
P: iguana and a van


----
C: Benjamin has a rabbit and a car.
Q: What vehicle does Benjamin have?
A: car
P: a rabbit and a car

C: Benjamin has a car and a rabbit.
Q: What vehicle does Benjamin have?
A: car
P: a car and a rabbit


----


In [31]:
# Test case on the animal and vehicle
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# the example of failed, the predicted answer provided animal and vehicle
# to test that if change the question, animal/vehicle. The predicted answer can answer the question that who bought the vehicle or animals or not

animals = ['dog', 'cat', 'bull', 'cow', 'fish', 'serpent', 'snake', 'lizard', 'hamster', 'rabbit', 'guinea pig', 'iguana', 'duck']
vehicles = ['car', 'truck', 'train', 'motorcycle', 'bike', 'firetruck', 'tractor', 'van', 'SUV', 'minivan']
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} bought {a:animal}. {first_name2} bought {a:vehicle}.',
            '{first_name2} bought {a:vehicle}. {first_name} bought {a:animal}.',
        ],
        'qas': [
            (
                'Who bought an animal?',
                '{first_name}'
            ), 
            (
                'Who bought a vehicle?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    animal=animals,
    vehicle=vehicles,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'Animal vs Vehicle v2'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test, overwrite=True)

Predicting 400 examples
Test cases:      500
Test cases run:  100
Fails (rate):    71 (71.0%)

Example fails:
C: Betty bought a minivan. Kevin bought a snake.
Q: Who bought an animal?
A: Kevin
P: Betty


----
C: Rose bought a duck. Lauren bought a tractor.
Q: Who bought an animal?
A: Rose
P: Lauren

C: Lauren bought a tractor. Rose bought a duck.
Q: Who bought an animal?
A: Rose
P: Lauren


----
C: George bought a motorcycle. Kim bought a snake.
Q: Who bought an animal?
A: Kim
P: George


----


In [32]:
# Test case on the synnonyms
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# This test case provides the normol context mainly test for synnonyms
# In the context, provided that `Lisa` is very humble. `Jennie` is very thankful. The question will ask about who is modest? (ask with the synnonyms)
# The expected answer is Lisa.

synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[0]}. {first_name2} is very {s2[0]}.',
            '{first_name2} is very {s2[0]}. {first_name} is very {s1[0]}.',
        ],
        'qas': [
            (
                'Who is {s1[1]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[1]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=250,
    save=True
   ))
t += crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is very {s1[1]}. {first_name2} is very {s2[1]}.',
            '{first_name2} is very {s2[1]}. {first_name} is very {s1[1]}.',
        ],
        'qas': [
            (
                'Who is {s1[0]}?',
                '{first_name}'
            ), 
            (
                'Who is {s2[0]}?',
                '{first_name2}'
            ), 
            
        ]
        
    },
    s=synonyms,
    remove_duplicates=True,
    nsamples=250,
    save=True
    )) 
name = 'Synonyms'
test = MFT(**t, name=name, description='', capability='Taxonomy', expect=expect_squad)
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      443
Test cases run:  100
Fails (rate):    13 (13.0%)

Example fails:
C: Chris is very outspoken. Diana is very modest.
Q: Who is vocal?
A: Chris
P: Diana


----
C: Philip is very organized. Nick is very intelligent.
Q: Who is smart?
A: Nick
P: Philip


----
C: Samuel is very happy. Tim is very humble.
Q: Who is joyful?
A: Samuel
P: Tim


----


In [33]:
# assign the opposite comparision words.
comp_pairs = [('better', 'worse'), ('older', 'younger'), ('smarter', 'dumber'), ('taller', 'shorter'), ('bigger', 'smaller'), ('stronger', 'weaker'), ('faster', 'slower'), ('darker', 'lighter'), ('richer', 'poorer'), ('happier', 'sadder'), ('louder', 'quieter'), ('warmer', 'colder')]
comp_pairs = list(set(comp_pairs))#list(set(comp_pairs + [(x[1], x[0]) for x in comp_pairs]))

In [34]:
# Test case on the taxonomy comparision
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together
# In the context, it provide the comparision words, but the question ask the opposite word of that comparision word which to test the taxonomy.

t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {comp[0]} than {first_name1}.',
            '{first_name1} is {comp[1]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is {comp[1]}?',
                '{first_name1}',
            ),
            (
                'Who is {comp[0]}?',
                '{first_name}',
            )
            
        ]
        ,
    },
    comp=comp_pairs,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'A is COMP than B. Who is antonym(COMP)? B'
test = MFT(**t, name=name, description='', capability='Taxonomy')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      498
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Donald is taller than Judith.
Q: Who is shorter?
A: Judith
P: Donald

C: Judith is shorter than Donald.
Q: Who is taller?
A: Donald
P: Judith


----
C: Alfred is colder than Jean.
Q: Who is warmer?
A: Jean
P: Alfred

C: Jean is warmer than Alfred.
Q: Who is colder?
A: Alfred
P: Jean


----
C: Roger is worse than Elaine.
Q: Who is better?
A: Elaine
P: Roger

C: Elaine is better than Roger.
Q: Who is worse?
A: Roger
P: Elaine


----


In [35]:
# Test case on the antonym, Test that the test case can be tested about more/less. Is it effectively?
# Mapping use the crossproduct for the test case.
# the crossproduct to map the context and qas along together

# in the context, it provided the context like `Mary is more hopeful than Julie.`. The question will ask `Q: Who is less hopeful?`. 
# The expected answer should be Julie, but the answer is Mary. (Falied test case) 


antonym_adjs = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is more {a[0]} than {first_name1}.',
            '{first_name1} is more {a[1]} than {first_name}.',
            '{first_name} is less {a[1]} than {first_name1}.',
            '{first_name1} is less {a[0]} than {first_name}.',
        ],
        'qas': [
            (
                'Who is more {a[0]}?',
                '{first_name}',
            ),
            (
                'Who is less {a[0]}?',
                '{first_name1}',
            ),
            (
                'Who is more {a[1]}?',
                '{first_name1}',
            ),
            (
                'Who is less {a[1]}?',
                '{first_name}',
            ),
        ]
        ,
    },
    a = antonym_adjs,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.'
test = MFT(**t, name=name, description='', capability='Taxonomy')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 1600 examples
Test cases:      498
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Mary is more hopeful than Julie.
Q: Who is less hopeful?
A: Julie
P: Mary

C: Julie is more hopeless than Mary.
Q: Who is less hopeless?
A: Mary
P: Julie

C: Mary is less hopeless than Julie.
Q: Who is more hopeless?
A: Julie
P: Mary


----
C: Andrew is more positive than Katherine.
Q: Who is less positive?
A: Katherine
P: Andrew

C: Andrew is more positive than Katherine.
Q: Who is more negative?
A: Katherine
P: Andrew

C: Katherine is more negative than Andrew.
Q: Who is less negative?
A: Andrew
P: Katherine


----
C: Mark is less conservative than Sophie.
Q: Who is more conservative?
A: Sophie
P: Mark

C: Mark is more progressive than Sophie.
Q: Who is less progressive?
A: Sophie
P: Mark

C: Mark is more progressive than Sophie.
Q: Who is more conservative?
A: Sophie
P: Mark


----


## Robustness

typos

In [36]:
#------------------------NOT USED--------------------------------------------------------
# original code from https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/SQuAD.ipynb

# import pickle
# data, answers =  load_squad()
# spacy_map =  pickle.load(open('/home/marcotcr/tmp/processed_squad.pkl', 'rb'))
# pairs = [(x['passage'], x['question']) for x in data]
# processed_pairs = [(spacy_map[x[0]], spacy_map[x[1]]) for x in pairs]
#--------------------------------------------------------------------------------
# Ps. I tried to find the dataset, but i didnot find, so i decide to install the datasets library to used the 'squad' dataset

In [37]:
!pip install datasets



In [38]:
import datasets
# from datasets import load_dataset
dataset = datasets.load_dataset('squad')
pairs = [(x['context'], x['question']) for x in dataset['train']] # train is in the squad dataset library

In [39]:
# source: https://github.com/marcotcr/checklist/blob/115f123de47ab015b2c3a6baebaffb40bab80c9f/notebooks/QQP.ipynb
# all_questions = list(all_questions)
# parsed_questions = list(nlp.pipe(all_questions))
# spacy_map = dict([(x, y) for x, y in zip(all_questions, parsed_questions)])

# python -m spacy download en_core_web_sm (install)

# This is the model from Spacy library used for the NLP task below in the parsed_question. 
nlp = spacy.load('en_core_web_sm')
all_questions = set() # a set of all questions and context

# add data in dataset to all_question
for x in dataset['train']:
    all_questions.add(x['question'])
    all_questions.add(x['context'])

#turn question into pipeline then convert to list
parsed_questions = list(nlp.pipe(all_questions)) 

# map the original question and question that turned into list
spacy_map = dict([(x, y) for x, y in zip(all_questions, parsed_questions)])



In [40]:
processed_pairs = [(spacy_map[x[0]], spacy_map[x[1]]) for x in pairs] # process the pair question using SpaCy

In [41]:
# check
spacy_map[pairs[0][0]].ents

(Catholic,
 the Main Building's,
 the Virgin Mary.,
 the Main Building,
 Venite Ad Me Omnes,
 the Main Building,
 the Sacred Heart,
 Grotto,
 Marian,
 Lourdes,
 France,
 the Virgin Mary,
 Saint Bernadette Soubirous,
 1858,
 3,
 the Gold Dome,
 Mary)

In [42]:
# Definition
def question_typo(x):
    """
    x[0]: context
    x[1]: question 
    Perturb.add_typos(x[1]): add a typo to question 
    """
    return (x[0], Perturb.add_typos(x[1]))

t = Perturb.perturb(pairs, question_typo, nsamples=500) # perturb is dataset to evaluate the robustness of BERT
test = INV(**t, name='Question typo', capability='Robustness', description='')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad) #format output
suite.add(test, overwrite=True)

Predicting 200 examples
Test cases:      500
Test cases run:  100
Fails (rate):    12 (12.0%)

Example fails:
Q: Where have the powers maintained peace in recent years?
P: United Nations

Q: Where have the powers maintained peace inr ecent years?
P: United Nations and other forums of international discussion


----
Q: Where in Houston is the University of Houston campus located?
P: southeast

Q: Where in Houston i sthe University of Houston campus located?
P: southeast Houston


----
Q: Who is the only one who has the authority to stop the game when something is wron?
P: referee

Q: Who is the only one who has the authority to stop the game when somethign is wron?
P: the referee


----


Contractions

In [43]:
# add the question contraction
def contractions(x):
    conts = Perturb.contractions(x[1])
    return [(x[0], a) for a in conts]
t = Perturb.perturb(pairs, contractions, nsamples=500)
test = INV(**t, name='Question contractions', capability='Robustness', description='')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad)
suite.add(test)

Predicting 200 examples
Test cases:      500
Test cases run:  100
Fails (rate):    7 (7.0%)

Example fails:
Q: What are some of the most widely known poultry tournaments?
P: national and regional poultry shows

Q: What're some of the most widely known poultry tournaments?
P: National Championship Show


----
Q: Where did Japanese warriors come to literary maturity?
P: the Heike Monogatari

Q: Where'd Japanese warriors come to literary maturity?
P: Heike Monogatari


----
Q: Who is Martha Ann Ricks?
P: One of the most well-known Liberian quilters

Q: Who's Martha Ann Ricks?
P: Ellen Johnson Sirleaf


----


Add random sentence

In [44]:
# add random sentence
random_sentences = set()

# generate the random sentence and saved as list
for x, _ in processed_pairs:
    for y in x.sents:
        random_sentences.add(y.text)
random_sentences = list(random_sentences)

In [45]:
# check
for y in spacy_map[pairs[1][1]].sents:
    print(y.text)

What is in front of the Notre Dame Main Building?


In [46]:
len(random_sentences)

92328

In [47]:
# function to add the random sentence at the beginning or ending
def add_random_sentence(x, **kwargs):
    random_s = np.random.choice(random_sentences)
    while random_s in x[0]:
        random_s = np.random.choice(random_sentences)
    random_s = random_s.strip('.') + '. '
    meta = ['add to end: %s' % random_s, 'add to beg: %s' % random_s]
    return [(x[0] + random_s, x[1]), (random_s + x[0], x[1])], meta

# format the result of perturb (add random sentence)
def format_add(x, pred, conf, label=None, meta=None):
    ret = format_squad(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s\n' % meta
    return ret

# Test the random sentence
t = Perturb.perturb(pairs, add_random_sentence, nsamples=500, meta=True)
test = INV(**t, name='Add random sentence to context', capability='Robustness', description='')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_add)
suite.add(test)

Predicting 300 examples
Test cases:      500
Test cases run:  100
Fails (rate):    16 (16.0%)

Example fails:
Q: What did the Chinese media focus on as far as human rights protesters?
P: disruptive protesters

Q: What did the Chinese media focus on as far as human rights protesters?
P: more disruptive protesters
Perturb: add to beg: In 1992, Gomes directed Udju Azul di Yonta, which was screened in the Un Certain Regard section at the 1992 Cannes Film Festival. 


----
Q: What began in 1254 
P: The construction of the present Gothic building was begun in 1254

Q: What began in 1254 
P: The construction of the present Gothic building
Perturb: add to beg: The term Muslim world, also known as Islamic world and the Ummah (Arabic: أمة‎, meaning "nation" or "community") has different meanings. 


----
Q: John remains a recurring character within what culture?
P: Western

Q: John remains a recurring character within what culture?
P: Western popular culture
Perturb: add to beg: Downtown Boston'

## NER

In [48]:
import re
# change the context  and question
def change_thing(change_fn):
    def change_both(cq, **kwargs):
        context, question = cq
        a = change_fn(context, meta=True)
        if not a:
            return None
        changed, meta = a
        ret = []
        for c, m in zip(changed, meta):
            new_q = re.sub(r'\b%s\b' % re.escape(m[0]), m[1], question.text) #new
            ret.append((c, new_q))
        return ret, meta
    return change_both
            

In [49]:
# These function formatted the evaluation results
def expect_same(orig_pred, pred, orig_conf, conf, labels=None, meta=None):
    if not meta:
        return pred == orig_pred
    return pred == re.sub(r'\b%s\b' % re.escape(meta[0]), meta[1], orig_pred)

def format_replace(x, pred, conf, label=None, meta=None):
    ret = format_squad(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s -> %s\n' % meta
    return ret

def format_replace_context(x, pred, conf, label=None, meta=None):
    ret = format_squad_with_context(x, pred, conf, label, meta)
    if meta:
        ret += 'Perturb: %s -> %s\n' % meta
    return ret

In [50]:
# change the name and add the replace word. 
# append the test case Perturb labed the formatted like `Perturb: %s -> %s`
t = Perturb.perturb(processed_pairs, change_thing(Perturb.change_names), nsamples=500, meta=True)

test = INV(**t, name='Change name everywhere', capability='NER',
          description='', expect=Expect.pairwise(expect_same))
test.run(predconfs, n=100)
test.summary(3, format_example_fn=format_replace)
suite.add(test, overwrite=True)

Predicting 1100 examples
Test cases:      500
Test cases run:  100
Fails (rate):    1 (1.0%)

Example fails:
Q: What was the subject under consideration in discussions in 1980?
P: patriation of the Canadian constitution

Q: What was the subject under consideration in discussions in 1980?
P: Canadian constitution
Perturb: Queen -> Alicia

Q: What was the subject under consideration in discussions in 1980?
P: the patriation of the Canadian constitution
Perturb: Pierre Trudeau -> William Jones


----


In [51]:
# change the location name and add the replace word. 
# append the test case Perturb labed the formatted like `Perturb: %s -> %s`
t = Perturb.perturb(processed_pairs, change_thing(Perturb.change_location), nsamples=500, meta=True)

test = INV(**t, name='Change location everywhere', capability='NER',
          description='', expect=Expect.pairwise(expect_same))
test.run(predconfs, n=100)
test.summary(3, format_example_fn=format_replace)
suite.add(test, overwrite=True)

Predicting 1100 examples
Test cases:      500
Test cases run:  100
Fails (rate):    4 (4.0%)

Example fails:
Q: What roles was George Warren Brown known?
P: a St. Louis philanthropist and co-founder of the Brown Shoe Company

Q: What roles was George Warren Brown known?
P: philanthropist and co-founder of the Brown Shoe Company
Perturb: St. Louis -> Lexington-Fayette

Q: What roles was George Warren Brown known?
P: San Diego philanthropist and co-founder of the Brown Shoe Company
Perturb: St. Louis -> San Diego


----
Q: What are some of the target countries?
P: Bangladesh, Brazil, China, Egypt, India, Indonesia, Mexico, Nigeria

Q: What are some of the target countries?
P: Bangladesh, Brazil, Japan, Egypt
Perturb: China -> Japan

Q: What are some of the target countries?
P: Bangladesh, Brazil, China, Egypt
Perturb: Nigeria -> France


----
Q: What is the 3rd oldest paper in the nation?
P: Philadelphia Inquirer

Q: What is the 3rd oldest paper in the nation?
P: The Anaheim Inquirer
Per

## Temporal

In [52]:
# Test case on the Temporal
# Mapping use the crossproduct for the test case.
# change the profession in the second sentence
t = crossproduct(editor.template(
    {
        'contexts': [
            'Both {first_name} and {first_name2} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
            'Both {first_name2} and {first_name} were {prof1}s, but there was a change in {first_name}, who is now {a:prof2}.',
        ],
        'qas': [
            (
                'Who is {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'There was a change in profession'
test = MFT(**t, expect=expect_squad, capability='Temporal', name=name, description='' )
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 200 examples


Test cases:      480
Test cases run:  100
Fails (rate):    0 (0.0%)


In [53]:
# Testing on the before/after changed to first/last that understanding or not

t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} became a {prof} before {first_name2} did.',
            '{first_name2} became a {prof} after {first_name} did.',
        ],
        'qas': [
            (
                'Who became a {prof} first?',
                '{first_name}'
            ), 
            (
                'Who became a {prof} last?',
                '{first_name2}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Understanding before / after -> first / last.'
test = MFT(**t, expect=expect_squad, capability='Temporal', name=name, description='' )
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)


Predicting 400 examples
Test cases:      497
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Laura became a waitress before Roy did.
Q: Who became a waitress last?
A: Roy
P: Laura

C: Roy became a waitress after Laura did.
Q: Who became a waitress first?
A: Laura
P: Roy


----
C: Paul became a editor before Alexandra did.
Q: Who became a editor last?
A: Alexandra
P: Paul

C: Alexandra became a editor after Paul did.
Q: Who became a editor first?
A: Paul
P: Alexandra


----
C: Daniel became a agent before Bobby did.
Q: Who became a agent last?
A: Bobby
P: Daniel

C: Bobby became a agent after Daniel did.
Q: Who became a agent first?
A: Daniel
P: Bobby


----


## Negation

In context

In [54]:
# Test case on the Negation
# Mapping use the crossproduct for the test case.
# Add the negation (not) in the context.
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is not {a:prof}. {first_name2} is.',
            '{first_name2} is {a:prof}. {first_name} is not.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Negation in context, may or may not be in question'
test = MFT(**t, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      497
Test cases run:  100
Fails (rate):    94 (94.0%)

Example fails:
C: Mike is not an attorney. Tim is.
Q: Who is an attorney?
A: Tim
P: Mike

C: Tim is an attorney. Mike is not.
Q: Who is not an attorney?
A: Mike
P: Tim


----
C: Michelle is not a historian. Ed is.
Q: Who is a historian?
A: Ed
P: Michelle


----
C: Steven is not a waitress. Joan is.
Q: Who is a waitress?
A: Joan
P: Steven


----


Not in context:

In [55]:
# Test case on the Negation
# Mapping use the crossproduct for the test case.
# Add the negation (not) in the question.
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} is {a:prof}. {first_name2} is {a:prof2}.',
            '{first_name2} is {a:prof2}. {first_name} is {a:prof}.',
        ],
        'qas': [
            (
                'Who is {a:prof}?',
                '{first_name}'
            ), 
            (
                'Who is not {a:prof}?',
                '{first_name2}'
            ), 
            (
                'Who is {a:prof2}?',
                '{first_name2}'
            ), 
            (
                'Who is not {a:prof2}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Negation in question only.'
test = MFT(**t, expect=expect_squad, capability='Negation', name=name, description='' )
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 800 examples
Test cases:      479
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Anne is a producer. Suzanne is an interpreter.
Q: Who is not a producer?
A: Suzanne
P: Anne

C: Anne is a producer. Suzanne is an interpreter.
Q: Who is not an interpreter?
A: Anne
P: Suzanne

C: Suzanne is an interpreter. Anne is a producer.
Q: Who is not a producer?
A: Suzanne
P: Anne


----
C: Florence is a producer. Annie is an economist.
Q: Who is not a producer?
A: Annie
P: Florence

C: Florence is a producer. Annie is an economist.
Q: Who is not an economist?
A: Florence
P: Annie

C: Annie is an economist. Florence is a producer.
Q: Who is not a producer?
A: Annie
P: Florence


----
C: Carol is a journalist. Alice is an engineer.
Q: Who is not a journalist?
A: Alice
P: Carol

C: Carol is a journalist. Alice is an engineer.
Q: Who is not an engineer?
A: Carol
P: Alice

C: Alice is an engineer. Carol is a journalist.
Q: Who is not a journalist?
A: Alice
P: Carol


---

## Fairness spinoff

In [56]:
import collections

# Test on the fairness
# add negation in the context
fewer_profs = ['doctor', 'nurse', 'secretary', 'CEO']
t = editor.template(
    [
        ('{male} is not {a:prof}, {female} is.', 'Who is {a:prof}?', '{female}', 'woman', '{prof}'),
        ('{female} is not {a:prof}, {male} is.', 'Who is {a:prof}?', '{male}', 'man', '{prof}'),
    ],
#     prof=professions + ['doctor'],
    prof=fewer_profs,
    remove_duplicates=True,
    nsamples=1000,
    unroll=True,
    save=True,
    )
data = [(d[0], d[1]) for d in t.data]
labels = [d[2] for d in t.data]
meta = [(d[3], d[4]) for d in t.data]

test = MFT(data, expect=expect_squad, labels=labels, meta=meta, templates=t.templates,
          name='M/F failure rates should be similar for different professions', capability='Fairness',
          description='Using negation in context.')
test.run(predconfs, n=100)

#function to formatted the fair output
def print_fair(test):
    c = collections.Counter(test.meta)
    fail = collections.Counter([tuple(x) for x in np.array(test.meta)[test.fail_idxs()]])
    profs = set()
    for sex, prof in fail:
        profs.add(prof)
    prof_fail = {}
    get_fail = lambda f:fail[f] / c[f]
    for prof in profs:
        fail_m = get_fail(('man', prof))
        fail_f = get_fail(('woman', prof))
        prof_fail[prof] = (fail_m, fail_f)
    print('%-13s fail_men fail_women (count)' % 'profession')
    for prof, vs in sorted(prof_fail.items(), key=lambda x:max(x[1][0], x[1][1]), reverse=True):
        fail_m, fail_f = vs
        print('%-13s   %.1f      %.1f     (%d)' % (prof, 100 * fail_m, 100 * fail_f, c[('man', prof)]))
print_fair(test)

# add test to the suite
suite.add(test)

Predicting 100 examples
profession    fail_men fail_women (count)
doctor          5.7      5.7     (247)
secretary       4.0      4.8     (249)
CEO             4.0      4.0     (248)
nurse           2.7      3.5     (256)


## Coref

Basic coref

In [58]:
# This test case test on the coref 'he and she'
# In the context provide male and female are friends. Then Tell that He is {first_job}, and she is {second_job}.
# After that asking who is the {first_job}. So that this test is to test he and she can be understandable by the system or not.
t = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{female} and {male} are friends. He is {a:prof1}, and she is {a:prof2}.',
            '{male} and {female} are friends. She is {a:prof2}, and he is {a:prof1}.',
            '{female} and {male} are friends. She is {a:prof2}, and he is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{male}'
            ), 
            (
                'Who is {a:prof2}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Basic coref, he / she'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 800 examples
Test cases:      489
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Richard and Emma are friends. He is an organizer, and she is an administrator.
Q: Who is an organizer?
A: Richard
P: Richard and Emma

C: Richard and Emma are friends. He is an organizer, and she is an administrator.
Q: Who is an administrator?
A: Emma
P: Richard and Emma

C: Emma and Richard are friends. He is an organizer, and she is an administrator.
Q: Who is an organizer?
A: Richard
P: Emma and Richard


----
C: Roy and Nancy are friends. He is an economist, and she is a musician.
Q: Who is an economist?
A: Roy
P: Roy and Nancy

C: Roy and Nancy are friends. He is an economist, and she is a musician.
Q: Who is a musician?
A: Nancy
P: Roy and Nancy

C: Nancy and Roy are friends. He is an economist, and she is a musician.
Q: Who is an economist?
A: Roy
P: Nancy and Roy


----
C: Larry and Jane are friends. He is a secretary, and she is an escort.
Q: Who is a secretary?


In [59]:
# Test on the basic his/her 
# The example failed:
# C: Patrick and Charlotte are friends. His mom is an accountant.
# Q: Whose mom is an accountant?
# A: Patrick
# P: Patrick and Charlotte
# In my opinion, if it added the chain-of-thought, the failer rate will be decreased.
t = crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. His mom is {a:prof}.',
            '{female} and {male} are friends. His mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{male}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=250,
    ))
t += crossproduct(editor.template(
    {
        'contexts': [
            '{male} and {female} are friends. Her mom is {a:prof}.',
            '{female} and {male} are friends. Her mom is {a:prof}.',
        ],
        'qas': [
            (
                'Whose mom is {a:prof}?',
                '{female}'
            ), 
        ]
        
    },
    save=True,
    prof=professions,
    remove_duplicates=True,
    nsamples=250,
    ))

name = 'Basic coref, his / her'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 200 examples
Test cases:      500
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Patrick and Charlotte are friends. His mom is an accountant.
Q: Whose mom is an accountant?
A: Patrick
P: Patrick and Charlotte

C: Charlotte and Patrick are friends. His mom is an accountant.
Q: Whose mom is an accountant?
A: Patrick
P: Charlotte and Patrick


----
C: Ralph and Ann are friends. His mom is an entrepreneur.
Q: Whose mom is an entrepreneur?
A: Ralph
P: Ralph and Ann

C: Ann and Ralph are friends. His mom is an entrepreneur.
Q: Whose mom is an entrepreneur?
A: Ralph
P: Ann and Ralph


----
C: Ralph and Pamela are friends. Her mom is a waitress.
Q: Whose mom is a waitress?
A: Pamela
P: Ralph and Pamela

C: Pamela and Ralph are friends. Her mom is a waitress.
Q: Whose mom is a waitress?
A: Pamela
P: Pamela and Ralph


----


Former, latter

In [60]:
# This test cases is for test the Former(before) and Latter(after)
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} and {first_name2} are friends. The former is {a:prof1}.',
            '{first_name2} and {first_name} are friends. The latter is {a:prof1}.',
            '{first_name} and {first_name2} are friends. The former is {a:prof1} and the latter is {a:prof2}.',
            '{first_name2} and {first_name} are friends. The former is {a:prof2} and the latter is {a:prof1}.',
        ],
        'qas': [
            (
                'Who is {a:prof1}?',
                '{first_name}'
            ), 
        ]
        
    },
    prof=professions,
    remove_duplicates=True,
    nsamples=500,
    save=True
    ))
name = 'Former / Latter'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='Coref')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      481
Test cases run:  100
Fails (rate):    99 (99.0%)

Example fails:
C: Sarah and Harriet are friends. The former is an agent.
Q: Who is an agent?
A: Sarah
P: Sarah and Harriet

C: Harriet and Sarah are friends. The latter is an agent.
Q: Who is an agent?
A: Sarah
P: Harriet and Sarah

C: Sarah and Harriet are friends. The former is an agent and the latter is an economist.
Q: Who is an agent?
A: Sarah
P: Sarah and Harriet


----
C: Simon and Michael are friends. The former is an assistant.
Q: Who is an assistant?
A: Simon
P: Simon and Michael

C: Michael and Simon are friends. The latter is an assistant.
Q: Who is an assistant?
A: Simon
P: Michael and Simon

C: Simon and Michael are friends. The former is an assistant and the latter is an entrepreneur.
Q: Who is an assistant?
A: Simon
P: Simon and Michael


----
C: Sharon and Linda are friends. The former is an attorney.
Q: Who is an attorney?
A: Sharon
P: Sharon and Linda

C: Linda and Sharon a

## SRL

In [62]:
import pattern
import pattern.en
pverb = ['love', 'hate', 'like', 'remember', 'recognize', 'trust', 'deserve', 'understand', 'blame', 'dislike', 'prefer', 'follow', 'notice', 'hurt', 'bother', 'support', 'believe', 'accept', 'attack']
a = pattern.en.tenses('loves')[0]
b = pattern.en.tenses('stolen')[0]
pverb = [(pattern.en.conjugate(v, *a), pattern.en.conjugate(v, *b)) for v in pverb]

# Test the object distinction. In the context and question are swap when testing. If context is active voice, the question will be the passive voice. 
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} {v[0]} {first_name2}.',
            '{first_name2} is {v[1]} by {first_name}.',
        ],
        'qas': [
            (
                'Who {v[0]}?',
                '{first_name}'
            ), 
            (
                'Who is {v[1]}?',
                '{first_name2}'
            ), 
        ]
        
    },
    v=pverb,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Agent / object distinction'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='SRL')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)

Predicting 400 examples
Test cases:      497
Test cases run:  100
Fails (rate):    68 (68.0%)

Example fails:
C: Mary hates Robin.
Q: Who is hated?
A: Robin
P: Mary


----
C: Philip deserves Florence.
Q: Who is deserved?
A: Florence
P: Philip deserves Florence


----
C: Mary bothers Paul.
Q: Who bothers?
A: Mary
P: Paul

C: Paul is bothered by Mary.
Q: Who bothers?
A: Mary
P: Paul


----


In [63]:
# To test the object distinction between 3
# in the context has 2 sentences. If first sentence is active voice, the second sentence will be passive voice. 
# In the question ask with the different formatted. The test is not understand the meaning. It us just only mask the context and question, but if we ask with 
# another order of sentence, the predicted answer will always answer with the wrong answer for the test case (distinction between 3)
t = crossproduct(editor.template(
    {
        'contexts': [
            '{first_name} {v[0]} {first_name2}. {first_name2} {v[0]} {first_name3}.',
            '{first_name} {v[0]} {first_name2}. {first_name3} is {v[1]} by {first_name2}.',
            '{first_name2} is {v[1]} by {first_name}. {first_name2} {v[0]} {first_name3}.',
            '{first_name2} is {v[1]} by {first_name}. {first_name3} is {v[1]} by {first_name2}.',
        ],
        'qas': [
            (
                'Who {v[0]} {first_name2}?',
                '{first_name}'
            ), 
            (
                'Who {v[0]} {first_name3}?',
                '{first_name2}'
            ), 
            (
                'Who is {v[1]} by {first_name}?',
                '{first_name2}'
            ), 
            (
                'Who is {v[1]} by {first_name2}?',
                '{first_name3}'
            ), 
        ]
        
    },
    save=True,
    v=pverb,
    remove_duplicates=True,
    nsamples=500,
    ))
name = 'Agent / object distinction with 3 agents'
test = MFT(**t, expect=expect_squad, name=name, description='', capability='SRL')
test.run(predconfs, n=100)
test.summary(n=3, format_example_fn=format_squad_with_context)
suite.add(test)


Predicting 1600 examples
Test cases:      496
Test cases run:  100
Fails (rate):    100 (100.0%)

Example fails:
C: Bob remembers Steve. Sarah is remembered by Steve.
Q: Who remembers Sarah?
A: Steve
P: Bob

C: Bob remembers Steve. Sarah is remembered by Steve.
Q: Who is remembered by Bob?
A: Steve
P: Sarah

C: Steve is remembered by Bob. Sarah is remembered by Steve.
Q: Who is remembered by Bob?
A: Steve
P: Sarah


----
C: Rachel hates Amanda. Amanda hates Roy.
Q: Who is hated by Rachel?
A: Amanda
P: Roy

C: Rachel hates Amanda. Roy is hated by Amanda.
Q: Who hates Roy?
A: Amanda
P: Rachel

C: Rachel hates Amanda. Roy is hated by Amanda.
Q: Who is hated by Rachel?
A: Amanda
P: Roy


----
C: Larry loves Melissa. Bobby is loved by Melissa.
Q: Who loves Bobby?
A: Melissa
P: Larry

C: Larry loves Melissa. Bobby is loved by Melissa.
Q: Who is loved by Larry?
A: Melissa
P: Bobby

C: Melissa is loved by Larry. Melissa loves Bobby.
Q: Who is loved by Melissa?
A: Bobby
P: Larry


----


In [64]:
path = 'squad_suite.pkl' #define path
suite.save(path) #save the test case into path that define