# QQP
This notebook is used to prepare the Quora Question Pairs (QQP) test.  After processing the test suite, it will save to the `qqp_suite.pkl` file which will be used to run the test case in `test-qqp-bert` and `test-qqp-roberta`. The Quora question pair notebook usually tests the 2 questions to check that both of them are equal or not, author provides different tests, such as a test on vocab of 2 questions, which have the same generate the text question pair with the place holder `first_name` `last_name` `adj` `noun` and the result label expected should be non-duplicate if its results label as 1 it will be the fail test case. The capabilities that were tested on this NLP task are capability, taxonomy, ner (name entity recognition), temporal, negation, core, SRL (Semantic Role Labeling), and logic. Except the fairness.



Note:
- MFT(Minimum Functionality Test): focuses on evaluating whether a model has the basic functionality 
- DIR(Directional Expectation test). determine whether a model’s predictions are consistent with a prior expectation or hypothesis 
- INV (Invariance testing) is a type of testing in ML that checks whether a model is invariant to certain transformations or changes in the input data. 



ref:
- https://www.godeltech.com/how-to-automate-the-testing-process-for-machine-learning-systems/

To test each test case fail or not, depends on the `Label` that provide in line like this
```test = MFT(**t, labels=0, name=name, capability = 'Vocabulary',description=desc)```
ps. it can be changed depends on the type(MFT, DIR, or INV) that select to test 

For the QQP
- 1 means duplicate
- 0 means not duplicate

## Installation

In [None]:
!pip install  re
!pip install itertools
!pip install checklist

In [1]:
%load_ext autoreload
%autoreload 2

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb

In [2]:
editor = checklist.editor.Editor()  # creates an instance of the Editor class 
editor.tg # generate the new text data of the Checklist library fro the testing model

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


<checklist.text_generation.TextGenerator at 0x1ce4be9c1d0>

In [4]:
# This is the model from Spacy library used for the NLP task below in the parsed_question. 
# Explain more below
nlp = spacy.load('en_core_web_sm')

In [5]:
# initialize the variable of the array that wish to be store the data from the dev.tsv
qs = []
labels = []
all_questions = set()

# add the content including q1 and q2 into `all_question` `qs` array
for x in open('dev.tsv').readlines()[1:]:
    try:
        q1, q2, label = x.strip().split('\t')[3:]
    except:
        print(x)
        continue
    all_questions.add(q1)
    all_questions.add(q2)
    qs.append((q1, q2))
    labels.append(label)
labels = np.array(labels).astype(int)

This section especially on the first line is to convert the array in to the list. Then, processing each question into the pipeline SpaCy's doc stored in the `parsed_questions` variable. Lastly, in the spacy_map is used for creating the dictionary between `all_questions` (original) and `parsed_questions`(processed question) to map the key value question. Mapping for the task that needs to used for the task below.

The different between `all_questions` and `parsed_questions` is that 
- `all_questions` contains the raw file
- `parsed_questions` is the processed question using Spacy

The parsed_qs variable contains a collection of tuples. Each tuple includes documents processed by spaCy for pairs of questions found in the qs variable. This enables additional processing or analysis utilizing the spaCy representations of the questions. 

This parsed_qs is used to add the INV test (Invariance). To test that if add the outside, the test case quality will not decrease.

In [6]:
all_questions = list(all_questions)                
parsed_questions = list(nlp.pipe(all_questions))    
spacy_map = dict([(x, y) for x, y in zip(all_questions, parsed_questions)])

In [7]:
parsed_qs = [(spacy_map[q[0]], spacy_map[q[1]]) for q in qs]

In [8]:
# Test suite is container for the unit test. used for the test case.
suite = TestSuite()

## Vocabulary
Add vocabulary to the test suite (test case) such as profession(career).

use `editor.suggest(...)` enerate suggestions based on the input string (first_name and profession which on a:mask).
- {first_name} represents the suggestion name where a person's first name should be inserted.
- {a:mask} represents a job title. `:mask` indicate a special words that provide the suggestion data for this holder.


[:30] specifies that only the first 30 items from that list are provided.

In [9]:
professions = editor.suggest('{first_name} works as {a:mask}.')[:30]
print(', '.join(professions))

  to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64)


journalist, secretary, nurse, waitress, historian, accountant, engineer, model, attorney, editor, artist, architect, teacher, photographer, interpreter, analyst, escort, actor, assistant, actress, economist, intern, administrator, agent, DJ, organizer, investigator, auditor, author, investor



`editor.suggest('{first_name} {last_name} works as {a:mask}.')` is similar to the first, but this template also includes {last_name}. The suggestion generates both the first_name and last_name along with the profession. When adding the `last_name ` can createmore specific suggestion compared to the profession that contain only the `first_name`. Append this to the second list stored in prefessions. This is better that combbine into 1 list because it may cause the duplicate value.


In [10]:
professions = editor.suggest('{first_name} works as {a:mask}.')[:30]
professions += editor.suggest('{first_name} {last_name} works as {a:mask}.')[:30]
professions = list(set(professions)) # use set to remove duplicate the element, them convert back into list
print(professions)

['educator', 'photographer', 'nurse', 'escort', 'artist', 'actor', 'economist', 'entrepreneur', 'historian', 'author', 'DJ', 'producer', 'assistant', 'editor', 'attorney', 'investor', 'intern', 'waitress', 'architect', 'activist', 'teacher', 'accountant', 'analyst', 'actress', 'organizer', 'administrator', 'agent', 'reporter', 'engineer', 'journalist', 'interpreter', 'investigator', 'executive', 'secretary', 'model']


In [10]:
print(', '.join(editor.suggest('{first_name} {last_name} is a good {mask}.')[:30]))

writer, example, guy, player, read, actor, friend, person, pick, teacher, character, one, choice, kid, poet, artist, name, candidate, fighter, student, singer, judge, shot, comedian, reader, listener, book, story, reporter, bet


In [11]:
other_nouns = ['player', 'person', 'friend', 'kid', 'candidate'] # add the various noun
nouns = list(set(professions + other_nouns)) # append other nouns and profession into nouns variable
print(nouns)

['intern', 'entrepreneur', 'musician', 'economist', 'editor', 'agent', 'educator', 'player', 'waitress', 'executive', 'actress', 'nurse', 'investor', 'producer', 'candidate', 'DJ', 'investigator', 'organizer', 'secretary', 'administrator', 'friend', 'analyst', 'attorney', 'escort', 'interpreter', 'kid', 'accountant', 'model', 'photographer', 'author', 'reporter', 'architect', 'artist', 'actor', 'historian', 'person', 'assistant', 'engineer', 'activist', 'writer', 'journalist']


In [12]:
print(', '.join(editor.suggest('Is {first_name} {last_name} {a:mask} {noun}?', noun=nouns)[:50])) 

effective, good, important, American, excellent, actual, active, ethical, average, experienced, outstanding, independent, honest, accomplished, elite, bad, established, Australian, interested, influential, ordinary, international, better, unusual, real, outside, great, aggressive, efficient, interesting, new, ideal, underrated, decent, NBA, top, OK, attractive, incompetent, intelligent, evil, exceptional, unethical, amazing, Irish, innovative, equal, successful, impressive, art


In [13]:
adjs = ['effective', 'actual', 'American', 'active', 'honest', 'excellent', 'elite', 'acomplished', 'official', 'outstanding', 'experienced', 'independent', 'international', 'aspiring', 'average', 'good', 'amazing', 'exceptional', 'successful', 'accredited', 'English', 'real', 'bad', 'terrible', 'fake', 'unusual', 'influential', 'incompetent']

Add suite test

In [15]:
#generate the text question pair with the place holder `first_name` `last_name` `adj` `noun`
#the structure removed the duplication and generated 1000 samples.

t = editor.template(('Is {first_name} {last_name} {a:noun}?', 'Is {first_name} {last_name} {a:adj} {noun}?'),
                noun=nouns,
                adj=adjs,
                remove_duplicates=True, 
                nsamples=1000)

# Define a test (MFT test) focus on basic functionality
test = MFT(**t, labels=0, name='Modifier: adj', capability='Vocabulary', 
          description = 'Adding an adjective makes questions non-duplicates')

#Add test suite
suite.add(test)

Is person {a1, a2}?

In [16]:
print(', '.join(editor.suggest('Is John Wayne {mask}?'))) #generates suggestions without restriction
print()
print(', '.join(editor.suggest('Is John Wayne {a:mask}?')[:50])) #generates suggestion a: mask to the word. it is more specific

dead, gay, right, Dead, alive, back, crazy, mad, wrong, real, insane, OK, Jewish, Right, dying, lying, evil, straight, interested, gone, okay, Back, sane, ready, happy, angry, Alone, racist, sick, cheating, bisexual, correct, Wrong, next, resurrected, alone, doomed, shot, pregnant, innocent, finished, DEAD, guilty, cursed, assassinated, returning, safe, ok, joking, related, murdered, immortal, relevant, listening, God, serious, Missing, sincere, missing, alright, fired, free, homosexual, President, King, killed, Crazy, involved, toast, out, Real, famous, here, dreaming, Black, corrupt, coming, delusional, Evil, Mad, special, playing, Muslim, responsible, White, drunk, kidding, suicidal, haunted, retiring, smiling, legit, Satan, done, dangerous, reborn, awake, Batman, forgiven, depressed, radioactive, cool, Coming, single, changed, different, psychic, jealous, human, doping

racist, atheist, asshole, idiot, actor, ape, inspiration, outlaw, American, orphan, Christian, anomaly, artist, a

In [17]:
# define adj to protect the overlap
adjs_without_overlap = ['dead', 'gay', 'Jewish', 'Christian', 'American', 'mad', 'immortal', 'evil', 'famous', 'racist', 'Muslim', 'white', 'black', 'English', 'autistic', 'Australian', 'trustworthy', 'an atheist', 'an anarchist', 'an inventor', 'Indian', 'Armenian', 'an astronaut', 'an immigrant']

In [18]:
# Add the test with different adjective
t = editor.template((
    'Is {first_name} {last_name} {adj1}?',
    'Is {first_name} {last_name} {adj2}?',
    ),
    adj=adjs_without_overlap,
    remove_duplicates=True, 
    nsamples=1000)
test = MFT(**t, labels=0, name='different adjectives', capability = 'Vocabulary',
          description='Same first and last name, different adjectives')
suite.add(test)

different animals (add the animal words to the test)

In [19]:
print(', '.join(editor.suggest('I have a pet {mask}.')))

cat, rabbit, turtle, dog, spider, rat, now, goat, lizard, squirrel, too, monkey, dragon, pig, owl, tiger, girl, wolf, name, bird, named, bear, snake, duck, friend, lab, python, Shepherd, also, deer, here, boy, …, problem, fish, there, Labrador, elephant, carrier, zoo, mix, called, Shiva, dinosaur, bunny, companion, kitten, farm, cow, lobster, shark, ring, mouse, bug, snail, Persian, owner, killer, project, animal, puppy, chicken, monitor, horse, tree, myself, dish, lover, shop, boxer, seal, gorilla, phone, syndrome, watch, one, fetish, bull, carriage, toy, gun, frog, Einstein, boyfriend, beetle, show, doll, once, sometimes, car, store, brother, piano, today, already, mom, computer, somewhere, door, lion, fan, goose, mosquito, ghost, robot, fox, girlfriend, bass, family, bully


In [20]:
animals = ['cat', 'dog', 'rabbit', 'turtle', 'spider', 'rat', 'goat', 'lizard', 'pig', 'monkey', 'squirrel', 'owl', 'snake', 'fish', 'lobster', 'snail', 'chicken']

In [21]:
print(', '.join(editor.suggest('Can I feed my {an} {mask}?', an=animals)))

again, food, now, today, eggs, pellets, water, tonight, properly, worms, too, here, dinner, outside, poop, meat, something, back, this, fish, instead, rice, more, steak, directly, live, anything, treats, milk, formula, alive, blood, that, rabbit, nuts, grass, regularly, inside, breakfast, soup, feed, honey, myself, there, correctly, better, some, anymore, enough, bugs, once, poison, larvae, lunch, urine, free, chocolate, home, well, tomorrow, cheese, right, scraps, sugar, salad, juice, yet, greens, shit, yesterday, indoors, online, money, bones, carrots, oil, cookies, peanuts, chicken, liver, cat, candy, seeds, anyway, corn, raw, antibiotics, bird, seed, protein, alone, …, crow, Rice, NOW, toys, babies, yogurt, cereal, tail, feces, Milo, spinach, grain, bacon, cats, hay, bamboo, mix, one, butter, twice, egg, biscuits, pancakes, meal, balls, safely, it, mate, crap, friend, naturally, another, privately, stuff, tails, fly, bananas, tuna, bread, chicks, salmon, saliva, birds, friends, tea

In [22]:
# define animal's food words
food = ['eggs', 'water',  'worms', 'meat', 'poop', 'milk', 'rice', 'nuts', 'steak', 'formula',  'soup', 'bugs', 'oil', 'chocolate', 'corn', 'cereal', 'sugar', 'seeds', 'liver', 'cookies', 'carrots', 'yogurt', 'salad', 'greens', 'rice', 'bananas', 'tuna', 'apples', 'salmon', 'butter', 'insulin', 'soy']

In [23]:
# add the animal testing word. Ask the question about the pet animals

t = editor.template((
    'Can I feed my {animal1} {food}?',
    'Can I feed my {animal2} {food}?',
    ),
    animal=animals,
    food=food,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Different animals' 
desc = 'Ask the same question about two different pet animals, expect prediction to be 0'
test = MFT(**t, labels=0, name=name, capability = 'Vocabulary',
          description=desc)
suite.add(test)

Modifiers that don't matter

In [24]:
# the contect of this test case meant that add the irrelevant adverb such as really, truely, actually, and others. 
# And then it has been added the place into the words.

action = editor.suggest('Is that {animal} really {mask} on the couch?', animal=animals)[:30]

editor.suggest('Is that {animal} {mask} {action} on the couch?', animal=animals, action=action)
non_changing_modifier = ['really', 'truly', 'actually', 'indeed', 'in fact', 'currently', 'literally', 'somehow']
t = editor.template((
    'Is that {animal} {action} on the {place}?',
    'Is that {animal} {mod2} {action} on the {place}?',
    ),
    action=action,
    animal=animals,
    mod=non_changing_modifier,
    place =['couch','bed', 'sofa', 'table'],
    remove_duplicates=False, 
    nsamples=1000)
name = 'Irrelevant modifiers - animals' 
desc = 'Add modifiers that preserve question semantics (e.g. \'really\')'
test = MFT(**t, labels=1, name=name, capability = 'Vocabulary',
          description=desc)
suite.add(test)

In [25]:
# add the people words modifies. It changes some of the context of the sentence questions.
# The meaning is about the action of someone to another with some adverb from above section.

action = editor.suggest('Is {first_name1} {mask} to {first_name2}?')[:30]
editor.suggest('Is {first_name1} {mask} {a} to {first_name2}?', a=action)
non_changing_modifier = ['really', 'truly', 'actually', 'indeed', 'in fact']
t = editor.template((
    'Is {first_name1} {action} to {first_name2}?',
    'Is {first_name1} {mod2} {action} to {first_name2}?',
    ),
    action=action,
    mod=non_changing_modifier,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Irrelevant modifiers - people' 
desc = 'Add modifiers that preserve question semantics (e.g. \'really\')'
test = MFT(**t, labels=1, name=name, capability = 'Vocabulary',
          description=desc)
suite.add(test)

In [26]:
# the context of this test case is asking about the animals in general context. 
# This test case expect to be same. Like `My pet {animal1} eats {food}` and `Is it normal for animals to eat {food}?`
t = editor.template((
    'My pet {animal1} eats {food}. Is it normal for animals to eat {food}?',
    'My pet {animal2} eats {food}. Is it normal for animals to eat {food}?',
    ),
    animal=animals,
    food=food,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Irrelevant preamble with different examples.' 
desc = 'Ask questions about animals in general, but with different examples in the preamble. Expect duplicate.'
test = MFT(**t, labels=1, name=name, capability = 'Vocabulary',
          description=desc)
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

In [31]:
# Add noun
nouns = ['finger', 'forearm', 'feet', 'head', 'foot', 'elbow', 'nose', 'back', 'heart', 'shoulder', 'skull', 'toe', 'hip', 'neck', 'body', 'thumb', 'rib', 'knee', 'jaw', 'heel', 'thigh', 'ankle', 'arm', 'wrist', 'eye', 'spine', 'leg', 'butt', 'teeth', 'skin', 'bone', 'hand', 'face']

In [32]:
# This test case is about the question different injuries from playing sports
# here the example of test case that happen: I hurt my bone last time I played football. Should I never play again?
# This expect the test result(folllow up words) will not duplicate
t = editor.template((
    'I hurt my {n} last time I played {sport}. {followup}',
    'I hurt my {n2} last time I played {sport}. {followup}',
    ),
    n=nouns,
    sport=['tennis', 'golf', 'soccer', 'football'],
    followup=['Is this going to impact my performance?',
              'Should I never play again?',
              'Is this a common injury?',
             'Is it normal to hurt this part of the body?',],
    remove_duplicates=True, 
    nsamples=1000)
name = 'Preamble is relevant (different injuries)' 
desc = 'Question preamble mentions different injuries, which makes the questions themselves non-duplicates.'
test = MFT(**t, labels=0, name=name, capability = 'Vocabulary',
          description=desc)
suite.add(test)

### Taxonomy

Synonyms

In [33]:
tmp = [] #initialize the store results

x = editor.suggest('How can I become more {mask}?') # first list
x += editor.suggest('How can I become less {mask}?') #second list
for a in set(x):
    e = editor.synonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less']) # can dynamically use more or less word
    if e:
        tmp.append([a] + e) # append the results
print(',\n'.join([str(tuple(x)) for x in tmp]))

('organised', 'organized', 'direct', 'engineer'),
('resilient', 'live'),
('radical', 'revolutionary'),
('knowledgeable', 'learned', 'intimate'),
('critical', 'decisive', 'vital'),
('mindful', 'aware'),
('evil', 'vicious'),
('nervous', 'anxious'),
('aware', 'mindful'),
('confident', 'positive'),
('clear', 'open', 'clean', 'light', 'clearly'),
('capable', 'able', 'open'),
('bad', 'sorry', 'tough', 'risky', 'spoiled', 'defective'),
('anxious', 'nervous'),
('difficult', 'hard'),
('ambitious', 'challenging'),
('strict', 'rigid', 'stern'),
('understanding', 'savvy'),
('thankful', 'grateful'),
('suspicious', 'wary', 'suspect'),
('organized', 'organised', 'direct'),
('important', 'significant', 'authoritative'),
('alone', 'solitary', 'lonely'),
('tolerant', 'resistant', 'liberal', 'kind'),
('upset', 'worried', 'broken', 'confused', 'distressed', 'disturbed'),
('disconnected', 'confused', 'fragmented'),
('lonely', 'alone', 'solitary'),
('kind', 'tolerant'),
('miserable', 'poor', 'suffering', 'l

In [34]:
# Initialize the synnonyms
synonyms = [ ('spiritual', 'religious'), ('angry', 'furious'), ('organized', 'organised'),
            ('vocal', 'outspoken'), ('grateful', 'thankful'), ('intelligent', 'smart'),
            ('humble', 'modest'), ('courageous', 'brave'), ('happy', 'joyful'), ('scared', 'frightened'),
           ]

Antonyms

In [35]:
# same as above but this is opposite word instead.

opps = [] #initialize the store results
x = editor.suggest('How can I become more {mask}?')
x += editor.suggest('How can I become less {mask}?')
for a in set(x):
    e = editor.antonyms('How can I become {moreless} %s?' % a, a, moreless=['more', 'less'])
    if e:
#         print(a, [b[0][0] for b in e] )
        opps.append([a] + e)
#         opps.append((a, e[0][0][0]))
print(','.join([str(tuple(x)) for x in opps]))

('powerless', 'powerful'),('evil', 'good'),('organic', 'functional'),('bad', 'good'),('uncomfortable', 'comfortable'),('negative', 'positive'),('difficult', 'easy'),('invisible', 'visible'),('stupid', 'smart', 'intelligent'),('emotional', 'intellectual'),('positive', 'negative'),('dependent', 'independent'),('humble', 'proud'),('rude', 'civil', 'polite'),('unhappy', 'happy'),('insecure', 'secure'),('specific', 'general'),('cautious', 'brave'),('pessimistic', 'optimistic'),('active', 'passive'),('corrupt', 'straight'),('smart', 'stupid'),('conspicuous', 'invisible'),('visible', 'invisible'),('fat', 'lean', 'thin'),('passive', 'active'),('impatient', 'patient'),('shy', 'confident'),('conservative', 'progressive', 'liberal'),('individual', 'common'),('irresponsible', 'responsible'),('progressive', 'conservative'),('hopeful', 'hopeless'),('hungry', 'thirsty'),('courageous', 'fearful'),('optimistic', 'pessimistic'),('defensive', 'offensive')


In [36]:
antonyms = [('progressive', 'conservative'),('religious', 'secular'),('positive', 'negative'),('defensive', 'offensive'),('rude',  'polite'),('optimistic', 'pessimistic'),('stupid', 'smart'),('negative', 'positive'),('unhappy', 'happy'),('active', 'passive'),('impatient', 'patient'),('powerless', 'powerful'),('visible', 'invisible'),('fat', 'thin'),('bad', 'good'),('cautious', 'brave'), ('hopeful', 'hopeless'),('insecure', 'secure'),('humble', 'proud'),('passive', 'active'),('dependent', 'independent'),('pessimistic', 'optimistic'),('irresponsible', 'responsible'),('courageous', 'fearful')]

In [37]:
# add synonyms test cases with the different templeates
t = editor.template([
    (
    'How can I become more {x[0]}?',
    'How can I become more {x[1]}?',
    ),
    (
    'How can I become more {x[1]}?',
    'How can I become more {x[0]}?',
    ),
    (
    'How can I become less {x[0]}?',
    'How can I become less {x[1]}?',
    ),
    (
    'How can I become less {x[1]}?',
    'How can I become less {x[0]}?',
    ),
    (
    'How can I become {a:x[0]} person?',
    'How can I become {a:x[1]} person?',
    ),
    (
    'How can I become {a:x[1]} person?',
    'How can I become {a:x[0]} person?',
    ),
],
    unroll=True,
    x=synonyms,
    remove_duplicates=True, 
    nsamples=1000)
name = 'How can I become more {synonym}?' 
desc = 'different (simple) templates where words are replaced with their synonyms'
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
# test.run(new_pp)
# test.summary(n=3)
suite.add(test, overwrite=True)

In [11]:

import re
# replace_paires: replace occurrences of each word in a pair with the other word, 
def replace_pairs(pairs):
    def replace_z(text):
        ret = []
        for x, y in pairs:
            t = re.sub(r'\b%s\b' % x, y, text )
            if t != text:
                ret.append(t)
            if y == 'smart':
                continue
            t = re.sub(r'\b%s\b' % y, x, text )
            if t != text:
                ret.append(t)
        return list(set(ret))
    return replace_z

# creating pairs of original and modified texts. for tracking the changes
def apply_and_pair(fn):
    def ret_fn(text):
        ret = fn(text)
        return [(text, r) for r in ret]
    return ret_fn

In [39]:
synonyms #show synonyms

[('spiritual', 'religious'),
 ('angry', 'furious'),
 ('organized', 'organised'),
 ('vocal', 'outspoken'),
 ('grateful', 'thankful'),
 ('intelligent', 'smart'),
 ('humble', 'modest'),
 ('courageous', 'brave'),
 ('happy', 'joyful'),
 ('scared', 'frightened')]

In [40]:
# describe test case name and description
name = '(question, f(question)) where f(question) replaces synonyms?'  
desc = 'Expect 1, should be easy because it\'s individual word changes'

# Invariance test function and the fine-tune the data into the test. 
t = Perturb.perturb(list(all_questions), apply_and_pair(replace_pairs(synonyms)), nsamples=1000, keep_original=False)
test = INV(t.data, threshold=0.1, name=name, description=desc, capability='Taxonomy')
suite.add(test, overwrite=True) # write over the old content in the suite


In [41]:
def apply_to_each_and_product(fn):
    def apply_to_one(x):
        p = fn(x) #apply function
        if not p:
            p = []
        return list(set([x] + p))
    def ret_fn(pair):
        #apply function to the first and second element
        p1 = apply_to_one(pair[0])
        p2 = apply_to_one(pair[1])
        return [x for x in itertools.product(p1, p2) if x != pair]
    return ret_fn

In [42]:
# Test case for Replace synonyms in real pairs
name = 'Replace synonyms in real pairs'
desc = ''
t = Perturb.perturb(qs, apply_to_each_and_product(replace_pairs(synonyms)), nsamples=1000, keep_original=True)
test = INV(t.data, threshold=0.1, name=name, description=desc, capability='Taxonomy')
suite.add(test, overwrite=True)

In [43]:
# Test case that would like to compare that the meaning is not the same.
t = editor.template([(
    'How can I become more {x[0]}?',
    'How can I become less {x[0]}?',
    ),
    (
    'How can I become less {x[1]}?',
    'How can I become more {x[1]}?',
    )],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=1000)
name = 'How can I become more X != How can I become less X' 
desc = ''
test = MFT(**t, labels=0, name=name, capability = 'Vocabulary',
          description=desc)
# test.run(new_pp, n=500, seed=1)
# test.summary(n=3)
suite.add(test, overwrite=True)

In [44]:
# Test case for the synnonyms and antonym. 
# for example, more {synonyms} = less {antonym}. the meaning will close. Assume that is equal.
t = editor.template([(
    
    'How can I become more {x[0]}?',
    'How can I become less {x[1]}?',
    ),
    (
    'How can I become less {x[0]}?',
    'How can I become more {x[1]}?',
    )],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=1000)
name = 'How can I become more X = How can I become less antonym(X)' 
desc = ''
test = MFT(**t, labels=1, name=name, capability = 'Taxonomy',
          description=desc)
suite.add(test, overwrite=True)

DIR version (kinda bad, won't add to suite)

In [45]:
t = Perturb.perturb(list(all_questions), apply_and_pair(replace_pairs(antonyms)), nsamples=200, keep_original=False)
test = DIR(t.data, expect=Expect.eq(0), agg_fn='all')
# Directional Expectation test

## Robustness

In [46]:
def wrap_apply_to_each(fn, both=False, *args, **kwargs):
    def new_fn(qs, *args, **kwargs):
        q1, q2 = qs
        ret = []
        fnq1 = fn(q1, *args, **kwargs)
        fnq2 = fn(q2, *args, **kwargs)
        if type(fnq1) != list:
            fnq1 = [fnq1]
        if type(fnq2) != list:
            fnq2 = [fnq2]
        #the diff between 2 function. apply to each question
        ret.extend([(x, str(q2)) for x in fnq1])
        ret.extend([(str(q1), x) for x in fnq2])
        if both:
            ret.extend([(x, x2) for x, x2 in itertools.product(fnq1, fnq2)])
        return [x for x in ret if x[0] and x[1]]
    return new_fn
def wrap_apply_to_both(fn, *args, **kwargs):
    def new_fn(qs, *args, **kwargs):
        q1, q2 = qs
        ret = []
        fnq1 = fn(q1, *args, **kwargs)
        fnq2 = fn(q2, *args, **kwargs)
        if type(fnq1) != list:
            fnq1 = [fnq1]
        if type(fnq2) != list:
            fnq2 = [fnq2]
        #the diff between 2 function
        ret.extend([(x, x2) for x, x2 in itertools.product(fnq1, fnq2)])
        return [x for x in ret if x[0] and x[1]]
    return new_fn

Typos

In [47]:
# add the typo test into suite under the capability(category) Roubustness
t = Perturb.perturb(qs, wrap_apply_to_each(Perturb.add_typos), nsamples=500) #add_type is function in the perturb
test = INV(t.data, name='add one typo', capability='Robustness', description='')
suite.add(test, overwrite=True)

Contractions

In [48]:
# add the contraction test into suite under the capability(category) Roubustness
t = Perturb.perturb(qs, wrap_apply_to_each(Perturb.contractions, both=True), nsamples=500) #contraction is function in the perturb
test = INV(**t, name='contrations', capability='Robustness', description='')
suite.add(test)

Paraphrases

In [49]:
import itertools
# function that change the word I->youm my->your mine->yours
def me_to_you(text):
    t = re.sub(r'\bI\b', 'you', text)
    t = re.sub(r'\bmy\b', 'your', t)
    return re.sub(r'\bmine\b', 'yours', t)

#function that used to paraphrase
def paraphrases(text):
    ts = ['How do I ', 'How can I ', 'What is a good way to ', 'How should I ']

    # define another phrase/template that can be paraphrase from above line
    templates1 = ['How do I {x}?', 'How can I {x}?', 'What is a good way to {x}?', 'If I want to {x}, what should I do?',
                'In order to {x}, what should I do?']
    ts2 = ['Can you ', 'Can I ']#, 'Do I']
    ts3 = ['Do I ']
    templates2 = ['Can you {x}?', 'Can I {x}?', 'Do you think I can {x}?', 'Do you think you can {x}?',]
    templates3 = ['Do I {x}?', 'Do you think I {x}?']
    ret = []
    for i, (tsz, templates) in enumerate(zip([ts, ts2, ts3], [templates1, templates2, templates3])):
        for t in tsz:
            if text.startswith(t):
                x = text[len(t):].strip('?')
                ts = editor.template(templates, x=x).data[0]
                if i <= 1:
                    ts = ts + [me_to_you(x) for x in ts]
                ret += ts
    return ret #return new paraphrase sentence

#this function will return the tuple of the text that paraphrases.
#e.g., text=hihihi this function will return like [('hihihi'),('hihihi')]
def paraphrases_product(text):
    pr = paraphrases(text)
    return list(itertools.product(pr, pr))

#this function will return the tuple of the text that also paraphrases into different word
#e.g., text=hihihi this function will return like [('how can i?'),('how can i ....?')]
def paraphrase_each(pair):
    p1 = paraphrases(pair[0])
    p2 = paraphrases(pair[1])
    return list(itertools.product(p1, p2))

In [50]:
# This test case that if paraphrase the word, it will consistent or not
t = Perturb.perturb(list(all_questions), paraphrases_product, nsamples=200, keep_original=False)
name = '(q, paraphrase(q))'
desc = 'For questions that start with "How do I X", "How can I X", etc'
test = DIR(t.data, expect=Expect.eq(1), agg_fn='all', name=name, description=desc, capability='Robustness')
suite.add(test, overwrite=True)

In [51]:
# This test case that if paraphrase the word, it will consistent or not, which this test case may be complex that above.

t = Perturb.perturb(qs, paraphrase_each, nsamples=100, keep_original=True)
name = 'Product of paraphrases(q1) * paraphrases(q2)'
desc = 'For questions that start with "How do I X", "How can I X", etc'
test = INV(t.data, name=name, description=desc, capability='Robustness')
suite.add(test)


## Ner (Name entity)

### Change same name, number, location in both

Names

person1 and person2 are different by first and last name

In [52]:
# This test case test the same adjective with the different people name. Test on the focus on basic function.
t = editor.template((
    'Is {first_name1} {last_name1} {adj}?',
    'Is {first_name2} {last_name2} {adj}?',
    ),
    adj=adjs_without_overlap,
    remove_duplicates=True, 
    nsamples=1000)
test = MFT(**t, labels=0, name='same adjectives, different people', capability = 'NER',
          description='Different first and last name, same adjectives')
suite.add(test)

person1 and person2 are different by first name only

In [53]:
# This test case for only different first name
t = editor.template((
    'Is {first_name} {last_name} {adj}?',
    'Is {first_name2} {last_name} {adj}?',
    ),
    adj=adjs_without_overlap,
    remove_duplicates=True, 
    nsamples=1000)
test = MFT(**t, labels=0, name='same adjectives, different people v2', capability = 'NER',
          description='Different first name, same adjective and last name')
suite.add(test)

person1 and person2 are different by last name only

In [54]:
# This test case for only different last name

t = editor.template((
    'Is {first_name} {last_name} {adj}?',
    'Is {first_name} {last_name2} {adj}?',
    ),
    adj=adjs_without_overlap,
    remove_duplicates=True, 
    nsamples=1000)
test = MFT(**t, labels=0, name='same adjectives, different people v3', capability = 'NER',
          description='Different last name, same adjective and first name')
# test.run(new_pp)
# test.summary(n=5)
suite.add(test)

In [55]:
# function for changing the both question. To add INV(Inveriance test)
def change_both_wrapper(fn):
    def change_both(qs):
        q1, q2 = qs
        seed = np.random.randint(100)
        c1 = fn(q1, seed=seed, meta=True)
        c2 = fn(q2, seed=seed, meta=True)
        if not c1 or not c2:
            return
        c1, m1 = c1
        c2, m2 = c2
        return [(q1, q2) for q1, q2, m1, m2 in zip(c1, c2, m1, m2) if m1 == m2]
    return change_both

def change_each_wrapper(fn):
    def change_one(qs, **kwargs):
        q1, q2 = qs
        seed = np.random.randint(100)
        c1 = fn(q1, seed=seed, meta=True, **kwargs)
        c2 = fn(q2, seed=seed, meta=True, **kwargs)
        if not c1 or not c2:
            return
        c1, m1 = c1
        c2, m2 = c2
        ret = []
        ret.extend([(q1_, str(q2)) for q1_, m1_ in zip(c1, m1) if m1_[0] in str(q2)])
        ret.extend([(str(q1), q2_) for q2_, m2_ in zip(c2, m2) if m2_[0] in str(q1)])
        return ret
    return change_one

# add the test case change the same name in the both questions
# INV type testing which meant focus on change the name.
t = Perturb.perturb(parsed_qs, change_both_wrapper(Perturb.change_names), nsamples=500)
test = INV(**t, name='Change same name in both questions', capability='NER',
          description='')
suite.add(test)

Locs

In [56]:
# add the test case change the same location in the both questions
t = Perturb.perturb(parsed_qs, change_both_wrapper(Perturb.change_location), nsamples=500)
test = INV(**t, name='Change same location in both questions', capability='NER',
          description='')
# test.run(new_pp)
# test.summary(3)
suite.add(test)

Numbers

In [57]:
# add the test case change the same number in the both questions

t = Perturb.perturb(parsed_qs, change_both_wrapper(Perturb.change_number), nsamples=500)
test = INV(**t, name='Change same number in both questions', capability='NER',
          description='')
suite.add(test)

### Change name, loc, number in only one where orig prediction is duplicate

Changing only first names

In [58]:
# add the test case change the same first_name in the only one questions


t = Perturb.perturb(parsed_qs, change_each_wrapper(Perturb.change_names), nsamples=500, first_only=True)
expect_fn = Expect.eq(0)
expect_fn = Expect.slice_orig(expect_fn, lambda orig, *args: orig == 1)
name = 'Change first name in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change first name in one of them and expect new prediction to be non-duplicate'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
suite.add(test)


Changing first and last names

In [59]:
# add the test case change the same first_name and last_name in the only one questions
t = Perturb.perturb(parsed_qs, change_each_wrapper(Perturb.change_names), nsamples=1500)
name = 'Change first and last name in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change first and last name in one of them and expect new prediction to be non-duplicate'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
# test.run(new_pp)
# test.summary(3)
suite.add(test)


Locs

In [60]:
# add the test case change the same location in the only one questions
t = Perturb.perturb(parsed_qs, change_each_wrapper(Perturb.change_location), nsamples=1500)
name = 'Change location in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change location in one of them and expect new prediction to be non-duplicate'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
# test.run(new_pp)
# test.summary(3)
suite.add(test)

numbers

In [61]:
# add the test case change the same location in the only one questions
t = Perturb.perturb(parsed_qs, change_each_wrapper(Perturb.change_number), nsamples=1500)
name = 'Change numbers in one of the questions'
desc = 'Take pairs that are originally predicted as duplicates, change number in one of them and expect new prediction to be non-duplicate'
test = DIR(**t, expect=expect_fn, name=name, description=desc, capability='NER')
suite.add(test, overwrite=True)

### Keep entities, fill in with BERT gibberish

In [63]:
# These functions below used to add the variation into the questions. 

def mask_gibberish(question):
    ents = question.ents
    if not ents:
        return None
    # Like the extract words
    wp = [x.text for x in question if x.tag_ in ['WP', 'WRB', 'WDT']]
    if not wp:
        wp = question[0].text
    else:
        wp = wp[0]
    ents = [x.text for x in ents]
    ents[-1] = ents[-1] + '?'
    template = ' {mask} '.join([wp] + ents)
    gibberish = editor.template(template).data[:5]
#     return gibberish
    ret = [(question.text, x) for x in gibberish if question.text.lower() != x.lower() ]
    return ret


def gibberish_both(qs):
    q1, q2 = qs # extract qs into q1 and q2
    ret = []
    x1 = mask_gibberish(q1) #add gibberish into 1st question
    if x1:
        ret.extend(x1)
    x2 = mask_gibberish(q2) #add gibberish into 2nd question
    if x2:
        ret.extend(x2)
    return ret

In [64]:
# This test is about adding the gibberish. Let's say gibberish is like an extract word.
t = Perturb.perturb(parsed_qs, gibberish_both, nsamples=500)
expect_false = Expect.eq(0)
name = 'Keep entitites, fill in with gibberish'
desc = 'Fill in between entitites with BERT, expect result to not be duplicate with original questions'
test = DIR(**t, expect=expect_false, name=name, description=desc, capability='NER')
suite.add(test, overwrite=True)

## Temporal

Is != used to be

In [66]:
# This test case for testing that `is` is not equal to `used to be` by defining in the template
other_nouns = ['player', 'person', 'friend', 'kid', 'candidate'] # define nouns
nouns = list(set(professions + other_nouns)) 
t = editor.template(('Is {first_name} {last_name} {a:noun}?', 'Did {first_name} {last_name} use to be {a:noun}?'),
                noun=nouns,
                adj=adjs,
                remove_duplicates=True, 
                nsamples=1000)
name = 'Is person X != Did person use to be X'
test = MFT(**t, labels=0, name=name, description='', capability='Temporal')
suite.add(test)

Is != becoming

In [70]:
# This test case for testing that `is` is not equal to `becoming` by defining in the template
t = editor.template(('Is {first_name} {last_name} {a:noun}?', 'Is {first_name} {last_name} becoming {a:noun}?'),
                noun=nouns,
                adj=adjs,
                remove_duplicates=True, 
                nsamples=1000)
name = 'Is person X != Is person becoming X'
test = MFT(**t, labels=0, name=name, description='', capability='Temporal')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

Before != after

In [71]:
# This test case for testing that `before` is not equal to `after` by defining in the template
t = editor.template((
    'What was {first_name} {last_name}\'s life before becoming {a:noun}?',
    'What was {first_name} {last_name}\'s life after becoming {a:noun}?'
),
                noun=nouns,
                adj=adjs,
                remove_duplicates=True, 
                nsamples=1000)
name = 'What was person\'s life before becoming X != What was person\'s life after becoming X'
test = MFT(**t, labels=0, name=name, description='', capability='Temporal')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

In [72]:
# added the word in the mask to action12. The word will depend on the context in the phrase. This will be the action words
action12 = [x for x in editor.suggest('Do you have to {mask} your cat before {mask} it?') if 'kill' not in x[0] and 'kill' not in x[1]][:200]

In [73]:
# this test case to test the before and after is not equal with the context about animals🐶🐈🐹
t = editor.template((
    'Do you have to {a[0]} your {an} before {a[1]} it?',
    'Do you have to {a[0]} your {an} after {a[1]} it?'
),
    an = ['cat', 'dog', 'hamster'], #assume `an` contains cat, dog, and hamster
    a=action12,
    remove_duplicates=True, 
   nsamples=1000)
name = 'Do you have to X your dog before Y it != Do you have to X your dog after Y it.'
test = MFT(**t, labels=0, name=name, description='', capability='Temporal')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

In [74]:
# just use this line with the context mask to suggest the word that can use to apply later. 
print(', '.join(editor.suggest('Is is {mask} to eat after 10pm?'))) 

illegal, OK, safe, legal, okay, ok, dangerous, acceptable, wrong, proper, possible, unhealthy, unsafe, enough, allowed, normal, appropriate, healthy, advisable, hard, permissible, right, wise, anything, safer, reasonable, alright, necessary, fine, better, rude, risky, important, lawful, best, smart, polite, difficult, unlawful, kosher, good, sensible, mandatory, fair, ethical, inappropriate, forbidden, taboo, healthier, something, unethical, time, customary, weird, cool, harder, compulsory, common, unusual, impossible, strange, easier, banned, permitted, harmful, safest, sufficient, feasible, supposed, prohibited, trendy, immoral, recommended, cheaper, moral, fun, fashionable, bad, easy, expensive, prudent, disrespectful, hot, going, food, wiser, unreasonable, suitable, sinful, what, legitimate, popular, realistic, criminal, disgusting, advised, improper, correct, tough, much, funny, unacceptable, mean, fit, odd, smarter, cheating, usual, abnormal, sick


In [75]:
# select the word from above cells
mid = ['normal', 'ok', 'safe', 'dangerous', 'acceptable', 'reasonable', 'proper', 'wrong', 'healthy', 'important']

In [76]:
# just use this line with the context mask to suggest the word that can use to apply later. 
print(', '.join(editor.suggest('Is is {mid} to {mask} after 10pm?', mid=mid)))

sleep, work, eat, drink, leave, drive, vote, go, smoke, call, talk, stay, fish, visit, post, study, write, retire, travel, tweet, read, rise, strike, return, shop, disappear, stop, start, speak, pray, celebrate, party, live, play, continue, die, fly, watch, cook, text, wait, move, be, act, do, finish, close, exercise, ask, walk, answer, bed, dance, marry, blog, gamble, swim, enter, propose, operate, rest, quit, report, wake, check, relax, search, pee, change, think, nap, meet, run, come, feed, look, fight, dress, vanish, comment, disturb, gather, kill, arrive, chat, remain, dinner, function, protest, queue, publish, withdraw, attend, commute, linger, end, perform, worry, exist, shower, crash, cancel, cry, shave, open, resign, respond, book, paint, exit, refuse, clean, begin, kiss, awake, hunt, intervene, binge, snack, know, cycle, park, date, skate, vomit, demonstrate, panic, fire, occur, practice, phone, reply, vacuum, yawn, campaign, lie, breathe, argue, plan, happen, tip, emerge, co

In [77]:
# define the selected word
activity = ['drink', 'sleep', 'drive', 'work', 'eat', 'smoke', 'walk', 'read', 'party', 'talk', 'exercise', 'celebrate', 'text', 'tweet', 'run', 'dance', 'swim', 'cook', 'pray', 'pee', 'rest']

In [78]:
# test case for the focused functionality. I think that this test case is also to test before != after, but the author provided with the context about time.
t = editor.template(('Is it {mid} to {activity} before {hour}{ampm}?','Is it {mid} to {activity} after {hour}{ampm}?'),
                activity=activity,
                mid=mid,
                hour=[str(x) for x in range(1, 12)],
                ampm=['am', 'pm'],
                remove_duplicates=True, 
                nsamples=1000)
name = 'Is it {ok, dangerous, ...} to {smoke, rest, ...} after != before'
test = MFT(**t, labels=0, name=name, description='', capability='Temporal')
suite.add(test)

## Negation

In [79]:
# Initialize `mid2` with append other adverb into `mid`
mid2 = mid + ['legal', 'awkward', 'socially acceptable']

In [80]:
# suggest the verb
print(', '.join(editor.suggest('Is it {mid} to {mask} in {country}?', mid=mid2)[:100]))

live, work, vote, drive, marry, be, travel, study, eat, fight, fly, smoke, exist, speak, pray, protest, die, stay, swim, campaign, write, meet, squat, gamble, surf, fish, hunt, shop, demonstrate, march, murder, kill, pee, visit, paint, rape, race, perform, drink, dance, strike, hide, gather, preach, gay, hate, spy, sleep, lie, worship, migrate, report, move, date, act, fire, party, remain, reside, serve, talk, cook, operate, teach, propose, experiment, settle, play, barbecue, ask, practice, breed, call, bomb, search, offend, black, arrive, discriminate, communicate, walk, survive, shoot, ski, camp, burn, queer, train, farm, sit, broadcast, blog, come, invest, sue, think, tan, shave, transgender, film


In [81]:
# define selected words
things = ['work', 'vote', 'travel', 'marry', 'drive', 'study', 'protest', 'campaign', 'fight', 'gamble', 'hunt', 'pray', 'smoke', 'fish', 'murder', 'invest', 'pee', 'march', 'worship', 'volunteer', 'surf', 'shoot', 'dance', 'camp', 'preach', 'spy', 'be gay', 'lie', 'divorce', 'discriminate']

In [82]:
# This test case is used to test the negation which add the `not` to test that is not equal
tmp = editor.suggest(('How can I become a person who is {mask}', 'How can I become a person who is not {mask}?'))
tmp.remove('differently')
t = editor.template((
    'How can I become {a:x} person?',
    'How can I become a person who is not {x}?',
    ),
    x=tmp,
    remove_duplicates=True, 
    nsamples=1000)
name = 'How can I become a X person != How can I become a person who is not X' 
desc = ''
test = MFT(**t, labels=0, name=name, capability = 'Negation',
          description=desc)
suite.add(test, overwrite=True)

In [83]:
# This test case context is also to test with the negation on the focused function. `Is it {mid} not` and `Is it {mid} ` is not equal
t = editor.template(('Is it {mid} to {activity} in {country}?','Is it {mid} not to {activity} in {country}?'),
                activity=things,
                mid=mid2,
                remove_duplicates=True, 
                nsamples=1000)
name = 'Is it {ok, dangerous, ...} to {smoke, rest, ...} in country != Is it {ok, dangerous, ...} not to {smoke, rest, ...} in country'
test = MFT(**t, labels=0, name=name, description='', capability='Negation')
suite.add(test)

In [84]:
# this test case use to test the should and should not is not equal
t = editor.template((
    'What are things {a:noun} should worry about?',
    'What are things {a:noun} should not worry about?',
),
                noun=nouns,
                remove_duplicates=True, 
                nsamples=1000)
name = 'What are things a {noun} should worry about != should not worry about.'
test = MFT(**t, labels=0, name=name, description='', capability='Negation')
suite.add(test)

In [85]:
# this test case is used for test the negation. 
# For example., How can I become a X person == How can I become a person who is not antonym(X)
# For easy understanding, to compare that `synonym` == not `antonym`. This test case is a bit sama as aboveeee test case.
t = editor.template([(
    'How can I become {a:x[0]} person?',
    'How can I become a person who is not {x[1]}?',
    ),
    (
    'How can I become {a:x[1]} person?',
    'How can I become a person who is not {x[0]}?',
    ),
],
    unroll=True,
    x=antonyms,
    remove_duplicates=True, 
    nsamples=1000)
name = 'How can I become a X person == How can I become a person who is not antonym(X)' 
desc = ''
test = MFT(**t, labels=1, name=name, capability = 'Negation',
          description=desc)
suite.add(test, overwrite=True)

## Coref
pronoun test

In [86]:
# this test case is to test that he and she is not the same
t = editor.template(
    [(
        'If {male} and {female} were alone, do you think he would reject her?',
        'If {male} and {female} were alone, do you think she would reject him?',
    ),
        (
        'If {female} and {male} were alone, do you think he would reject her?',
        'If {female} and {male} were alone, do you think she would reject him?',
    )
    ],
    remove_duplicates=True, 
    nsamples=1000,
    unroll=True)
name = 'Simple coref: he and she'
desc = '' 
test = MFT(**t, labels=0, name=name, description=desc, capability='Coref')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

In [87]:
# this test case is to test that him and her is not the same
t = editor.template(
    [(
        'If {male} and {female} were married, would his family be happy?',
        'If {male} and {female} were married, would {female}\'s family be happy?',
    ),(
        'If {male} and {female} were married, would her family be happy?',
        'If {male} and {female} were married, would {male}\'s family be happy?',
    ),
    ]
        ,
    unroll=True,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Simple coref: his and her'
desc = '' 
test = MFT(**t, labels=0, name=name, description=desc, capability='Coref')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test, 'Simple coref: his and her', 'Coref', 'TODO_DESCRIPTION')

## SRL

In [88]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Who is the best {mask} in the world?')))

boxer, footballer, player, athlete, magician, wrestler, quarterback, actor, coach, rapper, singer, cyclist, fighter, hacker, shooter, gamer, chef, goalkeeper, writer, journalist, dancer, drummer, cook, robot, trainer, goalie, photographer, musician, person, comedian, referee, man, defender, fisherman, runner, lawyer, doctor, DJ, pitcher, teacher, baker, sniper, guy, programmer, manager, football, dog, philosopher, guitarist, kicker, leader, thief, surgeon, driver, soldier, engineer, assassin, artist, QB, student, horse, striker, mathematician, team, worker, politician, vegetarian, scientist, hunter, catcher, vegan, fan, broadcaster, hitter, guard, receiver, human, businessman, diver, blogger, actress, AI, negotiator, basketball, psychologist, bomber, clown, poet, pizza, economist, judge, astronaut, pirate, salesman, pilot, accountant, linebacker, translator, spy, CEO, reporter, computer, farmer, sport, liar, goaltender, server, warrior, gun, friend


In [89]:
# define selected the words
thing = ['chef', 'boxer', 'player', 'footballer', 'athlete', 'rapper', 'actor', 'singer', 'cook', 'magician', 'coach', 'cyclist', 'wrestler', 'drummer', 'musician', 'quarterback', 'hacker', 'baker', 'fighter', 'journalist', 'teacher', 'doctor', 'gamer', 'husband', 'DJ', 'person', 'man', 'woman', 'surgeon', 'comedian', 'trainer', 'programmer', 'guitarist', 'goalkeeper']

In [90]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Who do {mask} think is the the best {thing} in the world?', thing=thing)))

you, YOU, people, we, they, readers, I, fans, You, your, guys, ya, u, Americans, experts, others, scientists, voters, the, some, men, conservatives, students, folks, critics, players, Canadians, everyone, authors, historians, analysts, friends, celebrities, all, judges, respondents, women, i, most, coaches, liberals, supporters, members, audiences, viewers, journalists, researchers, Australians, many, editors, gamers, Republicans, comedians, he, artists, reporters, participants, writers, parents, atheists, U, consumers, yo, veterans, teachers, independents, millennials, commentators, kids, YOUR, pros, scholars, feminists, users, those, anyone, each, individuals, athletes, pundits, economists, ye, both, stars, philosophers, reviewers, doctors, Christians, two, yours, listeners, candidates, politicians, insiders, adults, any, leaders, Mormons, skeptics, polls, archaeologists, Democrats, teenagers, guests, teams, ladies, psychologists, investigators, contestants, legends, humans, not, wre

In [91]:
# define selected the words

subjects = ['you', 'people', 'readers', 'guys', 'fans', 'experts', 'scientists', 'Americans', 'students', 'men', 'voters', 'authors', 'conservatives', 'women', 'Canadians', 'analysts', 'critics', 'judges', 'artists', 'researchers', 'liberals', 'historians', 'Australians', 'journalists', 'Republicans', 'coaches', 'parents', 'kids', 'economists', 'reporters', 'consumers', 'veterans', 'doctors']

In [92]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Who do {subjects} think is the the {mask} {thing} in the world?', thing=thing, subjects=subjects)[:50]))

best, greatest, worst, smartest, top, finest, strongest, toughest, biggest, fastest, deadliest, coolest, hottest, better, happiest, hardest, oldest, richest, safest, great, elite, brightest, superior, premier, ultimate, leading, Greatest, youngest, favorite, busiest, largest, BEST, highest, weakest, easiest, professional, perfect, most, newest, Best, foremost, outstanding, first, dominant, next, premiere, wealthiest, only, star, quickest


In [93]:
# define selected the words
best = ['best', 'greatest', 'worst', 'top', 'smartest', 'strongest', 'finests', 'happiest', 'coolest', 'richest', 'leading', 'brightest', 'premier', 'ultimate', 'dominant']

In [94]:
# This test case is to test that add superlative word with noun in different location in the sentence equal or not.
t = editor.template((
    'Who do {subjects} think is the {best} {thing} in the world?',
    'Who is the {best} {thing} in the world according to {subjects}?'
),
    subjects=subjects,
    best=best,
    thing=thing,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Who do X think - Who is the ... according to X'
desc = '' 
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

In [95]:
# suggest the word from the context provided
print(', '.join([str(x) for x in editor.suggest('Are {mask} smaller than {a}?', a=['bananas', 'dogs', 'cars', 'cats', 'elephants'])][:100]))

humans, cats, you, dogs, people, mice, pigs, birds, sheep, cows, rats, chickens, fish, bears, we, elephants, rabbits, lions, monkeys, they, snakes, bees, spiders, bats, puppies, dolphins, babies, kittens, children, frogs, ants, butterflies, insects, turtles, trees, ducks, whales, robots, animals, bugs, kids, crabs, carrots, dragons, mosquitoes, cars, sharks, dinosaurs, horses, tigers, wolves, primates, cattle, men, goats, chimpanzees, deer, apes, balls, reptiles, rodents, worms, mammals, flies, apples, ponies, mushrooms, plants, seals, potatoes, ticks, women, Lions, vampires, computers, stones, Pokémon, twins, things, eggs, pets, dwarves, bulls, Indians, boys, elves, calves, toys, beetles, bananas, flowers, trolls, houses, girls, burgers, beans, diamonds, guys, coins, demons


In [96]:
# define selected the words
things = editor.suggest('Are {mask} smaller than {a}?',a=['bananas', 'dogs', 'cars', 'cats', 'elephants'] )[:100]

In [97]:
# suggest the word from the context provided
print(', '.join([str(x) for x in editor.suggest('Are {a} {mask} than {a2}?', a=things)][:100]))

smarter, better, bigger, faster, different, worse, stronger, cooler, tougher, smaller, safer, wiser, larger, nicer, healthier, more, happier, weaker, cheaper, greater, less, slower, harder, cleaner, quicker, easier, quieter, hotter, darker, heavier, lighter, brighter, older, closer, taller, colder, higher, sharper, shorter, louder, warmer, simpler, longer, lesser, lower, younger, other, deeper, softer, thicker, richer, important, superior, fewer, clearer, stranger, broader, wider, intelligent, rather, real, finer, differently, dangerous, inferior, smoother, poorer, thinner, valuable, stupid, tighter, farther, smart, dumb, stricter, related, harsher, interesting, sooner, normal, weird, rare, equal, funny, wealthier, poisonous, bad, beautiful, anymore, wild, racist, evil, similar, good, strange, safe, crazy, true, fat, newer


In [98]:
# define selected the words
comp = ['better', 'worse', 'cheaper', 'bigger', 'louder', 'longer', 'larger', 'smaller', 'warmer', 'colder', 'thicker', 'lighter', 'heavier']

Order doesn't matter for comparison

In [99]:
# this test case for testing the order of word in the sentence which is not disturb the comparison context
t = editor.template([
    (
    'Are {t1} {comp} than {t2}?',
    'What is {comp}, {t2} or {t1}?'
    ),
    (
    'Are {t1} {comp} than {t2}?',
    'Are {t2} {comp} than {t1}?',
    ),
    (
    'Are {t1} {comp} than {t2}?',
    'What is {comp}, {t1} or {t2}?',
    )
]
    ,
    t = things,
    comp = comp,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Order does not matter for comparison'
desc = '' 
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

In [101]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Is {first_name1} {mask} to {first_name2}?', remove_duplicates=True)[:100]))
print()
print(', '.join(editor.suggest('Is {first_name1} {mask} {first_name2}?', remove_duplicates=True)[:100]))

married, related, close, engaged, closer, talking, attracted, lying, connected, linked, speaking, going, referring, loyal, on, writing, listening, tied, important, up, committed, proposing, responding, turning, faithful, Married, dead, similar, bound, closest, next, gay, true, returning, attached, addicted, allergic, coming, abusive, straight, crazy, hostile, indebted, mean, reacting, devoted, right, getting, good, opposed, new, nice, dangerous, back, real, friendly, attractive, trying, clinging, off, bisexual, available, happening, happy, moving, nicer, kin, drawn, Close, known, proposed, alive, entitled, superior, kind, equal, used, answering, cruel, supposed, cheating, truthful, special, relating, out, tending, written, father, unfair, headed, guilty, dedicated, lied, calling, down, cold, wed, relevant, heir, threatening

or, really, and, Really, dating, still, a, with, actually, like, not, the, leaving, after, &, meeting, now, REALLY, another, from, marrying, even, for, /, becoming

In [102]:
# define selected the words `symmetric`
symmetric = ['dating', 'married to', 'close to', 'engaged to', 'connected to', 'married to', 'friends with', 'related to', 'an acquaintance of']

In [103]:
# Test case that test order of the symmetric relation changed, it does not matter
t = editor.template((
    'Is {first_name1} {s} {first_name2}?',
    'Is {first_name2} {s} {first_name1}?',
),
    s = symmetric,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Order does not matter for symmetric relations'
desc = 'e.g. dating, married to, close to, engaged to, etc' 
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

Order matters for asymetric relations

In [104]:
# define selected the words `asymmetric`

asymmetric = ['hurting', 'lying to', 'loyal to', 'faithful to', 'proposing to', 'indebted to', 'abusive to', 'using', 'expecting', 'beating', 'punching', 'raising', 'poisoning', 'protecting', 'kidnapping']

In [105]:
# Test case that test order of the asymmetric relation changed, it does not matter
t = editor.template((
    'Is {first_name1} {s} {first_name2}?',
    'Is {first_name2} {s} {first_name1}?',
),
    s = asymmetric,
    remove_duplicates=True, 
    nsamples=1000)
name = 'Order does matter for asymmetric relations'
desc = 'e.g. hurting lying to, faithful to, etc'
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

More traditional SRL

In [106]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Did John buy the {mask}?', remove_duplicates=True)[:100]))

house, farm, property, land, rights, ticket, stake, boat, book, company, tickets, gun, yacht, island, papers, newspaper, horse, car, ranch, phone, paper, shotgun, plot, books, estate, team, idea, contract, ship, business, plane, game, shares, franchise, painting, Bible, rifle, time, guns, tractor, building, place, horses, tract, cattle, home, stock, newspapers, beer, piece, club, ring, campaign, station, film, church, store, money, Ark, watch, dog, sword, deal, suit, title, factory, story, manuscript, rest, pizza, truck, castle, radio, seat, diamond, machine, cows, telephone, plan, letter, submarine, loan, weapons, mortgage, tapes, equipment, letters, chickens, project, election, cow, pieces, site, insurance, magazine, twins, chicken, fish, plant, paintings


In [107]:
# define selected the words `obj`
obj = ['farm', 'house', 'property', 'company', 'land', 'ticket', 'newspaper', 'book', 'island', 'estate', 'ranch', 'boat', 'horse', 'paper', 'business', 'gun', 'game', 'factory', 'castle', 'painting', 'rifle', 'car', 'school', 'building']

In [108]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Did John {mask} the {obj}?', obj=obj, remove_duplicates=True)[:100]))

buy, sell, take, get, lose, have, own, leave, see, want, use, keep, run, need, steal, win, share, miss, purchase, seize, find, receive, abandon, manage, break, inherit, control, return, recover, save, raise, drop, move, kill, enjoy, hit, tip, touch, throw, like, know, rob, fix, pull, hold, bring, handle, finish, claim, destroy, crash, flip, sign, shoot, call, grab, remember, start, wreck, burn, forget, build, rebuild, reclaim, make, join, clear, enter, lead, give, remove, name, pick, secure, split, watch, fire, play, quit, settle, carry, close, survive, donate, accept, nail, change, eat, acquire, stop, rent, complete, forfeit, borrow, retain, deliver, catch, retrieve, raid, ruin


In [110]:
import pattern
# import pattern.en
# define the selected verbs and add another verb which is past particple
verbs = ['buy', 'purchase', 'sell', 'leave', 'own', 'take', 'keep', 'want', 'lose', 'destroy', 'inherit', 'find', 'use', 'need', 'receive', 'return', 'like', 'enjoy', 'abandon', 'manage', 'remember', 'miss', 'move', 'seize', 'steal']
a = pattern.en.tenses('stolen')[0]
verbs = [(v, pattern.en.conjugate(v, *a)) for v in verbs]
verbs[3] = ('leave', 'left')
verbs

[('buy', 'bought'),
 ('purchase', 'purchased'),
 ('sell', 'sold'),
 ('leave', 'left'),
 ('own', 'owned'),
 ('take', 'taken'),
 ('keep', 'kept'),
 ('want', 'wanted'),
 ('lose', 'lost'),
 ('destroy', 'destroyed'),
 ('inherit', 'inherited'),
 ('find', 'found'),
 ('use', 'used'),
 ('need', 'needed'),
 ('receive', 'received'),
 ('return', 'returned'),
 ('like', 'liked'),
 ('enjoy', 'enjoyed'),
 ('abandon', 'abandoned'),
 ('manage', 'managed'),
 ('remember', 'remembered'),
 ('miss', 'missed'),
 ('move', 'moved'),
 ('seize', 'seized'),
 ('steal', 'stolen')]

In [111]:
# This test case for the swapping active and passive verb. Swap the word order and word context. 
t = editor.template((
    'Did {first_name} {verb[0]} the {obj}?',
    'Was the {obj} {verb[1]} by {first_name}?'
),
    verb=verbs,
    obj=obj,
    remove_duplicates=True, 
    nsamples=1000)
name = 'traditional SRL: active / passive swap'
desc = ''
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

In [112]:
# This test case for the swapping active and passive verb. Wrong order word sentence.
t = editor.template((
    'Did {first_name} {verb[0]} the {obj}?',
    'Was {first_name} {verb[1]} by the {obj}?'
),
    verb=verbs,
    obj=obj,
    remove_duplicates=True, 
    nsamples=1000)
name = 'traditional SRL: wrong active / passive swap'
desc = ''
test = MFT(**t, labels=0, name=name, description=desc, capability='SRL')
suite.add(test)

With people

In [113]:
# suggest the word from the context provided
print(', '.join(editor.suggest('Does {first_name} {mask} {first_name2}?', remove_duplicates=True)[:100]))

know, hate, love, like, want, need, kill, miss, remember, have, see, marry, blame, Know, understand, mean, trust, date, forgive, recognize, deserve, tell, get, dislike, bother, Love, find, meet, Want, hurt, beat, Remember, notice, save, believe, murder, mention, fear, leave, prefer, help, resemble, or, follow, choose, shoot, own, support, resent, use, survive, and, eat, play, attack, visit, Like, owe, despise, respect, hit, scare, call, recognise, regret, fancy, kiss, lose, hear, threaten, Hate, haunt, LIKE, ask, suspect, protect, take, touch, mind, chase, kidnap, WANT, Need, represent, accept, control, hire, affect, include, See, divorce, KILL, reject, pity, NEED, possess, influence, stalk, Kill, KNOW


In [114]:
# define the word
pverb = ['love', 'hate', 'like', 'remember', 'recognize', 'trust', 'deserve', 'understand', 'blame', 'dislike', 'prefer', 'follow', 'notice', 'hurt', 'bother', 'support', 'believe', 'accept', 'attack']
a = pattern.en.tenses('stolen')[0]
pverb = [(v, pattern.en.conjugate(v, *a)) for v in pverb]

# this test case is for swap the people name and active/passive context. The meaning should be the same.
t = editor.template((
    'Does {first_name} {verb[0]} {first_name2}?',
    'Is {first_name2} {verb[1]} by {first_name}?',
),
    verb=pverb,
    obj=obj,
    remove_duplicates=True, 
    nsamples=1000)
name = 'traditional SRL: active / passive swap with people'
desc = ''
test = MFT(**t, labels=1, name=name, description=desc, capability='SRL')
suite.add(test)

In [115]:
pverb = ['love', 'hate', 'like', 'remember', 'recognize', 'trust', 'deserve', 'understand', 'blame', 'dislike', 'prefer', 'follow', 'notice', 'hurt', 'bother', 'support', 'believe', 'accept', 'attack']
a = pattern.en.tenses('stolen')[0]
pverb = [(v, pattern.en.conjugate(v, *a)) for v in pverb]

# this test case is for swap the wrong people name order and active/passive context. 
t = editor.template((
    'Does {first_name} {verb[0]} {first_name2}?',
    'Is {first_name} {verb[1]} by {first_name2}?',
),
    verb=pverb,
    obj=obj,
    remove_duplicates=True, 
    nsamples=1000)
# data = [tuple(np.random.choice(x, 2, replace=False)) for x in data]
name = 'traditional SRL: wrong active / passive swap with people'
desc = ''
test = MFT(**t, labels=0, name=name, description=desc, capability='SRL')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

## Logic

In [116]:
# This test case to test the logic
t = editor.template((
    'Is {first_name} {last_name} {a:p1} ?',
    'Is {first_name} {last_name} {a:p3}?',
),
    p=professions,
    remove_duplicates=True, 
    nsamples=1000)
test = MFT(**t, labels=0)


In [117]:
# This test case is to test that `x or y` is not same as `w and z` (different parameter)
t = editor.template((
    'Is {first_name} {last_name} {a:p1} or {a:p2}?',
    'Is {first_name} {last_name} simultaneously {a:p3} and {a:p4}?',
),
    p=professions,
    remove_duplicates=True, 
    nsamples=1000)
name = 'A or B is not the same as C and D'
desc = ''
test = MFT(**t, labels=0, name=name, description=desc, capability='Logic')
suite.add(test)


In [118]:
# This test case is to test that `x or y` is not same as `x and y` (same parameter)
t = editor.template((
    'Is {first_name} {last_name} {a:p1} or {a:p2}?',
    'Is {first_name} {last_name} simultaneously {a:p1} and {a:p2}?',
),
    p=professions,
    remove_duplicates=True, 
    nsamples=1000)
name = 'A or B is not the same as A and B'
desc = ''
test = MFT(**t, labels=0, name=name, description=desc, capability='Logic')
# test.run(new_pp)
# test.summary(n=3)
suite.add(test)

In [119]:
# This test case is to test that A {and/or} B is the same as B {and/or} A
# Swap the A and B
t = editor.template((
    'Is {first_name} {last_name} {a:p1} {andor} {a:p2}?',
    'Is {first_name} {last_name} {a:p2} {andor} {a:p1}?',
),
    andor=['and', 'or'],
    p=professions,
    remove_duplicates=True, 
    nsamples=1000)
name = 'A and / or B is the same as B and / or A'
desc = ''
test = MFT(**t, labels=1, name=name, description=desc, capability='Logic')
suite.add(test)

In [120]:
# This test case is for testing the nationality 
# for example; If kh lee a Thai journalist? = if kf lee a Thai and Journalist?
# seperate adjective and profession
t = editor.template((
    'Is {first_name} {last_name} {a:nat} {p1}?',
    'Is {first_name} {last_name} {a:p1} and {nat}?',
),
    nat=editor.lexicons['nationality'][:20],
    p=professions,
    remove_duplicates=True, 
    nsamples=1000)
name = 'a {nationality} {profession} = a {profession} and {nationality}'
desc = ''
test = MFT(**t, labels=1, name=name, description=desc, capability='Logic')
suite.add(test)

Reflexivity

In [121]:
# This test case is for test that the list tuple of question should be equal
t = Perturb.perturb(list(all_questions), lambda x:(x, x), nsamples=1000, keep_original=False)
name = 'Reflexivity: (q, q) should be duplicate'
desc = ''
test = MFT(**t, labels=1, name=name, description=desc, capability='Logic')
suite.add(test)

Symmetry

In [122]:
# This test case is for test that the function f(a,b) will equal f(b,a) or not

t = Perturb.perturb(qs, lambda x:(x[1], x[0]), nsamples=500, keep_original=True)
name = 'Symmetry: f(a, b) = f(b, a)'
desc = ''
test = INV(t.data, name=name, description=desc, capability='Logic')
suite.add(test)

In [125]:
# this function extracts unknown implications from a set of pairs and corresponding labels.
import collections
def extract_unknown_implications(pairs, labels):
    graph = collections.defaultdict(lambda: set())
    ls = {}
    for x, y in zip(pairs, labels):
        graph[x[0]].add(x[1])
        graph[x[1]].add(x[0])
        t = tuple(sorted(x))
        ls[t] = y

    d = []
    l = []
    for x in graph:
        if len(graph[x]) == 1:
            continue
        for y in graph[x]:
            t = tuple(sorted((x, y)))
    #         print(t, ls[t])
        new = list(set([tuple(sorted(a)) for a in itertools.product(list(graph[x]), list(graph[x])) if a[0] != a[1]]))
        new = [a for a in new if a not in ls]
        for b, c in new:
            t1 = tuple(sorted((x, b)))
            t2 = tuple(sorted((x, c)))
            l1 = ls[t1]
            l2 = ls[t2]
            if l1 + l2 == 2:
                l3 = 1
            elif l1 + l2 == 1:
                l3 = 0
            else:
                continue
            new_x = [(x, b), (x, c), (b, c)]
            new_l = np.array([l1, l2, l3])
            d.append(new_x)
            l.append(new_l)
    return d, l

In [126]:
data, ls = extract_unknown_implications(qs, labels) # extract the qs array and labels that define when importing and reading the dev.tsv 

In [127]:
# not sure about this section 
# Evaluate the correctness of prediction 
def expect_triplet(xs, preds, confs, labels, meta=None):

    if (preds[0] + preds[1]) == 2:
        if preds[2] != 1:
            return np.array([-3, -2, -1]) #expect the 
        else:
            return np.array([True, True, True])
    if (preds[0] + preds[1] == 1) and preds[1] != 0: #expect preds = '0' / preds = '1'
        if preds[1] != 0: 
            return np.array([-3, -2, -1])
        else:
            return np.array([True, True, True])
    return None

# add the expectation of the tesing by using `expect striplet` function. saved into `expect`
expect = Expect.testcase(expect_triplet)

In [128]:
# Test the DIR model prediction type
name = 'Testing implications'
desc = 'f(x, a) = 1 and f(x, b) = 1 => f(a, b) = 1\nf(x, a) = 1 and f(x, b) = 0 => f(a, b) = 0\n Only used (x, a, b) such that (x, a) and (x, b) in val dataset and (a, b) is not.\n Expectation function filters out examples where f(x, a) or f(x, b) are incorrect'
test = DIR(data, expect, labels=ls, name=name, description=desc, capability='Logic')
suite.add(test)

In [129]:
path = 'qqp_suite.pkl' #define path
suite.save(path) #save the test case into path that define