# QQP test with RoBERTa

this one is used to test the performance of the RoBERTa model which is `JeremiahZ/roberta-base-mrpc` model on the QQP dataset. It involves loading the pre-trained RoBERTa model, running it on the test data, and analyzing the outcomes. 

## Installation

In [1]:
%load_ext autoreload
%autoreload 2

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb

## Import the model

In [3]:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("JeremiahZ/roberta-base-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("JeremiahZ/roberta-base-mrpc")
model.eval()


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [12]:
# used to decrease size of chunk
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

In [13]:
# use to process the data with the RoBERTa model
def batch_qqp(data, batch_size=128):
    ret = []
    for d in chunks(data, batch_size):
        t = tokenizer([a[0] for a in d], [a[1] for a in d], return_tensors='pt', padding=True)
        with torch.no_grad():
            logits = torch.softmax(model(**t)[0], dim=1).numpy() #apply with RoBERTa model
        ret.append(logits)
    return np.vstack(ret)
    

In [14]:
from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(batch_qqp) #to make sure that the output of the function will be formatted

In [15]:
# Test
s0 = "The restaurant 'Haidilao' serves chinese hotpot."
s1 = "Mala soup is very delicious"
s2 = "Located in wide area, Haidilao is known for good services."
batch_qqp([(s0, s1), (s0, s2)])

array([[0.9924535 , 0.00754646],
       [0.99132866, 0.0086714 ]], dtype=float32)

In [16]:
suite_path = 'qqp_suite.pkl' #define path of the pkl
suite = TestSuite.from_file(suite_path)

In [17]:
#run the test with 10 test case. I have tried or n=100. It spend time for many hours .....
# overwrite = True which meant write over the old file
suite.run(wrapped_pp, n=10, overwrite=True) 

Running Modifier: adj
Predicting 10 examples
Running different adjectives
Predicting 10 examples
Running Different animals
Predicting 10 examples
Running Irrelevant modifiers - animals
Predicting 10 examples
Running Irrelevant modifiers - people
Predicting 10 examples
Running Irrelevant preamble with different examples.
Predicting 10 examples
Running Preamble is relevant (different injuries)
Predicting 10 examples
Running How can I become more {synonym}?
Predicting 10 examples
Running (question, f(question)) where f(question) replaces synonyms?
Predicting 10 examples
Running Replace synonyms in real pairs
Predicting 30 examples
Running How can I become more X != How can I become less X
Predicting 10 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 10 examples
Running add one typo
Predicting 30 examples
Running contrations
Predicting 28 examples
Running (q, paraphrase(q))
Predicting 964 examples
Running Product of paraphrases(q1) * paraphrases(q2)
P

In [18]:
suite.summary() #summary the test case

Vocabulary

Modifier: adj
Test cases:      1000
Test cases run:  10
Fails (rate):    10 (100.0%)

Example fails:
1.0 ('Is Elizabeth Miller an intern?', 'Is Elizabeth Miller a good intern?')
----
1.0 ('Is Albert Brooks a secretary?', 'Is Albert Brooks an exceptional secretary?')
----
1.0 ('Is Amanda Thomas an actor?', 'Is Amanda Thomas a good actor?')
----


different adjectives
Test cases:      963
Test cases run:  10
Fails (rate):    6 (60.0%)

Example fails:
0.9 ('Is Sarah Harrison gay?', 'Is Sarah Harrison trustworthy?')
----
0.6 ('Is Rachel Sullivan gay?', 'Is Rachel Sullivan famous?')
----
0.9 ('Is Emma Robertson white?', 'Is Emma Robertson mad?')
----


Different animals
Test cases:      942
Test cases run:  10
Fails (rate):    10 (100.0%)

Example fails:
1.0 ('Can I feed my turtle chocolate?', 'Can I feed my rabbit chocolate?')
----
1.0 ('Can I feed my rabbit cookies?', 'Can I feed my turtle cookies?')
----
1.0 ('Can I feed my goat insulin?', 'Can I feed my rat insulin?')
----

