# QQP-test with Bert
 This notebook is used to test the performance of the BERT model which model name `bert-base-cased-finetuned-mrpc`on the QQP dataset. It includes loading the pre-trained BERT model, running it on the test data, and evaluating the results. 

## Installation

In [1]:
%reload_ext autoreload
%load_ext autoreload
%autoreload 2

import checklist
import spacy
import itertools

import checklist.editor
import checklist.text_generation
from checklist.test_types import MFT, INV, DIR
from checklist.expect import Expect
from checklist.test_suite import TestSuite
import numpy as np
import spacy
from checklist.perturb import Perturb

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Import the model

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
model.eval()



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [3]:
# used to decrease size of chunk
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

In [4]:
# use to process the data with the RoBERTa model
def batch_qqp(data, batch_size=128):
    ret = []
    for d in chunks(data, batch_size):
        t = tokenizer([a[0] for a in d], [a[1] for a in d], return_tensors='pt', padding=True)
        # t = tokenizer([a[0] for a in d], [a[1] for a in d], return_tensors='pt', padding=True).to('cuda')
        with torch.no_grad():
            logits = torch.softmax(model(**t)[0], dim=1).numpy()
            # logits = torch.softmax(model(**t)[0], dim=1).cpu().numpy()
        ret.append(logits)
    return np.vstack(ret)
    

In [5]:
from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(batch_qqp) #to make sure that the output of the function will be formatted

In [6]:
# Test the output
s0 = "The company HuggingFace is based in New York City"
s1 = "Apples are especially bad for your health"
s2 = "HuggingFace's headquarters are situated in Manhattan"
batch_qqp([(s0, s1), (s0, s2)]) 

array([[0.94038326, 0.05961677],
       [0.09536293, 0.90463704]], dtype=float32)

In [7]:
suite_path = 'qqp_suite.pkl' #define path of the pkl
suite = TestSuite.from_file(suite_path)

In [8]:
#run the test with 10 test case. I have tried or n=100. It spend time for many hours .....
# overwrite = True which meant write over the old file
suite.run(wrapped_pp, n=10, overwrite=True) 


Running Modifier: adj
Predicting 10 examples
Running different adjectives
Predicting 10 examples
Running Different animals
Predicting 10 examples
Running Irrelevant modifiers - animals
Predicting 10 examples
Running Irrelevant modifiers - people
Predicting 10 examples
Running Irrelevant preamble with different examples.
Predicting 10 examples
Running Preamble is relevant (different injuries)
Predicting 10 examples
Running How can I become more {synonym}?
Predicting 10 examples
Running (question, f(question)) where f(question) replaces synonyms?
Predicting 10 examples
Running Replace synonyms in real pairs
Predicting 28 examples
Running How can I become more X != How can I become less X
Predicting 10 examples
Running How can I become more X = How can I become less antonym(X)
Predicting 10 examples
Running add one typo
Predicting 30 examples
Running contrations
Predicting 26 examples
Running (q, paraphrase(q))
Predicting 964 examples
Running Product of paraphrases(q1) * paraphrases(q2)
P

In [9]:
suite.summary() #summary the test case

Vocabulary

Modifier: adj
Test cases:      1000
Test cases run:  10
Fails (rate):    10 (100.0%)

Example fails:
0.9 ('Is Laura Richardson an assistant?', 'Is Laura Richardson an active assistant?')
----
0.9 ('Is Colin Reynolds an accountant?', 'Is Colin Reynolds an elite accountant?')
----
0.9 ('Is Victoria Clark an organizer?', 'Is Victoria Clark a successful organizer?')
----


different adjectives
Test cases:      963
Test cases run:  10
Fails (rate):    8 (80.0%)

Example fails:
0.9 ('Is Virginia Martin American?', 'Is Virginia Martin immortal?')
----
0.7 ('Is Elaine Bell Christian?', 'Is Elaine Bell an inventor?')
----
0.9 ('Is Angela Green Australian?', 'Is Angela Green English?')
----


Different animals
Test cases:      942
Test cases run:  10
Fails (rate):    10 (100.0%)

Example fails:
0.9 ('Can I feed my lobster meat?', 'Can I feed my goat meat?')
----
0.9 ('Can I feed my goat meat?', 'Can I feed my lobster meat?')
----
0.9 ('Can I feed my squirrel formula?', 'Can I feed my