## GLUE benchmark

### Imports and utils

In [70]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import nltk

def predict_and_print(logits, labels=None):
    scores = torch.nn.functional.sigmoid(logits)
    predictions = torch.argmax(scores, dim=1).tolist()
    print(f"Predictions:   {predictions}")
    if labels is not None:
        print(f"Actual labels: {labels}")
    print(f"\nScores:\n{scores}")

### CoLA - Corpus of Linguistic Acceptability
It used to evaluate models on the task of linguistic acceptability. Each sentence is labeled as either:<br>
0 - incorrect - unacceptable<br>
1 - correct - acceptable


In [72]:
# Use a BERT fine-tuned on CoLA
model_checkpoint = "textattack/bert-base-uncased-CoLA"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

cola_sentences = [
    "The boy is playing.",
    "The dog run fast."
]
cola_labels = [1, 0]

encodings = tokenizer(cola_sentences, truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encodings)

predict_and_print(outputs.logits, cola_labels)

Predictions:   [1, 0]
Actual labels: [1, 0]

Scores:
tensor([[0.1056, 0.8955],
        [0.8404, 0.1765]])


### QNLI - Question Natural Language Inference
This dataset is designed for question-answering tasks. Each pair consists of a question and a sentence, labeled as either: <br>
0 - entailment - acceptable<br>
1 - not_entailment - unacceptable 

In [73]:
# Use a BERT fine-tuned on QNLI
model_checkpoint = "textattack/bert-base-uncased-QNLI"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

qnli_questions = [
    'How many people live in Berlin?', 
    'What is the size of New York?'
]
qnli_sentences = [
    'Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 
    'New York City is famous for the Metropolitan Museum of Art.'
]
qnli_labels = [0, 1]

encodings = tokenizer(qnli_questions, qnli_sentences, truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encodings)
        
predict_and_print(outputs.logits, qnli_labels)

Predictions:   [0, 1]
Actual labels: [0, 1]

Scores:
tensor([[0.9419, 0.0948],
        [0.0635, 0.9644]])


### MNLI - Multi-Genre Natural Language Inference
This dataset is used for evaluating models on the task of natural language inference. Each pair of sentences is labeled as:<br>
0 - contradiction - unacceptable<br>
1 - neutral - acceptable<br>
2 - entailment - acceptable

In [74]:
# Use a BERT fine-tuned on MNLI
model_checkpoint = "textattack/bert-base-uncased-MNLI"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

mnli_premises = [
    "Sleep is a vital component of our overall health and well-being",
    "Sleep is a vital component of our overall health and well-being",
    "Sleep is a vital component of our overall health and well-being"
]
mnli_hypotheses = [
    "The boy is sleeping.",
    "Sleep is not a luxury but a necessity.",
    "Sleep isn't important."
]
mnli_labels = [2, 1, 0]

encodings = tokenizer(mnli_premises, mnli_hypotheses, truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encodings)
        
predict_and_print(outputs.logits, mnli_labels)

Predictions:   [2, 1, 0]
Actual labels: [2, 1, 0]

Scores:
tensor([[0.4699, 0.0906, 0.8886],
        [0.3002, 0.6163, 0.5773],
        [0.9957, 0.0394, 0.1105]])


To evaluate a chatbot's answers using the MNLI dataset, we first need to split the text into individual sentences. Next, we calculate the scores for each pair of sentences. For short answers, each pair should be classified as either 'neutral' or 'entailment'.

In [75]:
text = ("Sleep is crucial for maintaining physical health, as it allows the body to repair and rejuvenate itself."
        " It plays a vital role in cognitive functions, including memory consolidation and learning."
        " Adequate sleep supports emotional well-being by helping to regulate mood and stress levels."
        " Consistent, quality sleep strengthens the immune system, making it easier to fend off illnesses."
        " Additionally, proper sleep is essential for maintaining energy levels and overall productivity throughout the day.")

# nltk.download('punkt')
mnli_sentences = nltk.sent_tokenize(text)
for sentence in mnli_sentences: print(sentence)

encodings = tokenizer(mnli_sentences[:-1], mnli_sentences[1:], truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encodings)

print("")
predict_and_print(outputs.logits)

Sleep is crucial for maintaining physical health, as it allows the body to repair and rejuvenate itself.
It plays a vital role in cognitive functions, including memory consolidation and learning.
Adequate sleep supports emotional well-being by helping to regulate mood and stress levels.
Consistent, quality sleep strengthens the immune system, making it easier to fend off illnesses.
Additionally, proper sleep is essential for maintaining energy levels and overall productivity throughout the day.

Predictions:   [2, 2, 2, 2]

Scores:
tensor([[0.2991, 0.1641, 0.8924],
        [0.2571, 0.1057, 0.9436],
        [0.0662, 0.1924, 0.9620],
        [0.0338, 0.2374, 0.9779]])


### MRPC - Microsoft Research Paraphrase Corpus
Dataset commonly used for evaluating paraphrase detection models. It consists of pairs of sentences, each labeled as either:<br>
0 - not_paraphrases - unacceptable<br>
1 - paraphrases - acceptable

In [76]:
# Use a BERT fine-tuned on MRPC
model_checkpoint = "textattack/bert-base-uncased-MRPC"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

mrpc_1 = [
    "The company announced its quarterly earnings today.",
    "She loves to read books during her free time.",
    "The new policy will affect all employees starting next month."
]
mrpc_2 = [
    "The company revealed its quarterly results today",
    "He enjoys playing soccer with his friends after school.",
    "Starting next month, the new policy will impact all staff members."
]
mrpc_labels = [1, 0, 1]

encodings = tokenizer(mrpc_2, mrpc_1, truncation=True, padding=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encodings)
        
predict_and_print(outputs.logits, mrpc_labels)

Predictions:   [1, 0, 1]
Actual labels: [1, 0, 1]

Scores:
tensor([[0.0937, 0.9303],
        [0.6801, 0.1622],
        [0.0947, 0.9326]])
