# Evaluation metrics

| Task | Accuracy | F1 | BLUE | Perplexity | ROUGE | EM | METEOR |
| --- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Text classification |&check;|&check;| | | | | |
| Text generation | | |&check;|&check;| | | |
| Summarization | | |&check;| |&check;| | |
| Translation | | |&check;| | | |&check;|
| Extractive QA | |&check;| | | |&check;| |
| Generative QA | |&check;| | |&check;| | |

In [1]:
import evaluate

In [13]:
metric = evaluate.load("f1")

In [14]:
print(metric.description)


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)



In [15]:
# required inputs
print(metric.features)

{'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}


In [16]:
true_labels = [0,0,1]
predicted_labels = [1,0,1]

In [17]:
metric.compute(predictions=predicted_labels, references=true_labels)

{'f1': 0.6666666666666666}

## BLUE

BLEU and ROUGE compare generated text to reference texts and evaluate its quality more closely with how humans perceive language.

**BLEU (Bilingual Evaluation Understudy)** compares the generated text with a reference text by examining the occurrence of n-grams. 

In a sentence like 'the cat is on the mat', the 1-grams or uni-grams are each individual word, the 2-grams or bi-grams are 'the cat', 'cat is', and so on. The more the generated n-grams match the reference n-grams, the higher the BLEU score. A perfect match results in a score of 1-point-0, while zero would mean no match.

In [18]:
from torchmetrics.text import BLEUScore

In [32]:
generated_text = ['the cat is on the mat']

In [37]:
real_text = [['a cat is on the mat', 'there is a cat on mat']]

In [38]:
blue = BLEUScore()

In [39]:
blue_score = blue(generated_text, real_text)

In [40]:
blue_score.item()

0.7598357200622559

## ROUGE

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** assesses generated text against reference text in two ways: 

* examines overlapping n-grams, with N representing the n-gram order
* checks for the longest common subsequence (LCS), the longest shared word sequence between the generated and reference text

ROUGE has three metrics:

* F-measure is the harmonic mean of precision and recall. 

* Precision checks matches of n-grams in the generated text that are in the reference text (how many selected items are relevant). 

* Recall checks for matches of n-grams in the reference text that appear in the generated text (how many selected items are relevant). 

The prefixes 'rouge1', 'rouge2', and 'rougeL' specify the n-gram order or LCS.

In [41]:
from torchmetrics.text import ROUGEScore

In [42]:
rouge = ROUGEScore()

In [43]:
rouge_score = rouge(generated_text, real_text)

In [45]:
rouge_score

{'rouge1_fmeasure': tensor(0.8333),
 'rouge1_precision': tensor(0.8333),
 'rouge1_recall': tensor(0.8333),
 'rouge2_fmeasure': tensor(0.8000),
 'rouge2_precision': tensor(0.8000),
 'rouge2_recall': tensor(0.8000),
 'rougeL_fmeasure': tensor(0.8333),
 'rougeL_precision': tensor(0.8333),
 'rougeL_recall': tensor(0.8333),
 'rougeLsum_fmeasure': tensor(0.8333),
 'rougeLsum_precision': tensor(0.8333),
 'rougeLsum_recall': tensor(0.8333)}

## Perplexity

It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.

This is a measurement of how well probability distribution or probability model predicts a sample. A lower perplexity score indicates better generalization performance.

In [53]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [54]:
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [52]:
sample_text = """
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: 
she knelt down and looked along the passage into the loveliest garden you ever saw. 
How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, 
but she could not even get her head though the doorway; `and even if my head would go through,' thought poor Alice,
`it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! 
I think I could, if I only know how to begin.' For, you see, so many out-of-the-way things had happened lately, 
that Alice had begun to think that very few things indeed were really impossible.
"""

In [64]:
inputs = tokenizer.encode(sample_text, return_tensors="pt", max_length=10, truncation=True)
outputs = model.generate(inputs, max_length=25)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice opened the door and found that it led to a small room with a large table. She sat down on it and


In [65]:
perplexity = evaluate.load("perplexity", module_type="metric")

In [66]:
perplexity.features

{'predictions': Value(dtype='string', id=None)}

In [68]:
results = perplexity.compute(predictions=generated_text, model_id=model_name)

  0%|          | 0/7 [00:00<?, ?it/s]

Perplexity is calculated by output logit distributions returned by the model to generate each next token.

When multiple generated text predictions are passed, the average perplexity is beign used.

In [70]:
results['mean_perplexity']

3786.449084772647