# Evaluation metrics

| Task | Accuracy | F1 | BLEU | Perplexity | ROUGE | EM | METEOR |
| --- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Text classification |&check;|&check;| | | | | |
| Text generation | | |&check;|&check;| | | |
| Summarization | | |&check;| |&check;| | |
| Translation | | |&check;| | | |&check;|
| Extractive QA | |&check;| | | |&check;| |
| Generative QA | |&check;| | |&check;| | |

In [None]:
import evaluate

In [2]:
metric = evaluate.load("f1")

In [3]:
print(metric.description)


The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation:
F1 = 2 * (precision * recall) / (precision + recall)



In [4]:
# required inputs
print(metric.features)

{'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}


In [5]:
true_labels = [0,0,1]
predicted_labels = [1,0,1]

In [6]:
metric.compute(predictions=predicted_labels, references=true_labels)

{'f1': 0.6666666666666666}

## BLEU

BLEU and ROUGE compare generated text to reference texts and evaluate its quality more closely with how humans perceive language.

**BLEU (Bilingual Evaluation Understudy)** compares the generated text with a reference text by examining the occurrence of n-grams. 

In a sentence like 'the cat is on the mat', the 1-grams or uni-grams are each individual word, the 2-grams or bi-grams are 'the cat', 'cat is', and so on. The more the generated n-grams match the reference n-grams, the higher the BLEU score. A perfect match results in a score of 1-point-0, while zero would mean no match.

In [None]:
from torchmetrics.text import BLEUScore

In [8]:
generated_text = ['the cat is on the mat']

In [9]:
real_text = [['a cat is on the mat', 'there is a cat on mat']]

In [10]:
blue = BLEUScore()

In [11]:
blue_score = blue(generated_text, real_text)

In [12]:
blue_score.item()

0.7598357200622559

## BLEU translation task

In translation tasks, BLEU measures translation quality of LLM outputs against provided references.

The example below loads the BLEU score and a Spanish to English translation pipeline to evaluate the generated output against two references. Notice how the single translated text is encapsulated as a list before being passed as the predictions argument of the metric's compute method. 

BLEU reports an overall similarity score as well as several domain-specific measurements, such as precisions for n-grams of different length.

In [None]:
blue = evaluate.load("bleu")

In [None]:
from transformers import pipeline

model_name = "Helsinki-NLP/opus-mt-es-en"
translator = pipeline("translation", model=model_name)

In [35]:
sample_text = "Ha estado lloviendo todo el dia"
references = [["It's raining all day", "It's been raining all day"]]

In [33]:
translated_output = translator(sample_text)
sentence = translated_output[0]["translation_text"]
print(sentence)

It's been raining all day.


In [37]:
results = blue.compute(predictions=[sentence], references=references)
results

{'bleu': 0.7598356856515925,
 'precisions': [0.8333333333333334, 0.8, 0.75, 0.6666666666666666],
 'brevity_penalty': 1.0,
 'length_ratio': 1.5,
 'translation_length': 6,
 'reference_length': 4}

## Perplexity

It is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`.

This is a measurement of how well probability distribution or probability model predicts a sample. A lower perplexity score indicates better generalization performance.

In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [14]:
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [39]:
sample_text = """
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: 
she knelt down and looked along the passage into the loveliest garden you ever saw. 
How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, 
but she could not even get her head though the doorway; `and even if my head would go through,' thought poor Alice,
`it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! 
I think I could, if I only know how to begin.' For, you see, so many out-of-the-way things had happened lately, 
that Alice had begun to think that very few things indeed were really impossible.
"""

In [16]:
inputs = tokenizer.encode(sample_text, return_tensors="pt", max_length=10, truncation=True)
outputs = model.generate(inputs, max_length=25)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Alice opened the door and found that it led to a small room with a large table. She sat down on it and


In [17]:
perplexity = evaluate.load("perplexity", module_type="metric")

In [18]:
perplexity.features

{'predictions': Value(dtype='string', id=None)}

In [19]:
results = perplexity.compute(predictions=generated_text, model_id=model_name)

  0%|          | 0/7 [00:00<?, ?it/s]

Perplexity is calculated by output logit distributions returned by the model to generate each next token.

When multiple generated text predictions are passed, the average perplexity is beign used.

In [20]:
results['mean_perplexity']

3786.449084772647

## ROUGE

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** assesses generated text against reference text in two ways: 

* examines overlapping n-grams, with N representing the n-gram order
* checks for the longest common subsequence (LCS), the longest shared word sequence between the generated and reference text (overlapping)

ROUGE has three metrics:

* F-measure is the harmonic mean of precision and recall. 

* Precision checks matches of n-grams in the generated text that are in the reference text (how many selected items are relevant). 

* Recall checks for matches of n-grams in the reference text that appear in the generated text (how many selected items are relevant). 

The prefixes 'rouge1', 'rouge2', and 'rougeL' specify the n-gram order or LCS.

In [21]:
from torchmetrics.text import ROUGEScore

In [22]:
rouge = ROUGEScore()

In [23]:
rouge_score = rouge(generated_text, real_text)

In [24]:
rouge_score

{'rouge1_fmeasure': tensor(0.2069),
 'rouge1_precision': tensor(0.1304),
 'rouge1_recall': tensor(0.5000),
 'rouge2_fmeasure': tensor(0.),
 'rouge2_precision': tensor(0.),
 'rouge2_recall': tensor(0.),
 'rougeL_fmeasure': tensor(0.1379),
 'rougeL_precision': tensor(0.0870),
 'rougeL_recall': tensor(0.3333),
 'rougeLsum_fmeasure': tensor(0.1379),
 'rougeLsum_precision': tensor(0.0870),
 'rougeLsum_recall': tensor(0.3333)}

## METEOR translation task

Metric for Evaluation of Translation with Explicit Ordering (METEOR) score is a metric that measures the quality of generated text based on the alignment between the generated text and the reference text.

METEOR was proposed to overcome some limitations in ROUGE and BLEU by incorporating more linguistic aspects in the evaluation, such as stemming to deal with morphological variations, capturing words with similar meanings, and penalizing errors in word order. 

## EM (exact match) question answering

EM returns 1 when the model output exactly matches its associated reference answer, and 0 otherwise.

# Metrics for analyzing bias

**TOXICITY** is a metric to quantify language toxicity by using a pre-trained classification LLM for detecting hate speech. 

It takes a list of one or more texts as input, and calculates a toxicity score between 0 and 1 per input, or returns the maximum of the inputs' toxicity scores if the argument 'aggregation="maximum"' is specified, as shown in this example. Alternatively, it can also return the percentage of input predictions with a toxicity score above 0.5.

In [None]:
toxicity_metric = evaluate.load('toxicity')

In [66]:
text_1 = ["Everyone likes sunny days", "Everyone would relate to this"]
text_2 = ["The random person", "This person is opionated"]

In [67]:
results = toxicity_metric.compute(predictions=text_1, aggregation="maximum")
results

{'max_toxicity': 0.00017157204274553806}

In [68]:
results = toxicity_metric.compute(predictions=text_2, aggregation="maximum")
results

{'max_toxicity': 0.06041119620203972}

**REGARD** quantifies language polarity and biased perception towards certain demographics or groups.

In [None]:
regard_metric = evaluate.load('regard')

In [74]:
results = regard_metric.compute(data=text_1)

In [72]:
results = regard_metric.compute(data=text_2)

In [73]:
results

{'regard': [[{'label': 'neutral', 'score': 0.9326518177986145},
   {'label': 'negative', 'score': 0.03637849912047386},
   {'label': 'positive', 'score': 0.017817307263612747},
   {'label': 'other', 'score': 0.01315248478204012}],
  [{'label': 'neutral', 'score': 0.9218319654464722},
   {'label': 'negative', 'score': 0.04694298654794693},
   {'label': 'other', 'score': 0.01583946868777275},
   {'label': 'positive', 'score': 0.015385557897388935}]]}

The second sentence in the last sample has higher negativity score.