### [BLEU](https://huggingface.co/spaces/evaluate-metric/bleu)
* BLEU (Bilingual Evaluation Understudy) is a metric originally designed for evaluating machine translation models by comparing generated text to reference translations. It measures the overlap of n-grams between the generated output and the reference texts while applying a brevity penalty to discourage excessively short translations. BLEU ranges from 0 to 1, where a higher score indicates better similarity to the reference.

* n-gram:
  - max_order (int): Maximum n-gram order to use when computing BLEU score. Defaults to 4.

#### Tutorial
* [What is BLEU metric?](https://www.youtube.com/watch?v=M05L1DhFqcw)


* [BLEU examples with step-by-step calculation](https://docs.kolena.com/metrics/bleu/)

* ```json
   {
     'bleu': 0.40548013303822666,   // Geometric mean of n-gram (4) precisions
     'precisions': [0.6666666666666666, 0.4, 0.25], # 1-gram, 2-gram, 3-gram, 4-gram
     'brevity_penalty': 1.0,
     'length_ratio': 2.0,
     'translation_length': 6,
     'reference_length': 3
   }
  ```
#### Limitations of BLEU
* BLEU compares overlap in tokens from the predictions and references, instead of comparing meaning. This can lead to discrepancies between BLEU scores and human ratings.

* Doesn't incorporate sentence structure.
  - Shorter predicted translations achieve higher scores than longer ones, simply due to how the score is calculated. A brevity penalty is introduced to attempt to counteract this.

* BLEU favors precision over recall.
  - If we have two sentences "The quick brown fox jumps over the lazy dog. It was a sunny day." (Reference) and "The quick brown fox jumps over the lazy dog." (Candidate), BLEU would yield a good scores on what it does extract, but it won't effectively punish it for what it fails to extract.

* BLEU scores are not comparable across different datasets, nor are they comparable across different languages.

* BLEU scores can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used. It is therefore not possible to compare BLEU scores generated using different parameters, or when these parameters are unknown. For more discussion around this topic, see the following [issue](https://github.com/huggingface/datasets/issues/137).


In [2]:
import evaluate
from nltk.tokenize import word_tokenize

bleu = evaluate.load("bleu")

In [3]:
# Generated text from LLM
predictions = ["The quick brown fox jumps over the lazy dog."]

# Reference texts (ground truth responses)
references = [["The quick brown fox jumps over the lazy dog. It is a good day to fly."]]

bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)

{'bleu': 0.4493289641172217,
 'precisions': [1.0, 1.0, 1.0, 1.0],
 'brevity_penalty': 0.4493289641172217,
 'length_ratio': 0.5555555555555556,
 'translation_length': 10,
 'reference_length': 18}

#### Sacrebleu
* Recommended to address tokenization limitations with bleu.

* ```json
  {'score': 32.46679154750991,  // Ranges between 0 and 100 with 100 being identical.
   'counts': [4, 2, 1, 0],
   'totals': [6, 5, 4, 3],
   'precisions': [66.66666666666667, 40.0, 25.0, 16.666666666666668],
   'bp': 1.0,
   'sys_len': 6,
   'ref_len': 5}
  ```

In [4]:
# Generated text from LLM
predictions = ["The quick brown fox jumps over the lazy dog."]

# Reference texts (ground truth responses)
references = [["The quick brown fox jumps over the lazy dog. It is a good day to fly."]]

sacrebleu = evaluate.load("sacrebleu")
sacrebleu.compute(predictions=predictions, references=references)  # No need to pass in tokenizer. sacrebleu has its own tokenizer.

{'score': 44.932896411722176,
 'counts': [10, 9, 8, 7],
 'totals': [10, 9, 8, 7],
 'precisions': [100.0, 100.0, 100.0, 100.0],
 'bp': 0.44932896411722156,
 'sys_len': 10,
 'ref_len': 18}

### [METEOR](https://huggingface.co/spaces/evaluate-metric/meteor)
* METEOR (Metric for Evaluation of Translation with Explicit ORdering)
METEOR is an NLP evaluation metric designed to improve upon BLEU by incorporating semantic matching, stemming, synonyms, and recall in addition to n-gram precision. Unlike BLEU, which focuses purely on n-gram overlap, METEOR allows for flexible matching by considering variations of words.

#### Tutorial
* [METEOR: A metric for Machine Translation](https://www.youtube.com/watch?v=FqQbrlEh_b0)

* [METEOR examples with step-by-step calculation](https://docs.kolena.com/metrics/meteor/)

#### Limitations and Biases
Although METEOR was created to address some of the major limitations of BLEU, it still comes with its own limitations.

* METEOR can fail on context. If we have two sentences "I am a big fan of Taylor Swift" (Reference) and "Fan of Taylor Swift I am big" (Candidate), METEOR would yield a good score. However, the candidate sentence makes little sense and intuitively shouldn't be given a good score. This is a limitation with all n-gram metrics, and not specific to METEOR.

In [5]:
meteor = evaluate.load("meteor")

[nltk_data] Downloading package wordnet to /Users/sglee/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/sglee/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/sglee/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [6]:
# Generated text from LLM
predictions = ["The quick brown fox jumps over the lazy dog."]

# Reference texts (ground truth responses)
references = [["The quick brown fox jumps over the lazy dog. It is a good day to fly."]]

meteor.compute(predictions=predictions, references=references)  # No need to pass in tokenizer. sacrebleu has its own tokenizer.

{'meteor': 0.5790697674418605}

In [7]:
# Generated text from LLM
predictions = ["Fan of Taylor Swift I am big"]

# Reference texts (ground truth responses)
references = [["I am a big fan of Taylor Swift"]]

meteor.compute(predictions=predictions, references=references)  # No need to pass in tokenizer. sacrebleu has its own tokenizer.

{'meteor': 0.8512012399896667}

### [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge)
* ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics designed to evaluate text summarization and text generation models by comparing the overlap of n-grams, word sequences, and longest common subsequences between the generated text and reference texts. Unlike BLEU, which is precision-focused, ROUGE emphasizes recall, making it more suitable for tasks where capturing the full meaning is important.

* Note that ROUGE is case-insensitive, meaning that upper case letters are treated the same way as lower case letters.

* Valid rouge_types:
  - "rouge1": unigram (1-gram) based scoring
  - "rouge2": bigram (2-gram) based scoring
  - "rougeL": Longest common subsequence based scoring.
  - "rougeLSum": splits text using "\n"

#### Tutorial
* [What is ROUGE metric?](https://www.youtube.com/watch?v=TMshhnrEXlg)

* [ROUGE-N examples with step-by-step calculation](https://docs.kolena.com/metrics/rouge-n/)

* [Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples](https://dev.to/aws-builders/mastering-rouge-matrix-your-guide-to-large-language-model-evaluation-for-summarization-with-examples-jjg)

#### Limitations and Biases
ROUGE-N, like any other n-gram based metric, suffers from the following limitations:

* Unlike BERTScore, ROUGE-N is not able to consider order, context, or semantics when calculating a score. Since it only relies on overlapping n-grams, it can not tell when a synonym is being used or if the placement of two matching n-grams have any meaning on the overall sentence. As a result, the metric may not be a perfect representation of the quality of the text, but rather the "likeness" of the n-grams in two sentences. Take for example, the ROUGE-2 score of "This is an example of text" and "Is an example of text this". Both ROUGE-1 and ROUGE-2 would give this a (nearly) perfect score, but the second sentence makes absolutely no sense!

* ROUGE-N can not capture global coherence. Given a long paragraph, realistically, having too large of a value for N would not return a meaningful score for two sentences, but having a reasonable number like N = 3 wouldn't be able to capture the flow of the text. The score might yield good results, but the entire paragraph might not flow smoothly at all. This is a weakness of n-gram based metrics, as they are limited to short context windows.

* See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE’s limits.

In [8]:
rouge = evaluate.load("rouge")

In [9]:
# Generated text from LLM
predictions = ["The quick brown fox jumps over the lazy dog."]

# Reference texts (ground truth responses)
references = [["The quick brown fox jumps over the lazy dog. It is a good day to fly."]]

rouge.compute(predictions=predictions, references=references)   # returns a dictionary of f1 scores

{'rouge1': 0.72,
 'rouge2': 0.6956521739130436,
 'rougeL': 0.72,
 'rougeLsum': 0.72}

In [10]:
# Generated text from LLM
predictions = ["Fan of Taylor Swift I am big"]

# Reference texts (ground truth responses)
references = [["I am a big fan of Taylor Swift"]]

rouge.compute(predictions=predictions, references=references)   # returns a dictionary of f1 scores

{'rouge1': 0.9333333333333333,
 'rouge2': 0.6153846153846153,
 'rougeL': 0.5333333333333333,
 'rougeLsum': 0.5333333333333333}

### BERTScore
* BERTScore is an NLP evaluation metric that uses deep contextual embeddings (from models like BERT) to compare generated text with reference text. Unlike BLEU, ROUGE, or METEOR, which rely on n-gram overlap, BERTScore captures semantic similarity by computing token embeddings and measuring their alignment.

#### Tutorial
* [BERTScore For LLM Evaluation](https://www.comet.com/site/blog/bertscore-for-llm-evaluation/)

* [BERTScore examples with step-by-step calculation](https://docs.kolena.com/metrics/bertscore/)

#### Paper
* [BERTScore: Evaluating Text Generation with BERT](https://iclr.cc/virtual_2020/poster_SkeHuCVFDr.html)

#### Limitations and Biases
BERTScore, originally designed to be a replacement to the BLEU score and other n-gram similarity metrics, is a powerful metric that closely aligns with human judgement. However, it comes with limitations.

* BERTScore is computationally expensive. The default model (roberta-large) used to calculate BERTScore requires 1.4GB of weights to be stored, and requires a forward pass through the model in order to calculate the score. This may be computationally expensive for large datasets, compared to n-gram-based metrics which are straightforward and easy to compute. However, smaller distilled models like distilbert-base-uncased can be used instead to reduce the computational cost, at the cost of reduced alignment with human judgement.

* BERTScore is calculated using a black-box pretrained model. The score can not be easily explained, as the embedding space of BERT is a dense and complex representation that is only understood by the model. Though the metric provides a numerical score, it does not explain how or why the particular score was assigned. In contrast, n-gram-based metrics can easily be calculated by inspection.


In [11]:
bertscore = evaluate.load("bertscore")

In [12]:
# device = torch.device("cuda:0" if torch.cuda.is_available() else "cuda:1")    # if CUDA is available

# Generated text from LLM
predictions = ["The quick brown fox jumps over the lazy dog."]

# Reference texts (ground truth responses)
references = [["The quick brown fox jumps over the lazy dog. It is a good day to fly."]]

bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased", device='mps')   # mps is Mac M family GPU

{'precision': [0.987000584602356],
 'recall': [0.872553825378418],
 'f1': [0.9262553453445435],
 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.55.0)'}

In [13]:
# Generated text from LLM
predictions = ["Fan of Taylor Swift I am big"]

# Reference texts (ground truth responses)
references = [["I am a big fan of Taylor Swift"]]

bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased", device='mps')   # mps is Mac M family GPU

{'precision': [0.8867292404174805],
 'recall': [0.8649751543998718],
 'f1': [0.8757171034812927],
 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.55.0)'}

### **Comparison: BERTScore vs. BLEU vs. ROUGE vs. METEOR**

#### **1. Overview**
| **Metric**  | **Focus** | **Best For** | **Key Features** |
|------------|----------|-------------|------------------|
| **BLEU**   | Precision | Machine Translation, Structured Outputs | Measures **n-gram overlap**, penalizes extra words, lacks recall. |
| **ROUGE**  | Recall   | Summarization, Keyphrase Extraction | Measures **n-gram & longest common subsequence (LCS) overlap**, prioritizes recall. |
| **METEOR** | Precision + Recall | Machine Translation, LLM Evaluation, Paraphrase Detection | Uses **stemming, synonyms, paraphrases**, recall-weighted F-score, penalizes word order errors. |
| **BERTScore** | Semantic Similarity | Text Generation, Summarization, Paraphrasing, LLM Evaluation | Uses **pre-trained embeddings (BERT/RoBERTa)** for **meaning-based** comparison. |

---

#### **2. How They Work**
| **Feature**  | **BLEU** | **ROUGE** | **METEOR** | **BERTScore** |
|------------|--------|--------|--------|------------|
| **N-gram Matching** | ✅ Yes (1-4 grams) | ✅ Yes (1-gram, 2-gram, etc.) | ✅ Yes (flexible n-gram) | ❌ No |
| **Semantic Matching (Stemming, Synonyms, Paraphrases)** | ❌ No | ❌ No | ✅ Yes | ✅ Yes (via contextual embeddings) |
| **Precision-Based** | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes |
| **Recall-Based** | ❌ No | ✅ Yes | ✅ Yes (weighted) | ✅ Yes |
| **Word Order Sensitivity** | ❌ No | ⚠️ Partial (only in ROUGE-L) | ✅ Yes | ✅ Captures reordering effects |
| **Contextual Understanding** | ❌ No | ❌ No | ⚠️ Limited | ✅ Yes (learns from BERT-like models) |
| **Sentence-Level Evaluation** | ❌ No (works best at corpus level) | ⚠️ Some variants (ROUGE-L) | ✅ Yes | ✅ Yes |

---

#### **3. When to Use Each Metric**
| **Use Case** | **Best Metric** | **Why?** |
|-------------|--------------|--------|
| **Machine Translation** | METEOR or BERTScore | METEOR handles synonyms; BERTScore captures meaning. |
| **Summarization** | ROUGE or BERTScore | ROUGE captures recall; BERTScore understands reworded summaries. |
| **LLM Response Evaluation** | BERTScore | BERTScore understands sentence meaning better than n-grams. |
| **Code Generation / Structured Outputs** | BLEU | BLEU enforces **exact matches** for structured formats. |

---