## BLEU SCORE

BLEU (Bilingual Evaluation Understudy) is a metric that measures the accuracy of machine translations by comparing them to human reference translations using n-gram precision and a brevity penalty.

In [2]:
# pip install evaluate

In [1]:
import evaluate

blue = evaluate.load("bleu")

In [6]:
prediction_text = ["They cancelled the match because it was raining "]
reference_text = ["They cancelled the match because of bad weather"]

In [7]:
results = blue.compute(predictions=prediction_text, references= reference_text)

In [8]:
results

{'bleu': 0.5169731539571706,
 'precisions': [0.625, 0.5714285714285714, 0.5, 0.4],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 8,
 'reference_length': 8}

#### BLEU Score Limitations

No semantic understanding – It treats synonyms ("called off" vs. "cancelled") as different words. 
Ignores word importance – Gives equal weight to all words, even less meaningful ones like prepositions.
Limited word order preservation – Captures short sequences but not long sentence structure.
Only exact matches – Doesn’t recognize variations ("rain" vs. "raining").
Precision-focused – Checks correctness but not completeness of translation.

## ROUGE SCORE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate text summarization and machine translation by measuring n-gram overlap between machine-generated and reference texts.

In [7]:
# pip install rouge_score

In [4]:
rouge = evaluate.load("rouge")

In [5]:
prediction_text = ["They cancelled the match because it was raining "]
reference_text = ["They cancelled the match because of bad weather"]

In [6]:
rouge.compute(predictions=prediction_text, references=reference_text)

{'rouge1': 0.625,
 'rouge2': 0.5714285714285714,
 'rougeL': 0.625,
 'rougeLsum': 0.625}

#### ROUGE Score Limitations
No semantic understanding – It doesn’t recognize synonyms ("huge" vs. "massive").
Limited word order detection – Struggles with sentence structure, especially for short n-grams.
No length penalty – Can’t penalize summaries that are too short or contain extra details.

## METEOR SCORE

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric that improves upon BLEU by considering synonyms, stemming (word variations), and word order in its scoring.

Recognizes synonyms (happy vs. joyful)

Handles word variations (run vs. running)

Considers word order (penalizes incorrect structure)

Balances precision & recall, unlike BLEU, which mainly focuses on precision

In [8]:
meteor = evaluate.load('meteor')


Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [10]:
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
references = ["It is a guide to action that ensures that the military will forever heed Party commands"]


In [11]:
results = meteor.compute(predictions=predictions, references=references)

In [12]:
results

{'meteor': 0.6944444444444445}