<a href="https://colab.research.google.com/github/wesslen/llm-evaluations/blob/main/notebooks/01_llm_evaluations_reference_based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install bert-score rouge-score sacrebleu transformers moverscore comet pytorch-pretrained-bert pyemd unbabel-comet

## Reference-based Metrics

### BLEU

In [2]:
# Machine translation evaluation metric
import sacrebleu
# Note: SacreBLEU expects references as a list of lists (multiple references per translation)

# List of texts for multi-sample evaluation
candidates = ["The house is small", "The cat sits on the mat"]
references = ["The house is tiny", "There is a cat on the mat"]

bleu = sacrebleu.corpus_bleu(
    candidates,
    [[ref] for ref in references]  # Wrap each reference in a list
)

print("BLEU Score Results:")
print(f"BLEU Score: {bleu.score:.3f}")
print(f"Precisions: {[f'{p:.1f}' for p in bleu.precisions]}")
print(f"Brevity Penalty: {bleu.bp:.3f}")
print("\n")

BLEU Score Results:
BLEU Score: 59.460
Precisions: ['75.0', '66.7', '50.0', '50.0']
Brevity Penalty: 1.000





#### Input Texts

```python
Candidate 1: "The house is small"      (4 words)
Reference 1: "The house is tiny"       (4 words)
Candidate 2: "The cat sits on the mat" (6 words)
Reference 2: "There is a cat on the mat" (7 words)
```

#### Detailed Breakdown

**Precisions [75.0, 66.7, 50.0, 50.0]**

1. Unigrams (1-gram) = 75.0%

- In pair 1: "The", "house", "is" match (3/4 words)
- In pair 2: "cat", "on", "the", "mat" match (4/6 words)
- Total matches = 7, Total candidate words = 10
P1 = 7/10 = 0.75 or 75%

2. Bigrams (2-gram) = 66.7%

- In pair 1: "The house", "house is" match (2/3 possible bigrams)
- In pair 2: "the mat" matches (1/5 possible bigrams)
- Total matching bigrams / Total candidate bigrams ≈ 66.7%

3. Trigrams (3-gram) = 50.0%

- In pair 1: "The house is" matches (1/2 possible trigrams)
- In pair 2: Few or no matches
- Total matching trigrams / Total candidate trigrams = 50%

4. 4-grams = 50.0%

- Similar pattern of matching longer sequences

**Brevity Penalty (BP) = 1.000**

- BP penalizes translations that are too short
- Calculated as: `min(1, exp(1 - reference_length/candidate_length))`
- `Total reference length = 4 + 7 = 11 words`
- `Total candidate length = 4 + 6 = 10 words`
- `Reference/Candidate ratio ≈ 1.1`. Since it's close to 1, BP = 1.000 (no penalty applied)

**Final BLEU Score = 59.460**

- Calculated as: `BP × exp(mean(log(precision_n)))`
- In this case: `1.000 × geometric_mean([0.75, 0.667, 0.50, 0.50])`
- The geometric mean of the precisions is weighted by `0.25` each
- Final calculation: `1.000 × 59.460 = 59.460`

#### Interpretation

The score of `59.460` indicates:

- Moderate to good translation quality
- Strong word-level matching (75% unigram precision)
- Decent phrase matching (67% bigram precision)
- Some preservation of longer sequences (50% for 3-grams and 4-grams)
- No length penalty (BP = 1.000)

The score suggests the translations capture the basic meaning but have some variations in phrasing, which is exactly what we see in the examples (e.g., "small" vs "tiny", "There is a" vs "The").

### ROUGE

In [3]:
# Calculate ROUGE scores for text similarity based on n-gram overlap
from rouge_score import rouge_scorer

# Sample texts for evaluation
candidate = "The quick brown fox jumps over the lazy dog"
reference = "A quick brown fox jumped over a lazy dog"

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)

print("ROUGE Score Results:")
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")
print("\n")

ROUGE Score Results:
ROUGE-1: 0.778
ROUGE-2: 0.625
ROUGE-L: 0.778




### Bert Score

In [4]:
# Evaluates text similarity using BERT embeddings
from bert_score import score

# List of texts for multi-sample evaluation
candidates = ["The house is small", "The cat sits on the mat"]
references = ["The house is tiny", "There is a cat on the mat"]

P, R, F1 = score(candidates, references, lang='en', verbose=True)

print("BERT Score Results:")
print(f"Precision: {P.mean().item():.3f}")
print(f"Recall: {R.mean().item():.3f}")
print(f"F1: {F1.mean().item():.3f}")
print("\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.83 seconds, 1.09 sentences/sec
BERT Score Results:
Precision: 0.969
Recall: 0.965
F1: 0.967




In [5]:
# Uses BART model for scoring
import torch
from transformers import BartTokenizer, BartForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to(device)


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [6]:

def compute_bartscore(source, target):
    # Tokenize
    batch = bart_tokenizer(source, target, return_tensors="pt", padding=True, truncation=True)
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)

    # Generate score
    with torch.no_grad():
        outputs = bart_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=input_ids
        )

    return -outputs.loss.item()  # Negative log likelihood

bart_scores = [
    compute_bartscore(ref, cand)
    for ref, cand in zip(references, candidates)
]

print("BARTScore Results:")
for i, score in enumerate(bart_scores):
    print(f"Text pair {i+1}: {score:.3f}")
print("\n")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


BARTScore Results:
Text pair 1: -2.283
Text pair 2: -1.591




### MoverScore

In [7]:
from moverscore import word_mover_score  # MoverScore
# Uses Word Mover Distance with contextual embeddings
# Higher scores indicate better similarity
idf_dict_hyp = defaultdict(lambda: 1.)
idf_dict_ref = defaultdict(lambda: 1.)

mover_scores = word_mover_score(
    references, candidates,
    idf_dict_ref, idf_dict_hyp,
    stop_words=[], n_gram=1,
    remove_subwords=True
)

print("MoverScore Results:")
for i, score in enumerate(mover_scores):
    print(f"Text pair {i+1}: {score:.3f}")
print("\n")

Downloading https://github.com/AIPHES/emnlp19-moverscore/releases/download/0.6/MNLI_BERT.zip to /root/.moverscore/MNLI_BERT.zip
[--------------------------------------------------]


  state_dict = torch.load(weights_path, map_location='cpu')


NameError: name 'defaultdict' is not defined

### COMET

In [8]:
from comet import download_model, load_from_checkpoint  # COMET

# Download and load COMET model
model_path = download_model("wmt20-comet-da")
model = load_from_checkpoint(model_path)

wmt20-comet-da.tar.gz: 1.79GB [02:25, 12.3MB/s]                            
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.3.5 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/unbabel_comet/wmt20-comet-da/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


In [9]:
# Prepare data for COMET

# input / source -- what should be translated
source_texts = [
    "Hello, my dog is cute",
    "The weather is nice today",
    "I love programming"
]

# hypothetical model predictions
candidates = [
    ["Salut, mon chien est mignon"],
    ["Le temps est beau aujourd'hui"],
    ["J'adore la programmation"]
]

# gold standard (references)
references = [
    ["Bonjour, mon chien est mignon"],
    ["Le temps est beau aujourd'hui"],
    ["J'aime la programmation"]
]

comet_data = [{
    "src": source_texts,
    "mt": cand,
    "ref": references
} for cand, ref in zip(candidates, references)]

# Get COMET scores
comet_scores = model.predict(comet_data, batch_size=8, gpus=1)

comet_scores

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  3.26it/s]


### Perplexity

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Prepare text
text = "This is a sample text to evaluate perplexity."
encodings = tokenizer(text, return_tensors='pt')

# Calculate perplexity
with torch.no_grad():
    outputs = model(encodings.input_ids, labels=encodings.input_ids)
    ppl = torch.exp(outputs.loss)

print("Perplexity Results:")
print(f"Perplexity: {ppl.item():.3f}")

## Libraries

In [None]:
# Now check versions
import pkg_resources
import sys

def get_package_details():
    """Print details of specific packages and Python version"""
    packages_to_check = [
        'torch',
        'transformers',
        'sacrebleu',
        'bert-score',
        'rouge-score',
        'moverscore',
        'bleurt',
        'comet',
        'numpy',
        'sentencepiece'  # Often used by transformers
    ]

    print("Python version:", sys.version.split()[0])
    print("\nPackage versions:")
    print("-" * 50)

    for package in packages_to_check:
        try:
            version = pkg_resources.get_distribution(package).version
            print(f"{package:<15} {version}")
        except pkg_resources.DistributionNotFound:
            print(f"{package:<15} Not installed")

# Check CUDA availability for PyTorch
import torch
print("\nCUDA Status:")
print("-" * 50)
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Current GPU: {torch.cuda.get_device_name()}")

# Run the check
get_package_details()