# Evaluation Metrics for LLMs

---

## Notebook Structure

1. Metadata
2. Title and Objective
3. Model & Dataset Overview
4. Concept Explanation
   - What are Evaluation Metrics for LLMs?
   - Why & Where do we use them?
   - Overview of BLEU, ROUGE, Perplexity, and Human Evaluation
5. Mathematical Intuition & Formulas
   - BLEU
   - ROUGE
   - Perplexity
6. Implementation and Examples
   - Sample Candidate and Reference Texts
   - Evaluating using BLEU, ROUGE, and Perplexity
   - Comparing the Fine-Tuned Model vs. the Base Model
7. Analysis of Results and Discussion on Limitations
8. Conclusion and Learnings
9. Acknowledgements


## 1. Metadata

- **Topic**: Evaluation Metrics for Large Language Models (LLMs)
- **Metrics Covered**: BLEU, ROUGE, Perplexity
- **Additional Discussion**: Human Evaluation vs. Automated Metrics, Limitations of Metrics
- **Models Evaluated**: A fine-tuned model and a base (pre-trained) model
- **Tech Stack**: Python, NLTK, Hugging Face Transformers (if applicable), Pandas, Matplotlib


## 2. Title and Objective

### Title
Evaluation Metrics for LLMs: Measuring Performance with BLEU, ROUGE, and Perplexity

### Objective
- **What?** Assess the performance of LLMs using automated evaluation metrics.
- **Why?** To ensure models meet quality standards and to compare a fine-tuned model against a base model.
- **How?** By computing BLEU, ROUGE, and Perplexity scores on sample outputs.
- **Where?** These metrics are widely used in machine translation, summarization, and language modeling tasks.


## 3. Model & Dataset Overview

In this notebook, we assume:
- We have a **base model** (e.g., a pre-trained language model).
- We have a **fine-tuned model** (e.g., further trained on a specific dataset).
- For demonstration, we use a set of sample candidate texts generated by each model and corresponding reference texts (ground truth).

*Note:* In a production setting, you would evaluate on a larger test set of model outputs.


## 4. Concept Explanation

### What are Evaluation Metrics for LLMs?
Evaluation metrics are quantitative measures used to assess the performance and quality of language model outputs. They help us determine how well a model generates text compared to a reference (ground truth).

### Why & Where Do We Use Them?
- **Why?**  
  - To compare different models or model versions.
  - To guide model improvements and fine-tuning.
- **Where?**  
  - In machine translation (BLEU),
  - In summarization tasks (ROUGE),
  - In language modeling (Perplexity).

### Overview of the Metrics
- **BLEU (Bilingual Evaluation Understudy)**:  
  Compares the overlap between n-grams of candidate and reference texts.
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**:  
  Focuses on recall by comparing overlapping units such as n-grams or word sequences.
- **Perplexity**:  
  Measures how well a probability model predicts a sample; lower perplexity indicates a better model.

### Human Evaluation vs. Automated Metrics
- **Human Evaluation**:  
  Involves subjective assessments of fluency, coherence, and relevance.
- **Automated Metrics**:  
  Provide quick, reproducible measurements but may miss nuances that human evaluators capture.


## 5. Mathematical Intuition & Formulas

### BLEU
A simplified explanation:
- **Formula**:  
  `BLEU = BP * exp( sum( w_n * log(p_n) ) )`
  - **BP**: Brevity penalty (penalizes overly short candidate texts)
  - **p_n**: Precision for n-grams (overlap ratio)
  - **w_n**: Weight for each n-gram level (typically equal weights)

*Plain terms:*  
It computes the ratio of matching n-grams between candidate and reference and applies a penalty if the candidate is too short.

### ROUGE
A simplified explanation:
- **Commonly used ROUGE-N**:  
  `ROUGE-N = (Number of matching n-grams) / (Total n-grams in the reference)`
  
*Plain terms:*  
It measures the recall (how much of the reference is covered) by comparing overlapping n-grams.

### Perplexity
- **Formula**:  
  `Perplexity = exp( - (1/N) * sum( log(P(word_i)) ) )`
  - **N**: Number of words
  - **P(word_i)**: The probability assigned by the model to the i-th word

*Plain terms:*  
Lower perplexity means the model is better at predicting the next word; it’s the exponentiation of the negative average log-probability.


# 6. Implementation and Examples

In [1]:


# For demonstration, we create sample candidate and reference texts.
# In practice, these would come from your model outputs.

# Sample texts for the base model and fine-tuned model
reference_texts = [
    "The weather is nice today and perfect for a walk in the park.",
    "Machine learning techniques are revolutionizing the field of artificial intelligence."
]

candidate_texts_base = [
    "The weather is good today and ideal for a stroll in the park.",
    "Machine learning methods are changing the field of AI."
]

candidate_texts_finetuned = [
    "Today the weather is pleasant and it is a perfect day for a park walk.",
    "Techniques in machine learning are transforming the realm of artificial intelligence."
]

# Display sample texts
print("Reference Texts:")
for text in reference_texts:
    print("-", text)

print("\nCandidate Texts (Base Model):")
for text in candidate_texts_base:
    print("-", text)

print("\nCandidate Texts (Fine-Tuned Model):")
for text in candidate_texts_finetuned:
    print("-", text)


Reference Texts:
- The weather is nice today and perfect for a walk in the park.
- Machine learning techniques are revolutionizing the field of artificial intelligence.

Candidate Texts (Base Model):
- The weather is good today and ideal for a stroll in the park.
- Machine learning methods are changing the field of AI.

Candidate Texts (Fine-Tuned Model):
- Today the weather is pleasant and it is a perfect day for a park walk.
- Techniques in machine learning are transforming the realm of artificial intelligence.


### Explanation of Sample Texts

- **Reference Texts:** These are the ground-truth sentences we expect the models to generate.
- **Candidate Texts (Base Model):** Output from the base model.  
- **Candidate Texts (Fine-Tuned Model):** Output from the fine-tuned model.

Reviewing these texts gives us context before calculating the evaluation scores. Differences in wording, synonym usage, or structure will influence the BLEU, ROUGE, and Perplexity scores.


In [3]:
# 6.1 Evaluating BLEU Score using NLTK

import nltk
nltk.download('punkt_tab')
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Ensure that NLTK data is available (you might need to run nltk.download('punkt'))
smooth_fn = SmoothingFunction().method1

def compute_bleu(candidate, reference):
    # Tokenize the sentences
    candidate_tokens = nltk.word_tokenize(candidate.lower())
    reference_tokens = [nltk.word_tokenize(reference.lower())]
    return sentence_bleu(reference_tokens, candidate_tokens, smoothing_function=smooth_fn)

# Evaluate BLEU scores for base and fine-tuned models
bleu_scores_base = [compute_bleu(c, r) for c, r in zip(candidate_texts_base, reference_texts)]
bleu_scores_finetuned = [compute_bleu(c, r) for c, r in zip(candidate_texts_finetuned, reference_texts)]

print("BLEU Scores (Base Model):", bleu_scores_base)
print("BLEU Scores (Fine-Tuned Model):", bleu_scores_finetuned)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


BLEU Scores (Base Model): [0.31314224813827346, 0.12927595103001124]
BLEU Scores (Fine-Tuned Model): [0.0932304601086144, 0.2790159393585827]


### BLEU Score Explanation

- **What the Code Does:**  
  - The `compute_bleu` function tokenizes the candidate and reference sentences, then computes the BLEU score using NLTK’s `sentence_bleu`.
  - We calculate BLEU scores for each pair of candidate and reference texts for both models.
  
- **Interpreting the Scores:**  
  - **Higher BLEU Score:** Indicates a greater overlap of n‑grams (words, bi-grams, etc.) between candidate and reference, suggesting closer similarity to the ground truth.
  - Compare the BLEU scores of the base and fine-tuned models to assess which one produces outputs closer to the reference texts.

### Interpretation of BLEU Scores

- **Base Model BLEU Scores:** [0.3131, 0.1293]  
- **Fine-Tuned Model BLEU Scores:** [0.0932, 0.2790]

**What this means:**  
BLEU measures the overlap of n‑grams between the candidate and reference texts. For the first sample, the base model achieves a higher BLEU score (0.3131) compared to the fine-tuned model (0.0932), indicating that its output has more n‑gram similarity with the reference. Conversely, for the second sample, the fine-tuned model (0.2790) outperforms the base model (0.1293), suggesting better overlap with the reference text.  

**Overall:**  
The mixed BLEU scores across samples imply that while one model might perform better on a specific instance, a larger evaluation set is needed to conclusively determine which model consistently produces outputs closer to the references.


In [5]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=507ac6bb7213cb2ba8f9fd27118224e416ea955226ee9d4bf6695fd0f309fdaf
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [6]:
# 6.2 Evaluating ROUGE Score using the 'rouge' package
# You might need to install the package: pip install rouge-score

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

def compute_rouge(candidate, reference):
    scores = scorer.score(reference, candidate)
    # For demonstration, we return the F-measure of ROUGE-1 and ROUGE-L
    rouge1 = scores['rouge1'].fmeasure
    rougeL = scores['rougeL'].fmeasure
    return (rouge1 + rougeL) / 2

# Evaluate ROUGE scores for base and fine-tuned models
rouge_scores_base = [compute_rouge(c, r) for c, r in zip(candidate_texts_base, reference_texts)]
rouge_scores_finetuned = [compute_rouge(c, r) for c, r in zip(candidate_texts_finetuned, reference_texts)]

print("ROUGE Scores (Base Model):", rouge_scores_base)
print("ROUGE Scores (Fine-Tuned Model):", rouge_scores_finetuned)


ROUGE Scores (Base Model): [0.7692307692307693, 0.631578947368421]
ROUGE Scores (Fine-Tuned Model): [0.6428571428571428, 0.7142857142857143]


### ROUGE Score Explanation

- **What the Code Does:**  
  - The code uses the `rouge_scorer` from the `rouge-score` package to compute ROUGE-1 and ROUGE-L scores for each candidate-reference pair.
  - It then averages the F-measures of both ROUGE-1 and ROUGE-L to obtain a single score per sentence.
  
- **Interpreting the Scores:**  
  - **Higher ROUGE Score:** Suggests a better recall of n‑gram overlap, meaning the candidate text covers more of the key content in the reference.
  - Use these scores to compare how well each model’s outputs capture the essence of the reference texts.

### Interpretation of ROUGE Scores

- **Base Model ROUGE Scores:** [0.7692, 0.6316]  
- **Fine-Tuned Model ROUGE Scores:** [0.6429, 0.7143]

**What this means:**  
ROUGE focuses on recall—measuring the fraction of key n‑grams or phrases in the reference text that appear in the candidate text. For sample 1, the base model's ROUGE score (0.7692) is higher than that of the fine-tuned model (0.6429), suggesting that the base model’s output covers more of the important content. For sample 2, the fine-tuned model scores higher (0.7143) compared to the base model (0.6316), indicating better coverage of the reference content for that sample.

**Overall:**  
These results demonstrate that performance can vary by sample, and the choice between models might depend on the specific content or context. Averaging results over a larger test set is recommended for a robust evaluation.


In [7]:
# 6.3 Evaluating Perplexity
# For perplexity, assume we have a language model that can score text.
# We use a dummy function for demonstration. In practice, use your LM's API.

import math

def compute_perplexity(text, model_log_prob_func):
    """
    Compute perplexity given a text and a function that returns log probability for each token.
    model_log_prob_func should accept a list of tokens and return the sum of log probabilities.
    """
    tokens = nltk.word_tokenize(text.lower())
    N = len(tokens)
    if N == 0:
        return float('inf')
    log_prob_sum = model_log_prob_func(tokens)
    avg_log_prob = log_prob_sum / N
    perplexity = math.exp(-avg_log_prob)
    return perplexity

# Dummy log probability function: for demonstration, we assume a fixed probability for each token.
def dummy_log_prob(tokens):
    # Let's assume each token has a probability of 0.1 (log(0.1) = -2.3026)
    return -2.3026 * len(tokens)

# Evaluate perplexity for candidate texts
perplexity_base = [compute_perplexity(c, dummy_log_prob) for c in candidate_texts_base]
perplexity_finetuned = [compute_perplexity(c, dummy_log_prob) for c in candidate_texts_finetuned]

print("Perplexity (Base Model):", perplexity_base)
print("Perplexity (Fine-Tuned Model):", perplexity_finetuned)


Perplexity (Base Model): [10.000149071170647, 10.000149071170643]
Perplexity (Fine-Tuned Model): [10.000149071170643, 10.000149071170643]


### Perplexity Explanation

- **What the Code Does:**  
  - The `compute_perplexity` function tokenizes the text and uses a provided log probability function to compute the average log probability.
  - We then exponentiate the negative average to obtain the perplexity.
  - Here, we use a `dummy_log_prob` function for demonstration that assumes a constant probability for each token.
  
- **Interpreting the Scores:**  
  - **Lower Perplexity:** Indicates that the model is more confident in its predictions (i.e., the candidate text is more “predictable” by the model).
  - Compare the perplexity values between the base and fine-tuned models to see which one is better at predicting the text.

### Interpretation of Perplexity Scores

- **Base Model Perplexity Scores:** [10.00015, 10.00015]  
- **Fine-Tuned Model Perplexity Scores:** [10.00015, 10.00015]

**What this means:**  
Perplexity quantifies how well a model predicts the next word in a sequence—a lower perplexity indicates better predictive performance. In this demonstration, both models yield virtually identical perplexity values (approximately 10), meaning that, under the dummy probability function used here, they are equally “confident” in predicting the text.

**Overall:**  
Since we used a dummy log probability function for demonstration purposes, the perplexity scores do not differentiate the models. In a real-world scenario with a proper language model providing log probabilities, differences in perplexity might better reflect the models' predictive capabilities.


## 7. Analysis of Results and Discussion on Limitations

- **BLEU**:  
  Measures n-gram precision. While higher scores indicate more overlap, BLEU may not capture semantic meaning fully.
  
- **ROUGE**:  
  Focuses on recall and captures overlapping phrases. It is useful in summarization tasks but may penalize valid paraphrases.
  
- **Perplexity**:  
  Indicates how well a language model predicts a text. Lower perplexity is better; however, it is sensitive to the underlying probability estimates.
  
- **Human vs. Automated Evaluation**:  
  Automated metrics provide quick, reproducible results but can miss nuanced language qualities that human evaluators would notice.

- **Limitations**:  
  - **BLEU/ROUGE**: Can be insensitive to meaning and may reward word overlap without semantic correctness.
  - **Perplexity**: Depends on the quality of the language model; not always correlated with human judgments.
