# **NLP Evaluation Metrics: BLEU, ROUGE, and Perplexity**
---

Evaluation metrics are crucial in assessing the performance of NLP models. They help in quantifying how well a model's output aligns with the expected results. This project focuses on three widely-used evaluation metrics:

- **BLEU (Bilingual Evaluation Understudy)**
- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
- **Perplexity**

Each metric serves different purposes and is suitable for various NLP tasks such as machine translation, text summarization, and language modeling.

Before diving into the implementation, ensure that you have the necessary libraries installed. You can install them using `pip`.

```python
!pip install nltk
!pip install rouge
!pip install torch
!pip install transformers
```

In [3]:
# Importing necessary libraries
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge import Rouge
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import math

nltk.download('punkt')


[nltk_data] Downloading package punkt to /home/ali/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## **BLEU Score**

### **What is BLEU?**

**BLEU** (Bilingual Evaluation Understudy) is a metric for evaluating the quality of text which has been machine-translated from one language to another. It measures how closely a candidate translation matches one or more reference translations by calculating the precision of n-grams.

**Key Points:**
- Primarily used for machine translation.
- Considers up to 4-grams.
- Applies a brevity penalty to prevent short translations from scoring artificially high.

### **Implementing BLEU in Python**

Let's implement BLEU score calculation using NLTK.


In [5]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Example reference and candidate sentences
reference = [
    'The cat is on the mat'.split(),
    'There is a cat on the mat'.split()
]
candidate = 'The cat is on the mat'.split()

# Calculate BLEU score
bleu_score = sentence_bleu(reference, candidate, smoothing_function=SmoothingFunction().method1)
print(f"BLEU Score: {bleu_score:.4f}")

BLEU Score: 1.0000


**Explanation:**
- A BLEU score of `1.0` indicates a perfect match between the candidate and reference sentences.
- In real-world scenarios, scores range between `0` and `1`.

#### **Practical Example: Machine Translation**

Let's consider a simple example of evaluating machine-translated sentences.


In [9]:
# References (human translations)
references = [
    ['this', 'is', 'a', 'test'],
    ['this', 'is', 'test']
]

# Candidate (machine translation)
candidate = ['it', 'is', 'a', 'test']

# Calculate BLEU score
bleu = sentence_bleu(references, candidate, smoothing_function=SmoothingFunction().method1)
print(f"BLEU Score: {bleu:.4f}")

BLEU Score: 0.3976


## **ROUGE Score**

### **What is ROUGE?**

**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation that compares an automatically produced summary or translation against a set of reference summaries (typically human-produced).

**Key Variants:**
- **ROUGE-N**: Overlap of n-grams.
- **ROUGE-L**: Longest Common Subsequence.
- **ROUGE-W**: Weighted longest common subsequence.
- **ROUGE-S**: Skip-bigram.

### **Implementing ROUGE in Python**

We'll use the `rouge` library to calculate ROUGE scores.


In [10]:
from rouge import Rouge

# Initialize Rouge
rouge = Rouge()

# Example summaries
reference_summary = "The cat is on the mat."
candidate_summary = "There is a cat on the mat."

# Calculate ROUGE scores
scores = rouge.get_scores(candidate_summary, reference_summary)
print(scores)

[{'rouge-1': {'r': 0.8333333333333334, 'p': 0.7142857142857143, 'f': 0.7692307642603551}, 'rouge-2': {'r': 0.4, 'p': 0.3333333333333333, 'f': 0.36363635867768596}, 'rouge-l': {'r': 0.6666666666666666, 'p': 0.5714285714285714, 'f': 0.6153846104142012}}]



**Explanation:**
- **ROUGE-1**: Unigram overlap.
- **ROUGE-2**: Bigram overlap.
- **ROUGE-L**: Longest Common Subsequence.

#### **Practical Example: Text Summarization**


In [11]:
# Initialize Rouge
rouge = Rouge()

# Reference summary (human-generated)
reference = "Natural language processing enables computers to understand human language."

# Candidate summary (machine-generated)
candidate = "NLP allows machines to comprehend human languages."

# Calculate ROUGE scores
scores = rouge.get_scores(candidate, reference)
for metric, score in scores[0].items():
    print(f"{metric} - Precision: {score['p']:.4f}, Recall: {score['r']:.4f}, F1-Score: {score['f']:.4f}")

rouge-1 - Precision: 0.2857, Recall: 0.2500, F1-Score: 0.2667
rouge-2 - Precision: 0.0000, Recall: 0.0000, F1-Score: 0.0000
rouge-l - Precision: 0.2857, Recall: 0.2500, F1-Score: 0.2667


## **Perplexity**

### **What is Perplexity?**

**Perplexity** is a measurement of how well a probability model predicts a sample. In the context of language models, it quantifies how well the model predicts the next word in a sequence. Lower perplexity indicates better performance.

**Key Points:**
- Commonly used to evaluate language models.
- Related to the entropy of the model.
- Lower perplexity implies the model is more confident in its predictions.

### **Implementing Perplexity in Python**

We'll use the `transformers` library with a pre-trained GPT-2 model to calculate perplexity.


In [12]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import math

# Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# Example sentence
sentence = "The cat is sitting on the mat."

# Tokenize input
inputs = tokenizer.encode(sentence, return_tensors='pt')

# Calculate loss
with torch.no_grad():
    outputs = model(inputs, labels=inputs)
    loss = outputs.loss
    perplexity = torch.exp(loss)

print(f"Perplexity: {perplexity.item():.2f}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Perplexity: 45.53


**Explanation:**
- The perplexity value indicates how well the model predicts the given sentence.
- A lower value suggests better predictive performance.

#### **Practical Example: Comparing Perplexity**

Let's compare the perplexity of two different sentences.

In [13]:
# Function to calculate perplexity
def calculate_perplexity(sentence):
    inputs = tokenizer.encode(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(inputs, labels=inputs)
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Sentences
sentence1 = "The quick brown fox jumps over the lazy dog."
sentence2 = "asdfghjkl qwertyuiop zxcvbnm."

# Calculate perplexities
perplexity1 = calculate_perplexity(sentence1)
perplexity2 = calculate_perplexity(sentence2)

print(f"Sentence 1 Perplexity: {perplexity1:.2f}")
print(f"Sentence 2 Perplexity: {perplexity2:.2f}")


Sentence 1 Perplexity: 162.47
Sentence 2 Perplexity: 386.45


**Explanation:**
- The first sentence is grammatically correct and coherent, resulting in lower perplexity.
- The second sentence contains random characters, leading to extremely high perplexity, indicating poor predictability.

---

## **Conclusion**

In this mini project, we've explored three fundamental NLP evaluation metrics:

1. **BLEU**: Suitable for evaluating machine translation by measuring n-gram overlaps between candidate and reference translations.
2. **ROUGE**: Ideal for assessing text summarization by comparing overlaps in n-grams, longest common subsequences, etc.
3. **Perplexity**: Essential for evaluating language models by quantifying how well a model predicts a sample.

Understanding and implementing these metrics is crucial for developing and fine-tuning NLP models effectively. By leveraging libraries like NLTK, Rouge, and Transformers, you can seamlessly integrate these evaluation techniques into your NLP projects.

Feel free to expand this project by applying these metrics to larger datasets, experimenting with different models, or exploring additional evaluation metrics!

---

# References

- [NLTK Documentation](https://www.nltk.org/)
- [ROUGE Paper](https://www.aclweb.org/anthology/W04-1013/)
- [Transformers Documentation](https://huggingface.co/transformers/)
- [BLEU Score](https://en.wikipedia.org/wiki/BLEU)