# Module 23: NLP Evaluation & Monitoring

**Measuring What Matters in Production**

---

## 1. Objectives

- ‚úÖ Master NLP evaluation metrics
- ‚úÖ Understand task-specific metrics
- ‚úÖ Implement production monitoring
- ‚úÖ Handle model drift detection

## 2. Prerequisites

- [Module 22: NLP Model Deployment](../22_deployment/22_deployment.ipynb)

## 3. NLP Metrics Overview

### Task-Specific Metrics

| Task | Primary Metrics |
|------|----------------|
| Classification | Accuracy, F1, Precision, Recall |
| NER | Entity F1, Exact Match |
| Translation | BLEU, METEOR, chrF |
| Summarization | ROUGE-1/2/L |
| QA | EM, F1, MRR |
| Generation | Perplexity, BLEU, Human eval |

In [1]:
# Install: pip install evaluate rouge-score sacrebleu

import numpy as np
from collections import Counter
from typing import List, Dict

## 4. Classification Metrics

In [2]:
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix
import numpy as np

# Example predictions
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 0]
y_pred = [0, 1, 0, 0, 1, 1, 1, 1, 0, 0]

# Calculate metrics
precision, recall, f1, _ = precision_recall_fscore_support(
    y_true, y_pred, average='binary'
)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 Score: {f1:.3f}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))

Precision: 0.800
Recall: 0.800
F1 Score: 0.800

Confusion Matrix:
[[4 1]
 [1 4]]


In [3]:
# Multi-class metrics
y_true_multi = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred_multi = [0, 2, 2, 0, 1, 1, 0, 1, 2]

# Macro vs Micro vs Weighted
for avg in ['macro', 'micro', 'weighted']:
    p, r, f, _ = precision_recall_fscore_support(
        y_true_multi, y_pred_multi, average=avg
    )
    print(f"{avg.capitalize():8s} - P: {p:.3f}, R: {r:.3f}, F1: {f:.3f}")

Macro    - P: 0.778, R: 0.778, F1: 0.778
Micro    - P: 0.778, R: 0.778, F1: 0.778
Weighted - P: 0.778, R: 0.778, F1: 0.778


## 5. BLEU Score (Translation/Generation)

In [4]:
def compute_bleu(reference: List[str], candidate: List[str], n: int = 4) -> float:
    """Simplified BLEU score computation."""

    def ngrams(tokens, n):
        return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

    precisions = []
    for i in range(1, n + 1):
        ref_ngrams = Counter(ngrams(reference, i))
        cand_ngrams = Counter(ngrams(candidate, i))

        overlap = sum((cand_ngrams & ref_ngrams).values())
        total = sum(cand_ngrams.values())

        precisions.append(overlap / total if total > 0 else 0)

    # Geometric mean
    if 0 in precisions:
        return 0.0

    bleu = np.exp(np.mean(np.log(precisions)))

    # Brevity penalty
    bp = min(1, np.exp(1 - len(reference) / len(candidate)))

    return bp * bleu

# Example
ref = "the cat sat on the mat".split()
cand = "the cat is on the mat".split()

print(f"BLEU Score: {compute_bleu(ref, cand):.4f}")

BLEU Score: 0.0000


In [6]:
# Install required libraries
!pip install evaluate rouge-score sacrebleu

# Using HuggingFace evaluate
import evaluate

bleu = evaluate.load("bleu")
results = bleu.compute(
    predictions=["the cat is on the mat"],
    references=[["the cat sat on the mat"]]
)
print(f"BLEU (evaluate): {results['bleu']:.4f}")

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu
  Downloading sacrebleu-2.6.0-py3-none-any.whl.metadata (39 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacrebleu-2.6.0-py3-none-any.whl (100 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m100.8/100.8 kB[0m [31m9.4 MB/s[0m eta [36

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

BLEU (evaluate): 0.0000


## 6. ROUGE Score (Summarization)

In [7]:
rouge = evaluate.load("rouge")

predictions = ["The quick brown fox jumps over the lazy dog"]
references = ["A fast brown fox leaps over a sleepy dog"]

results = rouge.compute(
    predictions=predictions,
    references=references
)

print("ROUGE Scores:")
for key, value in results.items():
    print(f"  {key}: {value:.4f}")

Downloading builder script: 0.00B [00:00, ?B/s]

ROUGE Scores:
  rouge1: 0.4444
  rouge2: 0.1250
  rougeL: 0.4444
  rougeLsum: 0.4444


## 7. Entity-Level Evaluation (NER)

In [9]:
!pip install seqeval

from seqeval.metrics import classification_report, f1_score

# BIO format
y_true = [['O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O']]
y_pred = [['O', 'B-PER', 'I-PER', 'O', 'B-ORG', 'O']]

print("NER Classification Report:")
print(classification_report(y_true, y_pred))
print(f"\nEntity F1: {f1_score(y_true, y_pred):.3f}")

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=f7dca70b210f2a8c72b21107a1a5753b3afa2559a66bdebcef04b0dd25c5c8e0
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
NER Classifica

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 8. Perplexity (Language Models)

In [10]:
import torch
import torch.nn.functional as F

def compute_perplexity(logits: torch.Tensor, targets: torch.Tensor) -> float:
    """Compute perplexity from model logits."""
    # Cross entropy loss
    ce_loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        targets.view(-1),
        reduction='mean'
    )
    return torch.exp(ce_loss).item()

# Example
vocab_size = 1000
seq_len = 10

logits = torch.randn(1, seq_len, vocab_size)
targets = torch.randint(0, vocab_size, (1, seq_len))

ppl = compute_perplexity(logits, targets)
print(f"Perplexity: {ppl:.2f}")
print(f"(Lower = better, random baseline = {vocab_size})")

Perplexity: 1080.49
(Lower = better, random baseline = 1000)


## 9. Production Monitoring

### Key Metrics to Track

| Category | Metrics |
|----------|--------|
| Performance | Latency (p50, p99), throughput |
| Quality | Accuracy, confidence distribution |
| Drift | Input distribution, prediction shift |
| Business | User feedback, conversion rates |

In [11]:
import time
from dataclasses import dataclass, field
from typing import List

@dataclass
class ModelMonitor:
    """Simple production monitor."""
    latencies: List[float] = field(default_factory=list)
    predictions: List[int] = field(default_factory=list)
    confidences: List[float] = field(default_factory=list)

    def log_prediction(self, latency: float, pred: int, conf: float):
        self.latencies.append(latency)
        self.predictions.append(pred)
        self.confidences.append(conf)

    def get_stats(self) -> Dict:
        return {
            "latency_p50": np.percentile(self.latencies, 50),
            "latency_p99": np.percentile(self.latencies, 99),
            "avg_confidence": np.mean(self.confidences),
            "pred_distribution": dict(Counter(self.predictions))
        }

# Simulate monitoring
monitor = ModelMonitor()

for _ in range(100):
    monitor.log_prediction(
        latency=np.random.exponential(50),  # ms
        pred=np.random.choice([0, 1], p=[0.3, 0.7]),
        conf=np.random.beta(5, 1)
    )

print("Monitor Stats:")
for k, v in monitor.get_stats().items():
    print(f"  {k}: {v}")

Monitor Stats:
  latency_p50: 31.67470930869834
  latency_p99: 177.65922914044606
  avg_confidence: 0.8232331767613418
  pred_distribution: {np.int64(0): 32, np.int64(1): 68}


## 10. Drift Detection

In [12]:
from scipy import stats

def detect_drift(baseline: List[float], current: List[float],
                 threshold: float = 0.05) -> Dict:
    """Detect distribution drift using KS test."""
    statistic, p_value = stats.ks_2samp(baseline, current)

    return {
        "statistic": statistic,
        "p_value": p_value,
        "drift_detected": p_value < threshold
    }

# Example: baseline vs drifted distribution
baseline_conf = np.random.beta(5, 1, 1000)  # High confidence
current_conf = np.random.beta(2, 2, 1000)   # Lower confidence

result = detect_drift(baseline_conf, current_conf)
print(f"Drift Detection Results:")
print(f"  Statistic: {result['statistic']:.4f}")
print(f"  P-value: {result['p_value']:.4f}")
print(f"  Drift Detected: {result['drift_detected']}")

Drift Detection Results:
  Statistic: 0.6280
  P-value: 0.0000
  Drift Detected: True


## 11. Interview Questions

**Q1: When would you use BLEU vs ROUGE?**
<details><summary>Answer</summary>

- **BLEU**: Translation - measures precision (n-gram overlap in generated text)
- **ROUGE**: Summarization - measures recall (important content captured)
</details>

**Q2: How do you handle class imbalance in evaluation?**
<details><summary>Answer</summary>

1. Use macro F1 (treats classes equally)
2. Report per-class metrics
3. Use stratified splits
4. Consider precision-recall curves over accuracy
</details>

**Q3: What is model drift and how do you detect it?**
<details><summary>Answer</summary>

Drift = change in input distribution or model behavior over time.
Detection:
- Statistical tests (KS, PSI) on input features
- Monitor prediction distribution
- Track confidence scores
- Compare with ground truth when available
</details>

## 12. Summary

- **Classification**: F1, Precision, Recall (macro for imbalanced)
- **Translation**: BLEU (precision-focused)
- **Summarization**: ROUGE (recall-focused)
- **NER**: Entity-level F1 with seqeval
- **LM**: Perplexity
- **Production**: Latency, confidence, drift detection

## 13. References

- [HuggingFace Evaluate](https://huggingface.co/docs/evaluate/)
- [BLEU Paper](https://aclanthology.org/P02-1040/)
- [ROUGE Paper](https://aclanthology.org/W04-1013/)
- [seqeval](https://github.com/chakki-works/seqeval)

---
**üéâ Congratulations! You've completed the NLP with PyTorch curriculum!**

Return to [Module 00: NLP Pipeline Overview](../00_nlp_pipeline/00_nlp_pipeline_overview.ipynb)