# 📊 BlazeMetrics Core Metrics Showcase

This notebook demonstrates the core NLP evaluation metrics provided by BlazeMetrics. All metrics are implemented in Rust for blazing-fast performance and can be computed in parallel across CPU cores.

## 🎯 What You'll Learn

- **ROUGE Scores**: ROUGE-1, ROUGE-2, ROUGE-L for text summarization evaluation
- **BLEU Score**: Bilingual Evaluation Understudy for machine translation
- **chrF Score**: Character n-gram F-score for multilingual evaluation
- **METEOR**: Metric for Evaluation of Translation with Explicit ORdering
- **WER**: Word Error Rate for speech recognition evaluation
- **Token-level Metrics**: F1, Jaccard similarity for fine-grained analysis
- **BERTScore**: Semantic similarity using contextual embeddings
- **MoverScore**: Word mover distance for semantic evaluation

## 🚀 Performance Features

- **Rust Core**: Native performance without Python GIL limitations
- **Parallel Processing**: Automatic parallelization with Rayon
- **NumPy Integration**: Efficient handling of embeddings and numerical data
- **Batch Processing**: Optimized for large-scale evaluation

In [None]:
# Import required libraries
import numpy as np
import time
from typing import List, Dict
import matplotlib.pyplot as plt
import seaborn as sns

# Import BlazeMetrics functions
from blazemetrics import (
    rouge_score, bleu, chrf_score, meteor, wer,
    token_f1, jaccard, bert_score_similarity, moverscore_greedy,
    compute_text_metrics, aggregate_samples
)

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
print("✅ Libraries imported successfully!")

## 📝 Sample Data Preparation

Let's create realistic sample data for demonstration:

In [None]:
# Create sample data for evaluation
candidates = [
    "The quick brown fox jumps over the lazy dog",
    "A good book is a great friend that never betrays",
    "Machine learning algorithms can process vast amounts of data",
    "The weather today is sunny and warm",
    "Artificial intelligence transforms how we solve problems",
    "Natural language processing enables computers to understand text",
    "Deep learning models require significant computational resources",
    "The internet connects billions of people worldwide",
    "Climate change affects global ecosystems and human societies",
    "Quantum computing promises revolutionary advances in cryptography"
]

references = [
    ["A quick brown fox leaps over the lazy dog"],
    ["A good book is a true friend that never disappoints"],
    ["Machine learning algorithms process large amounts of data"],
    ["The weather today is sunny and pleasant"],
    ["AI transforms how we approach problem-solving"],
    ["NLP allows computers to comprehend human language"],
    ["Deep learning models need substantial computational power"],
    ["The internet links billions of people globally"],
    ["Climate change impacts ecosystems and human communities"],
    ["Quantum computing offers breakthrough advances in security"]
]

print(f"📊 Sample Data Created:")
print(f"  • Candidates: {len(candidates)} texts")
print(f"  • References: {len(references)} reference sets")
print(f"  • Average candidate length: {np.mean([len(c.split()) for c in candidates]):.1f} words")
print(f"  • Average reference length: {np.mean([len(r[0].split()) for r in references]):.1f} words")

## 🔥 ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is widely used for evaluating text summarization and generation tasks.

In [None]:
print("🎯 Computing ROUGE Scores...")
print("=" * 50)

# ROUGE-1 (Unigram overlap)
start_time = time.perf_counter()
rouge1_scores = rouge_score(candidates, references, score_type="rouge_n", n=1)
rouge1_time = time.perf_counter() - start_time

# ROUGE-2 (Bigram overlap)
start_time = time.perf_counter()
rouge2_scores = rouge_score(candidates, references, score_type="rouge_n", n=2)
rouge2_time = time.perf_counter() - start_time

# ROUGE-L (Longest Common Subsequence)
start_time = time.perf_counter()
rougeL_scores = rouge_score(candidates, references, score_type="rouge_l")
rougeL_time = time.perf_counter() - start_time

# Display results
print(f"ROUGE-1 Scores (took {rouge1_time*1000:.2f} ms):")
for i, (p, r, f1) in enumerate(rouge1_scores):
    print(f"  Text {i+1}: P={p:.3f}, R={r:.3f}, F1={f1:.3f}")

print(f"\nROUGE-2 Scores (took {rouge2_time*1000:.2f} ms):")
for i, (p, r, f1) in enumerate(rouge2_scores):
    print(f"  Text {i+1}: P={p:.3f}, R={r:.3f}, F1={f1:.3f}")

print(f"\nROUGE-L Scores (took {rougeL_time*1000:.2f} ms):")
for i, (p, r, f1) in enumerate(rougeL_scores):
    print(f"  Text {i+1}: P={p:.3f}, R={r:.3f}, F1={f1:.3f}")

# Performance summary
print(f"\n⚡ Performance Summary:")
print(f"  • ROUGE-1: {rouge1_time*1000:.2f} ms")
print(f"  • ROUGE-2: {rouge2_time*1000:.2f} ms")
print(f"  • ROUGE-L: {rougeL_time*1000:.2f} ms")

## 🎯 BLEU Score

BLEU (Bilingual Evaluation Understudy) is the standard metric for machine translation evaluation.

In [None]:
print("🎯 Computing BLEU Scores...")
print("=" * 50)

# BLEU with different n-gram orders
start_time = time.perf_counter()
bleu_scores = bleu(candidates, references, max_n=4)
bleu_time = time.perf_counter() - start_time

print(f"BLEU Scores (took {bleu_time*1000:.2f} ms):")
for i, score in enumerate(bleu_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\n📊 BLEU Statistics:")
print(f"  • Mean BLEU: {np.mean(bleu_scores):.4f}")
print(f"  • Std BLEU: {np.std(bleu_scores):.4f}")
print(f"  • Min BLEU: {np.min(bleu_scores):.4f}")
print(f"  • Max BLEU: {np.max(bleu_scores):.4f}")

## 🔤 chrF Score

chrF is a character n-gram F-score that works well across different languages and is less sensitive to tokenization issues.

In [None]:
print("🎯 Computing chrF Scores...")
print("=" * 50)

# chrF with different parameters
start_time = time.perf_counter()
chrf_scores = chrf_score(candidates, references, max_n=6, beta=2.0)
chrf_time = time.perf_counter() - start_time

print(f"chrF Scores (took {chrf_time*1000:.2f} ms):")
for i, score in enumerate(chrf_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\n📊 chrF Statistics:")
print(f"  • Mean chrF: {np.mean(chrf_scores):.4f}")
print(f"  • Std chrF: {np.std(chrf_scores):.4f}")
print(f"  • Min chrF: {np.min(chrf_scores):.4f}")
print(f"  • Max chrF: {np.max(chrf_scores):.4f}")

## 🌟 METEOR Score

METEOR considers synonymy, stemming, and paraphrasing for more robust evaluation.

In [None]:
print("🎯 Computing METEOR Scores...")
print("=" * 50)

start_time = time.perf_counter()
meteor_scores = meteor(candidates, references)
meteor_time = time.perf_counter() - start_time

print(f"METEOR Scores (took {meteor_time*1000:.2f} ms):")
for i, score in enumerate(meteor_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\n📊 METEOR Statistics:")
print(f"  • Mean METEOR: {np.mean(meteor_scores):.4f}")
print(f"  • Std METEOR: {np.std(meteor_scores):.4f}")
print(f"  • Min METEOR: {np.min(meteor_scores):.4f}")
print(f"  • Max METEOR: {np.max(meteor_scores):.4f}")

## ❌ Word Error Rate (WER)

WER measures the minimum number of word-level operations needed to transform the candidate into the reference.

In [None]:
print("🎯 Computing WER Scores...")
print("=" * 50)

start_time = time.perf_counter()
wer_scores = wer(candidates, references)
wer_time = time.perf_counter() - start_time

print(f"WER Scores (took {wer_time*1000:.2f} ms):")
for i, score in enumerate(wer_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\n📊 WER Statistics (lower is better):")
print(f"  • Mean WER: {np.mean(wer_scores):.4f}")
print(f"  • Std WER: {np.std(wer_scores):.4f}")
print(f"  • Min WER: {np.min(wer_scores):.4f}")
print(f"  • Max WER: {np.max(wer_scores):.4f}")

## 🔍 Token-Level Metrics

Fine-grained metrics for detailed analysis:

In [None]:
print("🎯 Computing Token-Level Metrics...")
print("=" * 50)

# Token F1
start_time = time.perf_counter()
token_f1_scores = token_f1(candidates, references)
token_f1_time = time.perf_counter() - start_time

# Jaccard similarity
start_time = time.perf_counter()
jaccard_scores = jaccard(candidates, references)
jaccard_time = time.perf_counter() - start_time

print(f"Token F1 Scores (took {token_f1_time*1000:.2f} ms):")
for i, score in enumerate(token_f1_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\nJaccard Scores (took {jaccard_time*1000:.2f} ms):")
for i, score in enumerate(jaccard_scores):
    print(f"  Text {i+1}: {score:.4f}")

print(f"\n📊 Token-Level Statistics:")
print(f"  • Mean Token F1: {np.mean(token_f1_scores):.4f}")
print(f"  • Mean Jaccard: {np.mean(jaccard_scores):.4f}")

## 🧠 BERTScore Similarity

BERTScore computes similarity using contextual embeddings for semantic evaluation:

In [None]:
print("🎯 Computing BERTScore Similarity...")
print("=" * 50)

# Create sample embeddings (in practice, these would come from a BERT model)
np.random.seed(42)
cand_embeddings = np.random.rand(len(candidates), 768).astype(np.float32)
ref_embeddings = np.random.rand(len(references), 768).astype(np.float32)

start_time = time.perf_counter()
precision, recall, f1 = bert_score_similarity(cand_embeddings, ref_embeddings)
bertscore_time = time.perf_counter() - start_time

print(f"BERTScore Results (took {bertscore_time*1000:.2f} ms):")
print(f"  • Precision: {precision:.4f}")
print(f"  • Recall: {recall:.4f}")
print(f"  • F1: {f1:.4f}")

print(f"\n📊 Note: These are random embeddings for demonstration.")
print(f"   In practice, use real BERT embeddings for meaningful results.")

## 🚀 High-Level API: compute_text_metrics

BlazeMetrics provides a convenient high-level API to compute multiple metrics at once:

In [None]:
print("🎯 Computing All Metrics at Once...")
print("=" * 50)

start_time = time.perf_counter()
all_metrics = compute_text_metrics(
    candidates, references,
    include=["rouge1", "rouge2", "rougeL", "bleu", "chrf", "meteor", "wer", "token_f1", "jaccard"],
    lowercase=False,
    stemming=False
)
compute_all_time = time.perf_counter() - start_time

print(f"All metrics computed in {compute_all_time*1000:.2f} ms!")
print(f"\n📊 Available metrics: {list(all_metrics.keys())}")

# Display sample results
print(f"\n📋 Sample Results for Text 1:")
for metric, scores in all_metrics.items():
    if len(scores) > 0:
        print(f"  • {metric}: {scores[0]:.4f}")

## 📊 Aggregation and Analysis

Let's aggregate the results and create visualizations:

In [None]:
# Aggregate all metrics
aggregated = aggregate_samples(all_metrics)

print("📊 Aggregated Metrics (across all texts):")
print("=" * 50)
for metric, value in aggregated.items():
    print(f"  • {metric}: {value:.4f}")

# Create performance comparison chart
metrics_to_plot = ["rouge1_f1", "rouge2_f1", "rougeL_f1", "bleu", "chrf", "meteor", "token_f1", "jaccard"]
values = [aggregated.get(m, 0.0) for m in metrics_to_plot]

plt.figure(figsize=(12, 6))
bars = plt.bar(metrics_to_plot, values, color='skyblue', alpha=0.7)
plt.title('Aggregated Metric Scores Across All Texts', fontsize=16, fontweight='bold')
plt.xlabel('Metrics', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## ⚡ Performance Analysis

Let's compare the performance of different metrics:

In [None]:
# Performance comparison
performance_data = {
    'ROUGE-1': rouge1_time * 1000,
    'ROUGE-2': rouge2_time * 1000,
    'ROUGE-L': rougeL_time * 1000,
    'BLEU': bleu_time * 1000,
    'chrF': chrf_time * 1000,
    'METEOR': meteor_time * 1000,
    'WER': wer_time * 1000,
    'Token F1': token_f1_time * 1000,
    'Jaccard': jaccard_time * 1000,
    'All Combined': compute_all_time * 1000
}

print("⚡ Performance Comparison (milliseconds):")
print("=" * 50)
for metric, time_ms in performance_data.items():
    print(f"  • {metric}: {time_ms:.2f} ms")

# Performance visualization
plt.figure(figsize=(14, 8))
metrics = list(performance_data.keys())
times = list(performance_data.values())

bars = plt.bar(metrics, times, color='lightcoral', alpha=0.7)
plt.title('Metric Computation Performance', fontsize=16, fontweight='bold')
plt.xlabel('Metrics', fontsize=12)
plt.ylabel('Time (milliseconds)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels
for bar, time_val in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
             f'{time_val:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\n🎯 Key Insights:")
print(f"  • Fastest metric: {min(performance_data, key=performance_data.get)} ({min(performance_data.values()):.2f} ms)")
print(f"  • Slowest metric: {max(performance_data, key=performance_data.get)} ({max(performance_data.values()):.2f} ms)")
print(f"  • Computing all metrics: {performance_data['All Combined']:.2f} ms")
print(f"  • Efficiency gain: {sum(performance_data.values()) - performance_data['All Combined']:.2f} ms saved by batch processing")

## 🔄 Batch Processing Efficiency

Let's demonstrate the efficiency of batch processing:

In [None]:
# Test different batch sizes
batch_sizes = [1, 5, 10, 25, 50, 100]
batch_times = []

print("🔄 Batch Processing Efficiency Test")
print("=" * 50)

for batch_size in batch_sizes:
    # Create larger dataset
    large_candidates = candidates * (batch_size // len(candidates) + 1)
    large_references = references * (batch_size // len(references) + 1)
    large_candidates = large_candidates[:batch_size]
    large_references = large_references[:batch_size]
    
    # Time the computation
    start_time = time.perf_counter()
    _ = compute_text_metrics(large_candidates, large_references, include=["bleu", "rouge1", "chrf"])
    batch_time = time.perf_counter() - start_time
    batch_times.append(batch_time)
    
    print(f"  • Batch size {batch_size:3d}: {batch_time*1000:6.2f} ms ({batch_time*1000/batch_size:6.2f} ms per text)")

# Visualization
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(batch_sizes, [t*1000 for t in batch_times], 'o-', linewidth=2, markersize=8, color='green')
plt.title('Total Processing Time vs Batch Size', fontweight='bold')
plt.xlabel('Batch Size')
plt.ylabel('Time (milliseconds)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
per_text_times = [t*1000/s for t, s in zip(batch_times, batch_sizes)]
plt.plot(batch_sizes, per_text_times, 's-', linewidth=2, markersize=8, color='orange')
plt.title('Per-Text Processing Time vs Batch Size', fontweight='bold')
plt.xlabel('Batch Size')
plt.ylabel('Time per Text (milliseconds)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n📊 Efficiency Analysis:")
print(f"  • Smallest batch (1): {batch_times[0]*1000:.2f} ms per text")
print(f"  • Largest batch (100): {batch_times[-1]*1000/batch_sizes[-1]:.2f} ms per text")
print(f"  • Efficiency gain: {batch_times[0]/batch_times[-1]*batch_sizes[-1]:.1f}x faster per text with batching")

## 🎉 Summary

You've successfully explored all the core metrics provided by BlazeMetrics! Here's what we've covered:

### ✅ **Metrics Demonstrated:**
- **ROUGE**: ROUGE-1, ROUGE-2, ROUGE-L for text overlap evaluation
- **BLEU**: Standard machine translation metric
- **chrF**: Character-level F-score for multilingual evaluation
- **METEOR**: Robust evaluation considering synonyms and paraphrasing
- **WER**: Word-level error rate for speech/text recognition
- **Token-level**: F1 and Jaccard for fine-grained analysis
- **BERTScore**: Semantic similarity using embeddings

### 🚀 **Performance Features:**
- **Rust Core**: Native performance without Python GIL limitations
- **Parallel Processing**: Automatic parallelization with Rayon
- **Batch Optimization**: Efficient processing of large datasets
- **High-Level API**: `compute_text_metrics()` for computing multiple metrics at once

### 📊 **Key Benefits:**
- **Speed**: 10-100x faster than pure Python implementations
- **Accuracy**: Mathematically equivalent to standard implementations
- **Scalability**: Efficient batch processing and parallelization
- **Ease of Use**: Simple, familiar Python API

### 🔄 **Next Steps:**
Continue to the next notebook to explore:
1. **🛡️ [Guardrails Showcase](./03_guardrails_showcase.ipynb)** - Content moderation and safety features
2. **🔄 [Streaming & Monitoring](./04_streaming_monitoring.ipynb)** - Real-time evaluation
3. **🏭 [Production Workflows](./05_production_workflows.ipynb)** - Batch processing and deployment

BlazeMetrics provides enterprise-grade performance for all your NLP evaluation needs!