# Similarity Measurements Tutorial

## Cost Warning

- **DEMO MODE**: $0.20-0.50 (10 queries)
- **FULL MODE**: $0.80-1.20 (50+ queries)

## Learning Objectives

By the end of this tutorial, you will:

1. Understand different similarity measurement techniques
2. Implement exact match, fuzzy match, BLEU score, and semantic similarity
3. Compare trade-offs between computational efficiency and semantic understanding
4. Apply appropriate similarity metrics for different evaluation scenarios
5. Analyze when to use each similarity measurement approach

In [None]:
# Configuration
DEMO_MODE = True  # Set to False for full dataset
NUM_QUERIES = 10 if DEMO_MODE else 50

In [None]:
# Imports
import os
import re

import numpy as np
import pandas as pd
from Levenshtein import ratio
from nltk.translate.bleu_score import SmoothingFunction, sentence_bleu
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Sample Data

We'll use recipe-related test examples with queries, reference answers, and candidate answers to evaluate different similarity metrics.

In [None]:
# 10 recipe test examples
test_examples = [
    {
        "query": "How do I make chocolate chip cookies?",
        "reference": "Mix butter, sugar, eggs, flour, and chocolate chips. Bake at 350F for 12 minutes.",
        "candidate": "Combine butter, sugar, eggs, flour, and chocolate chips. Bake at 350 degrees for 12 minutes."
    },
    {
        "query": "What ingredients do I need for pasta carbonara?",
        "reference": "You need pasta, eggs, parmesan cheese, pancetta, and black pepper.",
        "candidate": "The ingredients are spaghetti, eggs, parmesan, bacon, and pepper."
    },
    {
        "query": "How long should I roast a chicken?",
        "reference": "Roast at 425F for 20 minutes per pound, plus an additional 15 minutes.",
        "candidate": "Cook at 425 degrees for approximately 20 minutes per pound."
    },
    {
        "query": "What's the best way to cook rice?",
        "reference": "Use a 2:1 water to rice ratio. Bring to boil, then simmer covered for 18 minutes.",
        "candidate": "Use 2 cups water for 1 cup rice. Boil, then cover and simmer for 18 minutes."
    },
    {
        "query": "How do I make scrambled eggs?",
        "reference": "Beat eggs with milk, pour into hot buttered pan, and stir gently until set.",
        "candidate": "Whisk eggs with a splash of milk, cook in butter over medium heat, stirring constantly."
    },
    {
        "query": "What temperature should I bake bread?",
        "reference": "Bake at 375F for 30-35 minutes until golden brown.",
        "candidate": "Preheat oven to 375 degrees and bake for 30-35 minutes."
    },
    {
        "query": "How do I make pizza dough?",
        "reference": "Mix flour, yeast, water, salt, and olive oil. Knead for 10 minutes and let rise for 1 hour.",
        "candidate": "Combine flour, yeast, warm water, salt, and oil. Knead well and allow to rise."
    },
    {
        "query": "What's the secret to fluffy pancakes?",
        "reference": "Don't overmix the batter and let it rest for 5 minutes before cooking.",
        "candidate": "Mix the batter gently and let it sit briefly before making pancakes."
    },
    {
        "query": "How do I properly season a steak?",
        "reference": "Generously apply salt and pepper to both sides at least 40 minutes before cooking.",
        "candidate": "Season liberally with salt and pepper on both sides, preferably 40+ minutes ahead."
    },
    {
        "query": "What's the ideal temperature for brewing coffee?",
        "reference": "Water should be between 195-205F for optimal extraction.",
        "candidate": "The water temperature should range from 195 to 205 degrees Fahrenheit."
    }
]

# Limit to NUM_QUERIES
test_examples = test_examples[:NUM_QUERIES]
print(f"Loaded {len(test_examples)} test examples")

## Step 1: Exact Match

The simplest similarity metric: are the strings exactly the same after normalization?

In [None]:
def normalize_text(text: str) -> str:
    """Normalize text by lowercasing and removing extra whitespace.
    
    Args:
        text: Input text to normalize
        
    Returns:
        Normalized text string
    """
    if not isinstance(text, str):
        raise TypeError("text must be a string")
    
    # Lowercase and remove extra whitespace
    normalized = re.sub(r'\s+', ' ', text.lower().strip())
    return normalized


def exact_match(reference: str, candidate: str) -> bool:
    """Check if two strings match exactly after normalization.
    
    Args:
        reference: Reference string
        candidate: Candidate string to compare
        
    Returns:
        True if strings match exactly, False otherwise
    """
    if not isinstance(reference, str) or not isinstance(candidate, str):
        raise TypeError("Both reference and candidate must be strings")
    
    return normalize_text(reference) == normalize_text(candidate)


# Calculate exact match rate
exact_matches = [exact_match(ex["reference"], ex["candidate"]) for ex in test_examples]
exact_match_rate = sum(exact_matches) / len(exact_matches) * 100

print(f"Exact Match Rate: {exact_match_rate:.1f}%")
print(f"Matches: {sum(exact_matches)}/{len(exact_matches)}")

## Step 2: Fuzzy Match (Levenshtein Ratio)

Fuzzy matching measures string similarity based on edit distance. The Levenshtein ratio ranges from 0 (completely different) to 1 (identical).

In [None]:
def fuzzy_match(reference: str, candidate: str, threshold: float = 0.8) -> tuple[bool, float]:
    """Calculate fuzzy match using Levenshtein ratio.
    
    Args:
        reference: Reference string
        candidate: Candidate string to compare
        threshold: Minimum ratio to consider a match (0.0-1.0)
        
    Returns:
        Tuple of (is_match, similarity_score)
        
    Raises:
        TypeError: If inputs are not strings
        ValueError: If threshold is not between 0 and 1
    """
    if not isinstance(reference, str) or not isinstance(candidate, str):
        raise TypeError("Both reference and candidate must be strings")
    
    if not 0 <= threshold <= 1:
        raise ValueError("threshold must be between 0 and 1")
    
    # Calculate Levenshtein ratio
    similarity = ratio(normalize_text(reference), normalize_text(candidate))
    is_match = similarity >= threshold
    
    return is_match, similarity


# Test with different thresholds
thresholds = [0.7, 0.8, 0.9]

for thresh in thresholds:
    results = [fuzzy_match(ex["reference"], ex["candidate"], thresh) for ex in test_examples]
    matches = sum(match for match, _ in results)
    avg_score = np.mean([score for _, score in results])
    
    print(f"\nThreshold {thresh}:")
    print(f"  Match Rate: {matches/len(results)*100:.1f}%")
    print(f"  Average Score: {avg_score:.3f}")

# Store scores for later comparison
fuzzy_scores = [fuzzy_match(ex["reference"], ex["candidate"], 0.8)[1] for ex in test_examples]

## Step 3: BLEU Score

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between reference and candidate text. Originally designed for machine translation evaluation.

In [None]:
def calculate_bleu(reference: str, candidate: str) -> float:
    """Calculate BLEU score between reference and candidate.
    
    Args:
        reference: Reference string
        candidate: Candidate string to compare
        
    Returns:
        BLEU score (0.0-1.0)
        
    Raises:
        TypeError: If inputs are not strings
    """
    if not isinstance(reference, str) or not isinstance(candidate, str):
        raise TypeError("Both reference and candidate must be strings")
    
    # Tokenize by whitespace
    reference_tokens = normalize_text(reference).split()
    candidate_tokens = normalize_text(candidate).split()
    
    if not candidate_tokens:
        return 0.0
    
    # Use smoothing to handle edge cases
    smoothing = SmoothingFunction().method1
    score = sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smoothing)
    
    return score


# Calculate BLEU scores for all examples
bleu_scores = [calculate_bleu(ex["reference"], ex["candidate"]) for ex in test_examples]

print(f"Average BLEU Score: {np.mean(bleu_scores):.3f}")
print(f"Min BLEU Score: {np.min(bleu_scores):.3f}")
print(f"Max BLEU Score: {np.max(bleu_scores):.3f}")

# Show individual scores
print("\nIndividual BLEU Scores:")
for i, (ex, score) in enumerate(zip(test_examples, bleu_scores), 1):
    print(f"{i}. {score:.3f} - {ex['query'][:50]}...")

## Step 4: Semantic Similarity

Semantic similarity uses embeddings to capture meaning, not just surface-level text overlap. This can identify semantically equivalent text even with different wording.

In [None]:
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Get OpenAI embedding for text.
    
    Args:
        text: Text to embed
        model: OpenAI embedding model name
        
    Returns:
        Embedding vector as list of floats
        
    Raises:
        TypeError: If text is not a string
    """
    if not isinstance(text, str):
        raise TypeError("text must be a string")
    
    if not text.strip():
        raise ValueError("text cannot be empty")
    
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding


def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:
    """Calculate cosine similarity between two vectors.
    
    Args:
        vec1: First vector
        vec2: Second vector
        
    Returns:
        Cosine similarity score (-1.0 to 1.0)
        
    Raises:
        TypeError: If inputs are not lists
        ValueError: If vectors have different lengths
    """
    if not isinstance(vec1, list) or not isinstance(vec2, list):
        raise TypeError("Both vectors must be lists")
    
    if len(vec1) != len(vec2):
        raise ValueError("Vectors must have the same length")
    
    if not vec1 or not vec2:
        raise ValueError("Vectors cannot be empty")
    
    # Convert to numpy arrays
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    
    # Calculate cosine similarity
    dot_product = np.dot(v1, v2)
    norm_product = np.linalg.norm(v1) * np.linalg.norm(v2)
    
    if norm_product == 0:
        return 0.0
    
    return dot_product / norm_product


def semantic_similarity(reference: str, candidate: str) -> float:
    """Calculate semantic similarity using embeddings.
    
    Args:
        reference: Reference string
        candidate: Candidate string to compare
        
    Returns:
        Semantic similarity score (0.0-1.0)
    """
    if not isinstance(reference, str) or not isinstance(candidate, str):
        raise TypeError("Both reference and candidate must be strings")
    
    # Get embeddings
    ref_embedding = get_embedding(reference)
    cand_embedding = get_embedding(candidate)
    
    # Calculate similarity
    similarity = cosine_similarity(ref_embedding, cand_embedding)
    
    # Normalize to 0-1 range (cosine similarity is -1 to 1)
    return (similarity + 1) / 2


# Calculate semantic similarity for all examples
print("Calculating semantic similarities (this may take a moment)...")
semantic_scores = []

for i, ex in enumerate(test_examples, 1):
    score = semantic_similarity(ex["reference"], ex["candidate"])
    semantic_scores.append(score)
    print(f"  {i}/{len(test_examples)} - Score: {score:.3f}")

print(f"\nAverage Semantic Similarity: {np.mean(semantic_scores):.3f}")
print(f"Min Semantic Similarity: {np.min(semantic_scores):.3f}")
print(f"Max Semantic Similarity: {np.max(semantic_scores):.3f}")

## Step 5: Comparison Analysis

Let's compare all methods side-by-side to understand their strengths and weaknesses.

In [None]:
# Create comparison DataFrame
comparison_data = []

for i, ex in enumerate(test_examples):
    comparison_data.append({
        "Query": ex["query"][:40] + "...",
        "Exact Match": exact_matches[i],
        "Fuzzy (0.8)": fuzzy_scores[i],
        "BLEU": bleu_scores[i],
        "Semantic": semantic_scores[i]
    })

df = pd.DataFrame(comparison_data)

print("\n" + "="*80)
print("COMPARISON OF SIMILARITY METHODS")
print("="*80 + "\n")
print(df.to_string(index=False))

# Summary statistics
print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80 + "\n")

summary = pd.DataFrame({
    "Metric": ["Exact Match", "Fuzzy Match", "BLEU", "Semantic"],
    "Mean": [
        sum(exact_matches) / len(exact_matches),
        np.mean(fuzzy_scores),
        np.mean(bleu_scores),
        np.mean(semantic_scores)
    ],
    "Std Dev": [
        np.std([float(x) for x in exact_matches]),
        np.std(fuzzy_scores),
        np.std(bleu_scores),
        np.std(semantic_scores)
    ],
    "Min": [
        min([float(x) for x in exact_matches]),
        min(fuzzy_scores),
        min(bleu_scores),
        min(semantic_scores)
    ],
    "Max": [
        max([float(x) for x in exact_matches]),
        max(fuzzy_scores),
        max(bleu_scores),
        max(semantic_scores)
    ]
}).round(3)

print(summary.to_string(index=False))

## Summary and Recommendations

### Method Comparison

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| **Exact Match** | Fast, deterministic, no API cost | Too strict, misses paraphrases | Classification tasks, code validation |
| **Fuzzy Match** | Fast, handles typos, no API cost | Surface-level only, threshold tuning | Near-duplicates, typo detection |
| **BLEU** | Proven in MT, n-gram overlap | Doesn't capture semantics, score interpretation | Translation, text generation |
| **Semantic** | Captures meaning, robust to paraphrasing | API cost, slower, needs embedding model | QA evaluation, semantic search |

### Key Insights

1. **Exact match** is too strict for most NLP tasks - it rarely matches even when answers are semantically identical
2. **Fuzzy match** is useful for handling minor variations but doesn't understand meaning
3. **BLEU score** correlates with word overlap but can miss semantic equivalence
4. **Semantic similarity** best captures meaning but requires API calls and is slower

### Recommendations

- **For evaluation pipelines**: Use semantic similarity as the primary metric
- **For fast screening**: Start with fuzzy match, then apply semantic similarity to uncertain cases
- **For retrieval**: Use semantic similarity for ranking
- **For exact requirements**: Use exact match only when truly necessary (e.g., code, IDs)

### Trade-offs

- **Speed vs. Accuracy**: Exact/Fuzzy are fast but less accurate; Semantic is slower but more accurate
- **Cost vs. Quality**: Exact/Fuzzy/BLEU are free; Semantic requires API calls
- **Simplicity vs. Sophistication**: Simpler methods are easier to debug; Semantic is more powerful but harder to interpret

Choose the right metric based on your specific use case, balancing speed, cost, and accuracy requirements.