# Using evals.log for LLM Improvement (Composable App Tutorial)

## Learning Objectives
By completing this tutorial, you will:
- Understand why systematic evaluation is critical for LLM applications
- Learn to parse and analyze `evals.log` for quality insights
- Implement quantitative metrics (ideal count, diversity scoring)
- Use embedding variance to measure semantic diversity
- Apply statistical analysis to identify improvement opportunities

## Prerequisites
- **Python**: Intermediate proficiency with JSON, dataclasses, NumPy
- **Statistics**: Basic understanding of mean, variance, descriptive statistics
- **Setup**: Composable app installed with sample `evals.log` (generated from running the app)

## Estimated Time
30-35 minutes (reading + execution)

## Cost Estimate
‚úÖ **Free** - This tutorial reads local logs, no API calls required

> **Book Reference**: This pattern is detailed in *Generative AI Design Patterns*
> (Lakshmanan & Hapke, 2025), Chapter 16: "LLM Evaluation" and Chapter 30: "Observability".

---

## Why Evaluation Matters

**Task 4.2.2**: Conceptual section - Why evaluation matters, LLM-as-judge vs. metrics

### The Evaluation Challenge

LLM outputs are **non-deterministic** and **difficult to assess**:
- Traditional unit tests (`assert output == expected`) don't work for generative AI
- Human evaluation is expensive and doesn't scale
- Quality metrics vary by use case (accuracy, helpfulness, tone, citations)

### Two Evaluation Approaches

#### 1. **Quantitative Metrics** (This Tutorial)
- **Measurable**: Ideal count, length, diversity, response time
- **Fast**: Run on every output, no LLM calls needed
- **Objective**: Same input always produces same score
- **Limitations**: May miss semantic quality issues

**Example**: Keyword quality evaluation
```python
# Ideal: 5 keywords per article
score = 1.0 - abs(len(keywords) - 5) / 5.0
# 5 keywords ‚Üí score = 1.0 (perfect)
# 3 keywords ‚Üí score = 0.6 (penalty for too few)
```

#### 2. **LLM-as-Judge** (Covered in llm_as_judge_tutorial.ipynb)
- **Semantic**: Evaluates meaning, correctness, helpfulness
- **Flexible**: Can assess subjective qualities (tone, creativity)
- **Costly**: Requires LLM API calls for each evaluation
- **Bias risk**: Judge LLM may have preferences or blind spots

**Example**: Correctness evaluation
```python
judge_prompt = f"""Rate the accuracy of this answer (0-10):
Question: {question}
Answer: {llm_answer}
Ground Truth: {reference_answer}
"""
score = await judge_llm.evaluate(judge_prompt)
```

### Composable App Use Case

The composable app evaluates **keyword quality** for generated articles:
- **Metric 1**: Ideal count (5 keywords is optimal for search indexing)
- **Metric 2**: Diversity (keywords should cover different aspects)
- **Source**: `evals.log` records every AI-generated draft
- **Goal**: Identify systematic issues (e.g., GenAIWriter always generates 8 keywords)

**Code Location**: [`evals/evaluate_keywords.py`](../../evals/evaluate_keywords.py)

---

## Setup Cell

**Task 4.2.1**: Setup cell with imports, no API calls needed (reads logs)

In [None]:
# Add project root to path for imports
import sys
sys.path.insert(0, '../..')  # Navigate to composable_app/ root

# Standard library imports
import json
import os
from typing import List, Dict, Any

# Data science imports
import numpy as np
from scipy import stats
from sentence_transformers import SentenceTransformer

# Project imports
from agents.article import Article  # Needed for eval() to reconstruct Article objects

print("‚úÖ Setup complete")
print("‚úÖ No API calls needed - this tutorial reads local logs")
print("\n‚ö†Ô∏è Note: On first run, SentenceTransformer will download ~90MB model from HuggingFace")

---

## Reading evals.log

**Task 4.2.3**: Code section - Reading logs/evals.log (JSON lines format)

### evals.log Format

The `utils/save_for_eval.py` module records every AI response:

```python
# From save_for_eval.py:5-10
async def record_ai_response(target, ai_input, ai_response):
    logger.info(f"AI Response", extra={
        "target": target,        # What was evaluated (e.g., "initial_draft")
        "ai_input": ai_input,    # Input to LLM (e.g., topic)
        "ai_response": ai_response,  # Output from LLM (Article object)
    })
```

**Log Format**: JSON Lines (one JSON object per line)
```json
{"target": "initial_draft", "ai_input": "Photosynthesis", "ai_response": "Article(title='...')"}
{"target": "initial_draft", "ai_input": "Cell Division", "ai_response": "Article(...)"}
```

### Why JSON Lines?
- **Append-friendly**: Each run adds lines, no need to rewrite file
- **Streaming**: Can process large logs without loading everything into memory
- **Resilient**: Corrupted line doesn't break entire file
- **Standard**: Widely supported by log analysis tools (jq, Logfire, Splunk)

In [None]:
def get_records(target: str = "initial_draft", log_path: str = "../../evals.log") -> List[Article]:
    """Parse evals.log and extract Article objects for given target.
    
    Args:
        target: Filter for specific evaluation target (e.g., "initial_draft", "revised_draft")
        log_path: Path to evals.log file
    
    Returns:
        List of Article objects recorded in log
    
    Note:
        Uses eval() to reconstruct Article objects from string representation.
        This works because Article is a dataclass with a proper __repr__.
    """
    records = []
    
    # Check if log file exists
    if not os.path.exists(log_path):
        print(f"‚ö†Ô∏è Log file not found: {log_path}")
        print("   Run the composable app to generate evals.log")
        return records
    
    # Parse JSON lines
    with open(log_path) as ifp:
        for line_num, line in enumerate(ifp.readlines(), 1):
            try:
                obj = json.loads(line)
                
                # Filter by target
                if obj.get('target') == target:
                    # Reconstruct Article object from string representation
                    article = eval(obj['ai_response'])
                    records.append(article)
                    
            except json.JSONDecodeError as e:
                print(f"‚ö†Ô∏è Skipping malformed JSON at line {line_num}: {e}")
            except (KeyError, NameError, SyntaxError) as e:
                print(f"‚ö†Ô∏è Skipping invalid record at line {line_num}: {e}")
    
    print(f"‚úÖ Loaded {len(records)} records for target='{target}'")
    return records

# Demo: Load initial drafts from evals.log
articles = get_records(target="initial_draft")

if articles:
    print(f"\nüìã Sample Article:")
    print(f"   Title: {articles[0].title}")
    print(f"   Keywords: {articles[0].index_keywords}")
    print(f"   Text length: {len(articles[0].full_text)} chars")
else:
    print("\n‚ö†Ô∏è No articles found. Generate sample data by running the app.")

---

## Walkthrough: evaluate_keywords.py

**Task 4.2.4**: Code section - Walkthrough of evaluate_keywords.py (line-by-line)

### Evaluation Strategy

The `evaluate_keywords.py` script implements a **composite metric** for keyword quality:

```python
# From evals/evaluate_keywords.py:29-39
def evaluate(keywords: List[str], embedding_model) -> float:
    # Metric 1: Ideal count penalty
    # If we have 5 keywords, it is ideal. Anything more or less is penalized
    score = 1.0 - (np.abs(len(keywords) - 5) / 5.0)
    score = np.clip(score, 0.0, 1.0)  # Clamp to [0, 1]
    
    # Metric 2: Diversity bonus
    # The more diverse the set of keywords, the better
    # We calculate diversity as variance of the embeddings
    embeds = [np.mean(embedding_model.encode(keyword)) for keyword in keywords]
    score += np.var(embeds)
    
    return score
```

### Metric 1: Ideal Count Penalty

**Goal**: Prefer exactly 5 keywords (common best practice for search indexing)

**Formula**: `penalty = abs(len(keywords) - 5) / 5.0`

**Examples**:
- `5 keywords`: penalty = 0.0 ‚Üí score = 1.0 ‚úÖ (perfect)
- `3 keywords`: penalty = 0.4 ‚Üí score = 0.6 (too few)
- `8 keywords`: penalty = 0.6 ‚Üí score = 0.4 (too many)
- `0 keywords`: penalty = 1.0 ‚Üí score = 0.0 ‚ùå (worst)

**Why penalize?**
- **Too few**: Under-indexed, poor search discoverability
- **Too many**: Keyword stuffing, dilutes relevance signals

### Metric 2: Diversity Bonus

**Goal**: Keywords should cover different semantic aspects, not duplicates

**Formula**: `diversity = variance(keyword_embeddings)`

**Why variance?**
- **High variance**: Keywords are semantically distant (good diversity)
  - Example: `["photosynthesis", "chloroplast", "light", "oxygen", "glucose"]`
  - Covers: process, organelle, input, outputs
- **Low variance**: Keywords are semantically similar (redundant)
  - Example: `["photosynthesis", "photosynthetic", "photosynthesize", "photosystem"]`
  - All variations of same concept

**Implementation Detail**: Uses `np.mean(embedding)` to collapse 768-dim vector to scalar
- This is a **simplification** for demonstration
- Production version should use full embedding variance or pairwise cosine distance

### Composite Score

**Final score** = `ideal_count_score + diversity_bonus`

**Range**: ~0.0 to ~2.0 (not normalized)
- Higher is better
- Allows comparing relative quality across articles

---

## Metric 1: Keyword Count Evaluation

**Task 4.2.5**: Code section - Keyword quality evaluation (ideal count, diversity)

In [None]:
def evaluate_keyword_count(keywords: List[str], ideal_count: int = 5) -> float:
    """Evaluate keyword list based on ideal count.
    
    Args:
        keywords: List of keywords from Article.index_keywords
        ideal_count: Target number of keywords (default: 5)
    
    Returns:
        Score from 0.0 (worst) to 1.0 (perfect)
    """
    # Calculate penalty for deviation from ideal count
    penalty = np.abs(len(keywords) - ideal_count) / ideal_count
    score = 1.0 - penalty
    
    # Clamp to valid range [0.0, 1.0]
    score = np.clip(score, 0.0, 1.0)
    
    return score

# Demo: Evaluate keyword counts
if articles:
    print("üìä Keyword Count Analysis:\n")
    print(f"{'Title':<30} {'Count':<8} {'Score':<8} {'Status'}")
    print("-" * 70)
    
    count_scores = []
    for article in articles[:10]:  # Show first 10
        count = len(article.index_keywords)
        score = evaluate_keyword_count(article.index_keywords)
        count_scores.append(score)
        
        # Status indicator
        if score >= 0.8:
            status = "‚úÖ Good"
        elif score >= 0.5:
            status = "‚ö†Ô∏è Fair"
        else:
            status = "‚ùå Poor"
        
        # Truncate title for display
        title = article.title[:28] + ".." if len(article.title) > 30 else article.title
        print(f"{title:<30} {count:<8} {score:<8.2f} {status}")
    
    # Summary statistics
    print("\nüìà Summary Statistics:")
    print(f"   Mean score: {np.mean(count_scores):.2f}")
    print(f"   Std dev: {np.std(count_scores):.2f}")
    print(f"   Min: {np.min(count_scores):.2f}, Max: {np.max(count_scores):.2f}")
    
    # Insight
    avg_count = np.mean([len(a.index_keywords) for a in articles])
    print(f"\nüí° Insight: Average keyword count is {avg_count:.1f}")
    if avg_count > 6:
        print("   ‚ö†Ô∏è Consider tuning prompts to reduce keyword count")
    elif avg_count < 4:
        print("   ‚ö†Ô∏è Consider tuning prompts to increase keyword count")
    else:
        print("   ‚úÖ Keyword count is well-calibrated")
else:
    print("‚ö†Ô∏è No articles to evaluate. Run the app to generate evals.log")

---

## Metric 2: Keyword Diversity with Embeddings

**Task 4.2.6**: Code section - Embedding variance with SentenceTransformer

### Why Use Embeddings for Diversity?

**Problem**: How to measure if keywords are "different"?
- **Lexical distance** (edit distance, n-grams): Misses semantic similarity
  - "car" and "automobile" are lexically different but semantically identical
- **Embedding distance**: Captures semantic meaning
  - "photosynthesis" and "cellular respiration" have moderate distance (related processes)
  - "photosynthesis" and "economics" have large distance (unrelated)

**Approach**: Calculate variance of keyword embeddings
- **High variance**: Keywords are semantically diverse (good)
- **Low variance**: Keywords are semantically similar (redundant)

### SentenceTransformer Model

**Model**: `all-MiniLM-L6-v2`
- **Size**: ~90MB download (first run only)
- **Dimensions**: 384
- **Speed**: ~1000 sentences/second on CPU
- **Quality**: Good for keyword similarity tasks

**Alternative models**:
- `all-mpnet-base-v2`: Higher quality, slower (768 dims)
- `paraphrase-MiniLM-L3-v2`: Faster, lower quality (384 dims)

In [None]:
# Load embedding model (downloads ~90MB on first run)
print("üì• Loading SentenceTransformer model...")
print("   (First run: downloads ~90MB from HuggingFace)")

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("‚úÖ Model loaded: all-MiniLM-L6-v2 (384 dimensions)")

def evaluate_keyword_diversity(keywords: List[str], embedding_model) -> float:
    """Evaluate keyword diversity using embedding variance.
    
    Args:
        keywords: List of keywords from Article.index_keywords
        embedding_model: SentenceTransformer model for encoding
    
    Returns:
        Diversity score (higher = more diverse)
        
    Note:
        This is a simplified metric. Production version should:
        - Use full embedding variance (not just mean)
        - Normalize by number of keywords
        - Consider pairwise cosine distances
    """
    if len(keywords) == 0:
        return 0.0
    
    # Encode keywords to embeddings
    embeddings = embedding_model.encode(keywords)  # Shape: (num_keywords, 384)
    
    # Simplified diversity: variance of mean embeddings
    # (Original evaluate_keywords.py uses this for demonstration)
    mean_embeddings = [np.mean(emb) for emb in embeddings]
    diversity = np.var(mean_embeddings)
    
    return diversity

def evaluate_keyword_diversity_full(keywords: List[str], embedding_model) -> Dict[str, float]:
    """Advanced diversity metrics using full embeddings.
    
    Returns multiple diversity measures for comparison.
    """
    if len(keywords) <= 1:
        return {"mean_variance": 0.0, "pairwise_distance": 0.0, "full_variance": 0.0}
    
    # Encode keywords
    embeddings = embedding_model.encode(keywords)  # Shape: (num_keywords, 384)
    
    # Metric 1: Mean embedding variance (original)
    mean_embeddings = [np.mean(emb) for emb in embeddings]
    mean_variance = np.var(mean_embeddings)
    
    # Metric 2: Full embedding variance (better)
    full_variance = np.mean(np.var(embeddings, axis=0))
    
    # Metric 3: Average pairwise cosine distance (best)
    from sklearn.metrics.pairwise import cosine_distances
    distances = cosine_distances(embeddings)
    # Get upper triangle (avoid diagonal and duplicates)
    triu_indices = np.triu_indices_from(distances, k=1)
    pairwise_distance = np.mean(distances[triu_indices])
    
    return {
        "mean_variance": mean_variance,
        "full_variance": full_variance,
        "pairwise_distance": pairwise_distance
    }

# Demo: Evaluate diversity for sample articles
if articles:
    print("\nüìä Keyword Diversity Analysis:\n")
    print(f"{'Title':<30} {'Simple':<10} {'Full':<10} {'Pairwise':<10}")
    print("-" * 70)
    
    for article in articles[:5]:  # Show first 5
        metrics = evaluate_keyword_diversity_full(article.index_keywords, embedding_model)
        
        title = article.title[:28] + ".." if len(article.title) > 30 else article.title
        print(f"{title:<30} {metrics['mean_variance']:<10.4f} {metrics['full_variance']:<10.4f} {metrics['pairwise_distance']:<10.4f}")
    
    print("\nüí° Metric Comparison:")
    print("   - Simple: Original evaluate_keywords.py (variance of mean embeddings)")
    print("   - Full: Variance across all embedding dimensions (more accurate)")
    print("   - Pairwise: Average cosine distance between all keyword pairs (best)")
else:
    print("‚ö†Ô∏è No articles to evaluate")

---

## Statistical Analysis with scipy.stats

**Task 4.2.7**: Code section - Statistical analysis with scipy.stats

### Descriptive Statistics

Use `scipy.stats.describe()` to get comprehensive summary:
- **nobs**: Number of observations
- **minmax**: Range of scores
- **mean**: Average quality
- **variance**: Spread of scores (consistency)
- **skewness**: Distribution symmetry (negative = left-skewed)
- **kurtosis**: Tail heaviness (outliers)

### Interpretation Guide

**Mean score**:
- `> 1.5`: Excellent keyword quality ‚úÖ
- `1.0 - 1.5`: Good quality, room for improvement
- `< 1.0`: Poor quality, needs prompt tuning ‚ùå

**Variance**:
- **Low variance** (< 0.1): Consistent quality (good for production)
- **High variance** (> 0.3): Inconsistent (indicates prompt sensitivity)

**Skewness**:
- **Negative skew**: Most outputs good, few poor outliers
- **Positive skew**: Most outputs poor, few good outliers
- **Near zero**: Symmetric distribution

In [None]:
def evaluate_composite(keywords: List[str], embedding_model) -> float:
    """Composite evaluation combining count and diversity.
    
    This replicates the exact logic from evaluate_keywords.py:29-39
    """
    # Metric 1: Ideal count penalty
    score = 1.0 - (np.abs(len(keywords) - 5) / 5.0)
    score = np.clip(score, 0.0, 1.0)
    
    # Metric 2: Diversity bonus
    embeds = [np.mean(embedding_model.encode(keyword)) for keyword in keywords]
    score += np.var(embeds)
    
    return score

# Evaluate all articles
if articles:
    print("üî¨ Evaluating all articles...\n")
    
    scores = []
    for article in articles:
        score = evaluate_composite(article.index_keywords, embedding_model)
        scores.append(score)
    
    # Statistical analysis
    stats_result = stats.describe(scores)
    
    print("üìä Statistical Summary:")
    print("=" * 70)
    print(f"Number of articles: {stats_result.nobs}")
    print(f"Mean score: {stats_result.mean:.4f}")
    print(f"Variance: {stats_result.variance:.4f}")
    print(f"Std deviation: {np.sqrt(stats_result.variance):.4f}")
    print(f"Min score: {stats_result.minmax[0]:.4f}")
    print(f"Max score: {stats_result.minmax[1]:.4f}")
    print(f"Skewness: {stats_result.skewness:.4f}")
    print(f"Kurtosis: {stats_result.kurtosis:.4f}")
    print("=" * 70)
    
    # Interpretation
    print("\nüí° Interpretation:")
    
    # Mean score interpretation
    if stats_result.mean > 1.5:
        print("   ‚úÖ Excellent keyword quality (mean > 1.5)")
    elif stats_result.mean > 1.0:
        print("   ‚ö†Ô∏è Good quality with room for improvement (1.0 < mean < 1.5)")
    else:
        print("   ‚ùå Poor keyword quality (mean < 1.0) - tune prompts!")
    
    # Variance interpretation
    if stats_result.variance < 0.1:
        print("   ‚úÖ Low variance - consistent quality across articles")
    elif stats_result.variance < 0.3:
        print("   ‚ö†Ô∏è Moderate variance - some inconsistency")
    else:
        print("   ‚ùå High variance - inconsistent quality (prompt sensitivity)")
    
    # Skewness interpretation
    if abs(stats_result.skewness) < 0.5:
        print("   ‚úÖ Symmetric distribution (balanced quality)")
    elif stats_result.skewness < -0.5:
        print("   üìä Negative skew - mostly good with few poor outliers")
    else:
        print("   üìä Positive skew - mostly poor with few good outliers")
    
    # Identify outliers (scores > 2 std devs from mean)
    mean = stats_result.mean
    std = np.sqrt(stats_result.variance)
    outliers = [(i, s, articles[i].title) for i, s in enumerate(scores) 
                if abs(s - mean) > 2 * std]
    
    if outliers:
        print(f"\n‚ö†Ô∏è Outliers detected ({len(outliers)} articles > 2 std devs):")
        for idx, score, title in outliers[:5]:  # Show first 5
            deviation = (score - mean) / std
            print(f"   - '{title[:40]}': {score:.2f} ({deviation:+.1f}œÉ)")
    else:
        print("\n‚úÖ No outliers detected (all scores within 2 std devs)")
    
else:
    print("‚ö†Ô∏è No articles to analyze")

---

## Exercise: Design a New Evaluation Metric

**Task 4.2.8**: Exercise - Design a new evaluation metric (response length, citation count)

### Challenge

Design and implement a new evaluation metric for the composable app. Choose one:

1. **Response Length Evaluation**
   - Goal: Prefer articles between 300-500 words
   - Penalty for too short (<200) or too long (>800)
   - Normalize score to [0, 1]

2. **Citation Count Evaluation** (for GenAIWriter)
   - Goal: Ensure articles have page citations
   - Parse `article.full_text` for "See pages:" pattern
   - Score based on number of unique pages cited

3. **Key Lesson Quality**
   - Goal: Evaluate clarity and conciseness of `article.key_lesson`
   - Metrics: Length (ideal 10-20 words), starts with action verb

### Template Code

In [None]:
def evaluate_response_length(article: Article, ideal_min: int = 300, ideal_max: int = 500) -> float:
    """Evaluate article based on word count.
    
    TODO: Implement this function
    
    Args:
        article: Article object with full_text
        ideal_min: Minimum ideal word count
        ideal_max: Maximum ideal word count
    
    Returns:
        Score from 0.0 to 1.0
        - 1.0: Word count in [ideal_min, ideal_max]
        - 0.0: Word count < ideal_min/2 or > ideal_max*2
    """
    # Your code here
    pass

def evaluate_citations(article: Article) -> Dict[str, Any]:
    """Evaluate citation quality for GenAIWriter articles.
    
    TODO: Implement this function
    
    Args:
        article: Article object with full_text containing citations
    
    Returns:
        Dict with:
        - has_citations: bool
        - num_pages: int (number of unique pages cited)
        - citation_text: str (extracted citation string)
    
    Hint: Look for pattern "See pages: 42, 87, 103"
    """
    # Your code here
    pass

# Test your implementation
if articles:
    print("üß™ Testing custom evaluation metrics:\n")
    
    for article in articles[:3]:
        print(f"Article: {article.title}")
        
        # Test response length (uncomment after implementing)
        # length_score = evaluate_response_length(article)
        # print(f"  Length score: {length_score:.2f}")
        
        # Test citations (uncomment after implementing)
        # citation_metrics = evaluate_citations(article)
        # print(f"  Citations: {citation_metrics}")
        
        print()
else:
    print("‚ö†Ô∏è No articles to test")

print("üí° Bonus Challenge: Combine multiple metrics into a composite score")
print("   Example: 0.4*keyword_quality + 0.3*length_score + 0.3*citation_score")

---

## Common Pitfalls

**Task 4.2.9**: Common Pitfalls section

### ‚ùå Error: "FileNotFoundError: evals.log"
**Cause**: Log file doesn't exist yet

**Solution**: Run the composable app to generate logs
```bash
cd composable_app
streamlit run Main.py
# Generate at least 3-5 articles to have meaningful data
```

---

### ‚ùå Error: "JSONDecodeError: Expecting value"
**Cause**: Malformed JSON line in evals.log (corrupted write)

**Solution**: Gracefully skip bad lines
```python
try:
    obj = json.loads(line)
except json.JSONDecodeError as e:
    print(f"Skipping malformed line {line_num}: {e}")
    continue
```

---

### ‚ùå Error: "NameError: name 'Article' is not defined"
**Cause**: Using `eval()` without importing Article class

**Solution**: Import Article before calling `eval()`
```python
from agents.article import Article  # Required for eval() to work
article = eval(obj['ai_response'])
```

**Security Note**: `eval()` is dangerous with untrusted input. Only use on logs you generated.

---

### ‚ö†Ô∏è Warning: SentenceTransformer downloads 90MB on first run
**Cause**: Model files cached in `~/.cache/torch/sentence_transformers/`

**Solutions**:
1. **Wait for download** (one-time, ~30 seconds on broadband)
2. **Use smaller model**: `paraphrase-MiniLM-L3-v2` (33MB)
3. **Offline mode**: Pre-download model to shared cache

---

### ‚ö†Ô∏è Warning: "Mean variance is very small (< 0.01)"
**Cause**: Using `np.mean(embedding)` collapses 384-dim vector to scalar, losing information

**Solution**: Use full embedding variance or pairwise distances
```python
# Better: Full variance across all dimensions
full_variance = np.mean(np.var(embeddings, axis=0))

# Best: Average pairwise cosine distance
from sklearn.metrics.pairwise import cosine_distances
distances = cosine_distances(embeddings)
diversity = np.mean(distances[np.triu_indices_from(distances, k=1)])
```

---

### üí° Tip: Incremental evaluation during development

**Problem**: Running full evaluation on every prompt change is slow

**Solution**: Evaluate on sample during dev, full dataset before deploy
```python
# Development: Quick iteration
sample_articles = articles[:10]  # Evaluate on 10 articles

# Pre-deployment: Comprehensive
all_articles = get_records()  # Evaluate on all articles
```

---

### üí° Tip: Version your evaluation metrics

**Problem**: Hard to compare evaluations across time if metrics change

**Solution**: Tag metrics with version in output
```python
result = {
    "metric_version": "v1.0",
    "timestamp": "2025-11-05T10:00:00Z",
    "mean_score": 1.42,
    "variance": 0.18
}
```

---

## Self-Assessment

**Task 4.2.10**: Self-assessment questions with answers

### Question 1: Concept Check
**What's the difference between quantitative metrics and LLM-as-judge evaluation?**

<details>
<summary>Click to reveal answer</summary>

**Quantitative Metrics**:
- **What**: Measurable properties (count, length, diversity, latency)
- **Cost**: Free (no LLM calls)
- **Speed**: Fast (can run on every output)
- **Limitations**: May miss semantic quality issues
- **Example**: `score = 1.0 - abs(len(keywords) - 5) / 5.0`

**LLM-as-Judge**:
- **What**: Semantic evaluation (correctness, helpfulness, tone)
- **Cost**: Expensive (requires LLM API calls)
- **Speed**: Slow (latency per evaluation)
- **Limitations**: Bias, inconsistency, cost
- **Example**: `judge_llm.evaluate("Rate the accuracy of this answer (0-10): ...")`

**Best Practice**: Use quantitative metrics for continuous monitoring, LLM-as-judge for spot checks and deep analysis.
</details>

---

### Question 2: Implementation
**Why does `evaluate_keywords.py` use `eval()` to reconstruct Article objects? Is this safe?**

<details>
<summary>Click to reveal answer</summary>

**Why `eval()`?**

The log stores Article objects as their `__repr__` string:
```python
# In evals.log
{"ai_response": "Article(title='Photosynthesis', full_text='...', ...)"}
```

Using `eval()` reconstructs the Article from this string:
```python
article = eval("Article(title='Photosynthesis', ...)")  # Creates Article object
```

**Is it safe?**

‚ùå **No, `eval()` is dangerous with untrusted input** - can execute arbitrary code

‚úÖ **Acceptable here** because:
1. Logs are self-generated (not user input)
2. Running locally (not production service)
3. Requires Article class import (limits scope)

**Better alternatives for production**:
```python
# Option 1: Store as JSON dict, reconstruct with dataclass
obj = json.loads(line)
article = Article(**obj['ai_response'])  # Safer

# Option 2: Use ast.literal_eval (safer than eval)
import ast
article = ast.literal_eval(obj['ai_response'])  # Only literals, no functions

# Option 3: Use pickle (binary, efficient)
import pickle
article = pickle.loads(obj['ai_response'])  # Type-safe
```
</details>

---

### Question 3: Metrics Design
**The diversity metric uses `np.mean(embedding)` which collapses 384 dimensions to 1 scalar. Why is this problematic?**

<details>
<summary>Click to reveal answer</summary>

**Problem**: Information loss

**Example**:
```python
# Keyword 1: "photosynthesis"
emb1 = [0.5, -0.3, 0.8, ..., 0.1]  # 384 dims
mean1 = np.mean(emb1)  # Collapses to ~0.15

# Keyword 2: "chloroplast" (semantically related)
emb2 = [0.4, -0.2, 0.7, ..., 0.0]  # 384 dims
mean2 = np.mean(emb2)  # Collapses to ~0.12
```

**Issue**: Two very different 384-dim vectors collapse to similar scalars (0.15 vs 0.12)
- **Result**: Diversity appears low even if keywords are actually diverse
- **Variance of means**: `var([0.15, 0.12]) = 0.0009` (tiny!)

**Better approaches**:

1. **Full embedding variance**:
```python
# Variance across all 384 dimensions
full_variance = np.mean(np.var(embeddings, axis=0))
```

2. **Pairwise cosine distance** (best):
```python
from sklearn.metrics.pairwise import cosine_distances
distances = cosine_distances(embeddings)
# Average distance between all keyword pairs
diversity = np.mean(distances[np.triu_indices_from(distances, k=1)])
```

**Why original code uses mean**: Simplicity for educational demo. Production code should use pairwise distances.
</details>

---

### Question 4: Interpretation
**You run evaluation and get: mean=1.2, variance=0.5, skewness=+1.8. What does this tell you?**

<details>
<summary>Click to reveal answer</summary>

**Analysis**:

1. **Mean = 1.2** (moderate quality)
   - Above 1.0 (passing) but below 1.5 (excellent)
   - Interpretation: System is functional but needs improvement

2. **Variance = 0.5** (high variance)
   - Std dev = ‚àö0.5 ‚âà 0.71
   - Large spread: Some outputs score ~0.5, others score ~1.9
   - Interpretation: **Inconsistent quality** - prompt is sensitive to input

3. **Skewness = +1.8** (strong positive skew)
   - Distribution has long tail to the right
   - **Most outputs are low-scoring** with a few high-scoring outliers
   - Interpretation: System **usually fails** but occasionally produces good results

**Diagnosis**:
```
Score Distribution:

       |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                     ‚Üê Most articles (poor quality)
Count  |‚ñà‚ñà‚ñà
       |‚ñà‚ñà
       |‚ñà
       |      ‚ñà   ‚ñà   ‚ñà‚ñà           ‚Üê Few articles (good quality)
       +---------------------------
        0.5  1.0  1.5  2.0  2.5    Score
```

**Action Items**:
1. **Investigate outliers**: What makes the high-scoring articles good? Can we replicate?
2. **Reduce variance**: Add few-shot examples to prompt for consistency
3. **Shift distribution left**: Improve base prompt to raise mean score
4. **Root cause**: Check if certain topics/writers perform worse (use metadata filtering)

**Compare to ideal stats**:
- Ideal: `mean=1.8, variance=0.05, skewness=0.0` (high, consistent, symmetric)
</details>

---

### Question 5: Advanced
**How would you detect if prompt changes improved or degraded quality?**

<details>
<summary>Click to reveal answer</summary>

**Approach: A/B Testing with Statistical Significance**

**Step 1: Baseline evaluation**
```python
# Before prompt change
baseline_scores = [evaluate(a) for a in baseline_articles]
baseline_mean = np.mean(baseline_scores)  # e.g., 1.2
```

**Step 2: Experiment evaluation**
```python
# After prompt change
experiment_scores = [evaluate(a) for a in experiment_articles]
experiment_mean = np.mean(experiment_scores)  # e.g., 1.5
```

**Step 3: Statistical test** (t-test for mean difference)
```python
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(baseline_scores, experiment_scores)

if p_value < 0.05:  # Statistically significant
    if experiment_mean > baseline_mean:
        print("‚úÖ Prompt change IMPROVED quality (p < 0.05)")
    else:
        print("‚ùå Prompt change DEGRADED quality (p < 0.05)")
else:
    print("‚ö†Ô∏è No significant difference (p ‚â• 0.05) - need more data")
```

**Step 4: Effect size** (how much better?)
```python
# Cohen's d: Standardized mean difference
pooled_std = np.sqrt((np.var(baseline_scores) + np.var(experiment_scores)) / 2)
cohens_d = (experiment_mean - baseline_mean) / pooled_std

# Interpretation:
# |d| < 0.2: Negligible effect
# |d| = 0.5: Medium effect
# |d| > 0.8: Large effect

print(f"Effect size (Cohen's d): {cohens_d:.2f}")
```

**Best Practices**:
1. **Same topics**: Evaluate both prompts on identical set of topics (controlled)
2. **Sufficient sample**: Need ‚â•30 articles for reliable t-test
3. **Multiple metrics**: Don't just compare means - check variance, outliers, failure modes
4. **Version control**: Tag evals.log entries with prompt version for retrospective analysis

**Example versioning**:
```python
# In save_for_eval.py
record_ai_response(target, ai_input, ai_response, metadata={
    "prompt_version": "v2.1",
    "timestamp": "2025-11-05T10:00:00Z"
})
```
</details>

---

### Avoiding Judge Bias

**Task 4.3.3**: Avoiding judge bias (randomization, calibration)

**Problem**: LLM judges have systematic biases that can skew evaluations

### Common Biases

#### 1. **Position Bias**
**Issue**: Judge prefers first or last option in comparisons

**Example**:
```python
# Bad: Position influences score
prompt = f"Rate A vs B: A={article1.text}, B={article2.text}"
# Judge tends to favor A (primacy bias) or B (recency bias)
```

**Solution**: Randomize presentation order
```python
import random

# Randomly swap order
if random.random() < 0.5:
    prompt = f"Rate A vs B: A={article1.text}, B={article2.text}"
    swap = False
else:
    prompt = f"Rate A vs B: A={article2.text}, B={article1.text}"
    swap = True

result = await judge.run(prompt)

# Unswap scores if needed
if swap:
    result.score_A, result.score_B = result.score_B, result.score_A
```

#### 2. **Length Bias**  
**Issue**: Judge favors longer responses (assumes more detail = better quality)

**Example**: 
- Article A: 200 words, high quality
- Article B: 800 words, low quality (verbose)
- Judge gives B higher score due to length alone

**Solution**: 
1. **Blind evaluation**: Don't show judge the full text length
2. **Explicit instructions**: "Judge quality, not quantity. Penalize verbosity."
3. **Control group**: Include known reference articles of varying lengths

#### 3. **Self-Preference Bias**
**Issue**: Judge (same model as generator) favors its own outputs

**Example**:
- Generator: Gemini 2.0
- Judge: Gemini 2.0 (same model)
- Result: Gemini judges its own style favorably, penalizes other models

**Solution**: Use different model for judge
```python
# Generator
writer_agent = Agent('gemini-2.0-flash', ...)

# Judge: Use different model family
judge_agent = Agent('claude-3-5-sonnet-20241022', ...)  # Different model
```

#### 4. **Prompt Sensitivity**
**Issue**: Small wording changes in judge prompt cause large score swings

**Example**:
- Prompt A: "Rate quality (0-10)" ‚Üí Mean score: 6.5
- Prompt B: "Rate excellence (0-10)" ‚Üí Mean score: 4.2 (stricter interpretation)

**Solution**: Calibrate prompts with ground truth dataset
```python
# Step 1: Create calibration set with human ratings
calibration = [
    {"article": article1, "human_score": 8},
    {"article": article2, "human_score": 5},
    # ... 20-50 examples
]

# Step 2: Test prompt variants
for prompt_version in ["v1", "v2", "v3"]:
    judge_scores = [await judge.evaluate(item["article"], prompt_version) 
                    for item in calibration]
    human_scores = [item["human_score"] for item in calibration]
    
    # Step 3: Measure correlation
    from scipy.stats import pearsonr
    correlation, p_value = pearsonr(judge_scores, human_scores)
    
    print(f"{prompt_version}: r={correlation:.3f} (p={p_value:.3f})")
    # Choose prompt with highest correlation to human judgment
```

### Best Practices for Reducing Bias

1. **Randomization**: Shuffle presentation order, use multiple judges
2. **Calibration**: Validate against human ratings on sample dataset
3. **Cross-validation**: Use multiple judge models, average scores
4. **Explicit criteria**: Provide concrete rubric (see structured outputs above)
5. **Anchoring**: Include reference examples ("Score 5 looks like..., Score 10 looks like...")
6. **Blind evaluation**: Remove identifying information (author, model name)

### Example: Multi-Judge Consensus

**Reduce bias by averaging scores from multiple judge models:**

```python
# Use 3 different judge models
judges = [
    Agent('gemini-2.0-flash'),
    Agent('gpt-4o'),
    Agent('claude-3-5-sonnet-20241022')
]

# Get scores from all judges
scores = []
for judge in judges:
    result = await judge.evaluate(article)
    scores.append(result.overall_score)

# Average scores (reduces individual model bias)
consensus_score = np.mean(scores)
variance = np.var(scores)

if variance > 2.0:
    print(f"‚ö†Ô∏è High variance ({variance:.1f}) - judges disagree!")
else:
    print(f"‚úÖ Consensus score: {consensus_score:.1f} (low variance)")
```

**When to use multi-judge**:
- ‚úÖ High-stakes evaluations (production deployment decisions)
- ‚úÖ Subjective qualities (tone, creativity, helpfulness)
- ‚ùå Cost-sensitive scenarios (3x API costs)
- ‚ùå Simple objective metrics (use quantitative instead)

In [None]:
from pydantic import BaseModel, Field
from typing import List

class ArticleEvaluation(BaseModel):
    """Structured evaluation result for article quality."""
    
    correctness_score: int = Field(..., ge=0, le=10, description="Factual accuracy (0-10)")
    helpfulness_score: int = Field(..., ge=0, le=10, description="Usefulness for target audience (0-10)")
    tone_score: int = Field(..., ge=0, le=10, description="Tone appropriateness (0-10)")
    
    strengths: List[str] = Field(..., min_length=1, description="What the article does well")
    weaknesses: List[str] = Field(default_factory=list, description="Areas for improvement")
    
    overall_score: int = Field(..., ge=0, le=10, description="Combined quality score (0-10)")
    reasoning: str = Field(..., min_length=10, description="Brief explanation of scores")

# Example: How to use with Pydantic AI (pseudocode - requires API key)
"""
from pydantic_ai import Agent

judge_agent = Agent(
    'gemini-2.0-flash',
    result_type=ArticleEvaluation,  # Enforces structured output
    system_prompt='''
    You are an expert evaluator for K-12 educational content.
    Rate articles on correctness, helpfulness, and tone.
    Provide specific, actionable feedback.
    '''
)

# Evaluate an article
result = await judge_agent.run(f'''
Evaluate this article about "{article.title}":

{article.full_text}

Rate on three dimensions (0-10 each):
1. Correctness: Factual accuracy
2. Helpfulness: Usefulness for K-12 students  
3. Tone: Age-appropriate and engaging

Provide strengths, weaknesses, overall score, and reasoning.
''')

evaluation: ArticleEvaluation = result.output

print(f"Correctness: {evaluation.correctness_score}/10")
print(f"Helpfulness: {evaluation.helpfulness_score}/10")
print(f"Tone: {evaluation.tone_score}/10")
print(f"Overall: {evaluation.overall_score}/10")
print(f"\\nStrengths: {', '.join(evaluation.strengths)}")
print(f"Weaknesses: {', '.join(evaluation.weaknesses)}")
print(f"\\nReasoning: {evaluation.reasoning}")
"""

# Demo: Manual structured evaluation (no API call)
print("üìã Example Structured Evaluation:\n")

sample_eval = ArticleEvaluation(
    correctness_score=9,
    helpfulness_score=7,
    tone_score=8,
    strengths=[
        "Factually accurate with good citations",
        "Clear explanations of complex concepts",
        "Engaging examples (photosynthesis in everyday plants)"
    ],
    weaknesses=[
        "Could use more visual aids or diagrams",
        "Conclusion feels rushed",
        "Missing connection to real-world applications"
    ],
    overall_score=8,
    reasoning="Strong factual foundation with accessible tone. Main weakness is lack of visual aids and deeper real-world connections, which would enhance student engagement."
)

print(f"Correctness: {sample_eval.correctness_score}/10")
print(f"Helpfulness: {sample_eval.helpfulness_score}/10")
print(f"Tone: {sample_eval.tone_score}/10")
print(f"Overall: {sample_eval.overall_score}/10")
print(f"\nStrengths:")
for s in sample_eval.strengths:
    print(f"  ‚úÖ {s}")
print(f"\nWeaknesses:")
for w in sample_eval.weaknesses:
    print(f"  ‚ö†Ô∏è {w}")
print(f"\nReasoning: {sample_eval.reasoning}")

print("\nüí° Benefits of Structured Outputs:")
print("  1. Machine-readable: Easy to aggregate scores across evaluations")
print("  2. Consistent format: Always get same fields, no parsing errors")
print("  3. Type-safe: Pydantic validates ranges (0-10), required fields")
print("  4. Actionable: Separate strengths/weaknesses guide improvements")

---

### Structured Judge Outputs

**Task 4.3.2**: Structured judge outputs (scores + reasoning)

**Problem**: Free-text judge responses are hard to parse and analyze

**Bad Example** (unstructured):
```
Judge output: "This article is pretty good, I'd give it an 8 out of 10.
The tone is appropriate and it covers the main points, but it could
use more examples. Also the conclusion feels rushed."
```

Problems:
- Score buried in text (hard to extract)
- No machine-readable reasoning
- Inconsistent format across evaluations

**Better: Structured Output with Pydantic**

Use Pydantic AI's result type to enforce structure:

### Metric 2: Helpfulness Evaluation

**Goal**: Evaluate if article addresses the user's actual information need

**Judge Prompt**:
```python
judge_prompt = f"""
Rate how helpful this Article is for a K-12 student learning about "{topic}" (0-10):

Article: {article.full_text}

Criteria:
- Answers the core question about {topic}: +3 points
- Provides practical examples or analogies: +3 points
- Appropriate depth for K-12 level (not too simple, not too advanced): +2 points
- Well-structured (intro, body, conclusion): +2 points

Score (0-10):
Brief reasoning:
"""
```

**Key insight**: Helpfulness depends on **user context** (K-12 student vs. researcher vs. general public)
- Always specify target audience in judge prompt
- Consider creating persona-specific judges (e.g., "conservative parent" perspective)

### Metric 3: Tone Evaluation

**Goal**: Evaluate if tone is appropriate for target audience (K-12 educational)

**Judge Prompt**:
```python
judge_prompt = f"""
Rate the tone appropriateness of this Article for K-12 students (0-10):

Article: {article.full_text}

Criteria:
- Engaging and accessible (not dry or overly academic): +3 points
- Age-appropriate vocabulary (explains technical terms): +2 points
- Respectful and inclusive language: +2 points
- Encouraging learning (not condescending): +3 points

Score (0-10):
Issues found (if any):
"""
```

**Multi-dimensional tone**:
- **Formality**: Casual ‚Üî Academic
- **Complexity**: Simple ‚Üî Technical
- **Emotion**: Neutral ‚Üî Enthusiastic
- **Respect**: Condescending ‚Üî Empowering

**Tip**: Use multiple judges with different persona prompts to catch tone issues from diverse perspectives (see ReviewerPanel)

---

### Metric 1: Correctness Evaluation

**Task 4.3.1**: LLM-as-judge metrics (correctness, helpfulness, tone)

**Goal**: Evaluate if article content is factually accurate

**Challenge**: Need ground truth or reference answer for comparison

**Approaches**:

1. **Reference-based**: Compare to known correct answer
```python
judge_prompt = f"""
Rate the factual correctness of the Answer compared to the Reference (0-10):

Question: {topic}
Answer: {article.full_text}
Reference: {ground_truth_content}

Criteria:
- All facts in Answer are present in Reference: +4 points
- No hallucinations or false claims in Answer: +4 points  
- Answer includes key details from Reference: +2 points

Score (0-10): 
"""
```

2. **Self-consistency**: Generate multiple answers, check agreement
```python
# Generate 5 answers to same question
answers = [await writer.write_about(topic) for _ in range(5)]

# Judge: Do all 5 answers agree on key facts?
consistency_score = judge_llm.check_consistency(answers)
# High consistency ‚Üí likely correct
# Low consistency ‚Üí prompt ambiguity or knowledge gap
```

3. **Retrieval-augmented judge**: Give judge access to source documents
```python
judge_prompt = f"""
Check if this Article is grounded in the Source Documents:

Article: {article.full_text}
Source Documents: {retrieved_chunks}

For each claim in Article:
1. Is it supported by Source Documents? (yes/no/partial)
2. If no, mark as potential hallucination

Hallucination count: 
Grounding score (0-10):
```

**Production Tip**: Use retrieval-augmented judge for RAG applications (like GenAIWriter)

---

## Advanced: LLM-as-Judge Evaluation

**Task 4.3**: Evaluation methodologies - LLM-as-judge, structured outputs, bias avoidance

### When to Use LLM-as-Judge

**Quantitative metrics** (covered above) work well for measurable properties, but fail for semantic quality:
- ‚ùå **Correctness**: Is the answer factually accurate?
- ‚ùå **Helpfulness**: Does it address the user's actual question?
- ‚ùå **Tone**: Is it appropriate for K-12 audience?
- ‚ùå **Creativity**: Is the explanation engaging?

**LLM-as-Judge** uses an LLM to evaluate another LLM's output:

```python
# Pseudocode
judge_prompt = f"""
Rate the {quality} of this response (0-10):
Input: {user_query}
Output: {llm_response}
"""
score = await judge_llm.run(judge_prompt)
```

### Cost-Quality Trade-off

| Evaluation Method | Cost | Speed | Semantic Quality | Best For |
|-------------------|------|-------|------------------|----------|
| **Quantitative** | $0 | Fast | ‚ùå Misses meaning | Continuous monitoring |
| **LLM-as-Judge** | $0.01-0.10/eval | Slow | ‚úÖ Understands meaning | Spot checks, deep analysis |
| **Human** | $5-20/eval | Very slow | ‚úÖ‚úÖ Gold standard | Final validation |

**Best Practice**: Combine approaches
1. **Quantitative**: Run on every output (fast feedback loop)
2. **LLM-as-Judge**: Sample 10% of outputs (quality assurance)
3. **Human**: Review edge cases and failures (ground truth)

**Code Location**: See [`llm_as_judge_tutorial.ipynb`](llm_as_judge_tutorial.ipynb) for guardrails implementation

---

## Next Steps

### Continue Learning
1. **[LLM-as-Judge Tutorial](llm_as_judge_tutorial.ipynb)** - Semantic evaluation with guardrails
2. **[Horizontal Services](../concepts/horizontal_services.md)** - Learn about evaluation recording architecture
3. **[Advanced Patterns](advanced_patterns.ipynb)** - Cost and latency optimization

### Hands-On Practice
1. **Complete Exercise**: Implement response length and citation evaluation metrics
2. **Custom metric**: Design evaluation for your specific use case
3. **A/B testing**: Compare two prompt versions using t-test
4. **Visualization**: Plot score distributions with matplotlib/seaborn

### Advanced Challenges
1. **Real-time dashboard**: Stream evals.log to Streamlit dashboard with live metrics
2. **Regression detection**: Alert when mean score drops below threshold
3. **Multi-dimensional analysis**: Cluster articles by topic and compare quality across clusters
4. **Combine metrics**: Create composite score (e.g., 0.4*keywords + 0.3*length + 0.3*citations)

### Production Considerations
1. **Sampling**: Evaluate on random sample during high traffic (cost/latency)
2. **Versioning**: Tag all logs with model version, prompt version, timestamp
3. **Storage**: Rotate logs, archive to S3/GCS for historical analysis
4. **Alerting**: Set up alerts for quality degradation (PagerDuty, Slack)

---

**Congratulations!** You've learned how to systematically evaluate LLM outputs using quantitative metrics. You can now:
- Parse and analyze `evals.log` for quality insights
- Implement composite metrics (count + diversity)
- Use statistical analysis to detect improvements or regressions
- Design custom evaluation metrics for your use case

**Tutorial Version**: 1.0  
**Last Updated**: 2025-11-05  
**Estimated Time**: 30-35 minutes  
**Cost**: $0.00 (reads local logs, downloads ~90MB model)