# Notebook 03: Reference-Based Evaluation with Unitxt

## üéØ Objectives

This notebook demonstrates how to:
1. **Load** ground truth dataset using Unitxt
2. **Perform N-gram comparisons** (BLEU, ROUGE) against reference close notes
3. **Perform semantic comparisons** using embedding similarity
4. **Implement LLM-as-a-Judge** for structured multi-dimensional evaluation
5. **Compare** different models and prompts using comprehensive metrics
6. **Visualize** evaluation results and metrics

---

## üìã Overview

**Phase 2: Reference-Based Evaluation**

Using Unitxt to evaluate how well LLM-generated close notes match reference ground truth notes through multiple evaluation approaches:
- **N-gram Metrics**: BLEU, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) for surface-level similarity
- **Semantic Similarity**: Embedding-based comparison using BGE-M3 model
- **LLM-as-a-Judge**: Structured evaluation across 6 ITSM-specific dimensions

**Key Components:**
- **Unitxt**: Standardized evaluation framework
- **Ground Truth**: `data/gt_close_notes.csv` (26 high-quality examples)
- **Embeddings**: Pre-computed BGE-M3 embeddings (`gt_close_notes_embeddings.npy`)
- **Evaluation Metrics**: N-gram, semantic, and LLM-based scoring

---

## üîß Prerequisites

**‚ö†Ô∏è REQUIRED:**
- **Unitxt installed** (`pip install unitxt` or `uv pip install unitxt`) - **The notebook will not run without it**
- Ground truth dataset: `data/gt_close_notes.csv`
- Pre-computed embeddings: `data/gt_close_notes_embeddings.npy`

**Optional:**
- LLM for generating close notes (Ollama or other provider)
- LLM for LLM-as-a-Judge evaluation


In [None]:
# Install Unitxt if not available
try:
    import unitxt
    print("‚úÖ Unitxt is already installed")
except ImportError:
    print("üì¶ Unitxt not found. Installing...")
    import subprocess
    import sys
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "unitxt>=1.0.0"], 
                            stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
        print("‚úÖ Unitxt installed successfully")
        # Reload the module after installation
        import importlib
        importlib.invalidate_caches()
    except subprocess.CalledProcessError as e:
        raise ImportError(
            "‚ùå Failed to install Unitxt automatically.\n"
            "   Please install it manually with: pip install unitxt\n"
            "   Or using uv: uv pip install unitxt\n"
            "   The notebook cannot proceed without Unitxt."
        ) from e

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import json
from typing import Dict, List, Optional, Tuple
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
sys.path.append(str(Path("../src").resolve()))

from utils import (
    load_ground_truth_embeddings, 
    compute_semantic_similarity,
    find_most_similar_close_note
)
from prompts import get_prompt_variants

# Unitxt for evaluation - REQUIRED
try:
    import unitxt
    from unitxt import load_dataset, Metric
    UNITXT_AVAILABLE = True
    print("‚úÖ Unitxt imported successfully")
except ImportError:
    raise ImportError(
        "‚ùå Unitxt import failed after installation attempt.\n"
        "   Please restart the kernel and try again, or install manually:\n"
        "   pip install unitxt\n"
        "   The notebook cannot proceed without Unitxt."
    )

# Evaluation metrics
try:
    from rouge_score import rouge_scorer
    ROUGE_AVAILABLE = True
except ImportError:
    ROUGE_AVAILABLE = False
    print("‚ö†Ô∏è rouge-score not available. Install with: pip install rouge-score")

try:
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
    from nltk.tokenize import word_tokenize
    import nltk
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt', quiet=True)
    BLEU_AVAILABLE = True
except ImportError:
    BLEU_AVAILABLE = False
    print("‚ö†Ô∏è nltk not available. Install with: pip install nltk")

# Set up plotting style
try:
    plt.style.use('seaborn-v0_8')
except OSError:
    try:
        plt.style.use('seaborn')
    except OSError:
        plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Libraries imported successfully!")


## 1. Configuration

Set up paths and configuration for evaluation.


In [None]:
# Configuration
DATA_DIR = Path("../data")
GT_DATASET_PATH = DATA_DIR / "gt_close_notes.csv"
EMBEDDINGS_PATH = DATA_DIR / "gt_close_notes_embeddings.npy"
EMBEDDINGS_METADATA_PATH = DATA_DIR / "gt_close_notes_embeddings_metadata.pkl"

# Evaluation settings
EVAL_RANDOM_STATE = 42
SMOOTHING_FUNCTION = SmoothingFunction().method1 if BLEU_AVAILABLE else None

print("üìÅ Configuration:")
print(f"   Data directory: {DATA_DIR}")
print(f"   Ground truth dataset: {GT_DATASET_PATH}")
print(f"   Embeddings: {EMBEDDINGS_PATH}")
print(f"\n‚úÖ Configuration loaded")


## Step 1: Load Ground Truth Dataset

Load the ground truth close notes dataset that we'll use for evaluation. We'll prepare it in a format suitable for Unitxt evaluation.


In [None]:
# Load ground truth dataset
if not GT_DATASET_PATH.exists():
    raise FileNotFoundError(f"Ground truth dataset not found at {GT_DATASET_PATH}")

gt_df = pd.read_csv(GT_DATASET_PATH)
print(f"‚úÖ Loaded ground truth dataset: {len(gt_df)} examples")
print(f"\nDataset columns: {list(gt_df.columns)}")
print(f"\nFirst few rows:")
display(gt_df.head())

# Show basic statistics
print(f"\nüìä Dataset Statistics:")
print(f"   Total examples: {len(gt_df)}")
print(f"   Categories: {gt_df['category'].value_counts().to_dict()}")
print(f"   Average info_score: {gt_df['info_score'].mean():.2f}")
print(f"   Info score range: {gt_df['info_score'].min():.2f} - {gt_df['info_score'].max():.2f}")

# Prepare dataset for evaluation
# Format: source (incident content) -> target (reference close notes)
eval_data = []
for _, row in gt_df.iterrows():
    eval_data.append({
        "source": row["content"],
        "target": row["close_notes_ref"],
        "incident_number": row["number"],
        "category": row["category"],
        "subcategory": row["subcategory"],
        "contact_type": row["contact_type"],
        "info_score": row["info_score"]
    })

print(f"\n‚úÖ Prepared {len(eval_data)} examples for evaluation")
print(f"\nExample entry:")
print(json.dumps(eval_data[0], indent=2, ensure_ascii=False)[:500] + "...")


## Step 2: N-gram Comparisons

Perform n-gram based comparisons between predicted and reference close notes using BLEU and ROUGE metrics.


In [None]:
# N-gram evaluation functions

def compute_bleu_score(reference: str, candidate: str) -> Dict[str, float]:
    """
    Compute BLEU scores (1-4 grams) between reference and candidate texts.
    
    Args:
        reference: Reference text
        candidate: Candidate/predicted text
    
    Returns:
        Dictionary with BLEU scores for different n-grams
    """
    if not BLEU_AVAILABLE:
        return {"bleu_1": 0.0, "bleu_2": 0.0, "bleu_3": 0.0, "bleu_4": 0.0}
    
    try:
        ref_tokens = word_tokenize(reference.lower())
        cand_tokens = word_tokenize(candidate.lower())
        
        smoothing = SmoothingFunction().method1
        
        scores = {}
        for n in range(1, 5):
            weights = tuple([1.0/n] * n + [0.0] * (4-n))
            score = sentence_bleu([ref_tokens], cand_tokens, weights=weights, smoothing_function=smoothing)
            scores[f"bleu_{n}"] = float(score)
        
        return scores
    except Exception as e:
        print(f"‚ö†Ô∏è Error computing BLEU: {e}")
        return {"bleu_1": 0.0, "bleu_2": 0.0, "bleu_3": 0.0, "bleu_4": 0.0}


def compute_rouge_scores(reference: str, candidate: str) -> Dict[str, float]:
    """
    Compute ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) between reference and candidate texts.
    
    Args:
        reference: Reference text
        candidate: Candidate/predicted text
    
    Returns:
        Dictionary with ROUGE scores
    """
    if not ROUGE_AVAILABLE:
        return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}
    
    try:
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(reference, candidate)
        
        return {
            "rouge1": scores['rouge1'].fmeasure,
            "rouge2": scores['rouge2'].fmeasure,
            "rougeL": scores['rougeL'].fmeasure,
            "rouge1_precision": scores['rouge1'].precision,
            "rouge1_recall": scores['rouge1'].recall,
            "rouge2_precision": scores['rouge2'].precision,
            "rouge2_recall": scores['rouge2'].recall,
            "rougeL_precision": scores['rougeL'].precision,
            "rougeL_recall": scores['rougeL'].recall,
        }
    except Exception as e:
        print(f"‚ö†Ô∏è Error computing ROUGE: {e}")
        return {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0}


def evaluate_ngram_metrics(reference: str, candidate: str) -> Dict[str, float]:
    """
    Compute all n-gram based metrics (BLEU + ROUGE).
    
    Args:
        reference: Reference text
        candidate: Candidate/predicted text
    
    Returns:
        Dictionary with all n-gram metrics
    """
    bleu_scores = compute_bleu_score(reference, candidate)
    rouge_scores = compute_rouge_scores(reference, candidate)
    
    return {**bleu_scores, **rouge_scores}


# Test on a sample
if len(eval_data) > 0:
    sample = eval_data[0]
    print("üß™ Testing n-gram metrics on sample:")
    print(f"\nReference (first 200 chars):\n{sample['target'][:200]}...")
    print(f"\nCandidate (using reference as candidate for demo):\n{sample['target'][:200]}...")
    
    metrics = evaluate_ngram_metrics(sample['target'], sample['target'])
    print(f"\nüìä N-gram Metrics:")
    for metric, value in metrics.items():
        print(f"   {metric}: {value:.4f}")
else:
    print("‚ö†Ô∏è No evaluation data available")


In [None]:
# Load pre-computed embeddings
try:
    embeddings, embeddings_metadata = load_ground_truth_embeddings(data_dir=str(DATA_DIR))
    print(f"‚úÖ Loaded embeddings: shape {embeddings.shape}")
    print(f"   Metadata keys: {list(embeddings_metadata.keys())}")
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Embeddings not found: {e}")
    print("   Will compute embeddings on-the-fly if needed")
    embeddings = None
    embeddings_metadata = None


def evaluate_semantic_similarity(reference: str, candidate: str) -> Dict[str, float]:
    """
    Compute semantic similarity between reference and candidate texts.
    
    Args:
        reference: Reference text
        candidate: Candidate/predicted text
    
    Returns:
        Dictionary with semantic similarity scores
    """
    try:
        similarity = compute_semantic_similarity(reference, candidate)
        return {
            "semantic_similarity": similarity,
            "cosine_similarity": similarity  # Alias for clarity
        }
    except Exception as e:
        print(f"‚ö†Ô∏è Error computing semantic similarity: {e}")
        return {"semantic_similarity": 0.0, "cosine_similarity": 0.0}


# Test semantic similarity on a sample
if len(eval_data) > 0:
    sample = eval_data[0]
    print("üß™ Testing semantic similarity on sample:")
    print(f"\nReference (first 200 chars):\n{sample['target'][:200]}...")
    
    # Test with the same text (should be 1.0)
    semantic_scores = evaluate_semantic_similarity(sample['target'], sample['target'])
    print(f"\nüìä Semantic Metrics (self-comparison):")
    for metric, value in semantic_scores.items():
        print(f"   {metric}: {value:.4f}")
    
    # Test with a modified version (shorter, should be < 1.0)
    if len(sample['target']) > 100:
        modified = sample['target'][:len(sample['target'])//2] + " (truncated)"
        semantic_scores_modified = evaluate_semantic_similarity(sample['target'], modified)
        print(f"\nüìä Semantic Metrics (truncated comparison):")
        for metric, value in semantic_scores_modified.items():
            print(f"   {metric}: {value:.4f}")
else:
    print("‚ö†Ô∏è No evaluation data available")


In [None]:
# Load pre-computed embeddings
try:
    embeddings, embeddings_metadata = load_ground_truth_embeddings(data_dir=str(DATA_DIR))
    print(f"‚úÖ Loaded embeddings: shape {embeddings.shape}")
    print(f"   Metadata keys: {list(embeddings_metadata.keys())}")
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Embeddings not found: {e}")
    print("   Will compute embeddings on-the-fly if needed")
    embeddings = None
    embeddings_metadata = None


def evaluate_semantic_similarity(reference: str, candidate: str) -> Dict[str, float]:
    """
    Compute semantic similarity between reference and candidate texts.
    
    Args:
        reference: Reference text
        candidate: Candidate/predicted text
    
    Returns:
        Dictionary with semantic similarity scores
    """
    try:
        similarity = compute_semantic_similarity(reference, candidate)
        return {
            "semantic_similarity": similarity,
            "cosine_similarity": similarity  # Alias for clarity
        }
    except Exception as e:
        print(f"‚ö†Ô∏è Error computing semantic similarity: {e}")
        return {"semantic_similarity": 0.0, "cosine_similarity": 0.0}


# Test semantic similarity on a sample
if len(eval_data) > 0:
    sample = eval_data[0]
    print("üß™ Testing semantic similarity on sample:")
    print(f"\nReference (first 200 chars):\n{sample['target'][:200]}...")
    
    # Test with the same text (should be 1.0)
    semantic_scores = evaluate_semantic_similarity(sample['target'], sample['target'])
    print(f"\nüìä Semantic Metrics (self-comparison):")
    for metric, value in semantic_scores.items():
        print(f"   {metric}: {value:.4f}")
    
    # Test with a modified version (shorter, should be < 1.0)
    if len(sample['target']) > 100:
        modified = sample['target'][:len(sample['target'])//2] + " (truncated)"
        semantic_scores_modified = evaluate_semantic_similarity(sample['target'], modified)
        print(f"\nüìä Semantic Metrics (truncated comparison):")
        for metric, value in semantic_scores_modified.items():
            print(f"   {metric}: {value:.4f}")
else:
    print("‚ö†Ô∏è No evaluation data available")


## Step 4: LLM-as-a-Judge

Implement structured evaluation using an LLM as a judge to assess quality across 6 ITSM-specific dimensions.


In [None]:
# LLM-as-a-Judge prompt template

LLM_JUDGE_PROMPT_TEMPLATE = """You are an expert in IT Service Management and incident documentation.
Your task is to evaluate how accurately and completely a *generated close note* describes the resolution of an incident, compared to a *reference note*.

Compare the following texts:

* **Reference (ground truth) close note:**
{close_notes_ref}

* **Generated close note:**
{close_notes_pred}

Evaluate the generated note according to the following criteria.
For each, assign a **score from 0 to 5** and include a one-sentence explanation.

1. **Incident coverage (0‚Äì5)** ‚Äî Does it address the same issue and context?
2. **Technical steps & resolution actions (0‚Äì5)** ‚Äî Are the main diagnostic and corrective actions consistent and complete?
3. **Accuracy of facts (0‚Äì5)** ‚Äî Does it avoid inventing systems, errors, or results?
4. **Customer/system context (0‚Äì5)** ‚Äî Does it correctly reference the affected service, device, or user?
5. **Clarity & structure (0‚Äì5)** ‚Äî Is it readable, logically ordered, and professionally written?
6. **Resolution summary (0‚Äì5)** ‚Äî Does it clearly describe the outcome or confirmation of resolution?

Then compute:

* `"general_score"` ‚Äî the average of the six scores
* `"general_score_explanation"` ‚Äî a brief summary of your overall judgment

Return the evaluation as valid JSON only:

{{
  "check_incident_coverage": 5,
  "check_incident_coverage_explanation": "...",
  "check_technical_steps": 5,
  "check_technical_steps_explanation": "...",
  "check_accuracy_of_facts": 5,
  "check_accuracy_of_facts_explanation": "...",
  "check_customer_context": 5,
  "check_customer_context_explanation": "...",
  "check_clarity_structure": 4,
  "check_clarity_structure_explanation": "...",
  "check_resolution_summary": 5,
  "check_resolution_summary_explanation": "...",
  "general_score": 4.83,
  "general_score_explanation": "The generated close note accurately covers the same incident, includes consistent troubleshooting steps, and provides a clear resolution summary with no invented facts."
}}
"""


def create_llm_judge_prompt(reference: str, candidate: str) -> str:
    """
    Create LLM-as-a-Judge prompt with reference and candidate texts.
    
    Args:
        reference: Reference close note
        candidate: Generated/predicted close note
    
    Returns:
        Formatted prompt string
    """
    return LLM_JUDGE_PROMPT_TEMPLATE.format(
        close_notes_ref=reference,
        close_notes_pred=candidate
    )


def parse_llm_judge_response(response: str) -> Dict:
    """
    Parse LLM judge response (JSON) into structured format.
    
    Args:
        response: LLM response string (should contain JSON)
    
    Returns:
        Dictionary with parsed scores and explanations
    """
    try:
        # Try to extract JSON from response
        if "{" in response and "}" in response:
            start = response.find("{")
            end = response.rfind("}") + 1
            json_str = response[start:end]
            result = json.loads(json_str)
            return result
        else:
            # Fallback: try parsing entire response
            return json.loads(response)
    except json.JSONDecodeError as e:
        print(f"‚ö†Ô∏è Error parsing LLM judge response: {e}")
        print(f"   Response: {response[:200]}...")
        return {
            "check_incident_coverage": 0,
            "check_technical_steps": 0,
            "check_accuracy_of_facts": 0,
            "check_customer_context": 0,
            "check_clarity_structure": 0,
            "check_resolution_summary": 0,
            "general_score": 0,
            "error": str(e)
        }


# Example prompt generation
if len(eval_data) > 0:
    sample = eval_data[0]
    example_prompt = create_llm_judge_prompt(
        reference=sample['target'],
        candidate=sample['target']  # Using same as candidate for demo
    )
    print("üìù Example LLM-as-a-Judge Prompt (first 500 chars):")
    print(example_prompt[:500] + "...")
    print("\n‚úÖ LLM-as-a-Judge prompt template ready")
else:
    print("‚ö†Ô∏è No evaluation data available")


In [None]:
## Step 5: Comprehensive Evaluation Pipeline

Combine all evaluation methods (N-gram, Semantic, LLM-as-a-Judge) into a comprehensive evaluation pipeline.


In [None]:
def comprehensive_evaluation(
    reference: str,
    candidate: str,
    compute_llm_judge: bool = False,
    llm_provider=None
) -> Dict[str, any]:
    """
    Perform comprehensive evaluation combining all metrics.
    
    Args:
        reference: Reference close note
        candidate: Generated/predicted close note
        compute_llm_judge: Whether to compute LLM-as-a-Judge scores (requires LLM)
        llm_provider: Optional LLM provider function (e.g., Ollama, OpenAI)
    
    Returns:
        Dictionary with all evaluation metrics
    """
    results = {
        "reference": reference[:200] + "..." if len(reference) > 200 else reference,
        "candidate": candidate[:200] + "..." if len(candidate) > 200 else candidate,
    }
    
    # N-gram metrics
    ngram_metrics = evaluate_ngram_metrics(reference, candidate)
    results.update({f"ngram_{k}": v for k, v in ngram_metrics.items()})
    
    # Semantic similarity
    semantic_metrics = evaluate_semantic_similarity(reference, candidate)
    results.update({f"semantic_{k}": v for k, v in semantic_metrics.items()})
    
    # LLM-as-a-Judge (optional)
    if compute_llm_judge and llm_provider:
        try:
            prompt = create_llm_judge_prompt(reference, candidate)
            llm_response = llm_provider(prompt)
            llm_scores = parse_llm_judge_response(llm_response)
            results.update({f"llm_judge_{k}": v for k, v in llm_scores.items()})
        except Exception as e:
            print(f"‚ö†Ô∏è LLM-as-a-Judge evaluation failed: {e}")
            results["llm_judge_error"] = str(e)
    
    return results


# Example: Evaluate a sample with all metrics
if len(eval_data) > 0:
    sample = eval_data[0]
    print("üîç Comprehensive Evaluation Example:")
    print(f"\nIncident: {sample['incident_number']}")
    print(f"Category: {sample['category']}")
    
    # For demo, use reference as candidate (perfect match scenario)
    # In real usage, candidate would be LLM-generated
    eval_results = comprehensive_evaluation(
        reference=sample['target'],
        candidate=sample['target'],  # Replace with actual LLM output
        compute_llm_judge=False  # Set to True if LLM provider is available
    )
    
    print("\nüìä Evaluation Results:")
    print(json.dumps(eval_results, indent=2, ensure_ascii=False)[:800] + "...")
else:
    print("‚ö†Ô∏è No evaluation data available")
