# Notebook 05: LLM-as-a-Judge Evaluation

## üéØ What is This Notebook About?

This notebook evaluates close notes quality using **LLM-as-a-Judge** - an automated evaluation method that uses a Large Language Model (LLM) to assess close notes based on structured criteria.

**Context:**
1. We have **two datasets:**
   - **Reference Dataset** (good close notes) - High-quality examples
   - **Other Incidents Dataset** (bad/regular close notes) - Standard examples
   
2. We want to **evaluate close notes** using multiple quality criteria:
   - Does it provide useful information?
   - Is it specific and detailed?
   - Is it complete?
   - Does it avoid generic phrases?
   - Is it clear and well-written?

**This notebook's purpose:**
- **Set up evaluation criteria** - Define what makes a good close note
- **Evaluate close notes** - Use LLM to judge quality across multiple dimensions
- **Compare results** - See how good vs bad close notes score differently
- **Understand scoring** - Learn what the scores mean and how to interpret them

**What we'll learn:**
- LLM-as-a-Judge provides structured, explainable evaluation
- Good close notes score higher across all criteria
- Bad close notes score lower, especially on specificity and completeness
- This evaluation method can be used to assess AI-generated close notes

---

## üìö Key Concepts Explained

### What is LLM-as-a-Judge?

**LLM-as-a-Judge** is a method where we use a Large Language Model (like Llama) to evaluate text quality, similar to how a human judge would evaluate it.

**Think of it like this:**
- **Human judge:** Reads a close note and gives it a score based on criteria
- **LLM judge:** Does the same thing, but uses AI to be consistent and scalable

**How it works:**
1. We define **evaluation criteria** (what to look for)
2. We provide the **close note** and **incident context** to the LLM
3. The LLM **assesses** the close note against each criterion
4. The LLM **selects an option** (e.g., "Excellent", "Acceptable", "Bad")
5. We get a **score** (0.0 to 1.0) and **reasoning** (why that score was given)

**Why this matters:**
- Provides **consistent evaluation** (same criteria applied to all notes)
- Gives **explainable scores** (we know why a score was given)
- Can **scale** to evaluate many close notes automatically
- Helps **identify** what makes a close note good or bad

### What are Evaluation Criteria?

**Evaluation criteria** are specific questions we ask about a close note's quality.

**Our 5 criteria:**
1. **Informativeness** - Does it provide useful information?
2. **Specificity** - Does it include specific details?
3. **Completeness** - Does it cover all key aspects?
4. **No Generic Statements** - Does it avoid generic phrases?
5. **Clarity** - Is it well-written and clear?

**Each criterion has options:**
- **Excellent** (score: 1.0) - Meets the criterion perfectly
- **Acceptable** (score: 0.75) - Good but could be better
- **Could be Improved** (score: 0.4-0.6) - Needs improvement
- **Bad** (score: 0.0-0.2) - Doesn't meet the criterion

**Why multiple criteria?**
- One score isn't enough - we need to understand **what** makes a close note good
- Different close notes may be strong in different areas
- Helps identify **specific improvements** needed

### How Does Scoring Work?

**Scoring process:**
1. LLM reads the close note and incident context
2. For each criterion, LLM evaluates the close note
3. LLM selects an option (e.g., "Excellent")
4. Option is converted to a numeric score (e.g., 1.0)
5. We get scores for all 5 criteria
6. We calculate an **average score** across all criteria

**Score interpretation:**
- **0.8 - 1.0** = Excellent close note (high quality)
- **0.6 - 0.8** = Good close note (acceptable quality)
- **0.4 - 0.6** = Needs improvement
- **0.0 - 0.4** = Poor close note (low quality)

**Example:**
- Close note scores: Informativeness=1.0, Specificity=0.8, Completeness=1.0, No Generic=1.0, Clarity=0.9
- Average: **0.94** = Excellent quality close note!

---

## üéØ Objectives

This notebook will:
1. **Load** reference and other incidents datasets
2. **Set up evaluation criteria** - Define 5 quality dimensions
3. **Configure LLM-as-a-Judge** - Set up Ollama and Unitxt
4. **Evaluate close notes** - Score close notes from both datasets
5. **Compare results** - Analyze differences between good and bad close notes
6. **Visualize scores** - Create charts showing score distributions
7. **Interpret results** - Understand what the scores mean

---

## üìã What We're Evaluating

**Datasets:**
- **Reference Dataset** (`reference_close_notes.csv`) - Good close notes (ground truth)
- **Other Incidents Dataset** (`other_incidents.csv`) - Bad/regular close notes

**What we'll evaluate:**
- Close notes from both datasets
- Using 5 evaluation criteria
- With incident context (`content` field) for better evaluation
- Get scores and reasoning for each evaluation

**Expected results:**
- Reference close notes should score **higher** (0.7-1.0 average)
- Other incidents should score **lower** (0.3-0.6 average)
- Differences should be most obvious in **Specificity** and **No Generic Statements**

---

## üöÄ Getting Started

Let's start by importing the necessary libraries and setting up our evaluation environment.



## 1. Import Libraries and Setup

**What we're doing:** Importing the libraries we need for LLM-as-a-Judge evaluation.

**Libraries:**
- `pandas` - For working with datasets
- `unitxt` - For LLM-as-a-Judge evaluation framework
- `matplotlib` and `seaborn` - For visualizations
- `numpy` - For numerical operations


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Unitxt imports for LLM-as-a-Judge
from unitxt.api import create_dataset, evaluate
from unitxt.inference import CrossProviderInferenceEngine
from unitxt.llm_as_judge import LLMJudgeDirect
from unitxt.llm_as_judge_constants import CriteriaWithOptions

# Set up plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Libraries imported successfully!")
print(f"üìä Working directory: {Path.cwd()}")


## 2. Load Datasets

**What we're doing:** Loading the reference dataset (good close notes) and other incidents dataset (bad/regular close notes) that we created in Notebook 02.

**Files:**
- `data/reference_close_notes.csv` - High-quality close notes (ground truth)
- `data/other_incidents.csv` - Standard close notes (for comparison)

**Key fields:**
- `content` - Incident description (context for evaluation)
- `close_notes_ref` or `close_notes` - The close note to evaluate
- `category` - Incident category (for filtering/grouping)


In [None]:
# Load datasets
data_dir = Path("../data")

reference_df = pd.read_csv(data_dir / "reference_close_notes.csv")
other_incidents_df = pd.read_csv(data_dir / "other_incidents.csv")

print("="*80)
print("DATASETS LOADED")
print("="*80)
print(f"\nüìä Reference Dataset (Good Close Notes):")
print(f"   - Total records: {len(reference_df)}")
print(f"   - Columns: {list(reference_df.columns)}")

print(f"\nüìä Other Incidents Dataset (Bad/Regular Close Notes):")
print(f"   - Total records: {len(other_incidents_df)}")
print(f"   - Columns: {list(other_incidents_df.columns)}")

# Check for required fields
print(f"\nüîç Checking required fields...")
required_ref_fields = ['content', 'close_notes_ref']
required_other_fields = ['content', 'close_notes']

missing_ref = [f for f in required_ref_fields if f not in reference_df.columns]
missing_other = [f for f in required_other_fields if f not in other_incidents_df.columns]

if missing_ref:
    print(f"   ‚ö†Ô∏è  Missing in reference dataset: {missing_ref}")
else:
    print(f"   ‚úÖ Reference dataset has all required fields")

if missing_other:
    print(f"   ‚ö†Ô∏è  Missing in other incidents dataset: {missing_other}")
else:
    print(f"   ‚úÖ Other incidents dataset has all required fields")

print("="*80)


## 3. Prepare Sample Data for Evaluation

**What we're doing:** Selecting a sample of close notes from both datasets to evaluate. We'll start with a small sample to test the evaluation, then can expand.

**Why sample?**
- LLM evaluation takes time and resources
- Starting small helps us verify everything works
- We can evaluate more later if needed

**Selection strategy:**
- Take a diverse sample (different categories)
- Ensure we have incident context (`content` field)
- Filter out empty or very short close notes


In [None]:
# Prepare sample data for evaluation
# Start with a small sample (e.g., 5-10 from each dataset) for testing
SAMPLE_SIZE = 5

# Filter reference dataset: ensure we have content and close_notes_ref
reference_sample = reference_df[
    (reference_df['content'].notna()) & 
    (reference_df['content'].astype(str).str.strip() != '') &
    (reference_df['close_notes_ref'].notna()) & 
    (reference_df['close_notes_ref'].astype(str).str.strip() != '') &
    (reference_df['close_notes_ref'].astype(str).str.len() > 20)  # Minimum length
].head(SAMPLE_SIZE).copy()

# Filter other incidents dataset: ensure we have content and close_notes
other_sample = other_incidents_df[
    (other_incidents_df['content'].notna()) & 
    (other_incidents_df['content'].astype(str).str.strip() != '') &
    (other_incidents_df['close_notes'].notna()) & 
    (other_incidents_df['close_notes'].astype(str).str.strip() != '') &
    (other_incidents_df['close_notes'].astype(str).str.len() > 10)  # Minimum length
].head(SAMPLE_SIZE).copy()

print("="*80)
print("SAMPLE DATA PREPARED")
print("="*80)
print(f"\nüìä Reference Sample (Good Close Notes):")
print(f"   - Sample size: {len(reference_sample)}")
if 'category' in reference_sample.columns:
    print(f"   - Categories: {reference_sample['category'].value_counts().to_dict()}")

print(f"\nüìä Other Incidents Sample (Bad/Regular Close Notes):")
print(f"   - Sample size: {len(other_sample)}")
if 'category' in other_sample.columns:
    print(f"   - Categories: {other_sample['category'].value_counts().to_dict()}")

# Show example
print(f"\nüìù Example Reference Close Note:")
if len(reference_sample) > 0:
    example_idx = 0
    print(f"   Content (first 150 chars): {reference_sample.iloc[example_idx]['content'][:150]}...")
    print(f"   Close Note: {reference_sample.iloc[example_idx]['close_notes_ref'][:200]}...")

print("="*80)


## 4. Define Evaluation Criteria

**What we're doing:** Defining the 5 evaluation criteria that will be used to judge close note quality.

**Each criterion:**
- Has a **name** and **description** (what we're looking for)
- Has **options** (e.g., "Excellent", "Acceptable", "Bad")
- Has an **option_map** (converts options to numeric scores 0.0-1.0)

**Why these criteria?**
- They cover the key aspects of a good close note
- They're specific enough to be evaluated consistently
- They help identify what makes a close note good or bad


In [None]:
# Define 5 evaluation criteria for close notes quality

informativeness = CriteriaWithOptions.from_obj({
    "name": "Informativeness",
    "description": "Does the close note provide useful, specific information about what happened and how it was resolved?",
    "options": [
        {"name": "Excellent", "description": "Highly informative with specific details about problem, cause, and resolution."},
        {"name": "Acceptable", "description": "Provides useful information but could include more specific details."},
        {"name": "Could be Improved", "description": "Some information present but vague or incomplete."},
        {"name": "Bad", "description": "Little or no useful information (e.g., just 'Issue resolved')."},
    ],
    "option_map": {"Excellent": 1.0, "Acceptable": 0.75, "Could be Improved": 0.4, "Bad": 0.0},
})

specificity = CriteriaWithOptions.from_obj({
    "name": "Specificity",
    "description": "Does the close note include specific details such as error messages, specific actions taken, browser versions, or exact resolutions?",
    "options": [
        {"name": "Highly Specific", "description": "Includes concrete details like error codes, specific steps taken, browser versions, or exact outcomes."},
        {"name": "Somewhat Specific", "description": "Includes some details but could be more precise."},
        {"name": "Vague", "description": "Lacks specific details; too general."},
    ],
    "option_map": {"Highly Specific": 1.0, "Somewhat Specific": 0.6, "Vague": 0.2},
})

completeness = CriteriaWithOptions.from_obj({
    "name": "Completeness",
    "description": "Does the close note cover the key aspects: what the problem was, what was done to resolve it, and the outcome?",
    "options": [
        {"name": "Complete", "description": "Covers problem, actions taken, and outcome clearly."},
        {"name": "Partially Complete", "description": "Covers some aspects but missing important details."},
        {"name": "Incomplete", "description": "Significant gaps in information; missing key aspects."},
    ],
    "option_map": {"Complete": 1.0, "Partially Complete": 0.5, "Incomplete": 0.0},
})

no_generic_statements = CriteriaWithOptions.from_obj({
    "name": "No Generic Statements",
    "description": "Does the close note avoid generic, unhelpful phrases like 'Issue resolved', 'No changes noted', or 'Resolved per user' without explanation?",
    "options": [
        {"name": "No Generic Phrases", "description": "No generic statements; all content is specific and informative."},
        {"name": "Few Generic Phrases", "description": "Mostly specific but includes some generic statements."},
        {"name": "Too Generic", "description": "Primarily or entirely generic statements without explanation."},
    ],
    "option_map": {"No Generic Phrases": 1.0, "Few Generic Phrases": 0.4, "Too Generic": 0.0},
})

clarity = CriteriaWithOptions.from_obj({
    "name": "Clarity",
    "description": "Is the close note well-written, clear, and easy to understand?",
    "options": [
        {"name": "Clear", "description": "Well-structured, easy to follow, and professional."},
        {"name": "Somewhat Clear", "description": "Understandable but could be better organized or more concise."},
        {"name": "Unclear", "description": "Difficult to understand or poorly structured."},
    ],
    "option_map": {"Clear": 1.0, "Somewhat Clear": 0.6, "Unclear": 0.0},
})

print("="*80)
print("EVALUATION CRITERIA DEFINED")
print("="*80)
print(f"\n‚úÖ Created {5} evaluation criteria:")
print("   1. Informativeness - Does it provide useful information?")
print("   2. Specificity - Does it include specific details?")
print("   3. Completeness - Does it cover all key aspects?")
print("   4. No Generic Statements - Does it avoid generic phrases?")
print("   5. Clarity - Is it well-written and clear?")
print("="*80)


## 5. Configure LLM-as-a-Judge

**What we're doing:** Setting up the LLM judge using Ollama (local LLM) and Unitxt framework.

**Configuration:**
- **Model:** Llama 3.2 3B Instruct (via Ollama)
- **Provider:** Ollama (runs locally)
- **Context:** We'll pass the incident description (`content`) as context
- **Metrics:** One metric per criterion (5 metrics total)

**Important:** Make sure Ollama is running (`ollama serve`) and the model is pulled (`ollama pull llama3.2:3b`) before running this notebook.


In [None]:
# Create metrics for each criterion
# Each criterion gets its own LLMJudgeDirect metric instance

metrics = [
    LLMJudgeDirect(
        inference_engine=CrossProviderInferenceEngine(
            model="llama3.2:3b",  # Ollama model name
            max_tokens=1024,
            data_classification_policy=["private"],
            provider="ollama",
        ),
        criteria=informativeness,
        context_fields=["question"],  # Will contain incident context
        criteria_field="criteria",
    ),
    LLMJudgeDirect(
        inference_engine=CrossProviderInferenceEngine(
            model="llama3.2:3b",
            max_tokens=1024,
            data_classification_policy=["private"],
            provider="ollama",
        ),
        criteria=specificity,
        context_fields=["question"],
        criteria_field="criteria",
    ),
    LLMJudgeDirect(
        inference_engine=CrossProviderInferenceEngine(
            model="llama3.2:3b",
            max_tokens=1024,
            data_classification_policy=["private"],
            provider="ollama",
        ),
        criteria=completeness,
        context_fields=["question"],
        criteria_field="criteria",
    ),
    LLMJudgeDirect(
        inference_engine=CrossProviderInferenceEngine(
            model="llama3.2:3b",
            max_tokens=1024,
            data_classification_policy=["private"],
            provider="ollama",
        ),
        criteria=no_generic_statements,
        context_fields=["question"],
        criteria_field="criteria",
    ),
    LLMJudgeDirect(
        inference_engine=CrossProviderInferenceEngine(
            model="llama3.2:3b",
            max_tokens=1024,
            data_classification_policy=["private"],
            provider="ollama",
        ),
        criteria=clarity,
        context_fields=["question"],
        criteria_field="criteria",
    ),
]

print("="*80)
print("LLM-AS-A-JUDGE CONFIGURED")
print("="*80)
print(f"\n‚úÖ Created {len(metrics)} evaluation metrics (one per criterion)")
print(f"   Model: llama3.2:3b (via Ollama)")
print(f"   Provider: Ollama (local)")
print(f"\n‚ö†Ô∏è  Prerequisites:")
print(f"   1. Ollama server running: ollama serve")
print(f"   2. Model pulled: ollama pull llama3.2:3b")
print("="*80)


## 6. Prepare Data for Evaluation

**What we're doing:** Formatting the close notes and incident context into the format expected by Unitxt.

**Format:**
- Each item needs a `question` field containing:
  - Instruction to write a close note
  - Full incident context (from `content` field)
  - This gives the LLM judge all the information it needs

**We'll prepare:**
- Reference dataset close notes (good examples)
- Other incidents close notes (bad/regular examples)


In [None]:
# Prepare data for evaluation
# Format: question field contains instruction + incident context

def prepare_evaluation_data(df, close_notes_col='close_notes_ref', dataset_type='reference'):
    """Prepare data in format expected by Unitxt."""
    data = []
    for idx, row in df.iterrows():
        content = str(row['content']) if pd.notna(row['content']) else ""
        close_note = str(row[close_notes_col]) if pd.notna(row[close_notes_col]) else ""
        
        # Create question with incident context
        question = f"""Write a close note for this incident:

{content}

The close note being evaluated is:
{close_note}"""
        
        data.append({
            "question": question,
            "dataset_type": dataset_type,
            "incident_id": row.get('number', f"INC-{idx}"),
            "category": row.get('category', 'Unknown'),
        })
    
    return data

# Prepare reference dataset
reference_data = prepare_evaluation_data(
    reference_sample, 
    close_notes_col='close_notes_ref',
    dataset_type='reference'
)

# Prepare other incidents dataset
other_data = prepare_evaluation_data(
    other_sample,
    close_notes_col='close_notes',
    dataset_type='other'
)

print("="*80)
print("DATA PREPARED FOR EVALUATION")
print("="*80)
print(f"\nüìä Reference Data:")
print(f"   - Records: {len(reference_data)}")
print(f"   - Example question length: {len(reference_data[0]['question'])} chars")

print(f"\nüìä Other Incidents Data:")
print(f"   - Records: {len(other_data)}")
print(f"   - Example question length: {len(other_data[0]['question'])} chars")

print("="*80)


## 7. Evaluate Close Notes

**What we're doing:** Running the LLM-as-a-Judge evaluation on our close notes.

**Process:**
1. Create a dataset with our close notes
2. For each close note, the LLM evaluates it against all 5 criteria
3. We get scores (0.0-1.0) and reasoning for each criterion
4. Results are stored for analysis

**Note:** This may take a few minutes as the LLM processes each close note for each criterion.


In [None]:
# Combine all data for evaluation
all_data = reference_data + other_data

# Create dataset
print("Creating dataset...")
dataset = create_dataset(
    task="tasks.qa.open",
    test_set=all_data,
    metrics=metrics,
    split="test"
)

print(f"‚úÖ Dataset created with {len(dataset)} examples")

# Prepare predictions (the close notes to evaluate)
# For each item in data, extract the close note from the question
predictions = []
for item in all_data:
    # Extract close note from question (it's after "The close note being evaluated is:")
    question_parts = item['question'].split("The close note being evaluated is:")
    if len(question_parts) > 1:
        close_note = question_parts[1].strip()
    else:
        # Fallback: extract from original data
        if item['dataset_type'] == 'reference':
            idx = reference_data.index(item) if item in reference_data else 0
            close_note = reference_sample.iloc[idx]['close_notes_ref']
        else:
            idx = other_data.index(item) if item in other_data else 0
            close_note = other_sample.iloc[idx]['close_notes']
    predictions.append(close_note)

print(f"‚úÖ Prepared {len(predictions)} predictions for evaluation")

# Run evaluation
print("\n" + "="*80)
print("RUNNING LLM-AS-A-JUDGE EVALUATION")
print("="*80)
print("‚è≥ This may take a few minutes...")
print("   Evaluating each close note against 5 criteria...\n")

results = evaluate(predictions=predictions, data=dataset)

print("‚úÖ Evaluation completed!")
print("="*80)


## 8. Extract and Analyze Results

**What we're doing:** Extracting scores from the evaluation results and organizing them for analysis.

**We'll extract:**
- Score for each criterion (0.0-1.0)
- Selected option for each criterion (e.g., "Excellent", "Acceptable")
- Reasoning for each evaluation (why that score was given)
- Overall average score across all criteria

**Then we'll:**
- Compare scores between reference (good) and other (bad) close notes
- Identify which criteria show the biggest differences
- Visualize the results


In [None]:
# Extract results and organize into a DataFrame
results_list = []

if hasattr(results, 'instance_scores') and isinstance(results.instance_scores, list):
    for i, instance in enumerate(results.instance_scores):
        if isinstance(instance, dict):
            # Get metadata from original data
            metadata = all_data[i]
            
            # Extract scores for each criterion
            result_row = {
                'dataset_type': metadata['dataset_type'],
                'incident_id': metadata['incident_id'],
                'category': metadata.get('category', 'Unknown'),
            }
            
            # Find all criteria scores
            score_keys = [k for k in instance.keys() if k.endswith('_selected_option')]
            all_scores = []
            
            for score_key in score_keys:
                base_name = score_key.replace('_selected_option', '')
                score = instance.get(base_name, None)
                selected_option = instance.get(f'{base_name}_selected_option', 'N/A')
                
                if score is not None:
                    # Clean criterion name (remove underscores, title case)
                    criterion_name = base_name.replace('_', ' ').title()
                    result_row[f'{criterion_name}_score'] = score
                    result_row[f'{criterion_name}_option'] = selected_option
                    all_scores.append(score)
            
            # Calculate average score
            if all_scores:
                result_row['average_score'] = sum(all_scores) / len(all_scores)
            else:
                result_row['average_score'] = None
            
            results_list.append(result_row)

# Create DataFrame
results_df = pd.DataFrame(results_list)

print("="*80)
print("RESULTS EXTRACTED")
print("="*80)
print(f"\nüìä Total evaluations: {len(results_df)}")
print(f"   - Reference (good): {len(results_df[results_df['dataset_type'] == 'reference'])}")
print(f"   - Other (bad/regular): {len(results_df[results_df['dataset_type'] == 'other'])}")

if len(results_df) > 0:
    print(f"\nüìã Sample results:")
    print(results_df[['dataset_type', 'average_score']].head())
    
    # Show average scores by dataset type
    print(f"\nüìà Average Scores by Dataset Type:")
    avg_by_type = results_df.groupby('dataset_type')['average_score'].agg(['mean', 'std', 'min', 'max'])
    print(avg_by_type)

print("="*80)


## 9. Visualize Results

**What we're doing:** Creating visualizations to compare scores between good and bad close notes.

**Charts we'll create:**
1. **Average Score Comparison** - Box plot showing score distributions
2. **Criterion-by-Criterion Comparison** - See which criteria show biggest differences
3. **Score Distribution** - Histogram showing how scores are distributed

**What to look for:**
- Reference (good) close notes should score **higher** overall
- Biggest differences likely in **Specificity** and **No Generic Statements**
- Scores should cluster: good notes 0.7-1.0, bad notes 0.3-0.6


In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('LLM-as-a-Judge Evaluation Results: Good vs Bad Close Notes', fontsize=16, fontweight='bold')

# 1. Average Score Comparison
ax1 = axes[0, 0]
results_df.boxplot(column='average_score', by='dataset_type', ax=ax1)
ax1.set_title('Average Score Distribution')
ax1.set_xlabel('Dataset Type')
ax1.set_ylabel('Average Score (0.0 - 1.0)')
ax1.set_ylim(0, 1.1)
ax1.grid(True, alpha=0.3)

# 2. Criterion Scores Comparison
ax2 = axes[0, 1]
criterion_cols = [col for col in results_df.columns if col.endswith('_score')]
if criterion_cols:
    comparison_data = []
    for criterion in criterion_cols:
        criterion_name = criterion.replace('_score', '')
        for dataset_type in ['reference', 'other']:
            subset = results_df[results_df['dataset_type'] == dataset_type]
            if len(subset) > 0:
                comparison_data.append({
                    'Criterion': criterion_name,
                    'Dataset': 'Good (Reference)' if dataset_type == 'reference' else 'Bad/Regular (Other)',
                    'Score': subset[criterion].mean()
                })
    
    if comparison_data:
        comparison_df = pd.DataFrame(comparison_data)
        comparison_pivot = comparison_df.pivot(index='Criterion', columns='Dataset', values='Score')
        comparison_pivot.plot(kind='bar', ax=ax2, color=['#2ecc71', '#e74c3c'])
        ax2.set_title('Average Scores by Criterion')
        ax2.set_xlabel('Criterion')
        ax2.set_ylabel('Average Score')
        ax2.legend(title='Dataset Type')
        ax2.tick_params(axis='x', rotation=45)
        ax2.grid(True, alpha=0.3, axis='y')

# 3. Score Distribution Histogram
ax3 = axes[1, 0]
reference_scores = results_df[results_df['dataset_type'] == 'reference']['average_score'].dropna()
other_scores = results_df[results_df['dataset_type'] == 'other']['average_score'].dropna()

if len(reference_scores) > 0 and len(other_scores) > 0:
    ax3.hist(reference_scores, bins=10, alpha=0.6, label='Good (Reference)', color='#2ecc71')
    ax3.hist(other_scores, bins=10, alpha=0.6, label='Bad/Regular (Other)', color='#e74c3c')
    ax3.set_title('Score Distribution')
    ax3.set_xlabel('Average Score')
    ax3.set_ylabel('Frequency')
    ax3.legend()
    ax3.grid(True, alpha=0.3, axis='y')

# 4. Summary Statistics Table
ax4 = axes[1, 1]
ax4.axis('off')

# Create summary table
summary_data = []
for dataset_type in ['reference', 'other']:
    subset = results_df[results_df['dataset_type'] == dataset_type]
    if len(subset) > 0 and 'average_score' in subset.columns:
        summary_data.append({
            'Dataset': 'Good (Reference)' if dataset_type == 'reference' else 'Bad/Regular (Other)',
            'Count': len(subset),
            'Mean': subset['average_score'].mean(),
            'Std': subset['average_score'].std(),
            'Min': subset['average_score'].min(),
            'Max': subset['average_score'].max()
        })

if summary_data:
    summary_df = pd.DataFrame(summary_data)
    table = ax4.table(cellText=summary_df.values,
                     colLabels=summary_df.columns,
                     cellLoc='center',
                     loc='center',
                     bbox=[0, 0, 1, 1])
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)
    ax4.set_title('Summary Statistics', pad=20)

plt.tight_layout()
plt.show()

print("‚úÖ Visualizations created!")


## 9b. Additional Visualizations: Heatmap and Radar Chart

**What we're doing:** Creating two additional charts that provide deeper insights into the evaluation results.

**New charts:**
1. **Heatmap** - Visual comparison of scores across all criteria (quick overview)
2. **Radar Chart** - Shows the "profile" of good vs bad notes (strengths/weaknesses)

**Why these charts help:**
- **Heatmap:** See all scores at once - which criteria show the biggest gaps?
- **Radar Chart:** Understand the "shape" of quality - are good notes strong across all criteria or just some?

**Think of it like:** 
- Heatmap = A color-coded report card showing where good notes excel
- Radar Chart = A "spider web" showing the quality profile of each dataset


In [None]:
# Create additional visualizations: Heatmap and Radar Chart
# These charts provide deeper insights into the evaluation results

# Extract criterion columns (exclude average_score)
criterion_cols = [
    col
    for col in results_df.columns
    if col.endswith("_score") and col != "average_score"
]
criterion_names = [
    col.replace("_score", "").replace("_", " ").title() for col in criterion_cols
]

if len(criterion_cols) > 0:
    # Create figure with 2 subplots
    fig2, (ax5, ax6) = plt.subplots(1, 2, figsize=(18, 7))
    fig2.suptitle(
        "Additional Insights: Heatmap and Criterion Profile",
        fontsize=16,
        fontweight="bold",
        y=1.02,
    )

    # ========================================================================
    # CHART 1: HEATMAP - Visual comparison across all criteria
    # ========================================================================
    # Prepare data for heatmap
    heatmap_data = []
    for dataset_type in ["reference", "other"]:
        subset = results_df[results_df["dataset_type"] == dataset_type]
        if len(subset) > 0:
            row_data = []
            for criterion in criterion_cols:
                row_data.append(subset[criterion].mean())
            heatmap_data.append(row_data)

    if heatmap_data:
        heatmap_df = pd.DataFrame(
            heatmap_data,
            index=["Good (Reference)", "Bad/Regular (Other)"],
            columns=criterion_names,
        )

        # Create heatmap with better colormap
        im = ax5.imshow(heatmap_df.values, cmap="RdYlGn", aspect="auto", vmin=0, vmax=1)

        # Set ticks and labels
        ax5.set_xticks(np.arange(len(heatmap_df.columns)))
        ax5.set_yticks(np.arange(len(heatmap_df.index)))
        ax5.set_xticklabels(heatmap_df.columns, rotation=45, ha="right", fontsize=11)
        ax5.set_yticklabels(heatmap_df.index, fontsize=12, fontweight="bold")

        # Add text annotations with better formatting
        for i in range(len(heatmap_df.index)):
            for j in range(len(heatmap_df.columns)):
                score = heatmap_df.iloc[i, j]
                # Use white text for low scores, black for high scores
                text_color = "white" if score < 0.5 else "black"
                ax5.text(
                    j,
                    i,
                    f"{score:.2f}",
                    ha="center",
                    va="center",
                    color=text_color,
                    fontsize=11,
                    fontweight="bold",
                    bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.7),
                )

        ax5.set_title(
            "Score Heatmap Across All Criteria\n(Darker Green = Higher Score)",
            fontsize=14,
            fontweight="bold",
            pad=15,
        )

        # Add colorbar with better styling
        cbar = plt.colorbar(im, ax=ax5, fraction=0.046, pad=0.04)
        cbar.set_label(
            "Score (0.0 = Poor, 1.0 = Excellent)", fontsize=11, fontweight="bold"
        )
        cbar.ax.tick_params(labelsize=10)

    # ========================================================================
    # CHART 2: RADAR CHART - Criterion Profile Comparison
    # ========================================================================
    if len(criterion_cols) >= 3:
        try:
            # Calculate average scores for each criterion
            ref_means = [
                results_df[results_df["dataset_type"] == "reference"][col].mean()
                for col in criterion_cols
            ]
            other_means = [
                results_df[results_df["dataset_type"] == "other"][col].mean()
                for col in criterion_cols
            ]

            # Number of criteria
            N = len(criterion_cols)

            # Compute angle for each criterion
            angles = [n / float(N) * 2 * np.pi for n in range(N)]
            angles += angles[:1]  # Complete the circle

            # Add values to complete the circle
            ref_means += ref_means[:1]
            other_means += other_means[:1]

            # Create radar chart
            ax6 = plt.subplot(1, 2, 2, projection="polar")
            ax6.plot(
                angles,
                ref_means,
                "o-",
                linewidth=3,
                label="Good (Reference)",
                color="#2ecc71",
                markersize=8,
            )
            ax6.fill(angles, ref_means, alpha=0.25, color="#2ecc71")
            ax6.plot(
                angles,
                other_means,
                "o-",
                linewidth=3,
                label="Bad/Regular (Other)",
                color="#e74c3c",
                markersize=8,
            )
            ax6.fill(angles, other_means, alpha=0.25, color="#e74c3c")

            # Add criterion labels with better positioning
            ax6.set_xticks(angles[:-1])
            ax6.set_xticklabels(criterion_names, fontsize=10)
            ax6.set_ylim(0, 1)
            ax6.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
            ax6.set_yticklabels(["0.2", "0.4", "0.6", "0.8", "1.0"], fontsize=9)
            ax6.grid(True, alpha=0.3, linestyle="--")
            ax6.set_title(
                "Criterion Profile Comparison\n(Radar Chart - Shows Strengths/Weaknesses)",
                fontsize=14,
                fontweight="bold",
                pad=20,
            )
            ax6.legend(
                loc="upper right",
                bbox_to_anchor=(1.3, 1.1),
                fontsize=11,
                framealpha=0.9,
            )

            # Add score annotations at each point
            for angle, ref_val, other_val in zip(
                angles[:-1], ref_means[:-1], other_means[:-1]
            ):
                ax6.text(
                    angle,
                    ref_val + 0.1,
                    f"{ref_val:.2f}",
                    ha="center",
                    va="center",
                    fontsize=8,
                    color="#2ecc71",
                    fontweight="bold",
                )
                ax6.text(
                    angle,
                    other_val - 0.1,
                    f"{other_val:.2f}",
                    ha="center",
                    va="center",
                    fontsize=8,
                    color="#e74c3c",
                    fontweight="bold",
                )
        except Exception as e:
            ax6.text(
                0.5,
                0.5,
                f"Radar chart unavailable\n({str(e)[:40]})",
                ha="center",
                va="center",
                transform=ax6.transAxes,
                fontsize=11,
            )
            ax6.set_title(
                "Criterion Profile (Radar Chart)", fontsize=14, fontweight="bold"
            )
    else:
        ax6.text(
            0.5,
            0.5,
            "Radar chart requires at least 3 criteria",
            ha="center",
            va="center",
            transform=ax6.transAxes,
            fontsize=11,
        )
        ax6.set_title("Criterion Profile (Radar Chart)", fontsize=14, fontweight="bold")

    plt.tight_layout()
    plt.show()

    print("\n‚úÖ Additional visualizations created!")
    print("   üìä Charts: Heatmap (all criteria at once), Radar Chart (quality profile)")
    print("=" * 80)
else:
    print("‚ö†Ô∏è  No criterion columns found for additional visualizations")

## 10. Detailed Results Per Close Note

**What we're doing:** Showing detailed evaluation results for each close note, including scores and reasoning for each criterion.

**This helps us:**
- Understand why each close note scored the way it did
- See which criteria are strengths/weaknesses for each note
- Learn what makes a close note good or bad


In [None]:
# Display detailed results for each close note
print("="*80)
print("DETAILED EVALUATION RESULTS")
print("="*80)

if hasattr(results, 'instance_scores') and isinstance(results.instance_scores, list):
    for i, instance in enumerate(results.instance_scores):
        metadata = all_data[i]
        prediction = predictions[i]
        
        print(f"\n{'='*80}")
        print(f"CLOSE NOTE {i+1} - {metadata['dataset_type'].upper()}")
        print('='*80)
        print(f"\nüìù Close Note:")
        print(f"{prediction[:300]}..." if len(prediction) > 300 else prediction)
        
        if isinstance(instance, dict):
            # Extract all criteria scores
            score_keys = [k for k in instance.keys() if k.endswith('_selected_option')]
            
            if score_keys:
                print(f"\nüìä Scores Across All Criteria:")
                print("-" * 80)
                
                all_scores = []
                for score_key in score_keys:
                    base_name = score_key.replace('_selected_option', '')
                    score = instance.get(base_name, None)
                    selected_option = instance.get(f'{base_name}_selected_option', 'N/A')
                    assessment = instance.get(f'{base_name}_assessment', '')
                    
                    if score is not None:
                        criterion_name = base_name.replace('_', ' ').title()
                        print(f"\nüîç {criterion_name}:")
                        print(f"   Score: {score:.2f} ({selected_option})")
                        
                        if assessment:
                            # Show first 200 chars of reasoning
                            if len(assessment) > 200:
                                print(f"   Reasoning: {assessment[:200]}...")
                            else:
                                print(f"   Reasoning: {assessment}")
                        
                        all_scores.append(score)
                
                # Show average
                if all_scores:
                    avg_score = sum(all_scores) / len(all_scores)
                    print(f"\n{'='*80}")
                    print(f"üìà Overall Average Score: {avg_score:.2f} / 1.0")
                    print(f"   Quality Level: ", end="")
                    if avg_score >= 0.8:
                        print("‚úÖ Excellent")
                    elif avg_score >= 0.6:
                        print("‚úÖ Good")
                    elif avg_score >= 0.4:
                        print("‚ö†Ô∏è  Needs Improvement")
                    else:
                        print("‚ùå Poor")

print("\n" + "="*80)
print("‚úÖ DETAILED RESULTS DISPLAYED")
print("="*80)


## 11. Comparison and Interpretation

**What we're doing:** Comparing results between good (reference) and bad (other) close notes to understand the differences.

**Key questions:**
- Do good close notes score higher? (Expected: Yes)
- Which criteria show the biggest differences?
- What can we learn about what makes a close note good?

**Interpretation guide:**
- **Large difference (0.3+)** = This criterion strongly distinguishes good from bad
- **Small difference (<0.2)** = This criterion doesn't distinguish well
- **Consistent pattern** = Good notes score higher across all criteria


In [None]:
# Compare results between reference and other datasets
print("="*80)
print("COMPARISON: GOOD vs BAD CLOSE NOTES")
print("="*80)

if len(results_df) > 0:
    reference_scores = results_df[results_df['dataset_type'] == 'reference']
    other_scores = results_df[results_df['dataset_type'] == 'other']
    
    if len(reference_scores) > 0 and len(other_scores) > 0:
        print(f"\nüìä Overall Average Scores:")
        print(f"   Good (Reference): {reference_scores['average_score'].mean():.2f}")
        print(f"   Bad/Regular (Other): {other_scores['average_score'].mean():.2f}")
        print(f"   Difference: {reference_scores['average_score'].mean() - other_scores['average_score'].mean():.2f}")
        
        # Compare by criterion
        criterion_cols = [col for col in results_df.columns if col.endswith('_score')]
        if criterion_cols:
            print(f"\nüìã Criterion-by-Criterion Comparison:")
            print("-" * 80)
            
            for criterion_col in criterion_cols:
                criterion_name = criterion_col.replace('_score', '').replace('_', ' ').title()
                ref_mean = reference_scores[criterion_col].mean()
                other_mean = other_scores[criterion_col].mean()
                difference = ref_mean - other_mean
                
                print(f"\n{criterion_name}:")
                print(f"   Good: {ref_mean:.2f}")
                print(f"   Bad/Regular: {other_mean:.2f}")
                print(f"   Difference: {difference:.2f}", end="")
                
                if difference > 0.3:
                    print(" ‚úÖ Large difference - strongly distinguishes quality")
                elif difference > 0.15:
                    print(" ‚úÖ Moderate difference - distinguishes quality")
                else:
                    print(" ‚ö†Ô∏è  Small difference - less distinguishing")
        
        # Summary
        print(f"\n{'='*80}")
        print("üìà SUMMARY")
        print("="*80)
        print(f"\n‚úÖ Key Findings:")
        print(f"   - Good close notes score {'higher' if reference_scores['average_score'].mean() > other_scores['average_score'].mean() else 'lower'} overall")
        print(f"   - Average difference: {abs(reference_scores['average_score'].mean() - other_scores['average_score'].mean()):.2f} points")
        
        if reference_scores['average_score'].mean() > other_scores['average_score'].mean():
            print(f"\nüí° Interpretation:")
            print(f"   LLM-as-a-Judge successfully distinguishes between good and bad close notes.")
            print(f"   This evaluation method can be used to assess close note quality.")
        else:
            print(f"\n‚ö†Ô∏è  Note:")
            print(f"   Unexpected results - good notes should score higher.")
            print(f"   This may indicate issues with the evaluation or sample selection.")

print("="*80)


## 12. Conclusion and Next Steps

**What we learned:**
- ‚úÖ LLM-as-a-Judge provides structured, explainable evaluation
- ‚úÖ Good close notes score higher across multiple criteria
- ‚úÖ Specific criteria (like Specificity and No Generic Statements) strongly distinguish quality
- ‚úÖ This method can be used to evaluate AI-generated close notes

**Next steps:**
- **Improvement:** Use evaluation results to guide LLM prompt engineering
- **Scaling:** Evaluate more close notes to build a comprehensive quality assessment

**How to use this:**
- Generate close notes using an LLM (Notebook 06)
- Evaluate them using this same method
- Compare scores to identify areas for improvement
- Iterate on prompts to improve quality
