# Evaluation Framework

This notebook implements the LLM-as-a-Judge evaluation framework to assess domain name suggestions based on relevance, memorability, appropriateness, and availability-style plausibility.

## Objectives
- Evaluate domain suggestions using LLM-as-a-Judge.
- Score suggestions on four metrics.
- Aggregate results and generate summary statistics.
- Compare performance across model versions.

## Setup
Ensure the environment is set up and API keys for OpenAI/Anthropic are configured.

In [None]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

from evaluation.llm_judge import LLMJudge
from utils.config import load_config
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import asyncio
import logging

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Load Configuration

Load the evaluation configuration from `config/evaluation_config.yaml`.

In [None]:
CONFIG_PATH = 'config/evaluation_config.yaml'
config = load_config(CONFIG_PATH)

EVAL_DATASET_PATH = config['datasets']['test_set']['path']
OUTPUT_PATH = 'data/evaluation/results.json'

logger.info(f'Loading evaluation configuration from {CONFIG_PATH}')

## Initialize LLM Judge

Initialize the LLM-as-a-Judge evaluator.

In [None]:
judge = LLMJudge(CONFIG_PATH)
logger.info('LLM-as-a-Judge initialized')

## Load Test Dataset

Load the evaluation dataset.

In [None]:
with open(EVAL_DATASET_PATH, 'r') as f:
    dataset = json.load(f)

logger.info(f'Loaded evaluation dataset with {len(dataset)} samples')

## Evaluate Dataset

Run the comprehensive evaluation on the test dataset.

In [None]:
async def run_evaluation():
    results = await judge.evaluate_dataset(EVAL_DATASET_PATH, OUTPUT_PATH, max_samples=100)
    logger.info(f'Evaluation completed. Results saved to {OUTPUT_PATH}')
    return results

results = asyncio.run(run_evaluation())

## Analyze Results

Analyze the evaluation results and visualize performance metrics.

In [None]:
# Load results
with open(OUTPUT_PATH, 'r') as f:
    results = json.load(f)

# Extract scores
scores = []
for sample in results:
    for result in sample['results']:
        for metric, eval_result in result['evaluations'].items():
            scores.append({
                'Sample': sample['sample_id'],
                'Metric': metric,
                'Score': eval_result['score']
            })

scores_df = pd.DataFrame(scores)

# Plot score distribution
plt.figure(figsize=(10, 6))
sns.boxplot(data=scores_df, x='Metric', y='Score')
plt.title('Score Distribution by Metric')
plt.tight_layout()
plt.show()

# Calculate summary statistics
summary = scores_df.groupby('Metric')['Score'].agg(['mean', 'std', 'min', 'max'])
print('Summary Statistics:')
print(summary)

# Plot overall score distribution
overall_scores = [sample['results'][0]['overall_score'] for sample in results]
plt.figure(figsize=(8, 6))
sns.histplot(overall_scores, bins=20)
plt.title('Overall Score Distribution')
plt.xlabel('Overall Score')
plt.tight_layout()
plt.show()

## Conclusion

The evaluation framework successfully scored domain suggestions across four metrics. The results provide insights into model performance and areas for improvement. The next step is to analyze edge cases.