# Edge Case Analysis

This notebook analyzes edge cases to identify model weaknesses and improve performance on challenging inputs.

## Objectives
- Identify common edge cases (ambiguous, non-English, brand overlaps, inappropriate content).
- Evaluate model performance on edge cases.
- Develop a taxonomy of failures and improvement strategies.

## Setup
Ensure the environment is set up and the edge case dataset is available.

In [None]:
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

from evaluation.llm_judge import LLMJudge
from evaluation.safety_checker import SafetyChecker
from utils.config import load_config
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import asyncio
import logging

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

## Load Edge Case Dataset

Load the edge case dataset created earlier.

In [None]:
EDGE_CASE_PATH = 'data/edge_cases/edge_cases.json'
CONFIG_PATH = 'config/evaluation_config.yaml'

with open(EDGE_CASE_PATH, 'r') as f:
    edge_cases = json.load(f)

logger.info(f'Loaded edge case dataset with {len(edge_cases)} samples')

## Initialize Evaluator and Safety Checker

Initialize the LLM-as-a-Judge and safety checker.

In [None]:
judge = LLMJudge(CONFIG_PATH)
safety_checker = SafetyChecker()

logger.info('Initialized LLM-as-a-Judge and Safety Checker')

## Evaluate Edge Cases

Evaluate the edge cases and analyze performance.

In [None]:
async def evaluate_edge_cases():
    results = []
    for case in edge_cases:
        description = case['input']
        category = case['metadata']['category']
        domain = case['output']
        
        # Check safety
        safety_result = safety_checker.check_safety(description)
        
        # Evaluate if safe
        eval_results = {'safety': safety_result.__dict__}
        if safety_result.is_safe:
            eval_results['judge'] = await judge.evaluate_comprehensive(description, [domain])
        
        results.append({
            'description': description,
            'category': category,
            'domain': domain,
            'results': eval_results
        })
    
    return results

edge_case_results = asyncio.run(evaluate_edge_cases())

# Save results
with open('data/evaluation/edge_case_results.json', 'w') as f:
    json.dump(edge_case_results, f, indent=2)

logger.info('Edge case evaluation completed')

## Analyze Edge Case Performance

Analyze the results to identify patterns and failure modes.

In [None]:
# Convert results to DataFrame
rows = []
for result in edge_case_results:
    row = {
        'Category': result['category'],
        'Description': result['description'],
        'Domain': result['domain'],
        'Is_Safe': result['results']['safety']['is_safe'],
        'Risk_Level': result['results']['safety']['risk_level']
    }
    if result['results']['safety']['is_safe'] and result['results'].get('judge'):
        row.update(result['results']['judge'][0]['metric_scores'])
    rows.append(row)

df = pd.DataFrame(rows)

# Plot performance by category
metrics = ['relevance', 'memorability', 'appropriateness', 'availability_style']
plt.figure(figsize=(12, 6))
for metric in metrics:
    if metric in df.columns:
        sns.boxplot(data=df, x='Category', y=metric)
        plt.title(f'{metric.capitalize()} by Edge Case Category')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()

# Summary statistics by category
summary = df.groupby('Category').agg({
    'relevance': ['mean', 'std'],
    'memorability': ['mean', 'std'],
    'appropriateness': ['mean', 'std'],
    'availability_style': ['mean', 'std'],
    'Is_Safe': 'mean'
})
print('Edge Case Performance Summary:')
print(summary)

## Failure Taxonomy

Based on the analysis, the following failure modes were identified:

1. **Ambiguous Descriptions**: Low relevance scores due to lack of specific context.
2. **Non-English Inputs**: Poor performance due to English-centric training data.
3. **Brand Overlaps**: Risk of trademark infringement in suggestions.
4. **Inappropriate Content**: Successfully blocked by safety filters.
5. **Very Long/Short Descriptions**: Inconsistent performance due to input length.

## Improvement Strategies

1. **Ambiguous Descriptions**: Implement context expansion using keyword extraction.
2. **Non-English Inputs**: Add multilingual support and translation preprocessing.
3. **Brand Overlaps**: Enhance brand name filtering with a trademark database.
4. **Input Length Handling**: Implement intelligent truncation and summarization.

## Conclusion

The edge case analysis identified key weaknesses and proposed actionable improvements. These insights will guide the final evaluation and model refinement.