# Concept 2: Evaluating AI Agents

**Objective**: Evaluate an insurance claims assistant using industry-standard agentic AI metrics with LlamaIndex RAG integration.

**Top 3 Agentic RAG Metrics**:
- üéØ **Factual Accuracy** (40% weight): LLM-based correctness scoring
- üìù **Citation/Source Compliance** (30% weight): Source attribution and evidence quality
- üîç **Retrieval Relevance** (30% weight): Quality of document retrieval using LlamaIndex

**Time**: ~15-20 minutes
**Domain**: Insurance claims processing with persistent memory
**Dataset**: 50 labeled golden standard Q&A pairs
**RAG Framework**: LlamaIndex for document retrieval and indexing

## üéØ Learning Objectives

By the end of this demonstration, you will understand:
1. How to evaluate agentic AI systems using production-ready metrics
2. Implementing RAG evaluation with LlamaIndex document retrieval
3. Measuring the top 3 agentic AI metrics used in financial services
4. Generating performance reports with retrieval analytics
5. Evaluation best practices for memory-enabled RAG agents

In [1]:
# Import required libraries
import os
import sys
import json
import csv
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
import re
from pathlib import Path

# Data handling
import pandas as pd
import numpy as np

# OpenAI for LLM
from openai import OpenAI

# LlamaIndex for RAG
from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# Environment variables
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(
    base_url="https://openai.vocareum.com/v1",
    api_key=os.getenv("OPENAI_API_KEY")
)

# Configure LlamaIndex settings
Settings.llm = LlamaOpenAI(
    model="gpt-4o-mini", 
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url="https://openai.vocareum.com/v1"
)
Settings.embed_model = OpenAIEmbedding(
    api_key=os.getenv("OPENAI_API_KEY"),
    api_base="https://openai.vocareum.com/v1"
)

print("üîß Evaluation System Setup:")
print(f"   ‚úÖ OpenAI API Key: {'‚úì Configured' if os.getenv('OPENAI_API_KEY') else '‚ùå Missing'}")
print("   üè• Domain: Insurance claims processing with persistent memory")
print("   üîç RAG Framework: LlamaIndex for document retrieval")
print("   üìä Focus: Top 3 agentic AI metrics")
print("   üéØ Metrics: Factual accuracy, citation compliance, retrieval relevance")
print("   üìà Dataset: 50 golden standard labeled examples")
print("   üîó Integration: Memory-enabled claims assistant")



üîß Evaluation System Setup:
   ‚úÖ OpenAI API Key: ‚úì Configured
   üè• Domain: Insurance claims processing with persistent memory
   üîç RAG Framework: LlamaIndex for document retrieval
   üìä Focus: Top 3 agentic AI metrics
   üéØ Metrics: Factual accuracy, citation compliance, retrieval relevance
   üìà Dataset: 50 golden standard labeled examples
   üîó Integration: Memory-enabled claims assistant


## üß† Insurance Claims Assistant with Memory

Create a simplified claims assistant with persistent memory that integrates with LlamaIndex RAG for evaluation.

In [2]:
# Insurance claims assistant with memory integration
from datetime import datetime, timedelta
import uuid

class MemoryEntry:
    """Memory entry for insurance policy information"""
    def __init__(self, topic: str, fact_text: str, source: str, weight: float = 1.0):
        self.id = str(uuid.uuid4())
        self.topic = topic
        self.fact_text = fact_text
        self.source = source
        self.weight = weight
        self.created_at = datetime.now()
        self.updated_at = datetime.now()
        self.frequency_count = 1
        self.pinned = False

class InsuranceMemoryManager:
    """Memory manager for insurance policy information"""
    
    def __init__(self):
        self.memories = {}
        self.client = client
    
    def add_policy_knowledge(self, policy_documents: List[Dict]):
        """Add insurance policy documents to memory"""
        for doc in policy_documents:
            memory = MemoryEntry(
                topic=doc['category'],
                fact_text=f"{doc['title']}: {doc['content']}",
                source=doc['doc_id'],
                weight=2.0  # Higher weight for policy documents
            )
            self.memories[memory.id] = memory
    
    def retrieve_relevant_memories(self, query: str, top_k: int = 3) -> List[MemoryEntry]:
        """Keyword-based memory retrieval"""
        query_lower = query.lower()
        scored_memories = []
        
        for memory in self.memories.values():
            score = 0
            fact_lower = memory.fact_text.lower()
            
            # Simple keyword matching
            for word in query_lower.split():
                if len(word) > 3 and word in fact_lower:
                    score += memory.weight
            
            if score > 0:
                scored_memories.append((memory, score))
        
        # Sort by score and return top-k
        scored_memories.sort(key=lambda x: x[1], reverse=True)
        return [mem for mem, score in scored_memories[:top_k]]

class EvaluationClaimsAssistant:
    """Insurance claims assistant for evaluation"""
    
    def __init__(self, memory_manager: InsuranceMemoryManager, llamaindex_retriever):
        self.memory = memory_manager
        self.llamaindex_retriever = llamaindex_retriever
        self.client = client
    
    def answer_question_with_memory_and_rag(self, question: str) -> Dict[str, Any]:
        """Answer questions using both memory and LlamaIndex RAG"""
        
        # 1. Retrieve from persistent memory
        memory_results = self.memory.retrieve_relevant_memories(question, top_k=2)
        
        # 2. Retrieve using LlamaIndex RAG
        rag_results = self.llamaindex_retriever.retrieve(question)
        
        # 3. Combine context from both sources
        memory_context = "\n".join([f"Memory: {mem.fact_text}" for mem in memory_results])
        rag_context = "\n".join([f"Document: {node.text}" for node in rag_results])
        
        combined_context = f"""Memory Context:\n{memory_context}\n\nDocument Context:\n{rag_context}"""
        
        # 4. Generate answer using LLM
        prompt = f"""
You are an insurance claims assistant with access to persistent memory and current policy documents.

Context from memory and documents:
{combined_context}

Question: {question}

Instructions:
- Provide an accurate answer based on the context above
- Include source references in [brackets] for factual claims
- If you don't have the information, say "I don't have that information"
- Keep responses concise and professional
"""
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1
            )
            
            answer = response.choices[0].message.content
            
            return {
                "question": question,
                "answer": answer,
                "memory_sources": [mem.source for mem in memory_results],
                "rag_sources": [node.metadata.get('source', 'unknown') for node in rag_results],
                "memory_count": len(memory_results),
                "rag_count": len(rag_results),
                "context_length": len(combined_context),
                "tokens_used": response.usage.total_tokens,
                "retrieved_nodes": rag_results
            }
            
        except Exception as e:
            return {
                "question": question,
                "answer": f"Error: {str(e)}",
                "memory_sources": [],
                "rag_sources": [],
                "memory_count": 0,
                "rag_count": 0,
                "context_length": 0,
                "tokens_used": 0,
                "retrieved_nodes": []
            }

print("üß† Insurance Claims Assistant Components Ready:")
print("   ‚úÖ MemoryEntry class for persistent storage")
print("   ‚úÖ InsuranceMemoryManager for memory operations")
print("   ‚úÖ EvaluationClaimsAssistant with memory + RAG integration")
print("   üîó Ready to integrate with insurance policy documents")

üß† Insurance Claims Assistant Components Ready:
   ‚úÖ MemoryEntry class for persistent storage
   ‚úÖ InsuranceMemoryManager for memory operations
   ‚úÖ EvaluationClaimsAssistant with memory + RAG integration
   üîó Ready to integrate with insurance policy documents


## üìä Load Golden Dataset & Initialize LlamaIndex RAG

Create synthetic insurance claims data and set up LlamaIndex for document retrieval.

In [3]:
# Create synthetic insurance claims evaluation dataset
def create_insurance_evaluation_data() -> Tuple[List[Dict], List[Dict]]:
    """Create golden Q&A dataset and policy documents for insurance claims"""
    
    # Policy documents
    policy_docs = [
        {
            "doc_id": "AUTO_001",
            "title": "Auto Insurance Coverage Limits",
            "content": "Standard auto insurance policy includes liability coverage up to $300,000 per accident, collision coverage with $500 deductible, and comprehensive coverage with $250 deductible. Rental car reimbursement covers up to $40 per day for maximum 30 days.",
            "category": "auto_insurance",
            "relevance_keywords": ["auto", "car", "vehicle", "liability", "collision", "comprehensive", "deductible"]
        },
        {
            "doc_id": "AUTO_002",
            "title": "Auto Claims Processing Time",
            "content": "Auto insurance claims are typically processed within 5-7 business days for straightforward cases. Complex claims involving accidents with multiple parties may take 14-21 business days. Emergency rental car approval can be provided within 24 hours.",
            "category": "claims_processing",
            "relevance_keywords": ["processing", "time", "days", "approval", "rental", "emergency"]
        },
        {
            "doc_id": "HOME_001",
            "title": "Homeowners Insurance Coverage",
            "content": "Homeowners insurance covers dwelling up to policy limit, personal property at 50% of dwelling coverage, liability protection up to $500,000, and additional living expenses during repairs. Water damage from burst pipes is covered, but flood damage requires separate flood insurance.",
            "category": "home_insurance",
            "relevance_keywords": ["home", "property", "dwelling", "water", "flood", "liability", "living expenses"]
        },
        {
            "doc_id": "HOME_002",
            "title": "Homeowners Claims Deductibles",
            "content": "Standard homeowners deductible is $1,000 for most claims. Wind and hail damage has a separate 2% deductible based on dwelling coverage amount. Flood insurance through NFIP has separate deductibles: $1,000 for building and $1,000 for contents.",
            "category": "deductibles",
            "relevance_keywords": ["deductible", "wind", "hail", "flood", "amount", "building", "contents"]
        },
        {
            "doc_id": "HEALTH_001",
            "title": "Health Insurance Claim Submission",
            "content": "Health insurance claims must be submitted within 90 days of service date. Claims can be filed online, by mail, or through provider direct billing. Required documentation includes itemized bills, explanation of services, and physician notes for procedures over $5,000.",
            "category": "health_insurance",
            "relevance_keywords": ["health", "medical", "submission", "deadline", "documentation", "bills", "physician"]
        },
        {
            "doc_id": "HEALTH_002",
            "title": "Health Insurance Out-of-Pocket Maximums",
            "content": "Annual out-of-pocket maximum for individual coverage is $8,500 and $17,000 for family coverage. After reaching this limit, insurance covers 100% of eligible expenses. Deductible is $2,000 individual / $4,000 family. Copays and coinsurance count toward out-of-pocket max.",
            "category": "health_costs",
            "relevance_keywords": ["out-of-pocket", "maximum", "deductible", "copay", "coinsurance", "family", "individual"]
        }
    ]
    
    # Golden Q&A dataset
    qa_dataset = [
        {
            "question_id": "Q001",
            "question": "What is the deductible for collision coverage on auto insurance?",
            "correct_answer": "$500 deductible for collision coverage",
            "relevant_doc_ids": ["AUTO_001"],
            "category": "auto_insurance",
            "difficulty": "easy",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["auto", "collision", "deductible"]
        },
        {
            "question_id": "Q002",
            "question": "How long does it take to process a standard auto insurance claim?",
            "correct_answer": "5-7 business days for straightforward cases",
            "relevant_doc_ids": ["AUTO_002"],
            "category": "claims_processing",
            "difficulty": "easy",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["processing", "time", "auto"]
        },
        {
            "question_id": "Q003",
            "question": "Does homeowners insurance cover flood damage?",
            "correct_answer": "No, flood damage requires separate flood insurance policy",
            "relevant_doc_ids": ["HOME_001"],
            "category": "home_insurance",
            "difficulty": "medium",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["flood", "damage", "homeowners"]
        },
        {
            "question_id": "Q004",
            "question": "What is the wind and hail deductible for homeowners insurance?",
            "correct_answer": "2% of dwelling coverage amount as separate deductible",
            "relevant_doc_ids": ["HOME_002"],
            "category": "deductibles",
            "difficulty": "medium",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["wind", "hail", "deductible"]
        },
        {
            "question_id": "Q005",
            "question": "What is the deadline for submitting health insurance claims?",
            "correct_answer": "Within 90 days of service date",
            "relevant_doc_ids": ["HEALTH_001"],
            "category": "health_insurance",
            "difficulty": "easy",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["health", "deadline", "submission"]
        },
        {
            "question_id": "Q006",
            "question": "What is the individual out-of-pocket maximum for health insurance?",
            "correct_answer": "$8,500 annual out-of-pocket maximum for individual coverage",
            "relevant_doc_ids": ["HEALTH_002"],
            "category": "health_costs",
            "difficulty": "easy",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["out-of-pocket", "maximum", "individual"]
        },
        {
            "question_id": "Q007",
            "question": "How much rental car coverage is provided after an auto accident?",
            "correct_answer": "Up to $40 per day for maximum 30 days",
            "relevant_doc_ids": ["AUTO_001"],
            "category": "auto_insurance",
            "difficulty": "medium",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["rental", "car", "reimbursement"]
        },
        {
            "question_id": "Q008",
            "question": "What documentation is required for health insurance claims over $5,000?",
            "correct_answer": "Itemized bills, explanation of services, and physician notes",
            "relevant_doc_ids": ["HEALTH_001"],
            "category": "health_insurance",
            "difficulty": "medium",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["documentation", "required", "physician"]
        },
        {
            "question_id": "Q009",
            "question": "What is covered under additional living expenses in homeowners insurance?",
            "correct_answer": "Living expenses during repairs when home is uninhabitable",
            "relevant_doc_ids": ["HOME_001"],
            "category": "home_insurance",
            "difficulty": "medium",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["living", "expenses", "repairs"]
        },
        {
            "question_id": "Q010",
            "question": "Do copays count toward the out-of-pocket maximum?",
            "correct_answer": "Yes, copays and coinsurance count toward out-of-pocket maximum",
            "relevant_doc_ids": ["HEALTH_002"],
            "category": "health_costs",
            "difficulty": "easy",
            "should_have_citation": True,
            "expected_retrieval_keywords": ["copay", "out-of-pocket", "count"]
        }
    ]
    
    return qa_dataset, policy_docs

def setup_llamaindex_rag(policy_documents: List[Dict]) -> VectorIndexRetriever:
    """Set up LlamaIndex RAG system with insurance policy documents"""
    
    # Convert policy documents to LlamaIndex Document objects
    documents = []
    for doc in policy_documents:
        document = Document(
            text=f"{doc['title']}\n\n{doc['content']}",
            metadata={
                'doc_id': doc['doc_id'],
                'title': doc['title'],
                'category': doc['category'],
                'source': doc['doc_id'],
                'keywords': ','.join(doc['relevance_keywords'])
            }
        )
        documents.append(document)
    
    # Create vector index
    print("   üîç Building LlamaIndex vector store...")
    vector_index = VectorStoreIndex.from_documents(documents)
    
    # Create retriever
    retriever = VectorIndexRetriever(
        index=vector_index,
        similarity_top_k=3
    )
    
    print(f"   ‚úÖ LlamaIndex RAG initialized with {len(documents)} documents")
    return retriever

# Create the datasets
print("üìä Creating Insurance Claims Evaluation Data...")
GOLDEN_QA_DATASET, POLICY_DOCUMENTS = create_insurance_evaluation_data()

# Set up LlamaIndex RAG
print("üîç Setting up LlamaIndex RAG...")
llamaindex_retriever = setup_llamaindex_rag(POLICY_DOCUMENTS)

# Initialize memory manager and add policy knowledge
print("üß† Initializing Insurance Memory Manager...")
memory_manager = InsuranceMemoryManager()
memory_manager.add_policy_knowledge(POLICY_DOCUMENTS)

# Create evaluation assistant
evaluation_assistant = EvaluationClaimsAssistant(memory_manager, llamaindex_retriever)

# Display dataset overview
qa_df = pd.DataFrame(GOLDEN_QA_DATASET)
print("\nüìä Evaluation Setup Complete:")
print("=" * 40)
print(f"   üìã Total questions: {len(GOLDEN_QA_DATASET)}")
print(f"   üìÑ Policy documents: {len(POLICY_DOCUMENTS)}")
print(f"   üß† Memory entries: {len(memory_manager.memories)}")
print(f"   üîç LlamaIndex retriever: ‚úÖ Ready")
print(f"   üìä Categories: {qa_df['category'].nunique()}")
print(f"   üéØ Difficulty levels: {qa_df['difficulty'].nunique()}")

print(f"\nüìã Category Distribution:")
category_counts = qa_df['category'].value_counts()
for category, count in category_counts.items():
    print(f"   {category}: {count} questions")

# Test the integrated system
print(f"\nüß™ Testing Integrated System:")
test_question = "What is the deductible for collision coverage on auto insurance?"
test_response = evaluation_assistant.answer_question_with_memory_and_rag(test_question)

print(f"   Question: {test_response['question']}")
print(f"   Answer: {test_response['answer']}")
print(f"   Memory sources: {test_response['memory_count']} entries")
print(f"   RAG sources: {test_response['rag_count']} documents")
print(f"   Context length: {test_response['context_length']} chars")
print(f"   Tokens used: {test_response['tokens_used']}")

üìä Creating Insurance Claims Evaluation Data...
üîç Setting up LlamaIndex RAG...
   üîç Building LlamaIndex vector store...
   ‚úÖ LlamaIndex RAG initialized with 6 documents
üß† Initializing Insurance Memory Manager...

üìä Evaluation Setup Complete:
   üìã Total questions: 10
   üìÑ Policy documents: 6
   üß† Memory entries: 6
   üîç LlamaIndex retriever: ‚úÖ Ready
   üìä Categories: 6
   üéØ Difficulty levels: 2

üìã Category Distribution:
   auto_insurance: 2 questions
   home_insurance: 2 questions
   health_insurance: 2 questions
   health_costs: 2 questions
   claims_processing: 1 questions
   deductibles: 1 questions

üß™ Testing Integrated System:
   Question: What is the deductible for collision coverage on auto insurance?
   Answer: The deductible for collision coverage on auto insurance is $500 [Memory Context].
   Memory sources: 2 entries
   RAG sources: 3 documents
   Context length: 1502 chars
   Tokens used: 420


## üéØ Top 3 Agentic RAG Evaluation Metrics

Implement the industry-standard evaluation metrics for agentic RAG systems in insurance claims processing.

In [4]:
class AgenticRAGEvaluationMetrics:
    """Top 3 evaluation metrics for agentic RAG systems"""
    
    def __init__(self):
        self.client = client
    
    def evaluate_factual_accuracy(self, agent_answer: str, correct_answer: str, question: str) -> Dict[str, Any]:
        """Metric 1: Factual Accuracy (40% weight) - LLM-based correctness scoring"""
        
        prompt = f"""
Evaluate the factual accuracy of the agent's answer compared to the correct answer.

Question: {question}
Correct Answer: {correct_answer}
Agent Answer: {agent_answer}

Score the factual accuracy on a scale of 0-100:
- 90-100: All key facts correct, comprehensive
- 70-89: Most facts correct, minor missing details
- 50-69: Some correct facts, significant gaps
- 30-49: Few correct facts, mostly incorrect
- 0-29: Incorrect or completely missing information

Return only a JSON object:
{{"accuracy_score": <number>, "reasoning": "<explanation>", "key_facts_missing": ["<fact1>", "<fact2>"]}}
"""
        
        try:
            response = self.client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1
            )
            
            result = json.loads(response.choices[0].message.content)
            return {
                "accuracy_score": result["accuracy_score"],
                "accuracy_reasoning": result["reasoning"],
                "key_facts_missing": result.get("key_facts_missing", [])
            }
            
        except Exception as e:
            return {
                "accuracy_score": 0,
                "accuracy_reasoning": f"Evaluation error: {str(e)}",
                "key_facts_missing": []
            }
    
    def evaluate_citation_compliance(self, answer: str, should_have_citation: bool, sources_used: List[str]) -> Dict[str, Any]:
        """Metric 2: Citation/Source Compliance (30% weight) - Source attribution quality"""
        
        # Check for citation patterns
        citation_patterns = [
            r'\[.*?\]',
            r'according to',
            r'source:',
            r'reference:',
            r'policy states',
            r'document shows',
            r'as stated in'
        ]
        
        citations_found = []
        for pattern in citation_patterns:
            matches = re.findall(pattern, answer, re.IGNORECASE)
            citations_found.extend(matches)
        
        has_citations = len(citations_found) > 0
        
        # Calculate compliance score
        if should_have_citation and has_citations:
            compliance_score = 100
            compliance_status = "Correct: Citations present when required"
        elif not should_have_citation and not has_citations:
            compliance_score = 100
            compliance_status = "Correct: No citations when not required"
        elif should_have_citation and not has_citations:
            compliance_score = 0
            compliance_status = "Missing: Citations required but not provided"
        else:
            compliance_score = 80
            compliance_status = "Acceptable: Citations provided when not strictly required"
        
        # Bonus points for citing correct sources
        source_accuracy_bonus = 0
        if has_citations and sources_used:
            answer_lower = answer.lower()
            sources_mentioned = sum(1 for source in sources_used if source.lower() in answer_lower)
            if sources_mentioned > 0:
                source_accuracy_bonus = min(20, sources_mentioned * 10)
        
        final_score = min(100, compliance_score + source_accuracy_bonus)
        
        return {
            "citation_compliance_score": final_score,
            "citations_found": citations_found,
            "citation_expected": should_have_citation,
            "citation_present": has_citations,
            "compliance_status": compliance_status,
            "source_accuracy_bonus": source_accuracy_bonus,
            "sources_mentioned": sources_used
        }
    
    def evaluate_retrieval_relevance(self, question: str, retrieved_nodes: List, expected_doc_ids: List[str]) -> Dict[str, Any]:
        """Metric 3: Retrieval Relevance (30% weight) - Quality of LlamaIndex document retrieval"""
        
        if not retrieved_nodes:
            return {
                "retrieval_relevance_score": 0,
                "retrieved_doc_ids": [],
                "expected_doc_ids": expected_doc_ids,
                "precision": 0.0,
                "recall": 0.0,
                "relevance_reasoning": "No documents retrieved"
            }
        
        # Extract retrieved document IDs
        retrieved_doc_ids = []
        for node in retrieved_nodes:
            doc_id = node.metadata.get('doc_id', node.metadata.get('source', 'unknown'))
            retrieved_doc_ids.append(doc_id)
        
        # Calculate precision and recall
        if expected_doc_ids:
            expected_set = set(expected_doc_ids)
            retrieved_set = set(retrieved_doc_ids)
            
            precision = len(expected_set.intersection(retrieved_set)) / len(retrieved_set) if retrieved_set else 0
            recall = len(expected_set.intersection(retrieved_set)) / len(expected_set) if expected_set else 0
            f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            relevance_score = f1_score * 100
            
            reasoning = f"Precision: {precision:.2f}, Recall: {recall:.2f}, F1: {f1_score:.2f}"
        else:
            precision = 0.0
            recall = 1.0 if not retrieved_doc_ids else 0.0
            relevance_score = 100 if not retrieved_doc_ids else 50
            reasoning = "No expected documents for this question"
        
        return {
            "retrieval_relevance_score": relevance_score,
            "retrieved_doc_ids": retrieved_doc_ids,
            "expected_doc_ids": expected_doc_ids,
            "precision": precision,
            "recall": recall,
            "relevance_reasoning": reasoning
        }
    
    def evaluate_complete_response(self, agent_response: Dict[str, Any], gold_item: Dict[str, Any]) -> Dict[str, Any]:
        """Complete evaluation using all three metrics"""
        
        # Metric 1: Factual Accuracy (40% weight)
        accuracy_eval = self.evaluate_factual_accuracy(
            agent_response["answer"],
            gold_item["correct_answer"],
            gold_item["question"]
        )
        
        # Metric 2: Citation Compliance (30% weight)
        all_sources = agent_response["memory_sources"] + agent_response["rag_sources"]
        citation_eval = self.evaluate_citation_compliance(
            agent_response["answer"],
            gold_item["should_have_citation"],
            all_sources
        )
        
        # Metric 3: Retrieval Relevance (30% weight)
        retrieval_eval = self.evaluate_retrieval_relevance(
            gold_item["question"],
            agent_response["retrieved_nodes"],
            gold_item["relevant_doc_ids"]
        )
        
        # Calculate weighted composite score
        composite_score = (
            accuracy_eval["accuracy_score"] * 0.40 +
            citation_eval["citation_compliance_score"] * 0.30 +
            retrieval_eval["retrieval_relevance_score"] * 0.30
        )
        
        return {
            "question_id": gold_item["question_id"],
            "question": gold_item["question"],
            "category": gold_item["category"],
            "difficulty": gold_item["difficulty"],
            "agent_answer": agent_response["answer"],
            "correct_answer": gold_item["correct_answer"],
            "factual_accuracy_score": accuracy_eval["accuracy_score"],
            "accuracy_reasoning": accuracy_eval["accuracy_reasoning"],
            "key_facts_missing": accuracy_eval["key_facts_missing"],
            "citation_compliance_score": citation_eval["citation_compliance_score"],
            "citations_found": citation_eval["citations_found"],
            "compliance_status": citation_eval["compliance_status"],
            "retrieval_relevance_score": retrieval_eval["retrieval_relevance_score"],
            "retrieval_precision": retrieval_eval["precision"],
            "retrieval_recall": retrieval_eval["recall"],
            "retrieved_doc_ids": retrieval_eval["retrieved_doc_ids"],
            "expected_doc_ids": retrieval_eval["expected_doc_ids"],
            "composite_score": composite_score,
            "tokens_used": agent_response["tokens_used"],
            "memory_sources_used": agent_response["memory_count"],
            "rag_sources_used": agent_response["rag_count"]
        }

print("üéØ Agentic RAG Evaluation Metrics Initialized:")
print("   üìä Metric 1: Factual Accuracy (40% weight) - LLM-based scoring")
print("   üìù Metric 2: Citation Compliance (30% weight) - Source attribution quality")
print("   üîç Metric 3: Retrieval Relevance (30% weight) - LlamaIndex retrieval quality")
print("   ‚öñÔ∏è Composite scoring with industry-standard weightings")
print("   üìà Precision/Recall metrics for retrieval evaluation")
print("   üè• Optimized for insurance claims compliance requirements")

üéØ Agentic RAG Evaluation Metrics Initialized:
   üìä Metric 1: Factual Accuracy (40% weight) - LLM-based scoring
   üìù Metric 2: Citation Compliance (30% weight) - Source attribution quality
   üîç Metric 3: Retrieval Relevance (30% weight) - LlamaIndex retrieval quality
   ‚öñÔ∏è Composite scoring with industry-standard weightings
   üìà Precision/Recall metrics for retrieval evaluation
   üè• Optimized for insurance claims compliance requirements


## üß™ Run Comprehensive Evaluation

Evaluate the insurance claims assistant using our golden dataset and top 3 agentic RAG metrics.

In [5]:
# Initialize the evaluator
evaluator = AgenticRAGEvaluationMetrics()

# Run evaluation on the dataset
print("üß™ Running Agentic RAG Evaluation Suite...")
print("=" * 50)

evaluation_results = []

for i, gold_item in enumerate(GOLDEN_QA_DATASET, 1):
    print(f"   [{i}/{len(GOLDEN_QA_DATASET)}] Evaluating: {gold_item['question_id']} ({gold_item['category']})")
    
    # Get agent response using memory + RAG
    agent_response = evaluation_assistant.answer_question_with_memory_and_rag(gold_item["question"])
    
    # Evaluate response using all three metrics
    eval_result = evaluator.evaluate_complete_response(agent_response, gold_item)
    evaluation_results.append(eval_result)
    
    # Show brief progress
    print(f"      Accuracy: {eval_result['factual_accuracy_score']:.0f} | "
          f"Citation: {eval_result['citation_compliance_score']:.0f} | "
          f"Retrieval: {eval_result['retrieval_relevance_score']:.0f} | "
          f"Composite: {eval_result['composite_score']:.0f}")

print("\n‚úÖ Evaluation Complete!")

# Convert to DataFrame for analysis
results_df = pd.DataFrame(evaluation_results)

# Display summary results
print("\nüìä Evaluation Results Summary:")
summary_cols = [
    'question_id', 'category', 'difficulty',
    'factual_accuracy_score', 'citation_compliance_score', 
    'retrieval_relevance_score', 'composite_score'
]
print(results_df[summary_cols].to_string(index=False))

üß™ Running Agentic RAG Evaluation Suite...
   [1/10] Evaluating: Q001 (auto_insurance)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [2/10] Evaluating: Q002 (claims_processing)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [3/10] Evaluating: Q003 (home_insurance)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [4/10] Evaluating: Q004 (deductibles)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [5/10] Evaluating: Q005 (health_insurance)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [6/10] Evaluating: Q006 (health_costs)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [7/10] Evaluating: Q007 (auto_insurance)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [8/10] Evaluating: Q008 (health_insurance)
      Accuracy: 100 | Citation: 100 | Retrieval: 50 | Composite: 85
   [9/10] Evaluating: Q009 (home_insurance)
      Accurac

## üìà Performance Analysis & Insights

Analyze performance across the top 3 agentic RAG metrics and identify improvement opportunities.

In [6]:
# Calculate comprehensive performance metrics
performance_metrics = {
    "avg_factual_accuracy": results_df['factual_accuracy_score'].mean(),
    "avg_citation_compliance": results_df['citation_compliance_score'].mean(),
    "avg_retrieval_relevance": results_df['retrieval_relevance_score'].mean(),
    "avg_composite_score": results_df['composite_score'].mean(),
    "avg_retrieval_precision": results_df['retrieval_precision'].mean(),
    "avg_retrieval_recall": results_df['retrieval_recall'].mean(),
    "avg_tokens_per_question": results_df['tokens_used'].mean(),
    "total_tokens_used": results_df['tokens_used'].sum(),
    "avg_memory_sources": results_df['memory_sources_used'].mean(),
    "avg_rag_sources": results_df['rag_sources_used'].mean(),
    "questions_evaluated": len(results_df),
    "categories_covered": results_df['category'].nunique(),
    "perfect_scores": len(results_df[results_df['composite_score'] >= 95]),
    "needs_improvement": len(results_df[results_df['composite_score'] < 70])
}

print("üìà Comprehensive Performance Analysis:")
print("=" * 50)
print(f"\nüéØ Core Metric Performance:")
print(f"   Composite Score:           {performance_metrics['avg_composite_score']:.1f}/100")
print(f"   Factual Accuracy (40%):    {performance_metrics['avg_factual_accuracy']:.1f}/100")
print(f"   Citation Compliance (30%): {performance_metrics['avg_citation_compliance']:.1f}/100")
print(f"   Retrieval Relevance (30%):  {performance_metrics['avg_retrieval_relevance']:.1f}/100")

print(f"\nüîç Retrieval Analytics:")
print(f"   Average Precision:         {performance_metrics['avg_retrieval_precision']:.3f}")
print(f"   Average Recall:            {performance_metrics['avg_retrieval_recall']:.3f}")
print(f"   Memory sources per Q:      {performance_metrics['avg_memory_sources']:.1f}")
print(f"   RAG sources per Q:         {performance_metrics['avg_rag_sources']:.1f}")

print(f"\nüí∞ Efficiency Metrics:")
print(f"   Tokens per question:       {performance_metrics['avg_tokens_per_question']:.0f}")
print(f"   Total tokens used:         {performance_metrics['total_tokens_used']:,}")

print(f"\nüìä Coverage & Quality:")
print(f"   Questions evaluated:       {performance_metrics['questions_evaluated']}")
print(f"   Categories covered:        {performance_metrics['categories_covered']}")
print(f"   Perfect scores (‚â•95):      {performance_metrics['perfect_scores']}")
print(f"   Needs improvement (<70):   {performance_metrics['needs_improvement']}")

# Performance by category
print(f"\nüìã Performance by Category:")
category_performance = results_df.groupby('category')[['factual_accuracy_score', 'citation_compliance_score', 'retrieval_relevance_score', 'composite_score']].mean()
for category in category_performance.index:
    scores = category_performance.loc[category]
    print(f"   {category:18s}: Composite {scores['composite_score']:.0f} | "
          f"Accuracy {scores['factual_accuracy_score']:.0f} | "
          f"Citation {scores['citation_compliance_score']:.0f} | "
          f"Retrieval {scores['retrieval_relevance_score']:.0f}")

# Performance by difficulty
print(f"\nüéØ Performance by Difficulty:")
difficulty_performance = results_df.groupby('difficulty')[['composite_score', 'factual_accuracy_score']].mean()
for difficulty in ['easy', 'medium', 'hard']:
    if difficulty in difficulty_performance.index:
        scores = difficulty_performance.loc[difficulty]
        print(f"   {difficulty:6s}: Composite {scores['composite_score']:.0f} | Accuracy {scores['factual_accuracy_score']:.0f}")

# Identify problem areas
print(f"\nüîç Problem Areas Analysis:")
low_accuracy = results_df[results_df['factual_accuracy_score'] < 70]
poor_citations = results_df[results_df['citation_compliance_score'] < 70]
poor_retrieval = results_df[results_df['retrieval_relevance_score'] < 70]

if len(low_accuracy) > 0:
    print(f"   ‚ö†Ô∏è Low accuracy questions ({len(low_accuracy)}): {', '.join(low_accuracy['question_id'].tolist())}")
if len(poor_citations) > 0:
    print(f"   üìù Poor citation compliance ({len(poor_citations)}): {', '.join(poor_citations['question_id'].tolist())}")
if len(poor_retrieval) > 0:
    print(f"   üîç Poor retrieval relevance ({len(poor_retrieval)}): {', '.join(poor_retrieval['question_id'].tolist())}")

if len(low_accuracy) == 0 and len(poor_citations) == 0 and len(poor_retrieval) == 0:
    print(f"   ‚úÖ No major issues identified across all three metrics!")

# Best and worst performing questions
print(f"\nüèÜ Best Performing Question:")
best_q = results_df.loc[results_df['composite_score'].idxmax()]
print(f"   {best_q['question_id']}: {best_q['question'][:60]}...")
print(f"   Composite Score: {best_q['composite_score']:.0f} (A:{best_q['factual_accuracy_score']:.0f}, C:{best_q['citation_compliance_score']:.0f}, R:{best_q['retrieval_relevance_score']:.0f})")

print(f"\n‚ö†Ô∏è Needs Most Improvement:")
worst_q = results_df.loc[results_df['composite_score'].idxmin()]
print(f"   {worst_q['question_id']}: {worst_q['question'][:60]}...")
print(f"   Composite Score: {worst_q['composite_score']:.0f} (A:{worst_q['factual_accuracy_score']:.0f}, C:{worst_q['citation_compliance_score']:.0f}, R:{worst_q['retrieval_relevance_score']:.0f})")

üìà Comprehensive Performance Analysis:

üéØ Core Metric Performance:
   Composite Score:           84.6/100
   Factual Accuracy (40%):    99.0/100
   Citation Compliance (30%): 100.0/100
   Retrieval Relevance (30%):  50.0/100

üîç Retrieval Analytics:
   Average Precision:         0.333
   Average Recall:            1.000
   Memory sources per Q:      1.9
   RAG sources per Q:         3.0

üí∞ Efficiency Metrics:
   Tokens per question:       417
   Total tokens used:         4,174

üìä Coverage & Quality:
   Questions evaluated:       10
   Categories covered:        6
   Perfect scores (‚â•95):      0
   Needs improvement (<70):   0

üìã Performance by Category:
   auto_insurance    : Composite 85 | Accuracy 100 | Citation 100 | Retrieval 50
   claims_processing : Composite 85 | Accuracy 100 | Citation 100 | Retrieval 50
   deductibles       : Composite 85 | Accuracy 100 | Citation 100 | Retrieval 50
   health_costs      : Composite 83 | Accuracy 95 | Citation 100 | Retrieval

## üí° Actionable Improvement Recommendations

Generate specific recommendations based on the top 3 agentic RAG metrics performance.

In [7]:
def generate_agentic_rag_recommendations(metrics: Dict[str, float], results_df: pd.DataFrame) -> List[str]:
    """Generate improvement recommendations based on agentic RAG evaluation results"""
    
    recommendations = []
    
    # Factual Accuracy Recommendations (Metric 1 - 40% weight)
    if metrics["avg_factual_accuracy"] < 80:
        recommendations.append(
            "üéØ **Improve Factual Accuracy (Critical - 40% weight)**: "
            "Consider expanding the memory knowledge base, improving document quality, "
            "or fine-tuning the LLM with insurance-specific training data."
        )
    
    # Citation Compliance Recommendations (Metric 2 - 30% weight)
    if metrics["avg_citation_compliance"] < 80:
        recommendations.append(
            "üìù **Enhance Citation Compliance (Important - 30% weight)**: "
            "Modify prompts to consistently include source references. "
            "Critical for regulatory compliance in insurance claims."
        )
    
    # Retrieval Relevance Recommendations (Metric 3 - 30% weight)
    if metrics["avg_retrieval_relevance"] < 70:
        recommendations.append(
            "üîç **Optimize Retrieval Relevance (Important - 30% weight)**: "
            "Improve LlamaIndex embeddings, document chunking strategy, "
            "or implement hybrid retrieval with keyword + semantic search."
        )
    
    # Precision/Recall specific recommendations
    if metrics["avg_retrieval_precision"] < 0.7:
        recommendations.append(
            "üéØ **Improve Retrieval Precision**: "
            "Too many irrelevant documents retrieved. Consider increasing similarity thresholds "
            "or improving document metadata and keywords."
        )
    
    if metrics["avg_retrieval_recall"] < 0.7:
        recommendations.append(
            "üìö **Improve Retrieval Recall**: "
            "Missing relevant documents. Consider lowering similarity thresholds, "
            "expanding document coverage, or using query expansion techniques."
        )
    
    # Memory integration recommendations
    if metrics["avg_memory_sources"] < 1.0:
        recommendations.append(
            "üß† **Enhance Memory Integration**: "
            "Memory system underutilized. Improve memory retrieval algorithms "
            "or expand the persistent knowledge base."
        )
    
    # Token efficiency recommendations
    if metrics["avg_tokens_per_question"] > 1000:
        recommendations.append(
            "üí∞ **Optimize Token Efficiency**: "
            "High token usage detected. Consider shorter prompts, "
            "better context filtering, or smaller model variants for cost optimization."
        )
    
    # Category-specific recommendations
    category_performance = results_df.groupby('category')['composite_score'].mean()
    worst_categories = category_performance[category_performance < 70].index.tolist()
    if worst_categories:
        recommendations.append(
            f"üìä **Address Category Weaknesses**: "
            f"Poor performance in {', '.join(worst_categories)}. "
            f"Consider domain-specific training data or specialized retrieval strategies."
        )
    
    # Success case recommendations
    if metrics["avg_composite_score"] >= 85:
        recommendations.append(
            "üéâ **Strong Performance Detected**: "
            "Agent performs well across all three metrics. Consider testing on "
            "more complex scenarios, expanding to additional insurance domains, "
            "or implementing A/B testing for production deployment."
        )
    
    # Memory + RAG integration recommendations
    if metrics["avg_rag_sources"] > 2.5 and metrics["avg_memory_sources"] < 1.0:
        recommendations.append(
            "‚öñÔ∏è **Balance Memory and RAG Sources**: "
            "Over-reliance on RAG retrieval vs. persistent memory. "
            "Consider adjusting the memory retrieval scoring or expanding memory coverage."
        )
    
    return recommendations

# Generate comprehensive recommendations
recommendations = generate_agentic_rag_recommendations(performance_metrics, results_df)

print("üí° Agentic RAG Improvement Recommendations:")
print("=" * 60)
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}\n")

# Detailed example analysis
print("üîç Detailed Example Analysis:")
print("=" * 30)
sample_result = results_df.iloc[0]
print(f"Question: {sample_result['question']}")
print(f"Category: {sample_result['category']} | Difficulty: {sample_result['difficulty']}")
print(f"\nAgent Answer: {sample_result['agent_answer']}")
print(f"Expected Answer: {sample_result['correct_answer']}")

print(f"\nüìä Metric Breakdown:")
print(f"   Factual Accuracy: {sample_result['factual_accuracy_score']:.0f}/100")
print(f"   Reasoning: {sample_result['accuracy_reasoning']}")
print(f"   \nCitation Compliance: {sample_result['citation_compliance_score']:.0f}/100")
print(f"   Status: {sample_result['compliance_status']}")
print(f"   Citations Found: {sample_result['citations_found']}")
print(f"   \nRetrieval Relevance: {sample_result['retrieval_relevance_score']:.0f}/100")
print(f"   Precision: {sample_result['retrieval_precision']:.3f} | Recall: {sample_result['retrieval_recall']:.3f}")
print(f"   Retrieved: {sample_result['retrieved_doc_ids']}")
print(f"   Expected: {sample_result['expected_doc_ids']}")

print(f"\nüéØ Overall Performance:")
print(f"   Composite Score: {sample_result['composite_score']:.1f}/100")
print(f"   Tokens Used: {sample_result['tokens_used']}")
print(f"   Memory Sources: {sample_result['memory_sources_used']} | RAG Sources: {sample_result['rag_sources_used']}")

üí° Agentic RAG Improvement Recommendations:
1. üîç **Optimize Retrieval Relevance (Important - 30% weight)**: Improve LlamaIndex embeddings, document chunking strategy, or implement hybrid retrieval with keyword + semantic search.

2. üéØ **Improve Retrieval Precision**: Too many irrelevant documents retrieved. Consider increasing similarity thresholds or improving document metadata and keywords.

üîç Detailed Example Analysis:
Question: What is the deductible for collision coverage on auto insurance?
Category: auto_insurance | Difficulty: easy

Agent Answer: The deductible for collision coverage on auto insurance is $500 [Memory Context].
Expected Answer: $500 deductible for collision coverage

üìä Metric Breakdown:
   Factual Accuracy: 100/100
   Reasoning: The agent's answer accurately states the deductible for collision coverage on auto insurance as $500, which matches the correct answer provided.
   
Citation Compliance: 100/100
   Status: Correct: Citations present when req

## üéØ Summary & Key Learnings

### ‚úÖ What We Demonstrated

- ‚úÖ Integrated a memory-enabled insurance claims assistant with LlamaIndex RAG
- ‚úÖ Implemented the top 3 agentic AI evaluation metrics used in production
- ‚úÖ Evaluated insurance claims questions with comprehensive analytics
- ‚úÖ Generated actionable improvement recommendations

### üìä Top 3 Agentic RAG Metrics

1. **Factual Accuracy** (40% weight) - LLM-based correctness scoring
2. **Citation/Source Compliance** (30% weight) - Critical for insurance regulatory compliance
3. **Retrieval Relevance** (30% weight) - LlamaIndex document retrieval quality with precision/recall

### üîë Key Insights

- **Memory + RAG Integration**: Persistent memory enhances RAG performance for insurance claims
- **Compliance Requirements**: Citation compliance is critical for insurance applications
- **Retrieval Quality**: LlamaIndex provides measurable precision/recall metrics
- **Production Ready**: Framework supports A/B testing and continuous improvement

### üöÄ Next Steps for Production

1. **Expand Dataset**: Scale to 500+ questions across more insurance domains
2. **Optimize Retrieval**: Implement hybrid keyword + semantic search
3. **Memory Enhancement**: Expand persistent knowledge base with claims history
4. **A/B Testing**: Compare different agent configurations
5. **Real-time Monitoring**: Deploy evaluation metrics in production

### üíº Production Considerations

For a production insurance claims system, consider:
- **Audit Trail**: Complete logging of all claim decisions with timestamps
- **Compliance**: Ensure citation compliance meets regulatory requirements
- **Performance**: Monitor token usage and optimize for cost efficiency
- **Accuracy**: Maintain high factual accuracy to prevent claim errors
- **Scalability**: Design for handling thousands of claims evaluations daily

This evaluation framework demonstrates that agentic AI systems can be rigorously tested using industry-standard metrics, ensuring they meet production quality requirements for financial services! üéâ