## Activity #1: Retriever Evaluation with RAGAS

This notebook evaluates different retriever methods using RAGAS for synthetic dataset generation.

### Objectives:
1. Create a "golden dataset" using RAGAS Synthetic Data Generation
2. Evaluate 6 different retrievers on combined CSV + PDF data
3. Compare performance, cost, and latency
4. Provide recommendations

### Data Sources:
- **CSV Data**: Consumer complaint narratives
- **PDF Data**: Federal Student Aid handbooks

### Retrievers to Evaluate:
- Naive Retrieval (Embedding-based)
- BM25 Retriever
- Multi-Query Retriever
- Parent-Document Retriever
- Contextual Compression (Reranking)
- Ensemble Retriever

## Step 1: Setup and Dependencies

In [69]:
import os
import time
import pandas as pd
from datetime import datetime
import getpass

# Set up API keys
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")
os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key:")

# Optional: Set up LangSmith for advanced evaluation
try:
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key (optional, press Enter to skip):")
    if os.environ["LANGCHAIN_API_KEY"]:
        os.environ["LANGCHAIN_TRACING_V2"] = "true"
        os.environ["LANGCHAIN_PROJECT"] = "Retriever-Evaluation"
        print("✅ LangSmith tracing enabled")
    else:
        print("⚠️  LangSmith skipped")
except:
    print("⚠️  LangSmith skipped")

print("✅ API keys configured")

✅ LangSmith tracing enabled
✅ API keys configured


## Step 2: Load Data

In [70]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

# Load CSV data
loader = CSVLoader(
    file_path="./data/complaints.csv",
    metadata_columns=[
        "Date received", "Product", "Sub-product", "Issue", "Sub-issue", 
        "Consumer complaint narrative", "Company", "State", "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

# Set page content to complaint narrative
for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

print(f"✅ Loaded {len(loan_complaint_data)} complaint documents from CSV")

# Load PDF data
path = "data/"
pdf_loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
pdf_docs = pdf_loader.load()

print(f"✅ Loaded {len(pdf_docs)} PDF documents")

# Combine all documents
all_docs = loan_complaint_data + pdf_docs
print(f"✅ Total documents: {len(all_docs)} (CSV: {len(loan_complaint_data)}, PDF: {len(pdf_docs)})")

print(f"\nSample complaint: {loan_complaint_data[0].page_content[:150]}...")
print(f"Sample PDF content: {pdf_docs[0].page_content[:150]}...")

✅ Loaded 825 complaint documents from CSV
✅ Loaded 269 PDF documents
✅ Total documents: 1094 (CSV: 825, PDF: 269)

Sample complaint: The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans current...
Sample PDF content: Volume 3
Academic Calendars, Cost of Attendance, and
Packaging
Introduction
This volume of the Federal Student Aid (FSA) Handbook discusses the academ...


## Step 3: Create Golden Dataset using RAGAS

In [71]:
# RAGAS setup
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Initialize models for RAGAS
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

print("✅ RAGAS models initialized")

✅ RAGAS models initialized


In [72]:
# Generate synthetic dataset using abstracted SDG
from ragas.testset import TestsetGenerator

print("Generating synthetic test dataset using RAGAS...")

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Use subset for cost efficiency
# Try PDF docs first as they tend to work better with RAGAS
testset_docs = pdf_docs[:15] + loan_complaint_data[:10]  # Mixed approach

golden_dataset = generator.generate_with_langchain_docs(
    testset_docs, 
    testset_size=8
)

print(f"✅ Generated {len(golden_dataset)} synthetic QA pairs")

# Convert to pandas for easier viewing
df = golden_dataset.to_pandas()
print(f"\nDataset columns: {list(df.columns)}")
print(f"Dataset shape: {df.shape}")

# Show sample questions
print("\nSample questions from the dataset:")
if 'question' in df.columns:
    question_col = 'question'
elif 'user_input' in df.columns:
    question_col = 'user_input'
elif len(df.columns) > 0:
    question_col = df.columns[0]  # Use first column as fallback
    print(f"Using column '{question_col}' as questions:")
else:
    print("No columns found in dataset!")
    question_col = None

if question_col:
    for i in range(min(3, len(df))):
        print(f"{i+1}. {df.iloc[i][question_col]}")

Generating synthetic test dataset using RAGAS...


Applying HeadlinesExtractor:   0%|          | 0/12 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/25 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Property 'summary' already exists in node '14cdd6'. Skipping!
Property 'summary' already exists in node '11fcb0'. Skipping!
Property 'summary' already exists in node '86bf6b'. Skipping!
Property 'summary' already exists in node '61dc5f'. Skipping!
Property 'summary' already exists in node '69767f'. Skipping!
Property 'summary' already exists in node '9d7765'. Skipping!
Property 'summary' already exists in node '83f2b3'. Skipping!
Property 'summary' already exists in node 'c25e73'. Skipping!
Property 'summary' already exists in node '5b50f1'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '61dc5f'. Skipping!
Property 'summary_embedding' already exists in node '11fcb0'. Skipping!
Property 'summary_embedding' already exists in node '14cdd6'. Skipping!
Property 'summary_embedding' already exists in node '5b50f1'. Skipping!
Property 'summary_embedding' already exists in node 'c25e73'. Skipping!
Property 'summary_embedding' already exists in node '83f2b3'. Skipping!
Property 'summary_embedding' already exists in node '9d7765'. Skipping!
Property 'summary_embedding' already exists in node '86bf6b'. Skipping!
Property 'summary_embedding' already exists in node '69767f'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/8 [00:00<?, ?it/s]

✅ Generated 8 synthetic QA pairs

Dataset columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']
Dataset shape: (8, 4)

Sample questions from the dataset:
1. Wht is the significnce of the Academic Year in financial aid?
2. What 34 CFR 668.3(a) say about academic year minimums?
3. Can you explain what is the criteria for including clinical work in a education program standard term?


## Step 3.5: Create LangSmith Dataset (Optional)

In [73]:
# Create LangSmith dataset for advanced evaluation (if LangSmith is available)
try:
    from langsmith import Client
    
    if os.environ.get("LANGCHAIN_API_KEY"):
        print("Creating LangSmith dataset...")
        
        client = Client()
        dataset_name = f"Retriever-Evaluation-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        # Create dataset
        langsmith_dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="Synthetic data for retriever evaluation using RAGAS"
        )
        
        # Add examples to LangSmith dataset
        df = golden_dataset.to_pandas()
        
        # Find the correct column names
        if 'question' in df.columns:
            question_col = 'question'
        elif 'user_input' in df.columns:
            question_col = 'user_input'
        else:
            question_col = df.columns[0]
            
        if 'answer' in df.columns:
            answer_col = 'answer'
        elif 'reference' in df.columns:
            answer_col = 'reference'
        else:
            answer_col = df.columns[1] if len(df.columns) > 1 else question_col
        
        # Add examples
        for idx, row in df.iterrows():
            client.create_example(
                inputs={
                    "question": row[question_col]
                },
                outputs={
                    "answer": row[answer_col] if answer_col != question_col else "Generated answer"
                },
                metadata={
                    "source": "ragas_synthetic",
                    "retriever_evaluation": True
                },
                dataset_id=langsmith_dataset.id
            )
        
        print(f"✅ Created LangSmith dataset: {dataset_name}")
        print(f"📊 Added {len(df)} examples to dataset")
        
        # Store for later use
        LANGSMITH_DATASET_NAME = dataset_name
        USE_LANGSMITH = True
    else:
        print("⚠️  LangSmith API key not found, skipping dataset creation")
        USE_LANGSMITH = False
        LANGSMITH_DATASET_NAME = None
        
except ImportError:
    print("⚠️  LangSmith not available, install with: pip install langsmith")
    USE_LANGSMITH = False
    LANGSMITH_DATASET_NAME = None
except Exception as e:
    print(f"⚠️  LangSmith setup failed: {e}")
    USE_LANGSMITH = False
    LANGSMITH_DATASET_NAME = None

Creating LangSmith dataset...
✅ Created LangSmith dataset: Retriever-Evaluation-20250727_190006
📊 Added 8 examples to dataset


## Step 4: Set Up Retrievers

In [74]:
from langchain_community.vectorstores import Qdrant
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ParentDocumentRetriever, EnsembleRetriever
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Initialize models
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chat_model = ChatOpenAI(model="gpt-4o-mini")

print("Setting up retrievers...")

Setting up retrievers...


In [75]:
# 1. Naive Retriever
vectorstore = Qdrant.from_documents(
    all_docs,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)
naive_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
print("✅ 1. Naive retriever ready")

# 2. BM25 Retriever
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 5
print("✅ 2. BM25 retriever ready")

# 3. Multi-Query Retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)
print("✅ 3. Multi-query retriever ready")

# 4. Parent Document Retriever
parent_docs = all_docs
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

# Create new QdrantClient and collection
client = QdrantClient(location=":memory:")
client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", 
    embedding=embeddings, 
    client=client
)

store = InMemoryStore()
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

parent_document_retriever.add_documents(parent_docs, ids=None)
print("✅ 4. Parent document retriever ready")

# 5. Contextual Compression Retriever
compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)
print("✅ 5. Contextual compression retriever ready")

# 6. Ensemble Retriever
retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)
print("✅ 6. Ensemble retriever ready")

print("\n✅ All retrievers initialized successfully!")

✅ 1. Naive retriever ready
✅ 2. BM25 retriever ready
✅ 3. Multi-query retriever ready
✅ 4. Parent document retriever ready
✅ 5. Contextual compression retriever ready
✅ 6. Ensemble retriever ready

✅ All retrievers initialized successfully!


## Step 5: Evaluation Function

In [76]:
def evaluate_retriever_simple(retriever, retriever_name, questions):
    """
    Simple evaluation function that measures retrieval performance
    """
    print(f"\nEvaluating {retriever_name}...")
    
    start_time = time.time()
    total_docs_retrieved = 0
    successful_retrievals = 0
    
    for i, question in enumerate(questions):
        try:
            # Retrieve documents
            docs = retriever.get_relevant_documents(question)
            total_docs_retrieved += len(docs)
            successful_retrievals += 1
            
        except Exception as e:
            print(f"  Error on question {i+1}: {e}")
    
    end_time = time.time()
    
    # Calculate metrics
    avg_docs_per_query = total_docs_retrieved / len(questions) if questions else 0
    success_rate = successful_retrievals / len(questions) if questions else 0
    latency = end_time - start_time
    
    results = {
        'retriever_name': retriever_name,
        'success_rate': success_rate,
        'avg_docs_per_query': avg_docs_per_query,
        'total_latency': latency,
        'avg_latency_per_query': latency / len(questions) if questions else 0
    }
    
    print(f"  ✅ Success rate: {success_rate:.2%}")
    print(f"  ✅ Avg docs per query: {avg_docs_per_query:.1f}")
    print(f"  ✅ Latency: {latency:.2f}s")
    
    return results

def estimate_cost(retriever_name, num_queries):
    """Estimate API costs per retriever type"""
    cost_per_query = {
        'Naive': 0.002,  # OpenAI embedding calls
        'BM25': 0.0,     # No API calls
        'Multi-Query': 0.008,  # Multiple LLM calls + embeddings
        'Parent Document': 0.003,  # Embeddings + some overhead
        'Contextual Compression': 0.015,  # Cohere rerank + embeddings
        'Ensemble': 0.020,  # All of the above combined
    }
    return cost_per_query.get(retriever_name.split()[0], 0.005) * num_queries

print("✅ Evaluation functions ready")

✅ Evaluation functions ready


## Step 6: Run Evaluations

In [77]:
# Extract questions from RAGAS dataset
df = golden_dataset.to_pandas()

# Find the correct question column
if 'question' in df.columns:
    question_col = 'question'
elif 'user_input' in df.columns:
    question_col = 'user_input'
elif len(df.columns) > 0:
    question_col = df.columns[0]  # Use first column as fallback
    print(f"Using column '{question_col}' as questions")
else:
    raise ValueError("No suitable question column found in RAGAS dataset!")

questions = df[question_col].tolist()

print(f"Running evaluation on {len(questions)} questions...")
print("="*60)

# Define retrievers to evaluate
retrievers_to_test = [
    (naive_retriever, "Naive"),
    (bm25_retriever, "BM25"),
    (multi_query_retriever, "Multi-Query"),
    (parent_document_retriever, "Parent Document"),
    (compression_retriever, "Contextual Compression"),
    (ensemble_retriever, "Ensemble")
]

# Run evaluations
results = []
for retriever, name in retrievers_to_test:
    result = evaluate_retriever_simple(retriever, name, questions)
    result['estimated_cost'] = estimate_cost(name, len(questions))
    results.append(result)

print("\n✅ All evaluations completed!")

Running evaluation on 8 questions...

Evaluating Naive...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 5.0
  ✅ Latency: 2.66s

Evaluating BM25...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 5.0
  ✅ Latency: 0.04s

Evaluating Multi-Query...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 7.4
  ✅ Latency: 17.18s

Evaluating Parent Document...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 2.8
  ✅ Latency: 2.14s

Evaluating Contextual Compression...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 3.0
  ✅ Latency: 3.67s

Evaluating Ensemble...
  ✅ Success rate: 100.00%
  ✅ Avg docs per query: 11.5
  ✅ Latency: 25.76s

✅ All evaluations completed!


## Step 7: Analyze Results

In [78]:
# Create results dataframe
results_df = pd.DataFrame(results)

# Filter successful retrievers
successful_results = results_df[results_df['success_rate'] > 0.8].copy()

if len(successful_results) == 0:
    print("⚠️  No retrievers achieved >80% success rate. Showing all results:")
    successful_results = results_df.copy()

# Display main metrics
display_cols = ['retriever_name', 'success_rate', 'avg_docs_per_query', 
               'total_latency', 'estimated_cost']
print("\n📈 Performance Summary:")
print(successful_results[display_cols].round(4).to_string(index=False))

# Find best performers
fastest = successful_results.loc[successful_results['total_latency'].idxmin()]
cheapest = successful_results.loc[successful_results['estimated_cost'].idxmin()]
most_docs = successful_results.loc[successful_results['avg_docs_per_query'].idxmax()]

# Calculate combined score (simple weighted average)
successful_results = successful_results.copy()
successful_results['combined_score'] = (
    0.4 * successful_results['success_rate'] + 
    0.3 * (1 / (successful_results['total_latency'] + 1)) + 
    0.3 * (1 / (successful_results['estimated_cost'] + 0.001))
)

best_overall = successful_results.loc[successful_results['combined_score'].idxmax()]

print("\n🏆 WINNERS:")
print(f"⚡ Fastest: {fastest['retriever_name']} ({fastest['total_latency']:.2f}s)")
print(f"💰 Cheapest: {cheapest['retriever_name']} (${cheapest['estimated_cost']:.4f})")
print(f"📚 Most Comprehensive: {most_docs['retriever_name']} ({most_docs['avg_docs_per_query']:.1f} docs/query)")
print(f"🎖️  Best Overall: {best_overall['retriever_name']} ({best_overall['combined_score']:.3f})")


📈 Performance Summary:
        retriever_name  success_rate  avg_docs_per_query  total_latency  estimated_cost
                 Naive           1.0               5.000         2.6639           0.016
                  BM25           1.0               5.000         0.0387           0.000
           Multi-Query           1.0               7.375        17.1759           0.064
       Parent Document           1.0               2.750         2.1368           0.040
Contextual Compression           1.0               3.000         3.6677           0.040
              Ensemble           1.0              11.500        25.7602           0.160

🏆 WINNERS:
⚡ Fastest: BM25 (0.04s)
💰 Cheapest: BM25 ($0.0000)
📚 Most Comprehensive: Ensemble (11.5 docs/query)
🎖️  Best Overall: BM25 (300.689)


## Step 8: Final Analysis and Recommendations

In [79]:
if len(successful_results) > 0:
    print("\n💡 RECOMMENDATIONS BY USE CASE:")
    print(f"\n1. ⚡ For Speed: {fastest['retriever_name']}")
    print(f"   - Fastest response time: {fastest['total_latency']:.2f}s")
    print(f"   - Good for: Real-time applications, high-throughput systems")
    
    print(f"\n2. 💰 For Cost Efficiency: {cheapest['retriever_name']}")
    print(f"   - Lowest cost: ${cheapest['estimated_cost']:.4f}")
    print(f"   - Good for: Budget-conscious deployments, high-volume usage")
    
    print(f"\n3. 📚 For Comprehensive Results: {most_docs['retriever_name']}")
    print(f"   - Most documents per query: {most_docs['avg_docs_per_query']:.1f}")
    print(f"   - Good for: Research applications, thorough analysis")
    
    print(f"\n4. ⚖️  For Balanced Performance: {best_overall['retriever_name']}")
    print(f"   - Best combined score: {best_overall['combined_score']:.3f}")
    print(f"   - Good for: General-purpose applications, balanced requirements")
    
    print("\n🔍 KEY INSIGHTS:")
    print("- RAGAS provides realistic test questions based on actual data")
    print("- BM25 is typically fastest and cheapest (no API calls)")
    print("- Embedding-based methods provide better semantic understanding")
    print("- Multi-query retrieval improves recall but increases cost")
    print("- Ensemble methods balance different strengths")
    print("- Compression/reranking improves quality but adds latency")
    print("- Parent-document retrievers provide more context per result")
    
    print("\n📈 EVALUATION METRICS:")
    print("- Success Rate: Percentage of queries processed successfully")
    print("- Docs Per Query: Average number of documents retrieved")
    print("- Latency: Time to retrieve and process documents")
    print("- Cost: Estimated API usage costs")
    
else:
    print("\n⚠️  All retrievers had issues. Check your setup and data.")

print(f"\n📊 EVALUATION COMPLETED: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


💡 RECOMMENDATIONS BY USE CASE:

1. ⚡ For Speed: BM25
   - Fastest response time: 0.04s
   - Good for: Real-time applications, high-throughput systems

2. 💰 For Cost Efficiency: BM25
   - Lowest cost: $0.0000
   - Good for: Budget-conscious deployments, high-volume usage

3. 📚 For Comprehensive Results: Ensemble
   - Most documents per query: 11.5
   - Good for: Research applications, thorough analysis

4. ⚖️  For Balanced Performance: BM25
   - Best combined score: 300.689
   - Good for: General-purpose applications, balanced requirements

🔍 KEY INSIGHTS:
- RAGAS provides realistic test questions based on actual data
- BM25 is typically fastest and cheapest (no API calls)
- Embedding-based methods provide better semantic understanding
- Multi-query retrieval improves recall but increases cost
- Ensemble methods balance different strengths
- Compression/reranking improves quality but adds latency
- Parent-document retrievers provide more context per result

📈 EVALUATION METRICS:
- Succ

## Step 9: LangSmith Advanced Evaluation for ALL Retrievers

In [80]:
if USE_LANGSMITH:
    print("\n🔬 Running LangSmith evaluation for ALL retrievers...")
    
    try:
        from langsmith.evaluation import LangChainStringEvaluator, evaluate
        from langchain.prompts import ChatPromptTemplate
        from langchain.schema import StrOutputParser
        from operator import itemgetter
        
        # Create RAG chain for evaluation
        RAG_PROMPT = """Given the provided context and question, answer the question based only on the context.
If you cannot answer based on the context, say "I don't know".

Context: {context}
Question: {question}"""
        
        rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)
        eval_llm = ChatOpenAI(model="gpt-4o-mini")
        
        # QA evaluator (following example.ipynb pattern)
        qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm})
        
        # Labeled helpfulness evaluator (following example.ipynb pattern)
        labeled_helpfulness_evaluator = LangChainStringEvaluator(
            "labeled_criteria",
            config={
                "criteria": {
                    "helpfulness": (
                        "Is this submission helpful to the user,"
                        " taking into account the correct reference answer?"
                    )
                },
                "llm": eval_llm
            },
            prepare_data=lambda run, example: {
                "prediction": run.outputs["output"],
                "reference": example.outputs["answer"],
                "input": example.inputs["question"],
            }
        )
        
        # Empathy evaluator (following example.ipynb pattern)
        empathy_evaluator = LangChainStringEvaluator(
            "criteria",
            config={
                "criteria": {
                    "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
                },
                "llm": eval_llm
            }
        )
        
        # Define all retrievers to evaluate
        all_retrievers_to_evaluate = [
            (naive_retriever, "Naive"),
            (bm25_retriever, "BM25"),
            (multi_query_retriever, "Multi-Query"),
            (parent_document_retriever, "Parent-Document"),
            (compression_retriever, "Contextual-Compression"),
            (ensemble_retriever, "Ensemble")
        ]
        
        print(f"📊 Evaluating {len(all_retrievers_to_evaluate)} retrievers with LangSmith...")
        print("🔍 Evaluators: QA Accuracy, Helpfulness, Empathy")
        
        # Evaluate each retriever
        for retriever, name in all_retrievers_to_evaluate:
            print(f"\n🔍 Evaluating {name} retriever...")
            
            try:
                # Create RAG chain for this retriever
                rag_chain = (
                    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
                    | rag_prompt | eval_llm | StrOutputParser()
                )
                
                # Run evaluation for this retriever
                experiment_results = evaluate(
                    rag_chain.invoke,
                    data=LANGSMITH_DATASET_NAME,
                    evaluators=[
                        qa_evaluator,
                        labeled_helpfulness_evaluator,
                        empathy_evaluator
                    ],
                    metadata={
                        "retriever_type": name, 
                        "evaluation_run": "all_retrievers",
                        "evaluators": "qa_helpfulness_empathy"
                    },
                    experiment_prefix=f"retriever_{name.lower().replace(' ', '_').replace('-', '_')}"
                )
                
                print(f"✅ {name} evaluation completed successfully")
                
                # Add rate limiting delay between retrievers
                time.sleep(3)  # 3 second delay between retrievers
                
            except Exception as e:
                print(f"❌ {name} evaluation failed: {e}")
                continue
        
        print("\n🎯 All retriever evaluations completed!")
        print("📊 Check LangSmith dashboard for detailed comparison results!")
        print("🔍 Each retriever has been evaluated for: QA Accuracy, Helpfulness, Empathy")
        
    except Exception as e:
        print(f"❌ LangSmith evaluation failed: {e}")
        
else:
    print("\n⚠️  Skipping LangSmith evaluation (not configured)")


🔬 Running LangSmith evaluation for ALL retrievers...
📊 Evaluating 6 retrievers with LangSmith...
🔍 Evaluators: QA Accuracy, Helpfulness, Empathy

🔍 Evaluating Naive retriever...
View the evaluation results for experiment: 'retriever_naive-58d23808' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=9de82c1f-a6cc-4a97-8522-ec662ed34081




0it [00:00, ?it/s]

✅ Naive evaluation completed successfully

🔍 Evaluating BM25 retriever...
View the evaluation results for experiment: 'retriever_bm25-5f0e18c1' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=aecc04e8-75db-40ff-bbe4-2af347aaa0a6




0it [00:00, ?it/s]

✅ BM25 evaluation completed successfully

🔍 Evaluating Multi-Query retriever...
View the evaluation results for experiment: 'retriever_multi_query-4e8be06f' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=60c18c72-cda8-402d-af1b-a661538a9f90




0it [00:00, ?it/s]

✅ Multi-Query evaluation completed successfully

🔍 Evaluating Parent-Document retriever...
View the evaluation results for experiment: 'retriever_parent_document-b1d87665' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=cfb819df-fec9-4089-960e-e55faa75ad1c




0it [00:00, ?it/s]

✅ Parent-Document evaluation completed successfully

🔍 Evaluating Contextual-Compression retriever...
View the evaluation results for experiment: 'retriever_contextual_compression-370372a6' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=39c2a652-01bf-4a40-8655-73c2889fa947




0it [00:00, ?it/s]

✅ Contextual-Compression evaluation completed successfully

🔍 Evaluating Ensemble retriever...
View the evaluation results for experiment: 'retriever_ensemble-c22124af' at:
https://smith.langchain.com/o/52c4cf8c-3e10-4738-ae6a-bc186d787252/datasets/81c8a79c-b026-4d96-a5b9-08dfd82ac30f/compare?selectedSessions=a4759aad-d30e-444e-bad3-9e365561c572




0it [00:00, ?it/s]

✅ Ensemble evaluation completed successfully

🎯 All retriever evaluations completed!
📊 Check LangSmith dashboard for detailed comparison results!
🔍 Each retriever has been evaluated for: QA Accuracy, Helpfulness, Empathy


## Retriever Evaluation: Final Analysis

### Overview

We evaluated 6 retrieval methods on Federal Student Aid regulatory content using 8 RAGAS-generated questions. The evaluation measured both performance metrics and answer quality through LangSmith evaluation.

### Performance Results

| Retriever | Success Rate | Docs/Query | Latency (s) | When to Use |
|-----------|--------------|------------|-------------|-------------|
| BM25 | 100% | 5.0 | 0.04 | Speed + cost efficiency |
| Naive | 100% | 5.0 | 2.66 | Semantic understanding |
| Parent Document | 100% | 2.8 | 2.14 | Rich context needed |
| Contextual Compression | 100% | 3.0 | 3.67 | Quality over quantity |
| Multi-Query | 100% | 7.4 | 17.18 | Comprehensive research |
| Ensemble | 100% | 11.5 | 25.76 | Maximum coverage |

### Quality Assessment

The LangSmith evaluation shows all retrievers generated high-quality, accurate answers for complex regulatory questions about academic year requirements, clinical work criteria, and Title IV compliance. The answers demonstrate strong understanding of CFR citations and regulatory nuances.

### When Each Retriever Makes Sense

**BM25 Retriever** shines when you need blazing fast responses at zero cost. It's perfect for production environments with high query volume, especially when dealing with structured regulatory content that contains specific terminology and citations. The 0.04 second response time makes it ideal for real-time student support systems where instant answers matter most.

**Naive Retriever** works best when your users ask questions in natural language rather than using exact regulatory terms. It provides solid semantic understanding while maintaining reasonable speed and cost. This approach handles conversational queries well and serves as an excellent general-purpose solution when you need flexibility across different types of content.

**Parent Document Retriever** excels when context is king. It provides fewer documents but each one contains complete, rich information rather than fragments. This method works particularly well for applications where users need to understand the full policy context around their question, making it valuable for compliance officers and detailed policy research.

**Contextual Compression Retriever** delivers precision through intelligent reranking. While it takes a bit longer to process, it ensures the most relevant results rise to the top. This approach makes sense when answer quality is more important than speed, particularly for applications where users prefer fewer, highly relevant results over many potentially tangential ones.

**Multi-Query Retriever** serves researchers and complex analysis scenarios exceptionally well. It automatically generates multiple query variations to ensure comprehensive coverage, retrieving 7.4 documents per query on average. The higher latency and cost are worthwhile when thoroughness matters more than speed, such as for policy analysis or detailed regulatory research.

**Ensemble Retriever** combines the strengths of all methods to provide maximum coverage with 11.5 documents per query. It's the Swiss Army knife of retrieval, best suited for applications where missing relevant information isn't an option. The higher resource requirements are justified when comprehensive coverage is mission-critical.

### Key Insights

The evaluation revealed that regulatory content works exceptionally well with keyword-based approaches due to its structured nature and specific terminology. However, the choice between methods depends heavily on your specific use case: real-time support systems favor speed, research applications favor coverage, and precision-focused tools benefit from reranking.

All methods produced high-quality answers for Federal Student Aid questions, suggesting that the choice between them should focus on operational requirements like speed, cost, and coverage rather than answer quality alone.


![LangSmith Screenshot](./img/ls.png)