# 02: RAGAS Evaluation (Standardized Dataset)

## 🎯 Objective
Assess the RAG pipeline using the RAGAS framework with standardized 15-test-case dataset including key metrics faithfulness, response relevance, context precision, and context recall.

### **Evaluation Dataset:**
- **15 Business Specifications**: Complete domain coverage
- **15 SQL Pipelines**: Comprehensive pipeline coverage  
- **Comprehensive Lineage**: 40+ nodes, 35+ edges
- **15 Standardized Test Cases**: Covering all business domains

### **RAGAS Metrics:**
- **Faithfulness** - Factual accuracy of responses
- **Answer Relevancy** - Relevance of answers to questions
- **Context Precision** - Precision of retrieved context
- **Context Recall** - Recall of relevant context

### **Expected Outcomes:**
- **Comprehensive Performance Table**: Detailed metrics across all domains
- **Domain-Specific Analysis**: Performance by business area
- **Actionable Recommendations**: Specific improvement areas

## Test Coverage
- **10 Business Domains**: Sales Orders, Customer Analytics, Inventory Management, Financial Reporting, Marketing Attribution, Supply Chain, HR Analytics, Product Analytics, Risk Management, Compliance Monitoring
- **5 Operational Guides**: Incident Playbook, Data Quality Standards, Troubleshooting Guide, SLA Definitions, Escalation Procedures


## 1. Setup and Imports


In [24]:
import os
import sys
import json
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Dict, Any
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API keys
if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is not set. Create a .env file or export it in your shell.")

print("✅ Environment setup complete")


✅ Environment setup complete


In [25]:
# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import RAGAS components
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Import our Traceback system
from tracebackcore.core import traceback_graph, lineage_retriever, AgentState, initialize_system

print("✅ Imports complete")


✅ Imports complete


In [26]:
# Load standardized golden test data
import json
from pathlib import Path

# Load the standardized golden test data (from project root)
golden_test_data_path = Path("../data/golden_test_data.json")
with open(golden_test_data_path, "r") as f:
    golden_test_data = json.load(f)

print(f"✅ Loaded standardized golden test dataset with {len(golden_test_data)} comprehensive test cases")
print("�� Test cases cover all 15 business domains:")
for i, test_case in enumerate(golden_test_data, 1):
    print(f"  {i:2d}. {test_case["question"][:60]}...")


✅ Loaded standardized golden test dataset with 15 comprehensive test cases
�� Test cases cover all 15 business domains:
   1. What should I do if the sales orders pipeline fails?...
   2. How does the customer analytics pipeline segment customers?...
   3. What are the stock level management rules for inventory?...
   4. What financial controls are implemented in the reporting pip...
   5. What attribution models are used for marketing campaigns?...
   6. How is supplier performance measured in the supply chain?...
   7. What employee metrics are tracked in the HR analytics pipeli...
   8. What product usage metrics are monitored in real-time?...
   9. What risk management controls are implemented?...
  10. What compliance monitoring procedures are in place?...
  11. What are the escalation procedures for data pipeline inciden...
  12. What are the data quality standards and monitoring?...
  13. What are the SLA definitions and monitoring procedures?...
  14. What are the troubleshooti

## 2. Initialize Traceback System


In [27]:
# Initialize the Traceback system
print("🚀 Initializing Traceback system...")
initialize_system()
print("✅ Traceback system initialized successfully")


🚀 Initializing Traceback system...
🚀 Initializing Traceback system...
📚 Loading all specifications and SQL pipelines...
✅ Loaded 30 documents (15 specs, 15 SQL files)
✅ Loaded comprehensive lineage data: 13 nodes, 13 edges
✅ Traceback system initialized successfully
✅ Traceback system initialized successfully


## 3. Create Golden Test Dataset

We'll create a comprehensive test dataset covering various incident scenarios with ground truth answers.


In [28]:
# Note: Golden test data is now loaded from standardized JSON file in cell above
# This ensures consistency between RAGAS evaluation and Advanced Retrieval evaluation

print("✅ Using standardized golden test data from data/golden_test_data.json")
print(f"📊 Test cases loaded: {len(golden_test_data)}")
print("🎯 Both evaluations now use identical test data for fair comparison")


✅ Using standardized golden test data from data/golden_test_data.json
📊 Test cases loaded: 15
🎯 Both evaluations now use identical test data for fair comparison


## 4. Generate Responses Using Traceback System


In [29]:
def generate_traceback_response(question: str) -> Dict[str, Any]:
    """Generate response using our Traceback system."""
    try:
        # Create initial state
        initial_state = AgentState(
            question=question,
            context=[],
            impact_assessment=None,
            blast_radius=None,
            recommended_actions=None,
            incident_brief=None,
            current_step="supervisor",
            error=None
        )
        
        # Run the workflow
        result = traceback_graph.invoke(initial_state)
        
        # Get retrieved context from the lineage retriever
        retrieved_docs = lineage_retriever.search_with_lineage(question, k=5)
        retrieved_contexts = [doc.page_content for doc in retrieved_docs]
        
        # Extract relevant information
        return {
            "answer": result.get("incident_brief", "No response generated"),
            "context": retrieved_contexts,  # Use actual retrieved context
            "blast_radius": result.get("blast_radius", []),
            "impact_assessment": result.get("impact_assessment", {})
        }
    except Exception as e:
        return {
            "answer": f"Error generating response: {str(e)}",
            "context": [],
            "blast_radius": [],
            "impact_assessment": {}
        }

print("✅ Response generation function defined")


✅ Response generation function defined


In [30]:
# Generate responses for all test cases
print("🔄 Generating responses using Traceback system...")

evaluation_data = []
for i, test_case in enumerate(golden_test_data):
    print(f"Processing test case {i+1}/{len(golden_test_data)}: {test_case['question'][:50]}...")
    
    # Generate response
    response = generate_traceback_response(test_case["question"])
    
    # Prepare data for RAGAS evaluation
    # Use actual retrieved context from our system, not predefined context
    evaluation_data.append({
        "question": test_case["question"],
        "answer": response["answer"],
        "contexts": response["context"],  # Use actual retrieved context
        "ground_truth": test_case["ground_truth"]
    })

print(f"✅ Generated responses for {len(evaluation_data)} test cases")


🔄 Generating responses using Traceback system...
Processing test case 1/15: What should I do if the sales orders pipeline fail...
Processing test case 2/15: How does the customer analytics pipeline segment c...
Processing test case 3/15: What are the stock level management rules for inve...
Processing test case 4/15: What financial controls are implemented in the rep...
Processing test case 5/15: What attribution models are used for marketing cam...
Processing test case 6/15: How is supplier performance measured in the supply...
Processing test case 7/15: What employee metrics are tracked in the HR analyt...
Processing test case 8/15: What product usage metrics are monitored in real-t...
Processing test case 9/15: What risk management controls are implemented?...
Processing test case 10/15: What compliance monitoring procedures are in place...
Processing test case 11/15: What are the escalation procedures for data pipeli...
Processing test case 12/15: What are the data quality standard

## 5. RAGAS Evaluation


In [31]:
# Convert to RAGAS Dataset format
ragas_dataset = Dataset.from_list(evaluation_data)

print(f"📊 RAGAS dataset created with {len(ragas_dataset)} samples")
print(f"Dataset columns: {ragas_dataset.column_names}")

# Verify the data format
print("\n🔍 Sample data format verification:")
sample = ragas_dataset[0]
print(f"Question: {sample['question'][:50]}...")
print(f"Answer length: {len(sample['answer'])} characters")
print(f"Contexts count: {len(sample['contexts'])}")
print(f"Contexts type: {type(sample['contexts'])}")
print(f"First context: {sample['contexts'][0][:50]}...")
print(f"Ground truth length: {len(sample['ground_truth'])} characters")


📊 RAGAS dataset created with 15 samples
Dataset columns: ['question', 'answer', 'contexts', 'ground_truth']

🔍 Sample data format verification:
Question: What should I do if the sales orders pipeline fail...
Answer length: 3775 characters
Contexts count: 5
Contexts type: <class 'list'>
First context: -- Sales Orders Pipeline
-- Purpose: Transform raw...
Ground truth length: 432 characters


### Note: RAGAS EvaluationResult Object
RAGAS returns an `EvaluationResult` object, not a dictionary. To access the results, use:
- `result.to_pandas()` to get a DataFrame
- `result.samples` to get individual sample results
- `result.metrics` to get metric names


In [32]:
# Test with a smaller subset first to verify everything works
print("🧪 Testing RAGAS evaluation with first 2 samples...")

# Create a small test dataset
test_dataset = Dataset.from_list(evaluation_data[:2])

# Define metrics to evaluate
metrics = [
    faithfulness,      # How factually accurate are the responses?
    answer_relevancy, # How relevant are the responses to the questions?
    context_precision, # How precise is the retrieved context?
    context_recall     # How well does the context cover the ground truth?
]

print("🔄 Running RAGAS evaluation on test subset...")
print("This may take a few minutes...")

# Run evaluation on test subset
test_result = evaluate(
    test_dataset,
    metrics=metrics
)

print("✅ RAGAS test evaluation completed!")
print(f"Test result type: {type(test_result)}")

# Convert to pandas DataFrame to see the results
test_df = test_result.to_pandas()
print(f"Test results shape: {test_df.shape}")
print(f"Test results columns: {list(test_df.columns)}")
print("\n📊 Test Results Summary:")
for metric in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']:
    if metric in test_df.columns:
        print(f"{metric}: {test_df[metric].mean():.3f}")


🧪 Testing RAGAS evaluation with first 2 samples...
🔄 Running RAGAS evaluation on test subset...
This may take a few minutes...


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

✅ RAGAS test evaluation completed!
Test result type: <class 'ragas.dataset_schema.EvaluationResult'>
Test results shape: (2, 8)
Test results columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Test Results Summary:
faithfulness: 0.381
answer_relevancy: 0.923
context_precision: 1.000
context_recall: 0.524


In [33]:
# Display detailed test results
print("📋 Detailed Test Results:")
print("=" * 50)
display(test_df)

# Check if test passed (all metrics > 0)
test_passed = all(test_df[metric].mean() > 0 for metric in ['faithfulness', 'answer_relevancy'] if metric in test_df.columns)
print(f"\n✅ Test {'PASSED' if test_passed else 'FAILED'}: Ready for full evaluation")


📋 Detailed Test Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[-- Sales Orders Pipeline\n-- Purpose: Transfo...,# Incident Brief: Sales Orders Pipeline Failur...,If the sales orders pipeline fails: 1) Check p...,0.0,0.910627,1.0,0.714286
1,How does the customer analytics pipeline segme...,[# Customer Analytics Pipeline Specification\n...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.761905,0.935875,1.0,0.333333



✅ Test PASSED: Ready for full evaluation


In [34]:
# Run full evaluation on all samples
print("🚀 Running full RAGAS evaluation on all samples...")
print("This may take several minutes...")

# Run evaluation on full dataset
result = evaluate(
    ragas_dataset,
    metrics=metrics
)

print("✅ RAGAS evaluation completed!")


🚀 Running full RAGAS evaluation on all samples...
This may take several minutes...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

✅ RAGAS evaluation completed!


## 6. Results Analysis


In [35]:
# Extract results
results_df = result.to_pandas()

print("📊 RAGAS Evaluation Results:")
print("=" * 50)

# Display overall metrics
overall_metrics = {
    "Faithfulness": results_df['faithfulness'].mean(),
    "Answer Relevancy": results_df['answer_relevancy'].mean(),
    "Context Precision": results_df['context_precision'].mean(),
    "Context Recall": results_df['context_recall'].mean()
}

print("\n🎯 Overall Performance Metrics:")
for metric, score in overall_metrics.items():
    print(f"{metric:20}: {score:.3f}")

print("\n📋 Detailed Results:")
display(results_df)


📊 RAGAS Evaluation Results:

🎯 Overall Performance Metrics:
Faithfulness        : 0.494
Answer Relevancy    : 0.889
Context Precision   : 0.717
Context Recall      : 0.436

📋 Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[-- Sales Orders Pipeline\n-- Purpose: Transfo...,# Incident Brief: Sales Orders Pipeline Failur...,If the sales orders pipeline fails: 1) Check p...,0.0,0.910627,1.0,0.714286
1,How does the customer analytics pipeline segme...,[# Customer Analytics Pipeline Specification\n...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.893617,0.935875,1.0,0.333333
2,What are the stock level management rules for ...,[-- Inventory Management Pipeline\n-- Purpose:...,# Incident Brief: Inventory Stock Level Manage...,Inventory stock level management rules include...,0.947368,0.902031,0.0,0.0
3,What financial controls are implemented in the...,[# Financial Reporting Pipeline Specification\...,# Incident Brief: Financial Controls in Report...,Financial reporting pipeline controls include:...,0.928571,0.946052,1.0,0.3
4,What attribution models are used for marketing...,[# Marketing Attribution Pipeline Specificatio...,# Incident Brief: Marketing Attribution Models...,Marketing attribution models include: 1) First...,0.928571,0.896943,1.0,0.2
5,How is supplier performance measured in the su...,[-- Supply Chain Analytics Pipeline\n-- Purpos...,# Incident Brief: Supplier Performance Measure...,Supplier performance measurement includes: 1) ...,0.25,0.887601,1.0,0.142857
6,What employee metrics are tracked in the HR an...,[-- HR Analytics Pipeline\n-- Purpose: Employe...,# Incident Brief: HR Analytics Pipeline Disrup...,HR analytics tracks employee metrics including...,0.295455,0.888966,0.0,0.142857
7,What product usage metrics are monitored in re...,[-- Product Analytics Pipeline\n-- Purpose: Co...,# Incident Brief: Product Usage Metrics Monito...,Real-time product usage metrics include: 1) DA...,0.448276,0.890654,0.5,0.8
8,What risk management controls are implemented?,[# Risk Management Pipeline Specification\n\n#...,# Incident Brief: Traceback Incident Triage\n\...,Risk management controls include: 1) Credit ri...,0.571429,0.855866,0.0,0.3
9,What compliance monitoring procedures are in p...,[# Compliance Monitoring Pipeline Specificatio...,# Incident Brief: Compliance Monitoring Proced...,Compliance monitoring procedures include: 1) R...,0.0,0.893557,1.0,0.7


In [36]:
# Debug: Check what columns are available in results_df
print("🔍 Debugging Results DataFrame:")
print("=" * 40)
print(f"DataFrame shape: {results_df.shape}")
print(f"Available columns: {list(results_df.columns)}")
print(f"DataFrame head:")
display(results_df.head())


🔍 Debugging Results DataFrame:
DataFrame shape: (15, 8)
Available columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
DataFrame head:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[-- Sales Orders Pipeline\n-- Purpose: Transfo...,# Incident Brief: Sales Orders Pipeline Failur...,If the sales orders pipeline fails: 1) Check p...,0.0,0.910627,1.0,0.714286
1,How does the customer analytics pipeline segme...,[# Customer Analytics Pipeline Specification\n...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.893617,0.935875,1.0,0.333333
2,What are the stock level management rules for ...,[-- Inventory Management Pipeline\n-- Purpose:...,# Incident Brief: Inventory Stock Level Manage...,Inventory stock level management rules include...,0.947368,0.902031,0.0,0.0
3,What financial controls are implemented in the...,[# Financial Reporting Pipeline Specification\...,# Incident Brief: Financial Controls in Report...,Financial reporting pipeline controls include:...,0.928571,0.946052,1.0,0.3
4,What attribution models are used for marketing...,[# Marketing Attribution Pipeline Specificatio...,# Incident Brief: Marketing Attribution Models...,Marketing attribution models include: 1) First...,0.928571,0.896943,1.0,0.2


In [37]:
# Create a summary table
summary_table = pd.DataFrame({
    "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
    "Score": [overall_metrics["Faithfulness"], overall_metrics["Answer Relevancy"], 
              overall_metrics["Context Precision"], overall_metrics["Context Recall"]],
    "Interpretation": [
        "How factually accurate are the responses?",
        "How relevant are the responses to the questions?",
        "How precise is the retrieved context?",
        "How well does the context cover the ground truth?"
    ]
})

print("\n📊 RAGAS Evaluation Summary Table:")
print("=" * 80)
display(summary_table)



📊 RAGAS Evaluation Summary Table:


Unnamed: 0,Metric,Score,Interpretation
0,Faithfulness,0.493814,How factually accurate are the responses?
1,Answer Relevancy,0.889233,How relevant are the responses to the questions?
2,Context Precision,0.717037,How precise is the retrieved context?
3,Context Recall,0.435556,How well does the context cover the ground truth?


In [38]:
# Safe performance analysis (handles missing columns)
print("🔍 Safe Performance Analysis:")
print("=" * 50)

# Check what metrics are available
available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                    if col in results_df.columns]

print(f"Available metrics: {available_metrics}")

if available_metrics:
    # Calculate overall performance
    overall_performance = {}
    for metric in available_metrics:
        overall_performance[metric] = results_df[metric].mean()
    
    print("\n📊 Overall Performance Metrics:")
    for metric, score in overall_performance.items():
        print(f"{metric:20}: {score:.3f}")
    
    # Calculate performance statistics
    print("\n📈 Performance Statistics:")
    for metric in available_metrics:
        print(f"\n{metric}:")
        print(f"  Mean: {results_df[metric].mean():.3f}")
        print(f"  Std:  {results_df[metric].std():.3f}")
        print(f"  Min:  {results_df[metric].min():.3f}")
        print(f"  Max:  {results_df[metric].max():.3f}")
else:
    print("❌ No metrics found in results DataFrame")


🔍 Safe Performance Analysis:
Available metrics: ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Overall Performance Metrics:
faithfulness        : 0.494
answer_relevancy    : 0.889
context_precision   : 0.717
context_recall      : 0.436

📈 Performance Statistics:

faithfulness:
  Mean: 0.494
  Std:  0.360
  Min:  0.000
  Max:  0.947

answer_relevancy:
  Mean: 0.889
  Std:  0.040
  Min:  0.775
  Max:  0.946

context_precision:
  Mean: 0.717
  Std:  0.404
  Min:  0.000
  Max:  1.000

context_recall:
  Mean: 0.436
  Std:  0.295
  Min:  0.000
  Max:  1.000


In [39]:
# SAFE ALTERNATIVE: Skip the problematic cell above and use this instead
print("🔍 SAFE Performance Analysis:")
print("=" * 50)

# Check what columns are actually available
print(f"Available columns: {list(results_df.columns)}")

# Check if question column exists
if 'question' in results_df.columns:
    print("✅ Question column found - can do question type analysis")
    # Add question categories safely
    results_df['question_type'] = results_df['question'].apply(lambda x: 
        'Impact Analysis' if 'impacted' in x.lower() else
        'Troubleshooting' if 'troubleshoot' in x.lower() or 'should i do' in x.lower() else
        'Dependency Analysis' if 'depend' in x.lower() else
        'SLA Query' if 'sla' in x.lower() else
        'General'
    )
    
    # Group by question type
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        performance_by_type = results_df.groupby('question_type')[available_metrics].mean().round(3)
        print("\n📊 Performance by Question Type:")
        display(performance_by_type)
    else:
        print("❌ No metrics found for grouping")
else:
    print("⚠️ Question column not found - showing overall performance only")
    
    # Show overall performance
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        print("\n📊 Overall Performance Summary:")
        for metric in available_metrics:
            print(f"{metric}: {results_df[metric].mean():.3f}")
    else:
        print("❌ No performance metrics found")


🔍 SAFE Performance Analysis:
Available columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
⚠️ Question column not found - showing overall performance only

📊 Overall Performance Summary:
faithfulness: 0.494
answer_relevancy: 0.889
context_precision: 0.717
context_recall: 0.436


## 7. Performance Analysis and Conclusions


In [40]:
# Identify strengths and weaknesses (with safety checks)
print("\n🎯 Performance Strengths and Weaknesses:")
print("=" * 50)

# Use the safe overall_performance from previous cell
if 'overall_performance' in locals() and overall_performance:
    # Find best and worst performing metrics
    best_metric = max(overall_performance.items(), key=lambda x: x[1])
    worst_metric = min(overall_performance.items(), key=lambda x: x[1])

    print(f"\n✅ Strongest Performance: {best_metric[0]} ({best_metric[1]:.3f})")
    print(f"❌ Weakest Performance: {worst_metric[0]} ({worst_metric[1]:.3f})")

    # Calculate overall score
    overall_score = np.mean(list(overall_performance.values()))
    print(f"\n📊 Overall Pipeline Score: {overall_score:.3f}")

    # Performance interpretation
    if overall_score >= 0.8:
        performance_level = "Excellent"
    elif overall_score >= 0.7:
        performance_level = "Good"
    elif overall_score >= 0.6:
        performance_level = "Fair"
    else:
        performance_level = "Needs Improvement"

    print(f"🎯 Performance Level: {performance_level}")
else:
    print("⚠️ No performance data available for analysis")



🎯 Performance Strengths and Weaknesses:

✅ Strongest Performance: answer_relevancy (0.889)
❌ Weakest Performance: context_recall (0.436)

📊 Overall Pipeline Score: 0.634
🎯 Performance Level: Fair


## 8. Detailed Conclusions and Recommendations


In [41]:
print("📋 Detailed Conclusions and Recommendations:")
print("=" * 60)

print("\n🔍 Key Findings:")
print("-" * 20)

# Faithfulness analysis
faithfulness_score = overall_metrics["Faithfulness"]
if faithfulness_score >= 0.8:
    faithfulness_conclusion = "The system generates highly factual and accurate responses."
elif faithfulness_score >= 0.6:
    faithfulness_conclusion = "The system generates mostly accurate responses with some factual inconsistencies."
else:
    faithfulness_conclusion = "The system has significant factual accuracy issues that need attention."

print(f"1. Faithfulness ({faithfulness_score:.3f}): {faithfulness_conclusion}")

# Answer relevancy analysis
relevancy_score = overall_metrics["Answer Relevancy"]
if relevancy_score >= 0.8:
    relevancy_conclusion = "Responses are highly relevant to the questions asked."
elif relevancy_score >= 0.6:
    relevancy_conclusion = "Responses are generally relevant but may sometimes miss the mark."
else:
    relevancy_conclusion = "Responses often lack relevance to the specific questions asked."

print(f"2. Answer Relevancy ({relevancy_score:.3f}): {relevancy_conclusion}")

# Context precision analysis
precision_score = overall_metrics["Context Precision"]
if precision_score >= 0.8:
    precision_conclusion = "The system retrieves highly precise and relevant context."
elif precision_score >= 0.6:
    precision_conclusion = "The system retrieves reasonably precise context with some noise."
else:
    precision_conclusion = "The system retrieves context with significant noise and irrelevance."

print(f"3. Context Precision ({precision_score:.3f}): {precision_conclusion}")

# Context recall analysis
recall_score = overall_metrics["Context Recall"]
if recall_score >= 0.8:
    recall_conclusion = "The system retrieves comprehensive context that covers ground truth well."
elif recall_score >= 0.6:
    recall_conclusion = "The system retrieves adequate context but may miss some important information."
else:
    recall_conclusion = "The system often misses important context needed for accurate responses."

print(f"4. Context Recall ({recall_score:.3f}): {recall_conclusion}")


📋 Detailed Conclusions and Recommendations:

🔍 Key Findings:
--------------------
1. Faithfulness (0.494): The system has significant factual accuracy issues that need attention.
2. Answer Relevancy (0.889): Responses are highly relevant to the questions asked.
3. Context Precision (0.717): The system retrieves reasonably precise context with some noise.
4. Context Recall (0.436): The system often misses important context needed for accurate responses.


In [42]:
print("\n💡 Recommendations for Improvement:")
print("-" * 40)

recommendations = []

# Faithfulness recommendations
if faithfulness_score < 0.8:
    recommendations.append("• Improve factual accuracy by enhancing the knowledge base and fact-checking mechanisms")

# Relevancy recommendations
if relevancy_score < 0.8:
    recommendations.append("• Enhance response relevance by improving question understanding and response generation")

# Precision recommendations
if precision_score < 0.8:
    recommendations.append("• Improve context precision by refining retrieval algorithms and filtering mechanisms")

# Recall recommendations
if recall_score < 0.8:
    recommendations.append("• Enhance context recall by expanding the knowledge base and improving retrieval coverage")

# General recommendations
if overall_score < 0.8:
    recommendations.extend([
        "• Consider fine-tuning the LLM on domain-specific data",
        "• Implement feedback loops to continuously improve performance",
        "• Add more diverse test cases to the evaluation dataset",
        "• Consider ensemble methods for better response quality"
    ])

if recommendations:
    for rec in recommendations:
        print(rec)
else:
    print("• System performance is excellent - consider monitoring for consistency")
    print("• Expand the test dataset to cover more edge cases")
    print("• Implement A/B testing for continuous improvement")



💡 Recommendations for Improvement:
----------------------------------------
• Improve factual accuracy by enhancing the knowledge base and fact-checking mechanisms
• Improve context precision by refining retrieval algorithms and filtering mechanisms
• Enhance context recall by expanding the knowledge base and improving retrieval coverage
• Consider fine-tuning the LLM on domain-specific data
• Implement feedback loops to continuously improve performance
• Add more diverse test cases to the evaluation dataset
• Consider ensemble methods for better response quality


In [43]:
print("\n🎯 Overall Pipeline Effectiveness Assessment:")
print("=" * 50)

print(f"\n📊 Summary Statistics:")
print(f"• Total Test Cases: {len(evaluation_data)}")
print(f"• Overall Score: {overall_score:.3f}")
print(f"• Performance Level: {performance_level}")
print(f"• Best Metric: {best_metric[0]} ({best_metric[1]:.3f})")
print(f"• Worst Metric: {worst_metric[0]} ({worst_metric[1]:.3f})")

print(f"\n🔍 Key Insights:")
print(f"• The Traceback system demonstrates {'strong' if overall_score >= 0.7 else 'moderate' if overall_score >= 0.6 else 'weak'} performance across all RAGAS metrics")
print(f"• {'The system excels at' if best_metric[1] >= 0.8 else 'The system shows good performance in'} {best_metric[0].lower()}")
print(f"• {'Significant improvement needed in' if worst_metric[1] < 0.6 else 'Some improvement possible in'} {worst_metric[0].lower()}")

print(f"\n✅ Conclusion:")
if overall_score >= 0.8:
    conclusion = "The Traceback pipeline is highly effective and ready for production deployment."
elif overall_score >= 0.7:
    conclusion = "The Traceback pipeline shows good effectiveness with room for targeted improvements."
elif overall_score >= 0.6:
    conclusion = "The Traceback pipeline demonstrates fair effectiveness but requires significant improvements."
else:
    conclusion = "The Traceback pipeline needs substantial improvements before production deployment."

print(conclusion)



🎯 Overall Pipeline Effectiveness Assessment:

📊 Summary Statistics:
• Total Test Cases: 15
• Overall Score: 0.634
• Performance Level: Fair
• Best Metric: answer_relevancy (0.889)
• Worst Metric: context_recall (0.436)

🔍 Key Insights:
• The Traceback system demonstrates moderate performance across all RAGAS metrics
• The system excels at answer_relevancy
• Significant improvement needed in context_recall

✅ Conclusion:
The Traceback pipeline demonstrates fair effectiveness but requires significant improvements.


## 9. Save Results


In [44]:
# Save detailed results
results_output_path = Path.cwd().parent / "data" / "ragas_evaluation_results.json"

evaluation_summary = {
    "evaluation_date": pd.Timestamp.now().isoformat(),
    "total_test_cases": len(evaluation_data),
    "overall_metrics": overall_metrics,
    "overall_score": float(overall_score),
    "performance_level": performance_level,
    "detailed_results": results_df.to_dict('records'),
    "recommendations": recommendations if recommendations else ["System performance is excellent"]
}

with open(results_output_path, 'w') as f:
    json.dump(evaluation_summary, f, indent=2)

print(f"✅ Results saved to: {results_output_path}")
print(f"📊 Evaluation completed successfully!")
print(f"🎯 Overall Pipeline Score: {overall_score:.3f} ({performance_level})")


✅ Results saved to: /Users/sandeepgogineni/ai-engineering/bootcamp/Traceback/data/ragas_evaluation_results.json
📊 Evaluation completed successfully!
🎯 Overall Pipeline Score: 0.634 (Fair)
