# Task 5: RAGAS Evaluation - Golden Test Data Set

## ⚠️ Important Note
If you encounter a `KeyError: 'question'` error, **skip the problematic cell** and use the "SAFE ALTERNATIVE" cell instead. The RAGAS results DataFrame may not include the original question column.

## Objective
Assess the Traceback pipeline using the RAGAS framework with key metrics:
- **Faithfulness**: How factually accurate are the generated responses?
- **Response Relevance**: How relevant are the responses to the questions?
- **Context Precision**: How precise is the retrieved context?
- **Context Recall**: How well does the context cover the ground truth?

## Methodology
1. Create a golden test dataset with ground truth answers
2. Generate responses using our Traceback system
3. Evaluate using RAGAS metrics
4. Analyze results and draw conclusions about pipeline effectiveness


## 1. Setup and Imports


In [30]:
import os
import sys
import json
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Dict, Any
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API keys
if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is not set. Create a .env file or export it in your shell.")

print("✅ Environment setup complete")


✅ Environment setup complete


In [31]:
# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import RAGAS components
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Import our Traceback system
from tracebackcore.core import traceback_graph, lineage_retriever, AgentState, initialize_system

print("✅ Imports complete")


✅ Imports complete


## 2. Initialize Traceback System


In [32]:
# Initialize the Traceback system
print("🚀 Initializing Traceback system...")
initialize_system()
print("✅ Traceback system initialized successfully")


🚀 Initializing Traceback system...
🚀 Initializing Traceback system...
✅ Traceback system initialized successfully
✅ Traceback system initialized successfully


## 3. Create Golden Test Dataset

We'll create a comprehensive test dataset covering various incident scenarios with ground truth answers.


In [33]:
# Define our golden test dataset
golden_test_data = [
    {
        "question": "Job curated.sales_orders failed — who's impacted?",
        "ground_truth": "The curated.sales_orders job failure impacts curated.revenue_summary and analytics.customer_behavior tables. These downstream systems depend on sales order data for revenue calculations and customer behavior analysis. The blast radius includes all analytics dashboards and reporting tools that rely on this data.",
        "context": [
            "Sales orders pipeline processes raw order data into curated datasets for analytics and reporting.",
            "SELECT * FROM curated.sales_orders WHERE order_date >= CURRENT_DATE - 1",
            "Data pipeline incident response procedures: 1. Acknowledge incident 2. Assess impact 3. Determine blast radius 4. Notify stakeholders"
        ]
    },
    {
        "question": "What should I do if raw.sales_orders has quality issues?",
        "ground_truth": "If raw.sales_orders has quality issues, you should: 1) Acknowledge the incident and notify stakeholders, 2) Assess the impact on downstream systems (curated.sales_orders, curated.revenue_summary, analytics.customer_behavior), 3) Investigate the root cause by reviewing data quality checks and logs, 4) Implement data cleansing procedures, 5) Reprocess the affected data through the pipeline, 6) Validate the corrected data, and 7) Monitor for future quality issues.",
        "context": [
            "Data pipeline incident response procedures: 1. Acknowledge incident 2. Assess impact 3. Determine blast radius 4. Notify stakeholders",
            "Sales orders pipeline processes raw order data into curated datasets for analytics and reporting.",
            "Data quality standards require validation of order completeness, customer information accuracy, and product details consistency."
        ]
    },
    {
        "question": "Which dashboards depend on curated.revenue_summary?",
        "ground_truth": "Dashboards that depend on curated.revenue_summary include: 1) Revenue Performance Dashboard - shows daily/monthly revenue trends, 2) Sales Forecasting Dashboard - uses revenue data for predictive analytics, 3) Executive Financial Summary - provides high-level revenue metrics, 4) Monthly Revenue Reports - detailed revenue breakdowns by product/customer, and 5) Customer Analytics Dashboard - analyzes revenue per customer segment.",
        "context": [
            "Revenue summary aggregates sales data for executive reporting and business intelligence.",
            "SELECT SUM(revenue) FROM curated.revenue_summary GROUP BY month",
            "Executive dashboards require real-time revenue data for strategic decision making."
        ]
    },
    {
        "question": "How do I troubleshoot a data pipeline failure?",
        "ground_truth": "To troubleshoot a data pipeline failure: 1) Check system logs for error messages and stack traces, 2) Verify data source availability and connectivity, 3) Validate input data quality and format, 4) Check resource utilization (CPU, memory, disk space), 5) Review pipeline configuration and dependencies, 6) Test individual pipeline components in isolation, 7) Check for schema changes or data format modifications, 8) Verify authentication and permissions, 9) Monitor downstream system health, and 10) Document findings and implement preventive measures.",
        "context": [
            "Data pipeline incident response procedures: 1. Acknowledge incident 2. Assess impact 3. Determine blast radius 4. Notify stakeholders",
            "Pipeline monitoring includes checking data freshness, quality metrics, and processing times.",
            "Common pipeline failures include data quality issues, resource constraints, and configuration errors."
        ]
    },
    {
        "question": "What is the SLA for curated.sales_orders data freshness?",
        "ground_truth": "The SLA for curated.sales_orders data freshness is 4 hours from source data availability. This means that once raw sales order data is available, the curated.sales_orders table should be updated within 4 hours. The SLA includes: 1) Data processing completion within 4 hours, 2) Data quality validation within 6 hours, 3) Downstream system updates within 8 hours, and 4) 99.5% uptime for the processing pipeline.",
        "context": [
            "Sales orders pipeline processes raw order data into curated datasets for analytics and reporting.",
            "Data freshness SLA requires updates within 4 hours of source data availability.",
            "Pipeline monitoring includes checking data freshness, quality metrics, and processing times."
        ]
    }
]

print(f"✅ Created golden test dataset with {len(golden_test_data)} test cases")


✅ Created golden test dataset with 5 test cases


## 4. Generate Responses Using Traceback System


In [34]:
def generate_traceback_response(question: str) -> Dict[str, Any]:
    """Generate response using our Traceback system."""
    try:
        # Create initial state
        initial_state = AgentState(
            question=question,
            context=[],
            impact_assessment=None,
            blast_radius=None,
            recommended_actions=None,
            incident_brief=None,
            current_step="supervisor",
            error=None
        )
        
        # Run the workflow
        result = traceback_graph.invoke(initial_state)
        
        # Get retrieved context from the lineage retriever
        retrieved_docs = lineage_retriever.search_with_lineage(question, k=5)
        retrieved_contexts = [doc.page_content for doc in retrieved_docs]
        
        # Extract relevant information
        return {
            "answer": result.get("incident_brief", "No response generated"),
            "context": retrieved_contexts,  # Use actual retrieved context
            "blast_radius": result.get("blast_radius", []),
            "impact_assessment": result.get("impact_assessment", {})
        }
    except Exception as e:
        return {
            "answer": f"Error generating response: {str(e)}",
            "context": [],
            "blast_radius": [],
            "impact_assessment": {}
        }

print("✅ Response generation function defined")


✅ Response generation function defined


In [35]:
# Generate responses for all test cases
print("🔄 Generating responses using Traceback system...")

evaluation_data = []
for i, test_case in enumerate(golden_test_data):
    print(f"Processing test case {i+1}/{len(golden_test_data)}: {test_case['question'][:50]}...")
    
    # Generate response
    response = generate_traceback_response(test_case["question"])
    
    # Prepare data for RAGAS evaluation
    # Use actual retrieved context from our system, not predefined context
    evaluation_data.append({
        "question": test_case["question"],
        "answer": response["answer"],
        "contexts": response["context"],  # Use actual retrieved context
        "ground_truth": test_case["ground_truth"]
    })

print(f"✅ Generated responses for {len(evaluation_data)} test cases")


🔄 Generating responses using Traceback system...
Processing test case 1/5: Job curated.sales_orders failed — who's impacted?...
Processing test case 2/5: What should I do if raw.sales_orders has quality i...
Processing test case 3/5: Which dashboards depend on curated.revenue_summary...
Processing test case 4/5: How do I troubleshoot a data pipeline failure?...
Processing test case 5/5: What is the SLA for curated.sales_orders data fres...
✅ Generated responses for 5 test cases


## 5. RAGAS Evaluation


In [36]:
# Convert to RAGAS Dataset format
ragas_dataset = Dataset.from_list(evaluation_data)

print(f"📊 RAGAS dataset created with {len(ragas_dataset)} samples")
print(f"Dataset columns: {ragas_dataset.column_names}")

# Verify the data format
print("\n🔍 Sample data format verification:")
sample = ragas_dataset[0]
print(f"Question: {sample['question'][:50]}...")
print(f"Answer length: {len(sample['answer'])} characters")
print(f"Contexts count: {len(sample['contexts'])}")
print(f"Contexts type: {type(sample['contexts'])}")
print(f"First context: {sample['contexts'][0][:50]}...")
print(f"Ground truth length: {len(sample['ground_truth'])} characters")


📊 RAGAS dataset created with 5 samples
Dataset columns: ['question', 'answer', 'contexts', 'ground_truth']

🔍 Sample data format verification:
Question: Job curated.sales_orders failed — who's impacted?...
Answer length: 3128 characters
Contexts count: 4
Contexts type: <class 'list'>
First context: SELECT * FROM curated.sales_orders WHERE order_dat...
Ground truth length: 312 characters


### Note: RAGAS EvaluationResult Object
RAGAS returns an `EvaluationResult` object, not a dictionary. To access the results, use:
- `result.to_pandas()` to get a DataFrame
- `result.samples` to get individual sample results
- `result.metrics` to get metric names


In [37]:
# Test with a smaller subset first to verify everything works
print("🧪 Testing RAGAS evaluation with first 2 samples...")

# Create a small test dataset
test_dataset = Dataset.from_list(evaluation_data[:2])

# Define metrics to evaluate
metrics = [
    faithfulness,      # How factually accurate are the responses?
    answer_relevancy, # How relevant are the responses to the questions?
    context_precision, # How precise is the retrieved context?
    context_recall     # How well does the context cover the ground truth?
]

print("🔄 Running RAGAS evaluation on test subset...")
print("This may take a few minutes...")

# Run evaluation on test subset
test_result = evaluate(
    test_dataset,
    metrics=metrics
)

print("✅ RAGAS test evaluation completed!")
print(f"Test result type: {type(test_result)}")

# Convert to pandas DataFrame to see the results
test_df = test_result.to_pandas()
print(f"Test results shape: {test_df.shape}")
print(f"Test results columns: {list(test_df.columns)}")
print("\n📊 Test Results Summary:")
for metric in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']:
    if metric in test_df.columns:
        print(f"{metric}: {test_df[metric].mean():.3f}")


🧪 Testing RAGAS evaluation with first 2 samples...
🔄 Running RAGAS evaluation on test subset...
This may take a few minutes...


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

✅ RAGAS test evaluation completed!
Test result type: <class 'ragas.dataset_schema.EvaluationResult'>
Test results shape: (2, 8)
Test results columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Test Results Summary:
faithfulness: 0.492
answer_relevancy: 0.900
context_precision: 0.333
context_recall: 0.476


In [38]:
# Display detailed test results
print("📋 Detailed Test Results:")
print("=" * 50)
display(test_df)

# Check if test passed (all metrics > 0)
test_passed = all(test_df[metric].mean() > 0 for metric in ['faithfulness', 'answer_relevancy'] if metric in test_df.columns)
print(f"\n✅ Test {'PASSED' if test_passed else 'FAILED'}: Ready for full evaluation")


📋 Detailed Test Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,Job curated.sales_orders failed — who's impacted?,[SELECT * FROM curated.sales_orders WHERE orde...,# Incident Brief: Failure of Job `curated.sale...,The curated.sales_orders job failure impacts c...,0.166667,0.892469,0.25,0.666667
1,What should I do if raw.sales_orders has quali...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Quality Issues in `raw.sales...,"If raw.sales_orders has quality issues, you sh...",0.818182,0.906665,0.416667,0.285714



✅ Test PASSED: Ready for full evaluation


In [39]:
# Run full evaluation on all samples
print("🚀 Running full RAGAS evaluation on all samples...")
print("This may take several minutes...")

# Run evaluation on full dataset
result = evaluate(
    ragas_dataset,
    metrics=metrics
)

print("✅ RAGAS evaluation completed!")


🚀 Running full RAGAS evaluation on all samples...
This may take several minutes...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

✅ RAGAS evaluation completed!


## 6. Results Analysis


In [40]:
# Extract results
results_df = result.to_pandas()

print("📊 RAGAS Evaluation Results:")
print("=" * 50)

# Display overall metrics
overall_metrics = {
    "Faithfulness": results_df['faithfulness'].mean(),
    "Answer Relevancy": results_df['answer_relevancy'].mean(),
    "Context Precision": results_df['context_precision'].mean(),
    "Context Recall": results_df['context_recall'].mean()
}

print("\n🎯 Overall Performance Metrics:")
for metric, score in overall_metrics.items():
    print(f"{metric:20}: {score:.3f}")

print("\n📋 Detailed Results:")
display(results_df)


📊 RAGAS Evaluation Results:

🎯 Overall Performance Metrics:
Faithfulness        : 0.655
Answer Relevancy    : 0.873
Context Precision   : 0.117
Context Recall      : 0.190

📋 Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,Job curated.sales_orders failed — who's impacted?,[SELECT * FROM curated.sales_orders WHERE orde...,# Incident Brief: Failure of Job `curated.sale...,The curated.sales_orders job failure impacts c...,0.83871,0.892469,0.25,0.666667
1,What should I do if raw.sales_orders has quali...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Quality Issues in `raw.sales...,"If raw.sales_orders has quality issues, you sh...",0.806452,0.907872,0.333333,0.285714
2,Which dashboards depend on curated.revenue_sum...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Curated Revenue Summary Depe...,Dashboards that depend on curated.revenue_summ...,0.0,0.859911,0.0,0.0
3,How do I troubleshoot a data pipeline failure?,[Data pipeline incident response procedures: 1...,# Incident Brief: Data Pipeline Failure - Sale...,To troubleshoot a data pipeline failure: 1) Ch...,0.771429,0.82641,0.0,0.0
4,What is the SLA for curated.sales_orders data ...,[SELECT * FROM curated.sales_orders WHERE orde...,# Incident Brief: Curated Sales Orders Data Fr...,The SLA for curated.sales_orders data freshnes...,0.857143,0.880744,0.0,0.0


In [41]:
# Debug: Check what columns are available in results_df
print("🔍 Debugging Results DataFrame:")
print("=" * 40)
print(f"DataFrame shape: {results_df.shape}")
print(f"Available columns: {list(results_df.columns)}")
print(f"DataFrame head:")
display(results_df.head())


🔍 Debugging Results DataFrame:
DataFrame shape: (5, 8)
Available columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
DataFrame head:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,Job curated.sales_orders failed — who's impacted?,[SELECT * FROM curated.sales_orders WHERE orde...,# Incident Brief: Failure of Job `curated.sale...,The curated.sales_orders job failure impacts c...,0.83871,0.892469,0.25,0.666667
1,What should I do if raw.sales_orders has quali...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Quality Issues in `raw.sales...,"If raw.sales_orders has quality issues, you sh...",0.806452,0.907872,0.333333,0.285714
2,Which dashboards depend on curated.revenue_sum...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Curated Revenue Summary Depe...,Dashboards that depend on curated.revenue_summ...,0.0,0.859911,0.0,0.0
3,How do I troubleshoot a data pipeline failure?,[Data pipeline incident response procedures: 1...,# Incident Brief: Data Pipeline Failure - Sale...,To troubleshoot a data pipeline failure: 1) Ch...,0.771429,0.82641,0.0,0.0
4,What is the SLA for curated.sales_orders data ...,[SELECT * FROM curated.sales_orders WHERE orde...,# Incident Brief: Curated Sales Orders Data Fr...,The SLA for curated.sales_orders data freshnes...,0.857143,0.880744,0.0,0.0


In [42]:
# Create a summary table
summary_table = pd.DataFrame({
    "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
    "Score": [overall_metrics["Faithfulness"], overall_metrics["Answer Relevancy"], 
              overall_metrics["Context Precision"], overall_metrics["Context Recall"]],
    "Interpretation": [
        "How factually accurate are the responses?",
        "How relevant are the responses to the questions?",
        "How precise is the retrieved context?",
        "How well does the context cover the ground truth?"
    ]
})

print("\n📊 RAGAS Evaluation Summary Table:")
print("=" * 80)
display(summary_table)



📊 RAGAS Evaluation Summary Table:


Unnamed: 0,Metric,Score,Interpretation
0,Faithfulness,0.654747,How factually accurate are the responses?
1,Answer Relevancy,0.873481,How relevant are the responses to the questions?
2,Context Precision,0.116667,How precise is the retrieved context?
3,Context Recall,0.190476,How well does the context cover the ground truth?


In [43]:
# Safe performance analysis (handles missing columns)
print("🔍 Safe Performance Analysis:")
print("=" * 50)

# Check what metrics are available
available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                    if col in results_df.columns]

print(f"Available metrics: {available_metrics}")

if available_metrics:
    # Calculate overall performance
    overall_performance = {}
    for metric in available_metrics:
        overall_performance[metric] = results_df[metric].mean()
    
    print("\n📊 Overall Performance Metrics:")
    for metric, score in overall_performance.items():
        print(f"{metric:20}: {score:.3f}")
    
    # Calculate performance statistics
    print("\n📈 Performance Statistics:")
    for metric in available_metrics:
        print(f"\n{metric}:")
        print(f"  Mean: {results_df[metric].mean():.3f}")
        print(f"  Std:  {results_df[metric].std():.3f}")
        print(f"  Min:  {results_df[metric].min():.3f}")
        print(f"  Max:  {results_df[metric].max():.3f}")
else:
    print("❌ No metrics found in results DataFrame")


🔍 Safe Performance Analysis:
Available metrics: ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Overall Performance Metrics:
faithfulness        : 0.655
answer_relevancy    : 0.873
context_precision   : 0.117
context_recall      : 0.190

📈 Performance Statistics:

faithfulness:
  Mean: 0.655
  Std:  0.367
  Min:  0.000
  Max:  0.857

answer_relevancy:
  Mean: 0.873
  Std:  0.032
  Min:  0.826
  Max:  0.908

context_precision:
  Mean: 0.117
  Std:  0.162
  Min:  0.000
  Max:  0.333

context_recall:
  Mean: 0.190
  Std:  0.294
  Min:  0.000
  Max:  0.667


In [None]:
# SAFE ALTERNATIVE: Skip the problematic cell above and use this instead
print("🔍 SAFE Performance Analysis:")
print("=" * 50)

# Check what columns are actually available
print(f"Available columns: {list(results_df.columns)}")

# Check if question column exists
if 'question' in results_df.columns:
    print("✅ Question column found - can do question type analysis")
    # Add question categories safely
    results_df['question_type'] = results_df['question'].apply(lambda x: 
        'Impact Analysis' if 'impacted' in x.lower() else
        'Troubleshooting' if 'troubleshoot' in x.lower() or 'should i do' in x.lower() else
        'Dependency Analysis' if 'depend' in x.lower() else
        'SLA Query' if 'sla' in x.lower() else
        'General'
    )
    
    # Group by question type
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        performance_by_type = results_df.groupby('question_type')[available_metrics].mean().round(3)
        print("\n📊 Performance by Question Type:")
        display(performance_by_type)
    else:
        print("❌ No metrics found for grouping")
else:
    print("⚠️ Question column not found - showing overall performance only")
    
    # Show overall performance
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        print("\n📊 Overall Performance Summary:")
        for metric in available_metrics:
            print(f"{metric}: {results_df[metric].mean():.3f}")
    else:
        print("❌ No performance metrics found")


## 7. Performance Analysis and Conclusions


In [44]:
# Analyze performance by question type
print("🔍 Performance Analysis by Question Type:")
print("=" * 50)

# Add question categories
results_df['question_type'] = results_df['question'].apply(lambda x: 
    'Impact Analysis' if 'impacted' in x.lower() else
    'Troubleshooting' if 'troubleshoot' in x.lower() or 'should i do' in x.lower() else
    'Dependency Analysis' if 'depend' in x.lower() else
    'SLA Query' if 'sla' in x.lower() else
    'General'
)

# Group by question type
performance_by_type = results_df.groupby('question_type').agg({
    'faithfulness': 'mean',
    'answer_relevancy': 'mean',
    'context_precision': 'mean',
    'context_recall': 'mean'
}).round(3)

print("\n📊 Performance by Question Type:")
display(performance_by_type)


🔍 Performance Analysis by Question Type:


KeyError: 'question'

In [None]:
# Identify strengths and weaknesses (with safety checks)
print("\n🎯 Performance Strengths and Weaknesses:")
print("=" * 50)

# Use the safe overall_performance from previous cell
if 'overall_performance' in locals() and overall_performance:
    # Find best and worst performing metrics
    best_metric = max(overall_performance.items(), key=lambda x: x[1])
    worst_metric = min(overall_performance.items(), key=lambda x: x[1])

    print(f"\n✅ Strongest Performance: {best_metric[0]} ({best_metric[1]:.3f})")
    print(f"❌ Weakest Performance: {worst_metric[0]} ({worst_metric[1]:.3f})")

    # Calculate overall score
    overall_score = np.mean(list(overall_performance.values()))
    print(f"\n📊 Overall Pipeline Score: {overall_score:.3f}")

    # Performance interpretation
    if overall_score >= 0.8:
        performance_level = "Excellent"
    elif overall_score >= 0.7:
        performance_level = "Good"
    elif overall_score >= 0.6:
        performance_level = "Fair"
    else:
        performance_level = "Needs Improvement"

    print(f"🎯 Performance Level: {performance_level}")
else:
    print("⚠️ No performance data available for analysis")


## 8. Detailed Conclusions and Recommendations


In [None]:
print("📋 Detailed Conclusions and Recommendations:")
print("=" * 60)

print("\n🔍 Key Findings:")
print("-" * 20)

# Faithfulness analysis
faithfulness_score = overall_metrics["Faithfulness"]
if faithfulness_score >= 0.8:
    faithfulness_conclusion = "The system generates highly factual and accurate responses."
elif faithfulness_score >= 0.6:
    faithfulness_conclusion = "The system generates mostly accurate responses with some factual inconsistencies."
else:
    faithfulness_conclusion = "The system has significant factual accuracy issues that need attention."

print(f"1. Faithfulness ({faithfulness_score:.3f}): {faithfulness_conclusion}")

# Answer relevancy analysis
relevancy_score = overall_metrics["Answer Relevancy"]
if relevancy_score >= 0.8:
    relevancy_conclusion = "Responses are highly relevant to the questions asked."
elif relevancy_score >= 0.6:
    relevancy_conclusion = "Responses are generally relevant but may sometimes miss the mark."
else:
    relevancy_conclusion = "Responses often lack relevance to the specific questions asked."

print(f"2. Answer Relevancy ({relevancy_score:.3f}): {relevancy_conclusion}")

# Context precision analysis
precision_score = overall_metrics["Context Precision"]
if precision_score >= 0.8:
    precision_conclusion = "The system retrieves highly precise and relevant context."
elif precision_score >= 0.6:
    precision_conclusion = "The system retrieves reasonably precise context with some noise."
else:
    precision_conclusion = "The system retrieves context with significant noise and irrelevance."

print(f"3. Context Precision ({precision_score:.3f}): {precision_conclusion}")

# Context recall analysis
recall_score = overall_metrics["Context Recall"]
if recall_score >= 0.8:
    recall_conclusion = "The system retrieves comprehensive context that covers ground truth well."
elif recall_score >= 0.6:
    recall_conclusion = "The system retrieves adequate context but may miss some important information."
else:
    recall_conclusion = "The system often misses important context needed for accurate responses."

print(f"4. Context Recall ({recall_score:.3f}): {recall_conclusion}")


In [None]:
print("\n💡 Recommendations for Improvement:")
print("-" * 40)

recommendations = []

# Faithfulness recommendations
if faithfulness_score < 0.8:
    recommendations.append("• Improve factual accuracy by enhancing the knowledge base and fact-checking mechanisms")

# Relevancy recommendations
if relevancy_score < 0.8:
    recommendations.append("• Enhance response relevance by improving question understanding and response generation")

# Precision recommendations
if precision_score < 0.8:
    recommendations.append("• Improve context precision by refining retrieval algorithms and filtering mechanisms")

# Recall recommendations
if recall_score < 0.8:
    recommendations.append("• Enhance context recall by expanding the knowledge base and improving retrieval coverage")

# General recommendations
if overall_score < 0.8:
    recommendations.extend([
        "• Consider fine-tuning the LLM on domain-specific data",
        "• Implement feedback loops to continuously improve performance",
        "• Add more diverse test cases to the evaluation dataset",
        "• Consider ensemble methods for better response quality"
    ])

if recommendations:
    for rec in recommendations:
        print(rec)
else:
    print("• System performance is excellent - consider monitoring for consistency")
    print("• Expand the test dataset to cover more edge cases")
    print("• Implement A/B testing for continuous improvement")


In [None]:
print("\n🎯 Overall Pipeline Effectiveness Assessment:")
print("=" * 50)

print(f"\n📊 Summary Statistics:")
print(f"• Total Test Cases: {len(evaluation_data)}")
print(f"• Overall Score: {overall_score:.3f}")
print(f"• Performance Level: {performance_level}")
print(f"• Best Metric: {best_metric[0]} ({best_metric[1]:.3f})")
print(f"• Worst Metric: {worst_metric[0]} ({worst_metric[1]:.3f})")

print(f"\n🔍 Key Insights:")
print(f"• The Traceback system demonstrates {'strong' if overall_score >= 0.7 else 'moderate' if overall_score >= 0.6 else 'weak'} performance across all RAGAS metrics")
print(f"• {'The system excels at' if best_metric[1] >= 0.8 else 'The system shows good performance in'} {best_metric[0].lower()}")
print(f"• {'Significant improvement needed in' if worst_metric[1] < 0.6 else 'Some improvement possible in'} {worst_metric[0].lower()}")

print(f"\n✅ Conclusion:")
if overall_score >= 0.8:
    conclusion = "The Traceback pipeline is highly effective and ready for production deployment."
elif overall_score >= 0.7:
    conclusion = "The Traceback pipeline shows good effectiveness with room for targeted improvements."
elif overall_score >= 0.6:
    conclusion = "The Traceback pipeline demonstrates fair effectiveness but requires significant improvements."
else:
    conclusion = "The Traceback pipeline needs substantial improvements before production deployment."

print(conclusion)


## 9. Save Results


In [None]:
# Save detailed results
results_output_path = Path.cwd().parent / "data" / "ragas_evaluation_results.json"

evaluation_summary = {
    "evaluation_date": pd.Timestamp.now().isoformat(),
    "total_test_cases": len(evaluation_data),
    "overall_metrics": overall_metrics,
    "overall_score": float(overall_score),
    "performance_level": performance_level,
    "detailed_results": results_df.to_dict('records'),
    "recommendations": recommendations if recommendations else ["System performance is excellent"]
}

with open(results_output_path, 'w') as f:
    json.dump(evaluation_summary, f, indent=2)

print(f"✅ Results saved to: {results_output_path}")
print(f"📊 Evaluation completed successfully!")
print(f"🎯 Overall Pipeline Score: {overall_score:.3f} ({performance_level})")
