# Task 5: RAGAS Evaluation - Comprehensive Golden Test Data Set

## ✅ Updated for 15-Spec System
This notebook has been updated to evaluate the comprehensive 15-spec Traceback system covering all business domains and operational guides.

## Objective
Assess the Traceback pipeline using the RAGAS framework with key metrics:
- **Faithfulness**: How factually accurate are the generated responses?
- **Response Relevance**: How relevant are the responses to the questions?
- **Context Precision**: How precise is the retrieved context?
- **Context Recall**: How well does the context cover the ground truth?

## Methodology
1. Create a comprehensive golden test dataset covering all 15 business domains
2. Generate responses using our enhanced Traceback system
3. Evaluate using RAGAS metrics
4. Analyze results and draw conclusions about pipeline effectiveness

## Test Coverage
- **10 Business Domains**: Sales Orders, Customer Analytics, Inventory Management, Financial Reporting, Marketing Attribution, Supply Chain, HR Analytics, Product Analytics, Risk Management, Compliance Monitoring
- **5 Operational Guides**: Incident Playbook, Data Quality Standards, Troubleshooting Guide, SLA Definitions, Escalation Procedures


## 1. Setup and Imports


In [86]:
import os
import sys
import json
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Dict, Any
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify API keys
if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY is not set. Create a .env file or export it in your shell.")

print("✅ Environment setup complete")


✅ Environment setup complete


In [87]:
# Add src to path for imports
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import RAGAS components
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Import our Traceback system
from tracebackcore.core import traceback_graph, lineage_retriever, AgentState, initialize_system

print("✅ Imports complete")


✅ Imports complete


## 2. Initialize Traceback System


In [88]:
# Initialize the Traceback system
print("🚀 Initializing Traceback system...")
initialize_system()
print("✅ Traceback system initialized successfully")


🚀 Initializing Traceback system...
🚀 Initializing Traceback system...
✅ Traceback system initialized successfully
✅ Traceback system initialized successfully


## 3. Create Golden Test Dataset

We'll create a comprehensive test dataset covering various incident scenarios with ground truth answers.


In [89]:
# Updated: Comprehensive Golden Test Dataset (15 Business Domains)

# Replace the old test data with comprehensive coverage
golden_test_data = [
    # Sales Orders Domain
    {
        "question": "What should I do if the sales orders pipeline fails?",
        "ground_truth": "If the sales orders pipeline fails, follow these steps: 1) Assess business impact and determine blast radius affecting curated.sales_orders, curated.revenue_summary, and analytics.customer_behavior, 2) Check pipeline logs for error messages, 3) Verify data source availability (raw.sales_orders, raw.customers, raw.products), 4) Validate data quality metrics, 5) Test individual pipeline components, 6) Review recent changes or deployments. The pipeline has 99.9% uptime SLA and 2-hour freshness requirement. Escalate to data-sales team for P0/P1 incidents.",
        "context": [
            "Sales orders pipeline processes raw order data into curated datasets for analytics and reporting.",
            "SLA commitments: 99.9% uptime, 2-hour freshness, <0.1% error rate",
            "Downstream dependencies: curated.revenue_summary, bi.daily_sales, analytics.customer_behavior",
            "Common failure patterns: data quality issues, dependency failures, performance degradation"
        ]
    },
    
    # Customer Analytics Domain
    {
        "question": "How does the customer analytics pipeline segment customers?",
        "ground_truth": "The customer analytics pipeline segments customers based on lifetime value: VIP (>$10,000), High Value ($5,000-$10,000), Medium Value ($1,000-$5,000), and Low Value (<$1,000). It also calculates behavioral metrics including engagement scores based on interaction frequency, churn risk using ML models, and purchase propensity for next 30-day purchases. The pipeline processes curated.sales_orders, raw.customer_interactions, raw.marketing_campaigns, and raw.support_tickets with 99.5% uptime SLA and daily updates by 6 AM.",
        "context": [
            "Customer analytics pipeline for segmentation, lifetime value calculation, and behavioral analysis",
            "Customer segmentation: VIP, High Value, Medium Value, Low Value based on lifetime value",
            "Behavioral metrics: engagement score, churn risk, purchase propensity",
            "Data sources: curated.sales_orders, raw.customer_interactions, raw.marketing_campaigns, raw.support_tickets"
        ]
    },
    
    # Inventory Management Domain
    {
        "question": "What are the stock level management rules for inventory?",
        "ground_truth": "Inventory stock levels are managed with these rules: Critical (<10 units remaining), Low (10-50 units remaining), Normal (50-200 units remaining), and High (>200 units remaining). The reorder logic includes auto-reorder when stock falls below reorder point, manual approval for high-value items, and seasonal adjustments based on historical patterns. The system uses real-time updates (<5 minutes) with 99.9% uptime SLA and <0.01% error rate, processing raw.inventory_transactions, raw.warehouse_locations, raw.supplier_data, and raw.demand_forecasts.",
        "context": [
            "Real-time inventory tracking and management for warehouse operations and demand forecasting",
            "Stock level management: Critical, Low, Normal, High based on remaining units",
            "Reorder logic: auto-reorder, manual approval, seasonal adjustments",
            "Real-time updates with 99.9% uptime SLA and <0.01% error rate"
        ]
    },
    
    # Financial Reporting Domain
    {
        "question": "What financial controls are implemented in the reporting pipeline?",
        "ground_truth": "The financial reporting pipeline implements comprehensive financial controls including daily bank reconciliation, month-end accrual processing, asset depreciation calculations, and automated tax computations. It ensures SOX compliance with segregation of duties, follows GAAP standards, and maintains complete audit trails for all transactions. The pipeline processes raw.general_ledger, raw.accounts_payable, raw.accounts_receivable, and raw.budget_data with 99.95% uptime SLA, daily processing by 8 AM, and <0.001% error rate for financial precision.",
        "context": [
            "Comprehensive financial data processing for regulatory compliance and management reporting",
            "Financial controls: reconciliation, accruals, depreciation, tax calculations",
            "Compliance requirements: SOX compliance, GAAP standards, audit trails",
            "High precision requirements: 99.95% uptime, <0.001% error rate"
        ]
    },
    
    # Marketing Attribution Domain
    {
        "question": "What attribution models are used for marketing campaigns?",
        "ground_truth": "The marketing attribution pipeline uses multiple attribution models: First Touch (credit to first interaction), Last Touch (credit to final interaction), Linear (equal credit to all touchpoints), and Time Decay (more credit to recent interactions). It calculates ROI metrics including Campaign ROI (Revenue/Campaign Cost), Channel ROI (Revenue/Channel Investment), and Customer LTV (Lifetime Value). The pipeline processes raw.marketing_touchpoints, raw.campaign_performance, raw.conversion_events, and raw.customer_journey with 99.0% uptime SLA and weekly updates by Monday 9 AM.",
        "context": [
            "Multi-touch attribution modeling for marketing campaign effectiveness and ROI analysis",
            "Attribution models: First Touch, Last Touch, Linear, Time Decay",
            "ROI calculations: Campaign ROI, Channel ROI, Customer LTV",
            "Weekly processing with 99.0% uptime SLA"
        ]
    },
    
    # Supply Chain Domain
    {
        "question": "How is supplier performance measured in the supply chain?",
        "ground_truth": "Supplier performance is measured using key metrics: on-time delivery (>95% target), quality score (>98% target), cost efficiency with budget variance tracking, and risk assessment for supplier stability. The system also optimizes logistics with route optimization for cost and time minimization, strategic inventory positioning, and ML-based demand forecasting. The pipeline processes raw.supplier_performance, raw.logistics_data, raw.procurement_data, and raw.quality_metrics with 99.5% uptime SLA and daily updates by 7 AM.",
        "context": [
            "End-to-end supply chain visibility and optimization for cost reduction and efficiency",
            "Supplier performance metrics: on-time delivery, quality score, cost efficiency, risk assessment",
            "Logistics optimization: route optimization, inventory positioning, demand forecasting",
            "Daily processing with 99.5% uptime SLA"
        ]
    },
    
    # HR Analytics Domain
    {
        "question": "What employee metrics are tracked in the HR analytics pipeline?",
        "ground_truth": "The HR analytics pipeline tracks comprehensive employee metrics including retention rate (annual turnover calculations), performance scores (quarterly evaluations), engagement metrics (survey-based indicators), and career progression (promotion and growth tracking). It also includes predictive analytics with ML-based churn prediction for retention modeling, performance forecasting for future performance prediction, and skill gap analysis for training needs identification. The pipeline processes raw.employee_data, raw.performance_reviews, raw.attendance_data, and raw.learning_records with 99.0% uptime SLA and monthly updates by 5th of month.",
        "context": [
            "Employee lifecycle analytics for talent management, retention, and performance optimization",
            "Employee metrics: retention rate, performance scores, engagement metrics, career progression",
            "Predictive analytics: churn prediction, performance forecasting, skill gap analysis",
            "Monthly processing with 99.0% uptime SLA"
        ]
    },
    
    # Product Analytics Domain
    {
        "question": "What product usage metrics are monitored in real-time?",
        "ground_truth": "The product analytics pipeline monitors real-time usage metrics including DAU/MAU (Daily and Monthly Active Users), feature adoption rates for new features, session analytics for user journey analysis, and conversion funnels for step-by-step conversion tracking. It also tracks product KPIs including engagement scores (user activity level), retention rates (user return behavior), feature stickiness (feature retention metrics), and NPS tracking (Net Promoter Score monitoring). The pipeline processes raw.user_interactions, raw.feature_usage, raw.user_feedback, and raw.performance_metrics with 99.5% uptime SLA and real-time updates (<1 minute).",
        "context": [
            "Comprehensive product usage analytics for feature optimization and user experience",
            "Usage metrics: DAU/MAU, feature adoption, session analytics, conversion funnels",
            "Product KPIs: engagement score, retention rate, feature stickiness, NPS tracking",
            "Real-time processing with <1 minute updates"
        ]
    },
    
    # Risk Management Domain
    {
        "question": "How is risk scoring calculated in the risk management system?",
        "ground_truth": "Risk scoring is calculated across multiple dimensions: credit risk (customer creditworthiness), operational risk (process failure probability), market risk (external market volatility), and compliance risk (regulatory violation probability). The system provides real-time monitoring with threshold-based notifications, executive reporting dashboards, historical risk pattern analysis, and risk reduction measure tracking. The pipeline processes raw.transaction_data, raw.customer_data, raw.market_data, and raw.compliance_data with 99.95% uptime SLA and real-time updates (<30 seconds) for critical compliance requirements.",
        "context": [
            "Comprehensive risk assessment and monitoring for operational, financial, and compliance risks",
            "Risk scoring: credit risk, operational risk, market risk, compliance risk",
            "Risk monitoring: real-time alerts, dashboards, trend analysis, mitigation tracking",
            "Critical compliance requirements with <30 second updates"
        ]
    },
    
    # Compliance Monitoring Domain
    {
        "question": "What regulatory compliance standards are monitored?",
        "ground_truth": "The compliance monitoring system monitors multiple regulatory standards including GDPR (Data privacy and protection), SOX (Financial controls and reporting), PCI DSS (Payment card data security), and HIPAA (Healthcare data protection). It implements monitoring rules for data access (unauthorized access detection), data retention (compliance with retention policies), data quality (accuracy and completeness checks), and audit trails (complete activity logging). The pipeline processes raw.audit_logs, raw.transaction_data, raw.customer_data, and raw.employee_data with 99.99% uptime SLA and real-time monitoring (<10 seconds) for critical compliance requirements.",
        "context": [
            "Automated compliance monitoring and reporting for regulatory requirements and internal policies",
            "Regulatory compliance: GDPR, SOX, PCI DSS, HIPAA",
            "Monitoring rules: data access, data retention, data quality, audit trails",
            "Critical compliance with <10 second monitoring"
        ]
    },
    
    # Operational Guides
    {
        "question": "What are the escalation procedures for data pipeline incidents?",
        "ground_truth": "Escalation procedures follow a structured path: Level 1 (Data Engineer, 0-30 min), Level 2 (Senior Data Engineer, 30-60 min), Level 3 (Data Engineering Lead, 60-120 min), Level 4 (Engineering Manager, 120+ min). Escalation triggers include SLA breach imminent or occurred, multiple downstream systems affected, business-critical functionality impacted, and no resolution within expected timeframe. Communication protocols include immediate Slack alerts to #data-incidents, 15-minute email to stakeholders, 30-minute status page update, and 60-minute executive notification for P0/P1 incidents.",
        "context": [
            "Incident escalation procedures for data pipeline incidents",
            "Escalation paths: Level 1-4 with specific timeframes and roles",
            "Escalation triggers: SLA breach, multiple systems affected, critical functionality",
            "Communication protocols: Slack, email, status page, executive notification"
        ]
    },
    
    # Data Quality Standards
    {
        "question": "What are the data quality monitoring thresholds?",
        "ground_truth": "Data quality monitoring uses automated checks including schema validation, data freshness monitoring, anomaly detection, and statistical quality metrics. Alerting thresholds are set at Critical (>1% data quality issues), Warning (>0.1% data quality issues), and Info (quality metrics trending). The monitoring framework covers completeness (no missing values in critical fields, all expected records present, referential integrity maintained), accuracy (data matches source systems, business rules validated, calculated fields verified), consistency (format standards applied, naming conventions followed, data types consistent), and timeliness (data available within SLA windows, processing delays monitored, stale data alerts configured).",
        "context": [
            "Data quality standards and monitoring framework",
            "Quality dimensions: completeness, accuracy, consistency, timeliness",
            "Automated checks: schema validation, freshness monitoring, anomaly detection",
            "Alerting thresholds: Critical, Warning, Info levels"
        ]
    },
    
    # SLA Definitions
    {
        "question": "What are the different SLA tiers for data freshness?",
        "ground_truth": "Data freshness SLAs are tiered as follows: Real-time (<5 minutes delay), Near real-time (<1 hour delay), Batch (<4 hours delay), and Historical (<24 hours delay). Availability SLAs include Critical Systems (99.9% uptime), Important Systems (99.5% uptime), and Standard Systems (99.0% uptime). Recovery Time Objectives (RTO) are P0 Incidents (<1 hour), P1 Incidents (<4 hours), P2 Incidents (<24 hours), and P3 Incidents (<72 hours). Data accuracy requirements vary by type: Financial Data (<0.01% error rate), Operational Data (<0.1% error rate), and Analytical Data (<1% error rate).",
        "context": [
            "Service Level Agreement definitions for data pipeline operations",
            "Data freshness SLAs: Real-time, Near real-time, Batch, Historical",
            "Availability SLAs: Critical, Important, Standard systems",
            "Recovery Time Objectives: P0-P3 incident classifications"
        ]
    },
    
    # Troubleshooting Guide
    {
        "question": "What are the common failure patterns in data pipelines?",
        "ground_truth": "Common failure patterns include Data Quality Issues (symptoms: null values, invalid formats, constraint violations; root causes: source system changes, data corruption, schema drift; solutions: data validation, schema enforcement, source monitoring), Performance Degradation (symptoms: slow queries, timeouts, resource exhaustion; root causes: data volume growth, inefficient queries, resource constraints; solutions: query optimization, resource scaling, partitioning), and Dependency Failures (symptoms: missing upstream data, broken references; root causes: upstream pipeline failures, API outages, network issues; solutions: dependency monitoring, fallback mechanisms, retry logic). Diagnostic procedures include checking pipeline logs, verifying data source availability, validating data quality metrics, testing individual components, and reviewing recent changes.",
        "context": [
            "Data pipeline troubleshooting guide for common failure patterns",
            "Failure patterns: data quality issues, performance degradation, dependency failures",
            "Diagnostic procedures: logs, data sources, quality metrics, components, changes",
            "Solutions: validation, optimization, monitoring, fallback mechanisms"
        ]
    },
    
    # Incident Playbook
    {
        "question": "What are the severity levels for data pipeline incidents?",
        "ground_truth": "Data pipeline incidents are classified into four severity levels: P0 (Critical business impact, revenue loss), P1 (High impact, SLA breach risk), P2 (Medium impact, degraded service), and P3 (Low impact, minor issues). Response procedures include initial assessment (0-15 minutes) with incident acknowledgment, business impact assessment, blast radius determination, and stakeholder notification. Common actions include rollback (revert to last known good state), hotfix (apply targeted fix), backfill (reprocess affected data), and skip (bypass failed step if non-critical). The escalation matrix involves Data Engineering Lead for P0/P1 incidents, Platform Team for infrastructure issues, and Product Manager for business impact assessment.",
        "context": [
            "Data pipeline incident response playbook with severity classifications",
            "Severity levels: P0 (Critical), P1 (High), P2 (Medium), P3 (Low)",
            "Response procedures: initial assessment, common actions, escalation matrix",
            "Timeline: 0-15 minutes for initial assessment"
        ]
    }
]

print(f"✅ Updated golden test dataset with {len(golden_test_data)} comprehensive test cases")
print(f"📊 Coverage: 10 business domains + 5 operational guides")
print(f"🎯 RAGAS Improvement: Enhanced test coverage for better evaluation")
print(f"🚀 Ready for comprehensive RAGAS evaluation!")


✅ Updated golden test dataset with 15 comprehensive test cases
📊 Coverage: 10 business domains + 5 operational guides
🎯 RAGAS Improvement: Enhanced test coverage for better evaluation
🚀 Ready for comprehensive RAGAS evaluation!


## 4. Generate Responses Using Traceback System


In [90]:
def generate_traceback_response(question: str) -> Dict[str, Any]:
    """Generate response using our Traceback system."""
    try:
        # Create initial state
        initial_state = AgentState(
            question=question,
            context=[],
            impact_assessment=None,
            blast_radius=None,
            recommended_actions=None,
            incident_brief=None,
            current_step="supervisor",
            error=None
        )
        
        # Run the workflow
        result = traceback_graph.invoke(initial_state)
        
        # Get retrieved context from the lineage retriever
        retrieved_docs = lineage_retriever.search_with_lineage(question, k=5)
        retrieved_contexts = [doc.page_content for doc in retrieved_docs]
        
        # Extract relevant information
        return {
            "answer": result.get("incident_brief", "No response generated"),
            "context": retrieved_contexts,  # Use actual retrieved context
            "blast_radius": result.get("blast_radius", []),
            "impact_assessment": result.get("impact_assessment", {})
        }
    except Exception as e:
        return {
            "answer": f"Error generating response: {str(e)}",
            "context": [],
            "blast_radius": [],
            "impact_assessment": {}
        }

print("✅ Response generation function defined")


✅ Response generation function defined


In [91]:
# Generate responses for all test cases
print("🔄 Generating responses using Traceback system...")

evaluation_data = []
for i, test_case in enumerate(golden_test_data):
    print(f"Processing test case {i+1}/{len(golden_test_data)}: {test_case['question'][:50]}...")
    
    # Generate response
    response = generate_traceback_response(test_case["question"])
    
    # Prepare data for RAGAS evaluation
    # Use actual retrieved context from our system, not predefined context
    evaluation_data.append({
        "question": test_case["question"],
        "answer": response["answer"],
        "contexts": response["context"],  # Use actual retrieved context
        "ground_truth": test_case["ground_truth"]
    })

print(f"✅ Generated responses for {len(evaluation_data)} test cases")


🔄 Generating responses using Traceback system...
Processing test case 1/15: What should I do if the sales orders pipeline fail...


Processing test case 2/15: How does the customer analytics pipeline segment c...
Processing test case 3/15: What are the stock level management rules for inve...
Processing test case 4/15: What financial controls are implemented in the rep...
Processing test case 5/15: What attribution models are used for marketing cam...
Processing test case 6/15: How is supplier performance measured in the supply...
Processing test case 7/15: What employee metrics are tracked in the HR analyt...
Processing test case 8/15: What product usage metrics are monitored in real-t...
Processing test case 9/15: How is risk scoring calculated in the risk managem...
Processing test case 10/15: What regulatory compliance standards are monitored...
Processing test case 11/15: What are the escalation procedures for data pipeli...
Processing test case 12/15: What are the data quality monitoring thresholds?...
Processing test case 13/15: What are the different SLA tiers for data freshnes...
Processing test case 14/15

## 5. RAGAS Evaluation


In [92]:
# Convert to RAGAS Dataset format
ragas_dataset = Dataset.from_list(evaluation_data)

print(f"📊 RAGAS dataset created with {len(ragas_dataset)} samples")
print(f"Dataset columns: {ragas_dataset.column_names}")

# Verify the data format
print("\n🔍 Sample data format verification:")
sample = ragas_dataset[0]
print(f"Question: {sample['question'][:50]}...")
print(f"Answer length: {len(sample['answer'])} characters")
print(f"Contexts count: {len(sample['contexts'])}")
print(f"Contexts type: {type(sample['contexts'])}")
print(f"First context: {sample['contexts'][0][:50]}...")
print(f"Ground truth length: {len(sample['ground_truth'])} characters")


📊 RAGAS dataset created with 15 samples
Dataset columns: ['question', 'answer', 'contexts', 'ground_truth']

🔍 Sample data format verification:
Question: What should I do if the sales orders pipeline fail...
Answer length: 3512 characters
Contexts count: 3
Contexts type: <class 'list'>
First context: Sales orders pipeline processes raw order data int...
Ground truth length: 557 characters


### Note: RAGAS EvaluationResult Object
RAGAS returns an `EvaluationResult` object, not a dictionary. To access the results, use:
- `result.to_pandas()` to get a DataFrame
- `result.samples` to get individual sample results
- `result.metrics` to get metric names


In [93]:
# Test with a smaller subset first to verify everything works
print("🧪 Testing RAGAS evaluation with first 2 samples...")

# Create a small test dataset
test_dataset = Dataset.from_list(evaluation_data[:2])

# Define metrics to evaluate
metrics = [
    faithfulness,      # How factually accurate are the responses?
    answer_relevancy, # How relevant are the responses to the questions?
    context_precision, # How precise is the retrieved context?
    context_recall     # How well does the context cover the ground truth?
]

print("🔄 Running RAGAS evaluation on test subset...")
print("This may take a few minutes...")

# Run evaluation on test subset
test_result = evaluate(
    test_dataset,
    metrics=metrics
)

print("✅ RAGAS test evaluation completed!")
print(f"Test result type: {type(test_result)}")

# Convert to pandas DataFrame to see the results
test_df = test_result.to_pandas()
print(f"Test results shape: {test_df.shape}")
print(f"Test results columns: {list(test_df.columns)}")
print("\n📊 Test Results Summary:")
for metric in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']:
    if metric in test_df.columns:
        print(f"{metric}: {test_df[metric].mean():.3f}")


🧪 Testing RAGAS evaluation with first 2 samples...
🔄 Running RAGAS evaluation on test subset...
This may take a few minutes...


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

✅ RAGAS test evaluation completed!
Test result type: <class 'ragas.dataset_schema.EvaluationResult'>
Test results shape: (2, 8)
Test results columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Test Results Summary:
faithfulness: 0.408
answer_relevancy: 0.444
context_precision: 0.250
context_recall: 0.062


In [94]:
# Display detailed test results
print("📋 Detailed Test Results:")
print("=" * 50)
display(test_df)

# Check if test passed (all metrics > 0)
test_passed = all(test_df[metric].mean() > 0 for metric in ['faithfulness', 'answer_relevancy'] if metric in test_df.columns)
print(f"\n✅ Test {'PASSED' if test_passed else 'FAILED'}: Ready for full evaluation")


📋 Detailed Test Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Sales Orders Pipeline Failur...,"If the sales orders pipeline fails, follow the...",0.763158,0.0,0.5,0.125
1,How does the customer analytics pipeline segme...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.052632,0.88719,0.0,0.0



✅ Test PASSED: Ready for full evaluation


In [95]:
# Run full evaluation on all samples
print("🚀 Running full RAGAS evaluation on all samples...")
print("This may take several minutes...")

# Run evaluation on full dataset
result = evaluate(
    ragas_dataset,
    metrics=metrics
)

print("✅ RAGAS evaluation completed!")


🚀 Running full RAGAS evaluation on all samples...
This may take several minutes...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

✅ RAGAS evaluation completed!


## 6. Results Analysis


In [96]:
# Extract results
results_df = result.to_pandas()

print("📊 RAGAS Evaluation Results:")
print("=" * 50)

# Display overall metrics
overall_metrics = {
    "Faithfulness": results_df['faithfulness'].mean(),
    "Answer Relevancy": results_df['answer_relevancy'].mean(),
    "Context Precision": results_df['context_precision'].mean(),
    "Context Recall": results_df['context_recall'].mean()
}

print("\n🎯 Overall Performance Metrics:")
for metric, score in overall_metrics.items():
    print(f"{metric:20}: {score:.3f}")

print("\n📋 Detailed Results:")
display(results_df)


📊 RAGAS Evaluation Results:

🎯 Overall Performance Metrics:
Faithfulness        : 0.469
Answer Relevancy    : 0.787
Context Precision   : 0.033
Context Recall      : 0.017

📋 Detailed Results:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Sales Orders Pipeline Failur...,"If the sales orders pipeline fails, follow the...",0.571429,0.0,0.5,0.0
1,How does the customer analytics pipeline segme...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.078947,0.89255,0.0,0.0
2,What are the stock level management rules for ...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Inventory Stock Level Manage...,Inventory stock levels are managed with these ...,0.088235,0.89484,0.0,0.0
3,What financial controls are implemented in the...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Reporting Pipeline Disruptio...,The financial reporting pipeline implements co...,0.631579,0.845754,0.0,0.0
4,What attribution models are used for marketing...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Marketing Campaign Attributi...,The marketing attribution pipeline uses multip...,0.482759,0.89389,0.0,0.0
5,How is supplier performance measured in the su...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Supplier Performance Measure...,Supplier performance is measured using key met...,0.111111,0.879565,0.0,0.0
6,What employee metrics are tracked in the HR an...,[Sales orders pipeline processes raw order dat...,# Incident Brief: HR Analytics Pipeline Disrup...,The HR analytics pipeline tracks comprehensive...,0.027778,0.898012,0.0,0.0
7,What product usage metrics are monitored in re...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Sales Orders Pipeline Disrup...,The product analytics pipeline monitors real-t...,0.95122,0.786488,0.0,0.0
8,How is risk scoring calculated in the risk man...,[Data pipeline incident response procedures: 1...,# Incident Brief\n\n## 1. Incident Summary\nOn...,Risk scoring is calculated across multiple dim...,0.705882,0.755445,0.0,0.0
9,What regulatory compliance standards are monit...,[Data pipeline incident response procedures: 1...,# Incident Brief: Sales Orders Pipeline Disrup...,The compliance monitoring system monitors mult...,0.277778,0.748935,0.0,0.0


In [97]:
# Debug: Check what columns are available in results_df
print("🔍 Debugging Results DataFrame:")
print("=" * 40)
print(f"DataFrame shape: {results_df.shape}")
print(f"Available columns: {list(results_df.columns)}")
print(f"DataFrame head:")
display(results_df.head())


🔍 Debugging Results DataFrame:
DataFrame shape: (15, 8)
Available columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
DataFrame head:


Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,answer_relevancy,context_precision,context_recall
0,What should I do if the sales orders pipeline ...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Sales Orders Pipeline Failur...,"If the sales orders pipeline fails, follow the...",0.571429,0.0,0.5,0.0
1,How does the customer analytics pipeline segme...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Customer Analytics Pipeline ...,The customer analytics pipeline segments custo...,0.078947,0.89255,0.0,0.0
2,What are the stock level management rules for ...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Inventory Stock Level Manage...,Inventory stock levels are managed with these ...,0.088235,0.89484,0.0,0.0
3,What financial controls are implemented in the...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Reporting Pipeline Disruptio...,The financial reporting pipeline implements co...,0.631579,0.845754,0.0,0.0
4,What attribution models are used for marketing...,[Sales orders pipeline processes raw order dat...,# Incident Brief: Marketing Campaign Attributi...,The marketing attribution pipeline uses multip...,0.482759,0.89389,0.0,0.0


In [98]:
# Create a summary table
summary_table = pd.DataFrame({
    "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
    "Score": [overall_metrics["Faithfulness"], overall_metrics["Answer Relevancy"], 
              overall_metrics["Context Precision"], overall_metrics["Context Recall"]],
    "Interpretation": [
        "How factually accurate are the responses?",
        "How relevant are the responses to the questions?",
        "How precise is the retrieved context?",
        "How well does the context cover the ground truth?"
    ]
})

print("\n📊 RAGAS Evaluation Summary Table:")
print("=" * 80)
display(summary_table)



📊 RAGAS Evaluation Summary Table:


Unnamed: 0,Metric,Score,Interpretation
0,Faithfulness,0.469189,How factually accurate are the responses?
1,Answer Relevancy,0.786585,How relevant are the responses to the questions?
2,Context Precision,0.033333,How precise is the retrieved context?
3,Context Recall,0.016667,How well does the context cover the ground truth?


In [99]:
# Safe performance analysis (handles missing columns)
print("🔍 Safe Performance Analysis:")
print("=" * 50)

# Check what metrics are available
available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                    if col in results_df.columns]

print(f"Available metrics: {available_metrics}")

if available_metrics:
    # Calculate overall performance
    overall_performance = {}
    for metric in available_metrics:
        overall_performance[metric] = results_df[metric].mean()
    
    print("\n📊 Overall Performance Metrics:")
    for metric, score in overall_performance.items():
        print(f"{metric:20}: {score:.3f}")
    
    # Calculate performance statistics
    print("\n📈 Performance Statistics:")
    for metric in available_metrics:
        print(f"\n{metric}:")
        print(f"  Mean: {results_df[metric].mean():.3f}")
        print(f"  Std:  {results_df[metric].std():.3f}")
        print(f"  Min:  {results_df[metric].min():.3f}")
        print(f"  Max:  {results_df[metric].max():.3f}")
else:
    print("❌ No metrics found in results DataFrame")


🔍 Safe Performance Analysis:
Available metrics: ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

📊 Overall Performance Metrics:
faithfulness        : 0.469
answer_relevancy    : 0.787
context_precision   : 0.033
context_recall      : 0.017

📈 Performance Statistics:

faithfulness:
  Mean: 0.469
  Std:  0.334
  Min:  0.028
  Max:  0.951

answer_relevancy:
  Mean: 0.787
  Std:  0.224
  Min:  0.000
  Max:  0.898

context_precision:
  Mean: 0.033
  Std:  0.129
  Min:  0.000
  Max:  0.500

context_recall:
  Mean: 0.017
  Std:  0.065
  Min:  0.000
  Max:  0.250


In [100]:
# SAFE ALTERNATIVE: Skip the problematic cell above and use this instead
print("🔍 SAFE Performance Analysis:")
print("=" * 50)

# Check what columns are actually available
print(f"Available columns: {list(results_df.columns)}")

# Check if question column exists
if 'question' in results_df.columns:
    print("✅ Question column found - can do question type analysis")
    # Add question categories safely
    results_df['question_type'] = results_df['question'].apply(lambda x: 
        'Impact Analysis' if 'impacted' in x.lower() else
        'Troubleshooting' if 'troubleshoot' in x.lower() or 'should i do' in x.lower() else
        'Dependency Analysis' if 'depend' in x.lower() else
        'SLA Query' if 'sla' in x.lower() else
        'General'
    )
    
    # Group by question type
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        performance_by_type = results_df.groupby('question_type')[available_metrics].mean().round(3)
        print("\n📊 Performance by Question Type:")
        display(performance_by_type)
    else:
        print("❌ No metrics found for grouping")
else:
    print("⚠️ Question column not found - showing overall performance only")
    
    # Show overall performance
    available_metrics = [col for col in ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall'] 
                        if col in results_df.columns]
    
    if available_metrics:
        print("\n📊 Overall Performance Summary:")
        for metric in available_metrics:
            print(f"{metric}: {results_df[metric].mean():.3f}")
    else:
        print("❌ No performance metrics found")


🔍 SAFE Performance Analysis:
Available columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
⚠️ Question column not found - showing overall performance only

📊 Overall Performance Summary:
faithfulness: 0.469
answer_relevancy: 0.787
context_precision: 0.033
context_recall: 0.017


## 7. Performance Analysis and Conclusions


In [101]:
# Identify strengths and weaknesses (with safety checks)
print("\n🎯 Performance Strengths and Weaknesses:")
print("=" * 50)

# Use the safe overall_performance from previous cell
if 'overall_performance' in locals() and overall_performance:
    # Find best and worst performing metrics
    best_metric = max(overall_performance.items(), key=lambda x: x[1])
    worst_metric = min(overall_performance.items(), key=lambda x: x[1])

    print(f"\n✅ Strongest Performance: {best_metric[0]} ({best_metric[1]:.3f})")
    print(f"❌ Weakest Performance: {worst_metric[0]} ({worst_metric[1]:.3f})")

    # Calculate overall score
    overall_score = np.mean(list(overall_performance.values()))
    print(f"\n📊 Overall Pipeline Score: {overall_score:.3f}")

    # Performance interpretation
    if overall_score >= 0.8:
        performance_level = "Excellent"
    elif overall_score >= 0.7:
        performance_level = "Good"
    elif overall_score >= 0.6:
        performance_level = "Fair"
    else:
        performance_level = "Needs Improvement"

    print(f"🎯 Performance Level: {performance_level}")
else:
    print("⚠️ No performance data available for analysis")



🎯 Performance Strengths and Weaknesses:

✅ Strongest Performance: answer_relevancy (0.787)
❌ Weakest Performance: context_recall (0.017)

📊 Overall Pipeline Score: 0.326
🎯 Performance Level: Needs Improvement


## 8. Detailed Conclusions and Recommendations


In [102]:
print("📋 Detailed Conclusions and Recommendations:")
print("=" * 60)

print("\n🔍 Key Findings:")
print("-" * 20)

# Faithfulness analysis
faithfulness_score = overall_metrics["Faithfulness"]
if faithfulness_score >= 0.8:
    faithfulness_conclusion = "The system generates highly factual and accurate responses."
elif faithfulness_score >= 0.6:
    faithfulness_conclusion = "The system generates mostly accurate responses with some factual inconsistencies."
else:
    faithfulness_conclusion = "The system has significant factual accuracy issues that need attention."

print(f"1. Faithfulness ({faithfulness_score:.3f}): {faithfulness_conclusion}")

# Answer relevancy analysis
relevancy_score = overall_metrics["Answer Relevancy"]
if relevancy_score >= 0.8:
    relevancy_conclusion = "Responses are highly relevant to the questions asked."
elif relevancy_score >= 0.6:
    relevancy_conclusion = "Responses are generally relevant but may sometimes miss the mark."
else:
    relevancy_conclusion = "Responses often lack relevance to the specific questions asked."

print(f"2. Answer Relevancy ({relevancy_score:.3f}): {relevancy_conclusion}")

# Context precision analysis
precision_score = overall_metrics["Context Precision"]
if precision_score >= 0.8:
    precision_conclusion = "The system retrieves highly precise and relevant context."
elif precision_score >= 0.6:
    precision_conclusion = "The system retrieves reasonably precise context with some noise."
else:
    precision_conclusion = "The system retrieves context with significant noise and irrelevance."

print(f"3. Context Precision ({precision_score:.3f}): {precision_conclusion}")

# Context recall analysis
recall_score = overall_metrics["Context Recall"]
if recall_score >= 0.8:
    recall_conclusion = "The system retrieves comprehensive context that covers ground truth well."
elif recall_score >= 0.6:
    recall_conclusion = "The system retrieves adequate context but may miss some important information."
else:
    recall_conclusion = "The system often misses important context needed for accurate responses."

print(f"4. Context Recall ({recall_score:.3f}): {recall_conclusion}")


📋 Detailed Conclusions and Recommendations:

🔍 Key Findings:
--------------------
1. Faithfulness (0.469): The system has significant factual accuracy issues that need attention.
2. Answer Relevancy (0.787): Responses are generally relevant but may sometimes miss the mark.
3. Context Precision (0.033): The system retrieves context with significant noise and irrelevance.
4. Context Recall (0.017): The system often misses important context needed for accurate responses.


In [103]:
print("\n💡 Recommendations for Improvement:")
print("-" * 40)

recommendations = []

# Faithfulness recommendations
if faithfulness_score < 0.8:
    recommendations.append("• Improve factual accuracy by enhancing the knowledge base and fact-checking mechanisms")

# Relevancy recommendations
if relevancy_score < 0.8:
    recommendations.append("• Enhance response relevance by improving question understanding and response generation")

# Precision recommendations
if precision_score < 0.8:
    recommendations.append("• Improve context precision by refining retrieval algorithms and filtering mechanisms")

# Recall recommendations
if recall_score < 0.8:
    recommendations.append("• Enhance context recall by expanding the knowledge base and improving retrieval coverage")

# General recommendations
if overall_score < 0.8:
    recommendations.extend([
        "• Consider fine-tuning the LLM on domain-specific data",
        "• Implement feedback loops to continuously improve performance",
        "• Add more diverse test cases to the evaluation dataset",
        "• Consider ensemble methods for better response quality"
    ])

if recommendations:
    for rec in recommendations:
        print(rec)
else:
    print("• System performance is excellent - consider monitoring for consistency")
    print("• Expand the test dataset to cover more edge cases")
    print("• Implement A/B testing for continuous improvement")



💡 Recommendations for Improvement:
----------------------------------------
• Improve factual accuracy by enhancing the knowledge base and fact-checking mechanisms
• Enhance response relevance by improving question understanding and response generation
• Improve context precision by refining retrieval algorithms and filtering mechanisms
• Enhance context recall by expanding the knowledge base and improving retrieval coverage
• Consider fine-tuning the LLM on domain-specific data
• Implement feedback loops to continuously improve performance
• Add more diverse test cases to the evaluation dataset
• Consider ensemble methods for better response quality


In [105]:
print("\n🎯 Overall Pipeline Effectiveness Assessment:")
print("=" * 50)

print(f"\n📊 Summary Statistics:")
print(f"• Total Test Cases: {len(evaluation_data)}")
print(f"• Overall Score: {overall_score:.3f}")
print(f"• Performance Level: {performance_level}")
print(f"• Best Metric: {best_metric[0]} ({best_metric[1]:.3f})")
print(f"• Worst Metric: {worst_metric[0]} ({worst_metric[1]:.3f})")

print(f"\n🔍 Key Insights:")
print(f"• The Traceback system demonstrates {'strong' if overall_score >= 0.7 else 'moderate' if overall_score >= 0.6 else 'weak'} performance across all RAGAS metrics")
print(f"• {'The system excels at' if best_metric[1] >= 0.8 else 'The system shows good performance in'} {best_metric[0].lower()}")
print(f"• {'Significant improvement needed in' if worst_metric[1] < 0.6 else 'Some improvement possible in'} {worst_metric[0].lower()}")

print(f"\n✅ Conclusion:")
if overall_score >= 0.8:
    conclusion = "The Traceback pipeline is highly effective and ready for production deployment."
elif overall_score >= 0.7:
    conclusion = "The Traceback pipeline shows good effectiveness with room for targeted improvements."
elif overall_score >= 0.6:
    conclusion = "The Traceback pipeline demonstrates fair effectiveness but requires significant improvements."
else:
    conclusion = "The Traceback pipeline needs substantial improvements before production deployment."

print(conclusion)



🎯 Overall Pipeline Effectiveness Assessment:

📊 Summary Statistics:
• Total Test Cases: 15
• Overall Score: 0.326
• Performance Level: Needs Improvement
• Best Metric: answer_relevancy (0.787)
• Worst Metric: context_recall (0.017)

🔍 Key Insights:
• The Traceback system demonstrates weak performance across all RAGAS metrics
• The system shows good performance in answer_relevancy
• Significant improvement needed in context_recall

✅ Conclusion:
The Traceback pipeline needs substantial improvements before production deployment.


## 9. Save Results


In [106]:
# Save detailed results
results_output_path = Path.cwd().parent / "data" / "ragas_evaluation_results.json"

evaluation_summary = {
    "evaluation_date": pd.Timestamp.now().isoformat(),
    "total_test_cases": len(evaluation_data),
    "overall_metrics": overall_metrics,
    "overall_score": float(overall_score),
    "performance_level": performance_level,
    "detailed_results": results_df.to_dict('records'),
    "recommendations": recommendations if recommendations else ["System performance is excellent"]
}

with open(results_output_path, 'w') as f:
    json.dump(evaluation_summary, f, indent=2)

print(f"✅ Results saved to: {results_output_path}")
print(f"📊 Evaluation completed successfully!")
print(f"🎯 Overall Pipeline Score: {overall_score:.3f} ({performance_level})")


✅ Results saved to: /Users/sandeepgogineni/ai-engineering/bootcamp/Traceback/data/ragas_evaluation_results.json
📊 Evaluation completed successfully!
🎯 Overall Pipeline Score: 0.326 (Needs Improvement)
