# FusionGraph Research-Grade Evaluation

This notebook demonstrates comprehensive evaluation of the FusionGraph multimodal RAG system using three key research metrics:

1. **Retrieval Quality Assessment** - NDCG@K, MRR, Precision@K
2. **Factual Consistency & Grounding** - NLI-based verification, claim support
3. **Multimodal Integration Quality** - Cross-modal alignment, OCR accuracy

These metrics provide quantitative measures to assess system performance for research and production use.

In [None]:
import sys
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Add project root to path
sys.path.append('..')

# Import evaluation framework
from evaluation.evaluation_framework import (
    FusionGraphEvaluationSuite,
    RetrievalQualityEvaluator, 
    FactualConsistencyEvaluator,
    MultimodalGroundingEvaluator
)

print("📊 FusionGraph Evaluation Framework Loaded")
print("✅ Ready for research-grade evaluation")

## Test 1: Retrieval Quality Assessment

Measures how well the system retrieves relevant documents using information retrieval metrics:
- **NDCG@K**: Normalized Discounted Cumulative Gain at rank K
- **MRR**: Mean Reciprocal Rank of first relevant result
- **Precision@K**: Fraction of top-K results that are relevant
- **Recall@K**: Fraction of relevant documents found in top-K

In [None]:
# Initialize retrieval quality evaluator
retrieval_eval = RetrievalQualityEvaluator()

# Sample retrieval results for demonstration
sample_retrieval_results = [
    {'source_id': 'ai_healthcare_2024', 'score': 0.92},
    {'source_id': 'cv_deep_learning_2024', 'score': 0.75}, 
    {'source_id': 'knowledge_graphs_2024', 'score': 0.68},
    {'source_id': 'nlp_llm_2024', 'score': 0.45}
]

# Evaluate retrieval quality
retrieval_metrics = retrieval_eval.evaluate_retrieval_quality(
    sample_retrieval_results, 
    query_id="q1",  # "How does AI help in medical diagnosis?"
    k_values=[1, 3, 5]
)

print("🔍 Retrieval Quality Metrics:")
for metric, score in retrieval_metrics.items():
    print(f"   {metric}: {score:.3f}")

# Visualize retrieval metrics
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# NDCG at different K values
ndcg_metrics = {k: v for k, v in retrieval_metrics.items() if 'ndcg' in k}
k_vals = [int(k.split('_')[-1]) for k in ndcg_metrics.keys()]
ndcg_scores = list(ndcg_metrics.values())

axes[0].bar(k_vals, ndcg_scores, color='skyblue', alpha=0.7)
axes[0].set_xlabel('K (Top-K Results)')
axes[0].set_ylabel('NDCG Score')
axes[0].set_title('NDCG@K Scores')
axes[0].set_ylim(0, 1)

# Precision at different K values
prec_metrics = {k: v for k, v in retrieval_metrics.items() if 'precision' in k}
prec_scores = list(prec_metrics.values())

axes[1].bar(k_vals, prec_scores, color='lightcoral', alpha=0.7)
axes[1].set_xlabel('K (Top-K Results)')
axes[1].set_ylabel('Precision Score')
axes[1].set_title('Precision@K Scores')
axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

print(f"📈 MRR (Mean Reciprocal Rank): {retrieval_metrics.get('mrr', 0):.3f}")

## Test 2: Factual Consistency & Grounding

Evaluates whether generated answers are factually consistent with retrieved sources:
- **Claim Verification**: Uses NLI models to verify claims against sources
- **Support Ratio**: Fraction of claims supported by retrieved evidence
- **Hallucination Detection**: Identifies contradicted or unsupported claims
- **Entity Consistency**: Checks factual entity alignment

In [None]:
# Initialize factual consistency evaluator
factual_eval = FactualConsistencyEvaluator()

# Sample generated answer and sources
generated_answer = """
Artificial intelligence significantly improves medical diagnosis by analyzing medical images 
with higher accuracy than human doctors in some cases. AI systems can detect early-stage 
cancer, diabetic retinopathy, and cardiovascular conditions. Machine learning algorithms 
process electronic health records to identify patterns and risk factors.
"""

source_texts = [
    "AI-powered diagnostic tools have shown remarkable success in detecting diseases like cancer, diabetic retinopathy, and cardiovascular conditions at early stages.",
    "Studies have demonstrated that AI systems can achieve diagnostic accuracy comparable to or exceeding that of specialist physicians in certain domains.",
    "Machine learning algorithms can analyze medical images, electronic health records, and genomic data to identify patterns."
]

# Evaluate factual consistency
factual_metrics = factual_eval.evaluate_factual_consistency(generated_answer, source_texts)

print("✅ Factual Consistency Metrics:")
for metric, score in factual_metrics.items():
    if isinstance(score, float):
        print(f"   {metric}: {score:.3f}")
    else:
        print(f"   {metric}: {score}")

# Visualize factual consistency results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Support vs contradiction ratio
support_ratio = factual_metrics.get('support_ratio', 0)
hallucination_ratio = factual_metrics.get('hallucination_ratio', 0)
neutral_ratio = 1 - support_ratio - hallucination_ratio

labels = ['Supported', 'Neutral', 'Contradicted']
sizes = [support_ratio, neutral_ratio, hallucination_ratio]
colors = ['lightgreen', 'lightgray', 'lightcoral']

ax1.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
ax1.set_title('Claim Support Distribution')

# Verification score distribution
verification_score = factual_metrics.get('avg_verification_score', 0.5)
score_categories = ['Low (0-0.3)', 'Medium (0.3-0.7)', 'High (0.7-1.0)']

if verification_score < 0.3:
    score_dist = [1, 0, 0]
elif verification_score < 0.7:
    score_dist = [0, 1, 0]
else:
    score_dist = [0, 0, 1]

ax2.bar(score_categories, score_dist, color=['red', 'orange', 'green'])
ax2.set_ylabel('Proportion')
ax2.set_title(f'Verification Score Category\n(Avg: {verification_score:.3f})')
ax2.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

## Test 3: Multimodal Integration Quality

Assesses how well the system integrates visual and textual information:
- **Cross-Modal Alignment**: Semantic coherence between text and image sources
- **OCR Accuracy**: Quality of text extraction from images
- **Visual QA Correctness**: Accuracy of visual question answering
- **Source Relevance**: Relevance of both text and image sources to queries

In [None]:
# Initialize multimodal grounding evaluator
multimodal_eval = MultimodalGroundingEvaluator()

# Sample multimodal query and sources
query = "How do computer vision models analyze medical images?"

text_sources = [
    {
        'text': 'Computer vision models use convolutional neural networks to analyze medical images for disease detection.',
        'score': 0.89
    },
    {
        'text': 'Deep learning algorithms can process X-rays, MRI scans, and CT images with high accuracy.',
        'score': 0.82
    }
]

image_sources = [
    {
        'ocr_text': 'Medical image analysis using CNN architecture for chest X-ray classification',
        'caption': 'Diagram showing CNN layers processing medical images',
        'score': 0.78
    },
    {
        'ocr_text': 'Deep learning model performance on radiology datasets',
        'caption': 'Performance metrics table for medical image AI',
        'score': 0.71
    }
]

# Evaluate multimodal grounding
multimodal_result = multimodal_eval.evaluate_multimodal_grounding(
    query, text_sources, image_sources
)

print("🎭 Multimodal Integration Metrics:")
print(f"   Text Source Relevance: {multimodal_result.text_source_relevance:.3f}")
print(f"   Image Source Relevance: {multimodal_result.image_source_relevance:.3f}")
print(f"   Cross-Modal Alignment: {multimodal_result.cross_modal_alignment:.3f}")
print(f"   OCR Accuracy: {multimodal_result.ocr_accuracy:.3f}")
print(f"   Visual QA Correctness: {multimodal_result.visual_qa_correctness:.3f}")
print(f"   Overall Grounding Quality: {multimodal_result.grounding_quality}")

# Visualize multimodal metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Radar chart for multimodal metrics
metrics = {
    'Text Relevance': multimodal_result.text_source_relevance,
    'Image Relevance': multimodal_result.image_source_relevance,
    'Cross-Modal\nAlignment': multimodal_result.cross_modal_alignment,
    'OCR Accuracy': multimodal_result.ocr_accuracy,
    'Visual QA\nCorrectness': multimodal_result.visual_qa_correctness
}

labels = list(metrics.keys())
values = list(metrics.values())

ax1.bar(range(len(labels)), values, color='lightblue', alpha=0.7)
ax1.set_xticks(range(len(labels)))
ax1.set_xticklabels(labels, rotation=45, ha='right')
ax1.set_ylabel('Score')
ax1.set_title('Multimodal Integration Metrics')
ax1.set_ylim(0, 1)

# Overall quality distribution
quality_levels = ['Poor', 'Fair', 'Good', 'Excellent']
quality_scores = [0, 0, 0, 0]

quality_map = {'poor': 0, 'fair': 1, 'good': 2, 'excellent': 3}
quality_idx = quality_map.get(multimodal_result.grounding_quality.lower(), 1)
quality_scores[quality_idx] = 1

colors = ['red', 'orange', 'lightgreen', 'green']
ax2.bar(quality_levels, quality_scores, color=colors, alpha=0.7)
ax2.set_ylabel('Assessment')
ax2.set_title('Overall Grounding Quality')
ax2.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

## Comprehensive Evaluation Dashboard

Combined view of all evaluation metrics for research reporting.

In [None]:
# Create comprehensive evaluation summary
evaluation_summary = {
    'Retrieval Quality': {
        'NDCG@3': retrieval_metrics.get('ndcg_at_3', 0),
        'Precision@3': retrieval_metrics.get('precision_at_3', 0),
        'MRR': retrieval_metrics.get('mrr', 0)
    },
    'Factual Consistency': {
        'Support Ratio': factual_metrics.get('support_ratio', 0),
        'Avg Verification Score': factual_metrics.get('avg_verification_score', 0),
        'Hallucination Ratio': factual_metrics.get('hallucination_ratio', 0)
    },
    'Multimodal Integration': {
        'Text Relevance': multimodal_result.text_source_relevance,
        'Image Relevance': multimodal_result.image_source_relevance,
        'Cross-Modal Alignment': multimodal_result.cross_modal_alignment
    }
}

# Create comprehensive dashboard
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('FusionGraph Research-Grade Evaluation Dashboard', fontsize=16, fontweight='bold')

# 1. Retrieval Quality Overview
ret_metrics = list(evaluation_summary['Retrieval Quality'].keys())
ret_scores = list(evaluation_summary['Retrieval Quality'].values())

axes[0,0].bar(ret_metrics, ret_scores, color='skyblue', alpha=0.8)
axes[0,0].set_title('Retrieval Quality Metrics')
axes[0,0].set_ylabel('Score')
axes[0,0].set_ylim(0, 1)
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Factual Consistency Overview
fact_metrics = list(evaluation_summary['Factual Consistency'].keys())
fact_scores = list(evaluation_summary['Factual Consistency'].values())

axes[0,1].bar(fact_metrics, fact_scores, color='lightgreen', alpha=0.8)
axes[0,1].set_title('Factual Consistency Metrics')
axes[0,1].set_ylabel('Score')
axes[0,1].set_ylim(0, 1)
axes[0,1].tick_params(axis='x', rotation=45)

# 3. Multimodal Integration Overview
mm_metrics = list(evaluation_summary['Multimodal Integration'].keys())
mm_scores = list(evaluation_summary['Multimodal Integration'].values())

axes[1,0].bar(mm_metrics, mm_scores, color='lightcoral', alpha=0.8)
axes[1,0].set_title('Multimodal Integration Metrics')
axes[1,0].set_ylabel('Score')
axes[1,0].set_ylim(0, 1)
axes[1,0].tick_params(axis='x', rotation=45)

# 4. Overall System Performance Heatmap
all_metrics = []
all_scores = []

for category, metrics in evaluation_summary.items():
    for metric, score in metrics.items():
        all_metrics.append(f"{category}\n{metric}")
        all_scores.append(score)

# Create heatmap data
heatmap_data = [[score] for score in all_scores]
im = axes[1,1].imshow(heatmap_data, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)

axes[1,1].set_yticks(range(len(all_metrics)))
axes[1,1].set_yticklabels(all_metrics, fontsize=8)
axes[1,1].set_xticks([])
axes[1,1].set_title('Overall Performance Heatmap')

# Add colorbar
cbar = plt.colorbar(im, ax=axes[1,1], shrink=0.8)
cbar.set_label('Score', rotation=270, labelpad=15)

plt.tight_layout()
plt.show()

# Print summary table
print("\n📊 Research-Grade Evaluation Summary:")
print("=" * 60)

summary_df = pd.DataFrame(evaluation_summary).round(3)
print(summary_df.to_string())

# Calculate overall system score
overall_score = sum(all_scores) / len(all_scores)
print(f"\n🎯 Overall System Performance: {overall_score:.3f}")

if overall_score >= 0.8:
    grade = "A (Excellent)"
elif overall_score >= 0.7:
    grade = "B (Good)" 
elif overall_score >= 0.6:
    grade = "C (Satisfactory)"
else:
    grade = "D (Needs Improvement)"

print(f"📈 Research Grade: {grade}")

## Research Applications

These quantitative metrics enable:

### 📈 **Benchmarking & Comparison**
- Compare against other RAG systems using standardized metrics
- Track performance improvements over time
- Validate system reliability for production deployment

### 🔬 **Research Publications**
- Provides rigorous evaluation methodology for academic papers
- Enables reproducible experiments with quantitative results
- Supports claims about system effectiveness with evidence

### 🎯 **System Optimization**
- Identify specific areas for improvement (retrieval vs. generation)
- A/B test different components with quantitative measures
- Optimize hyperparameters based on evaluation metrics

### ✅ **Quality Assurance**
- Continuous monitoring of system performance
- Automated alerts when metrics fall below thresholds
- Regression testing for system updates