# Disagreement Analysis in RAG Systems

This notebook demonstrates how to analyze disagreements between different retrievers in a Retrieval-Augmented Generation (RAG) system using the `autorag-live` library.

## Overview

Disagreement analysis helps understand:
- How different retrievers rank the same documents
- Which retrievers are most consistent
- Where retrievers disagree and why
- How to optimize hybrid retriever weights

## Setup

In [None]:
# Install required packages
# !pip install autorag-live sentence-transformers scipy

import sys
import os
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any

# Import autorag-live components
from autorag_live.retrievers import bm25, dense, hybrid
from autorag_live.disagreement import metrics, report
from autorag_live.pipeline.hybrid_optimizer import grid_search_hybrid_weights

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Setup complete!")

## Sample Data

Let's create a small corpus of documents to demonstrate disagreement analysis.

In [None]:
# Sample corpus for demonstration
CORPUS = [
    "The sky is blue and beautiful during the day.",
    "The sun rises in the east and sets in the west.",
    "The sun is bright and provides light to Earth.",
    "The sun in the sky is very bright during daytime.",
    "We can see the shining sun, the bright sun in the sky.",
    "The quick brown fox jumps over the lazy dog.",
    "A lazy fox is usually sleeping in its den.",
    "The fox is a mammal that belongs to the canine family.",
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret visual information.",
    "Data science combines statistics, programming, and domain expertise.",
    "Python is a popular programming language for data science.",
    "Jupyter notebooks provide an interactive environment for coding."
]

QUERIES = [
    "bright sun in the sky",
    "fox jumping over dog",
    "machine learning and AI",
    "programming with Python"
]

print(f"Corpus size: {len(CORPUS)} documents")
print(f"Number of queries: {len(QUERIES)}")
print("\nSample documents:")
for i, doc in enumerate(CORPUS[:3]):
    print(f"{i+1}. {doc}")

## Retriever Setup

Let's initialize different retrievers and see how they perform on our queries.

In [None]:
# Initialize retrievers
print("Initializing retrievers...")

# BM25 retriever (lexical)
bm25_retriever = bm25

# Dense retriever (semantic)
dense_retriever = dense

# Hybrid retriever (combination)
hybrid_retriever = hybrid

print("✅ Retrievers initialized")

# Test each retriever on a sample query
test_query = QUERIES[0]
k = 5

print(f"\nTesting retrievers on query: '{test_query}'")
print("=" * 60)

# Get results from each retriever
bm25_results = bm25_retriever.bm25_retrieve(test_query, CORPUS, k)
dense_results = dense_retriever.dense_retrieve(test_query, CORPUS, k)
hybrid_results = hybrid_retriever.hybrid_retrieve(test_query, CORPUS, k)

print("BM25 Results:")
for i, doc in enumerate(bm25_results):
    print(f"{i+1}. {doc}")

print("\nDense Results:")
for i, doc in enumerate(dense_results):
    print(f"{i+1}. {doc}")

print("\nHybrid Results:")
for i, doc in enumerate(hybrid_results):
    print(f"{i+1}. {doc}")

## Disagreement Analysis

Now let's analyze the disagreements between different retrievers.

In [None]:
def analyze_disagreements(query: str, corpus: List[str], k: int = 5):
    """Analyze disagreements between retrievers for a given query."""
    
    # Get results from all retrievers
    bm25_results = bm25_retriever.bm25_retrieve(query, corpus, k)
    dense_results = dense_retriever.dense_retrieve(query, corpus, k)
    hybrid_results = hybrid_retriever.hybrid_retrieve(query, corpus, k)
    
    results = {
        "BM25": bm25_results,
        "Dense": dense_results,
        "Hybrid": hybrid_results
    }
    
    # Calculate disagreement metrics
    disagreement_metrics = {
        "jaccard_bm25_vs_dense": metrics.jaccard_at_k(bm25_results, dense_results),
        "kendall_tau_bm25_vs_dense": metrics.kendall_tau_at_k(bm25_results, dense_results),
        "jaccard_bm25_vs_hybrid": metrics.jaccard_at_k(bm25_results, hybrid_results),
        "kendall_tau_bm25_vs_hybrid": metrics.kendall_tau_at_k(bm25_results, hybrid_results),
        "jaccard_dense_vs_hybrid": metrics.jaccard_at_k(dense_results, hybrid_results),
        "kendall_tau_dense_vs_hybrid": metrics.kendall_tau_at_k(dense_results, hybrid_results),
    }
    
    return results, disagreement_metrics

# Analyze disagreements for all queries
all_results = {}
all_metrics = {}

print("Analyzing disagreements across all queries...")
for query in QUERIES:
    results, disagreement_metrics = analyze_disagreements(query, CORPUS)
    all_results[query] = results
    all_metrics[query] = disagreement_metrics
    
    print(f"\nQuery: '{query}'")
    print("-" * 40)
    for metric_name, value in disagreement_metrics.items():
        print(f"{metric_name}: {value:.3f}")

print("\n✅ Disagreement analysis complete!")

## Visualization of Disagreements

Let's create some visualizations to better understand the disagreements.

In [None]:
# Create a summary DataFrame of disagreement metrics
metrics_data = []
for query, metrics_dict in all_metrics.items():
    for metric_name, value in metrics_dict.items():
        retriever_pair = metric_name.split('_')[-1].replace('_', ' vs ')
        metric_type = 'Jaccard' if 'jaccard' in metric_name else 'Kendall Tau'
        metrics_data.append({
            'Query': query,
            'Retriever Pair': retriever_pair,
            'Metric': metric_type,
            'Value': value
        })

df_metrics = pd.DataFrame(metrics_data)

# Plot disagreement metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Disagreement Analysis Across Queries', fontsize=16)

# Jaccard similarity heatmap
jaccard_data = df_metrics[df_metrics['Metric'] == 'Jaccard'].pivot(
    index='Query', columns='Retriever Pair', values='Value'
)
sns.heatmap(jaccard_data, annot=True, cmap='YlOrRd', ax=axes[0,0])
axes[0,0].set_title('Jaccard Similarity (Higher = More Agreement)')

# Kendall Tau correlation heatmap
kendall_data = df_metrics[df_metrics['Metric'] == 'Kendall Tau'].pivot(
    index='Query', columns='Retriever Pair', values='Value'
)
sns.heatmap(kendall_data, annot=True, cmap='RdYlBu_r', ax=axes[0,1])
axes[0,1].set_title('Kendall Tau Correlation (Higher = More Agreement)')

# Average metrics by retriever pair
avg_metrics = df_metrics.groupby(['Retriever Pair', 'Metric'])['Value'].mean().reset_index()
sns.barplot(data=avg_metrics, x='Retriever Pair', y='Value', hue='Metric', ax=axes[1,0])
axes[1,0].set_title('Average Agreement by Retriever Pair')
axes[1,0].tick_params(axis='x', rotation=45)

# Distribution of disagreement metrics
sns.boxplot(data=df_metrics, x='Metric', y='Value', ax=axes[1,1])
axes[1,1].set_title('Distribution of Disagreement Metrics')

plt.tight_layout()
plt.show()

# Print summary statistics
print("\nSummary Statistics:")
print("=" * 50)
print(df_metrics.groupby(['Metric', 'Retriever Pair'])['Value'].describe())

## Generate HTML Report

Let's generate a comprehensive HTML report of our disagreement analysis.

In [None]:
# Generate HTML report for the first query
test_query = QUERIES[0]
results = all_results[test_query]
disagreement_metrics = all_metrics[test_query]

# Ensure reports directory exists
os.makedirs('reports', exist_ok=True)

# Generate the report
report_path = f"reports/disagreement_analysis_{test_query.replace(' ', '_')}.html"
report.generate_disagreement_report(
    test_query, 
    results, 
    disagreement_metrics, 
    report_path
)

print(f"✅ HTML report generated: {report_path}")
print("\nOpen the HTML file in your browser to view the detailed report.")

## Hybrid Weight Optimization

Based on our disagreement analysis, let's optimize the hybrid retriever weights.

In [None]:
# Optimize hybrid weights using grid search
print("Optimizing hybrid retriever weights...")

# Use a subset of queries for optimization
optimization_queries = QUERIES[:2]  # Use first 2 queries

# Perform grid search
optimal_weights, best_score = grid_search_hybrid_weights(
    optimization_queries, 
    CORPUS, 
    k=5, 
    grid_size=5  # Try 5x5 grid of weight combinations
)

print(f"\nOptimal weights found:")
print(f"BM25 weight: {optimal_weights.bm25_weight:.3f}")
print(f"Dense weight: {optimal_weights.dense_weight:.3f}")
print(f"Best diversity score: {best_score:.3f}")

# Test the optimized hybrid retriever
print("\nTesting optimized hybrid retriever:")
print("-" * 40)

# Update hybrid weights (this would normally be saved to config)
from autorag_live.pipeline.hybrid_optimizer import save_hybrid_config
save_hybrid_config(optimal_weights)

# Test on a new query
test_results = hybrid.hybrid_retrieve(QUERIES[2], CORPUS, 5)
print(f"Results for query '{QUERIES[2]}':")
for i, doc in enumerate(test_results):
    print(f"{i+1}. {doc}")

## Key Insights

From this disagreement analysis, we can observe:

1. **Agreement Patterns**: Different retriever pairs show varying levels of agreement
2. **Query Sensitivity**: Some queries show more disagreement than others
3. **Optimization Potential**: Hybrid approaches can leverage complementary strengths
4. **Evaluation Metrics**: Jaccard and Kendall Tau provide different perspectives on agreement

## Next Steps

- Try different corpora and query types
- Experiment with different retriever combinations
- Implement more sophisticated optimization algorithms
- Add more evaluation metrics and analysis techniques

This notebook demonstrates the core disagreement analysis capabilities of the `autorag-live` system.