# Focused Learning 4: Tied Ranking Metrics (MTRR & TMHits@K)

## Paper: Don't Forget to Connect! Improving RAG with Graph-based Reranking

---

## 📚 Learning Objectives

This notebook provides a deep dive into the novel evaluation metrics introduced in the G-RAG paper:

1. **Mean Tied Reciprocal Rank (MTRR)**: A ranking metric that handles tied scores by averaging reciprocal ranks
2. **Tied Mean Hits@K (TMHits@K)**: An extension of Hits@K that accounts for tied rankings
3. **Implementation and Analysis**: Complete implementation with visualization and comparison to standard metrics
4. **Real-world Applications**: Understanding when and why these metrics are crucial for LLM-based reranking

---

## 🎯 Why Tied Ranking Metrics Matter

### The Problem with Traditional Metrics

When Large Language Models (LLMs) are used as rerankers, they often produce **tied scores** for multiple documents. Traditional ranking metrics like Mean Reciprocal Rank (MRR) and Hits@K don't handle ties appropriately, leading to:

- **Unfair evaluation**: Tied documents get arbitrary rankings
- **Inconsistent results**: Same model performance varies based on tie-breaking strategy
- **Misleading comparisons**: Models with different tie patterns appear incomparable

### The G-RAG Solution

The paper introduces metrics that:
- **Average over all possible tie-breaking arrangements**
- **Provide fair evaluation** regardless of tie-breaking strategy
- **Enable consistent model comparison** across different LLM rerankers

In [None]:
# Environment Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
from collections import defaultdict, Counter
import itertools
from dataclasses import dataclass
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📊 Environment setup complete!")
print("🎯 Ready to explore tied ranking metrics!")

## 📐 Mathematical Foundation

### Traditional Mean Reciprocal Rank (MRR)

For a set of queries $Q$ and their first relevant document at rank $r_i$:

$$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{r_i}$$

### Mean Tied Reciprocal Rank (MTRR)

When documents have tied scores, we consider all possible rankings and average:

$$MTRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{|\Pi_i|} \sum_{\pi \in \Pi_i} \frac{1}{r_i^\pi}$$

Where:
- $\Pi_i$ is the set of all possible rankings for query $i$
- $r_i^\pi$ is the rank of first relevant document in permutation $\pi$

### Tied Mean Hits@K (TMHits@K)

$$TMHits@K = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{|\Pi_i|} \sum_{\pi \in \Pi_i} \mathbb{1}[r_i^\pi \leq K]$$

In [None]:
@dataclass
class RankingResult:
    """Represents a document ranking with potential ties"""
    doc_ids: List[str]
    scores: List[float]
    relevance: List[bool]  # True if document is relevant
    
    def __post_init__(self):
        assert len(self.doc_ids) == len(self.scores) == len(self.relevance)
        
    def get_tied_groups(self) -> List[List[int]]:
        """Group document indices by their scores (tied documents)"""
        score_to_indices = defaultdict(list)
        for i, score in enumerate(self.scores):
            score_to_indices[score].append(i)
        
        # Sort by score (descending) and return groups
        sorted_scores = sorted(score_to_indices.keys(), reverse=True)
        return [score_to_indices[score] for score in sorted_scores]

class TiedRankingMetrics:
    """Implementation of tied ranking metrics from G-RAG paper"""
    
    @staticmethod
    def generate_all_rankings(tied_groups: List[List[int]]) -> List[List[int]]:
        """Generate all possible rankings given tied groups"""
        # Generate all permutations within each tied group
        group_permutations = []
        for group in tied_groups:
            group_permutations.append(list(itertools.permutations(group)))
        
        # Combine all group permutations
        all_rankings = []
        for combination in itertools.product(*group_permutations):
            ranking = []
            for group_perm in combination:
                ranking.extend(group_perm)
            all_rankings.append(ranking)
            
        return all_rankings
    
    @staticmethod
    def compute_mrr_single(ranking: List[int], relevance: List[bool]) -> float:
        """Compute MRR for a single ranking"""
        for rank, doc_idx in enumerate(ranking, 1):
            if relevance[doc_idx]:
                return 1.0 / rank
        return 0.0
    
    @staticmethod
    def compute_hits_k_single(ranking: List[int], relevance: List[bool], k: int) -> float:
        """Compute Hits@K for a single ranking"""
        for rank, doc_idx in enumerate(ranking, 1):
            if relevance[doc_idx] and rank <= k:
                return 1.0
        return 0.0
    
    @classmethod
    def compute_mtrr(cls, result: RankingResult) -> float:
        """Compute Mean Tied Reciprocal Rank"""
        tied_groups = result.get_tied_groups()
        all_rankings = cls.generate_all_rankings(tied_groups)
        
        if not all_rankings:
            return 0.0
        
        total_rr = 0.0
        for ranking in all_rankings:
            total_rr += cls.compute_mrr_single(ranking, result.relevance)
        
        return total_rr / len(all_rankings)
    
    @classmethod
    def compute_tmhits_k(cls, result: RankingResult, k: int) -> float:
        """Compute Tied Mean Hits@K"""
        tied_groups = result.get_tied_groups()
        all_rankings = cls.generate_all_rankings(tied_groups)
        
        if not all_rankings:
            return 0.0
        
        total_hits = 0.0
        for ranking in all_rankings:
            total_hits += cls.compute_hits_k_single(ranking, result.relevance, k)
        
        return total_hits / len(all_rankings)
    
    @classmethod
    def compute_traditional_mrr(cls, result: RankingResult) -> float:
        """Compute traditional MRR (arbitrary tie-breaking)"""
        # Sort by score descending, then by original index for tie-breaking
        sorted_indices = sorted(range(len(result.scores)), 
                              key=lambda i: (-result.scores[i], i))
        return cls.compute_mrr_single(sorted_indices, result.relevance)
    
    @classmethod
    def compute_traditional_hits_k(cls, result: RankingResult, k: int) -> float:
        """Compute traditional Hits@K (arbitrary tie-breaking)"""
        sorted_indices = sorted(range(len(result.scores)), 
                              key=lambda i: (-result.scores[i], i))
        return cls.compute_hits_k_single(sorted_indices, result.relevance, k)

print("📐 Tied ranking metrics implementation complete!")

## 🧪 Demonstrating the Problem

Let's create scenarios where tied scores lead to different traditional metric values but consistent tied metric values.

In [None]:
def create_demo_scenarios():
    """Create demonstration scenarios showing tied ranking issues"""
    
    scenarios = {
        "No Ties": RankingResult(
            doc_ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
            scores=[0.9, 0.8, 0.7, 0.6, 0.5],
            relevance=[False, True, False, False, False]  # Relevant doc at rank 2
        ),
        
        "Simple Tie": RankingResult(
            doc_ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
            scores=[0.9, 0.8, 0.8, 0.6, 0.5],  # doc2 and doc3 tied
            relevance=[False, True, False, False, False]  # Relevant doc tied at rank 2-3
        ),
        
        "Complex Tie": RankingResult(
            doc_ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
            scores=[0.8, 0.8, 0.8, 0.6, 0.5],  # Three docs tied at top
            relevance=[False, True, False, False, False]  # Relevant doc in top tie
        ),
        
        "Multiple Ties": RankingResult(
            doc_ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
            scores=[0.9, 0.8, 0.8, 0.6, 0.6],  # Two separate tie groups
            relevance=[False, False, True, False, False]  # Relevant doc in first tie
        )
    }
    
    return scenarios

# Analyze each scenario
scenarios = create_demo_scenarios()
metrics = TiedRankingMetrics()

results_df = []

for name, scenario in scenarios.items():
    # Compute traditional metrics
    trad_mrr = metrics.compute_traditional_mrr(scenario)
    trad_hits5 = metrics.compute_traditional_hits_k(scenario, 5)
    trad_hits3 = metrics.compute_traditional_hits_k(scenario, 3)
    
    # Compute tied metrics
    mtrr = metrics.compute_mtrr(scenario)
    tmhits5 = metrics.compute_tmhits_k(scenario, 5)
    tmhits3 = metrics.compute_tmhits_k(scenario, 3)
    
    # Count tied groups
    tied_groups = scenario.get_tied_groups()
    num_ties = sum(1 for group in tied_groups if len(group) > 1)
    max_tie_size = max(len(group) for group in tied_groups)
    
    results_df.append({
        'Scenario': name,
        'Scores': str(scenario.scores),
        'Relevance': str(scenario.relevance),
        'Num_Ties': num_ties,
        'Max_Tie_Size': max_tie_size,
        'Traditional_MRR': f"{trad_mrr:.3f}",
        'MTRR': f"{mtrr:.3f}",
        'Traditional_Hits@3': f"{trad_hits3:.3f}",
        'TMHits@3': f"{tmhits3:.3f}",
        'Traditional_Hits@5': f"{trad_hits5:.3f}",
        'TMHits@5': f"{tmhits5:.3f}"
    })

results_df = pd.DataFrame(results_df)
print("📊 Scenario Analysis Results:")
print("=" * 80)
display(results_df)

## 🔍 Detailed Analysis: Understanding the Differences

Let's dive deeper into how tied rankings affect the metrics and why the tied versions provide fairer evaluation.

In [None]:
def analyze_ranking_permutations(scenario: RankingResult, scenario_name: str):
    """Analyze all possible rankings for a tied scenario"""
    print(f"\n🔍 Detailed Analysis: {scenario_name}")
    print("=" * 50)
    
    # Show original data
    print(f"📊 Scores: {scenario.scores}")
    print(f"🎯 Relevance: {scenario.relevance}")
    
    # Get tied groups
    tied_groups = scenario.get_tied_groups()
    print(f"\n🔗 Tied Groups:")
    for i, group in enumerate(tied_groups):
        if len(group) > 1:
            group_docs = [scenario.doc_ids[idx] for idx in group]
            group_scores = [scenario.scores[idx] for idx in group]
            group_rel = [scenario.relevance[idx] for idx in group]
            print(f"   Group {i+1}: {group_docs} (score={group_scores[0]}, relevant={group_rel})")
        else:
            doc_idx = group[0]
            print(f"   Single: {scenario.doc_ids[doc_idx]} (score={scenario.scores[doc_idx]}, relevant={scenario.relevance[doc_idx]})")
    
    # Generate all rankings
    all_rankings = metrics.generate_all_rankings(tied_groups)
    print(f"\n📈 Total possible rankings: {len(all_rankings)}")
    
    if len(all_rankings) <= 10:  # Only show details for manageable number
        print("\n🎲 All possible rankings:")
        ranking_mrrs = []
        ranking_hits3 = []
        
        for i, ranking in enumerate(all_rankings):
            ranking_docs = [scenario.doc_ids[idx] for idx in ranking]
            ranking_rel = [scenario.relevance[idx] for idx in ranking]
            
            # Find first relevant doc position
            first_rel_pos = next((pos + 1 for pos, rel in enumerate(ranking_rel) if rel), None)
            
            mrr = metrics.compute_mrr_single(ranking, scenario.relevance)
            hits3 = metrics.compute_hits_k_single(ranking, scenario.relevance, 3)
            
            ranking_mrrs.append(mrr)
            ranking_hits3.append(hits3)
            
            print(f"   Ranking {i+1}: {ranking_docs}")
            print(f"     First relevant at position: {first_rel_pos}")
            print(f"     MRR: {mrr:.3f}, Hits@3: {hits3:.3f}")
        
        print(f"\n📊 Summary:")
        print(f"   MRR range: {min(ranking_mrrs):.3f} - {max(ranking_mrrs):.3f}")
        print(f"   Average MRR (MTRR): {np.mean(ranking_mrrs):.3f}")
        print(f"   Hits@3 range: {min(ranking_hits3):.3f} - {max(ranking_hits3):.3f}")
        print(f"   Average Hits@3 (TMHits@3): {np.mean(ranking_hits3):.3f}")
    else:
        print(f"   (Too many rankings to show details - {len(all_rankings)} total)")

# Analyze the most interesting scenarios
interesting_scenarios = ["Simple Tie", "Complex Tie"]
for name in interesting_scenarios:
    analyze_ranking_permutations(scenarios[name], name)

## 📈 Visualization: Impact of Ties on Evaluation

Let's visualize how the number and size of ties affect the difference between traditional and tied metrics.

In [None]:
def generate_synthetic_scenarios(n_scenarios: int = 100) -> List[RankingResult]:
    """Generate synthetic scenarios with varying tie patterns"""
    np.random.seed(42)
    scenarios = []
    
    for i in range(n_scenarios):
        n_docs = np.random.randint(5, 15)  # 5-14 documents
        
        # Generate scores with some probability of ties
        unique_scores = np.random.uniform(0.1, 1.0, size=np.random.randint(3, n_docs+1))
        scores = np.random.choice(unique_scores, size=n_docs, replace=True)
        
        # Add some noise to reduce ties
        if np.random.random() > 0.3:  # 70% chance to add noise
            noise = np.random.normal(0, 0.01, size=n_docs)
            scores = scores + noise
        
        # Generate relevance (1-3 relevant docs)
        n_relevant = np.random.randint(1, min(4, n_docs+1))
        relevance = [False] * n_docs
        relevant_indices = np.random.choice(n_docs, size=n_relevant, replace=False)
        for idx in relevant_indices:
            relevance[idx] = True
        
        doc_ids = [f"doc_{i}_{j}" for j in range(n_docs)]
        
        scenarios.append(RankingResult(
            doc_ids=doc_ids,
            scores=scores.tolist(),
            relevance=relevance
        ))
    
    return scenarios

# Generate and analyze synthetic scenarios
synthetic_scenarios = generate_synthetic_scenarios(200)
analysis_results = []

for i, scenario in enumerate(synthetic_scenarios):
    # Count ties
    tied_groups = scenario.get_tied_groups()
    num_ties = sum(1 for group in tied_groups if len(group) > 1)
    max_tie_size = max(len(group) for group in tied_groups) if tied_groups else 1
    total_tied_docs = sum(len(group) for group in tied_groups if len(group) > 1)
    
    # Compute metrics
    trad_mrr = metrics.compute_traditional_mrr(scenario)
    mtrr = metrics.compute_mtrr(scenario)
    trad_hits3 = metrics.compute_traditional_hits_k(scenario, 3)
    tmhits3 = metrics.compute_tmhits_k(scenario, 3)
    
    analysis_results.append({
        'scenario_id': i,
        'n_docs': len(scenario.doc_ids),
        'n_relevant': sum(scenario.relevance),
        'num_ties': num_ties,
        'max_tie_size': max_tie_size,
        'total_tied_docs': total_tied_docs,
        'tie_ratio': total_tied_docs / len(scenario.doc_ids),
        'traditional_mrr': trad_mrr,
        'mtrr': mtrr,
        'mrr_diff': abs(trad_mrr - mtrr),
        'traditional_hits3': trad_hits3,
        'tmhits3': tmhits3,
        'hits3_diff': abs(trad_hits3 - tmhits3)
    })

analysis_df = pd.DataFrame(analysis_results)

print(f"📊 Analyzed {len(synthetic_scenarios)} synthetic scenarios")
print(f"📈 Scenarios with ties: {(analysis_df['num_ties'] > 0).sum()} ({(analysis_df['num_ties'] > 0).mean()*100:.1f}%)")
print(f"📊 Average tie ratio: {analysis_df['tie_ratio'].mean():.3f}")

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Impact of Tied Rankings on Evaluation Metrics', fontsize=16, fontweight='bold')

# 1. Tie ratio vs MRR difference
axes[0, 0].scatter(analysis_df['tie_ratio'], analysis_df['mrr_diff'], alpha=0.6, s=50)
axes[0, 0].set_xlabel('Tie Ratio (Fraction of Tied Documents)')
axes[0, 0].set_ylabel('|Traditional MRR - MTRR|')
axes[0, 0].set_title('MRR Difference vs Tie Ratio')
axes[0, 0].grid(True, alpha=0.3)

# 2. Max tie size vs MRR difference
axes[0, 1].scatter(analysis_df['max_tie_size'], analysis_df['mrr_diff'], alpha=0.6, s=50)
axes[0, 1].set_xlabel('Maximum Tie Size')
axes[0, 1].set_ylabel('|Traditional MRR - MTRR|')
axes[0, 1].set_title('MRR Difference vs Max Tie Size')
axes[0, 1].grid(True, alpha=0.3)

# 3. Distribution of MRR differences
axes[0, 2].hist(analysis_df['mrr_diff'], bins=30, alpha=0.7, edgecolor='black')
axes[0, 2].set_xlabel('|Traditional MRR - MTRR|')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Distribution of MRR Differences')
axes[0, 2].grid(True, alpha=0.3)

# 4. Tie ratio vs Hits@3 difference
axes[1, 0].scatter(analysis_df['tie_ratio'], analysis_df['hits3_diff'], alpha=0.6, s=50, color='orange')
axes[1, 0].set_xlabel('Tie Ratio (Fraction of Tied Documents)')
axes[1, 0].set_ylabel('|Traditional Hits@3 - TMHits@3|')
axes[1, 0].set_title('Hits@3 Difference vs Tie Ratio')
axes[1, 0].grid(True, alpha=0.3)

# 5. Max tie size vs Hits@3 difference
axes[1, 1].scatter(analysis_df['max_tie_size'], analysis_df['hits3_diff'], alpha=0.6, s=50, color='orange')
axes[1, 1].set_xlabel('Maximum Tie Size')
axes[1, 1].set_ylabel('|Traditional Hits@3 - TMHits@3|')
axes[1, 1].set_title('Hits@3 Difference vs Max Tie Size')
axes[1, 1].grid(True, alpha=0.3)

# 6. Correlation between MRR and Hits@3 differences
axes[1, 2].scatter(analysis_df['mrr_diff'], analysis_df['hits3_diff'], alpha=0.6, s=50, color='green')
axes[1, 2].set_xlabel('|Traditional MRR - MTRR|')
axes[1, 2].set_ylabel('|Traditional Hits@3 - TMHits@3|')
axes[1, 2].set_title('MRR vs Hits@3 Difference Correlation')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📊 Summary Statistics:")
print("=" * 40)
print(f"Mean MRR difference: {analysis_df['mrr_diff'].mean():.4f} (±{analysis_df['mrr_diff'].std():.4f})")
print(f"Max MRR difference: {analysis_df['mrr_diff'].max():.4f}")
print(f"Mean Hits@3 difference: {analysis_df['hits3_diff'].mean():.4f} (±{analysis_df['hits3_diff'].std():.4f})")
print(f"Max Hits@3 difference: {analysis_df['hits3_diff'].max():.4f}")
print(f"\nCorrelation between tie ratio and MRR diff: {analysis_df['tie_ratio'].corr(analysis_df['mrr_diff']):.3f}")
print(f"Correlation between max tie size and MRR diff: {analysis_df['max_tie_size'].corr(analysis_df['mrr_diff']):.3f}")

## 🤖 LLM Reranking Simulation

Let's simulate how LLM-based rerankers might produce tied scores and why tied metrics are crucial for fair evaluation.

In [None]:
class LLMRerankerSimulator:
    """Simulates different LLM reranking behaviors"""
    
    def __init__(self, model_name: str, tie_probability: float = 0.3, score_discretization: int = 10):
        self.model_name = model_name
        self.tie_probability = tie_probability
        self.score_discretization = score_discretization
    
    def rerank_documents(self, query: str, documents: List[str], true_relevance: List[bool]) -> RankingResult:
        """Simulate LLM reranking with potential ties"""
        n_docs = len(documents)
        
        # Simulate base relevance scores (with some correlation to true relevance)
        base_scores = np.random.uniform(0.1, 0.9, n_docs)
        
        # Boost relevant documents (but not perfectly)
        for i, is_relevant in enumerate(true_relevance):
            if is_relevant:
                if np.random.random() > 0.1:  # 90% chance to boost relevant docs
                    base_scores[i] += np.random.uniform(0.1, 0.3)
        
        # Clip scores to [0, 1]
        base_scores = np.clip(base_scores, 0, 1)
        
        # Discretize scores to simulate LLM output patterns
        if self.score_discretization > 0:
            score_levels = np.linspace(0, 1, self.score_discretization)
            discretized_scores = []
            for score in base_scores:
                # Find closest level
                closest_level = score_levels[np.argmin(np.abs(score_levels - score))]
                discretized_scores.append(closest_level)
            base_scores = np.array(discretized_scores)
        
        # Introduce additional ties based on tie_probability
        if self.tie_probability > 0:
            for i in range(n_docs):
                if np.random.random() < self.tie_probability:
                    # Find another document to tie with
                    other_idx = np.random.choice([j for j in range(n_docs) if j != i])
                    base_scores[other_idx] = base_scores[i]
        
        doc_ids = [f"doc_{i}" for i in range(n_docs)]
        
        return RankingResult(
            doc_ids=doc_ids,
            scores=base_scores.tolist(),
            relevance=true_relevance
        )

# Simulate different LLM rerankers
rerankers = {
    "GPT-4 (Low Ties)": LLMRerankerSimulator("GPT-4", tie_probability=0.1, score_discretization=20),
    "PaLM-2 (Medium Ties)": LLMRerankerSimulator("PaLM-2", tie_probability=0.3, score_discretization=10),
    "Claude (High Ties)": LLMRerankerSimulator("Claude", tie_probability=0.5, score_discretization=5),
    "Local LLM (Very High Ties)": LLMRerankerSimulator("Local", tie_probability=0.7, score_discretization=3)
}

def evaluate_llm_rerankers(n_queries: int = 50):
    """Evaluate different LLM rerankers on synthetic queries"""
    np.random.seed(42)
    
    results = []
    
    for query_id in range(n_queries):
        # Generate synthetic query scenario
        n_docs = np.random.randint(8, 20)
        n_relevant = np.random.randint(1, min(5, n_docs))
        
        documents = [f"Document {i} about topic X" for i in range(n_docs)]
        true_relevance = [False] * n_docs
        relevant_indices = np.random.choice(n_docs, size=n_relevant, replace=False)
        for idx in relevant_indices:
            true_relevance[idx] = True
        
        query = f"Query {query_id} about topic X"
        
        # Evaluate each reranker
        for reranker_name, reranker in rerankers.items():
            ranking_result = reranker.rerank_documents(query, documents, true_relevance)
            
            # Compute metrics
            trad_mrr = metrics.compute_traditional_mrr(ranking_result)
            mtrr = metrics.compute_mtrr(ranking_result)
            trad_hits5 = metrics.compute_traditional_hits_k(ranking_result, 5)
            tmhits5 = metrics.compute_tmhits_k(ranking_result, 5)
            
            # Analyze ties
            tied_groups = ranking_result.get_tied_groups()
            num_ties = sum(1 for group in tied_groups if len(group) > 1)
            tie_ratio = sum(len(group) for group in tied_groups if len(group) > 1) / len(documents)
            
            results.append({
                'query_id': query_id,
                'reranker': reranker_name,
                'n_docs': n_docs,
                'n_relevant': n_relevant,
                'num_ties': num_ties,
                'tie_ratio': tie_ratio,
                'traditional_mrr': trad_mrr,
                'mtrr': mtrr,
                'mrr_diff': abs(trad_mrr - mtrr),
                'traditional_hits5': trad_hits5,
                'tmhits5': tmhits5,
                'hits5_diff': abs(trad_hits5 - tmhits5)
            })
    
    return pd.DataFrame(results)

# Run evaluation
llm_results = evaluate_llm_rerankers(100)

print("🤖 LLM Reranker Evaluation Complete!")
print(f"📊 Evaluated {len(rerankers)} rerankers on {llm_results['query_id'].nunique()} queries")

In [None]:
# Analyze LLM reranker performance
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('LLM Reranker Performance: Traditional vs Tied Metrics', fontsize=16, fontweight='bold')

# 1. Mean performance by reranker (MRR)
mrr_summary = llm_results.groupby('reranker').agg({
    'traditional_mrr': 'mean',
    'mtrr': 'mean',
    'tie_ratio': 'mean'
}).reset_index()

x = np.arange(len(mrr_summary))
width = 0.35

axes[0, 0].bar(x - width/2, mrr_summary['traditional_mrr'], width, label='Traditional MRR', alpha=0.8)
axes[0, 0].bar(x + width/2, mrr_summary['mtrr'], width, label='MTRR', alpha=0.8)
axes[0, 0].set_xlabel('LLM Reranker')
axes[0, 0].set_ylabel('Mean MRR')
axes[0, 0].set_title('Mean MRR Comparison')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(mrr_summary['reranker'], rotation=45, ha='right')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Tie ratio by reranker
tie_summary = llm_results.groupby('reranker')['tie_ratio'].mean()
axes[0, 1].bar(range(len(tie_summary)), tie_summary.values, color='orange', alpha=0.8)
axes[0, 1].set_xlabel('LLM Reranker')
axes[0, 1].set_ylabel('Mean Tie Ratio')
axes[0, 1].set_title('Average Tie Ratio by Reranker')
axes[0, 1].set_xticks(range(len(tie_summary)))
axes[0, 1].set_xticklabels(tie_summary.index, rotation=45, ha='right')
axes[0, 1].grid(True, alpha=0.3)

# 3. MRR difference vs tie ratio
for reranker in rerankers.keys():
    subset = llm_results[llm_results['reranker'] == reranker]
    axes[1, 0].scatter(subset['tie_ratio'], subset['mrr_diff'], 
                      label=reranker, alpha=0.6, s=30)

axes[1, 0].set_xlabel('Tie Ratio')
axes[1, 0].set_ylabel('|Traditional MRR - MTRR|')
axes[1, 0].set_title('MRR Difference vs Tie Ratio by Reranker')
axes[1, 0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
axes[1, 0].grid(True, alpha=0.3)

# 4. Distribution of metric differences
for reranker in rerankers.keys():
    subset = llm_results[llm_results['reranker'] == reranker]
    axes[1, 1].hist(subset['mrr_diff'], bins=20, alpha=0.5, label=reranker, density=True)

axes[1, 1].set_xlabel('|Traditional MRR - MTRR|')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Distribution of MRR Differences')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
summary_stats = llm_results.groupby('reranker').agg({
    'tie_ratio': ['mean', 'std'],
    'traditional_mrr': ['mean', 'std'],
    'mtrr': ['mean', 'std'],
    'mrr_diff': ['mean', 'max'],
    'traditional_hits5': ['mean', 'std'],
    'tmhits5': ['mean', 'std'],
    'hits5_diff': ['mean', 'max']
}).round(4)

print("\n📊 LLM Reranker Summary Statistics:")
print("=" * 60)
display(summary_stats)

## 🔧 Implementation Guidelines & Best Practices

### When to Use Tied Metrics

1. **LLM-based rerankers**: When using large language models for document reranking
2. **Discrete scoring systems**: When rerankers produce scores from a limited set of values
3. **Model comparison**: When comparing different reranking approaches with varying tie patterns
4. **Production systems**: When evaluation consistency is crucial for system monitoring

### Implementation Considerations

1. **Computational complexity**: Tied metrics can be computationally expensive for large tie groups
2. **Approximation methods**: For very large tie groups, consider sampling-based approximations
3. **Reporting**: Always report both traditional and tied metrics for comprehensive evaluation

In [None]:
class OptimizedTiedMetrics:
    """Optimized implementation for large-scale evaluation"""
    
    @staticmethod
    def compute_mtrr_efficient(result: RankingResult, max_permutations: int = 10000) -> float:
        """Compute MTRR with optimization for large tie groups"""
        tied_groups = result.get_tied_groups()
        
        # Calculate total permutations
        total_perms = 1
        for group in tied_groups:
            if len(group) > 1:
                total_perms *= np.math.factorial(len(group))
        
        if total_perms <= max_permutations:
            # Use exact computation
            return TiedRankingMetrics.compute_mtrr(result)
        else:
            # Use sampling approximation
            return OptimizedTiedMetrics._compute_mtrr_sampling(
                result, tied_groups, max_permutations
            )
    
    @staticmethod
    def _compute_mtrr_sampling(result: RankingResult, tied_groups: List[List[int]], 
                              n_samples: int) -> float:
        """Approximate MTRR using random sampling of permutations"""
        total_rr = 0.0
        
        for _ in range(n_samples):
            # Generate random ranking
            ranking = []
            for group in tied_groups:
                if len(group) > 1:
                    # Random permutation of tied group
                    shuffled_group = list(group)
                    np.random.shuffle(shuffled_group)
                    ranking.extend(shuffled_group)
                else:
                    ranking.extend(group)
            
            # Compute MRR for this ranking
            total_rr += TiedRankingMetrics.compute_mrr_single(ranking, result.relevance)
        
        return total_rr / n_samples
    
    @staticmethod
    def analyze_computational_complexity(tied_groups: List[List[int]]) -> Dict:
        """Analyze computational complexity of tied ranking evaluation"""
        total_perms = 1
        tie_sizes = []
        
        for group in tied_groups:
            if len(group) > 1:
                tie_sizes.append(len(group))
                total_perms *= np.math.factorial(len(group))
        
        return {
            'total_permutations': total_perms,
            'num_tie_groups': len(tie_sizes),
            'tie_sizes': tie_sizes,
            'max_tie_size': max(tie_sizes) if tie_sizes else 0,
            'is_tractable': total_perms <= 10000,
            'recommended_sampling': max(1000, min(10000, total_perms // 10))
        }

# Test computational efficiency
def test_computational_efficiency():
    """Test computational efficiency of tied metrics"""
    test_cases = [
        {"name": "Small ties", "tie_sizes": [2, 2]},
        {"name": "Medium ties", "tie_sizes": [3, 3, 2]},
        {"name": "Large tie", "tie_sizes": [5]},
        {"name": "Very large tie", "tie_sizes": [8]},
        {"name": "Multiple large ties", "tie_sizes": [4, 4, 3]}
    ]
    
    print("⚡ Computational Complexity Analysis")
    print("=" * 50)
    
    for case in test_cases:
        # Create mock tied groups
        tied_groups = []
        current_idx = 0
        
        for tie_size in case["tie_sizes"]:
            tied_groups.append(list(range(current_idx, current_idx + tie_size)))
            current_idx += tie_size
        
        # Add remaining single documents
        for i in range(current_idx, current_idx + 3):
            tied_groups.append([i])
        
        complexity = OptimizedTiedMetrics.analyze_computational_complexity(tied_groups)
        
        print(f"\n📊 {case['name']}:")
        print(f"   Tie sizes: {case['tie_sizes']}")
        print(f"   Total permutations: {complexity['total_permutations']:,}")
        print(f"   Tractable: {'✅' if complexity['is_tractable'] else '❌'}")
        if not complexity['is_tractable']:
            print(f"   Recommended sampling: {complexity['recommended_sampling']:,}")

test_computational_efficiency()

## 🎯 Practical Implementation for G-RAG

Let's implement a complete evaluation framework that integrates tied metrics into the G-RAG system.

In [None]:
class GRAGEvaluator:
    """Complete evaluation framework for G-RAG with tied metrics"""
    
    def __init__(self, use_tied_metrics: bool = True, max_permutations: int = 10000):
        self.use_tied_metrics = use_tied_metrics
        self.max_permutations = max_permutations
        self.tied_metrics = TiedRankingMetrics()
        self.optimized_metrics = OptimizedTiedMetrics()
    
    def evaluate_single_query(self, ranking_result: RankingResult) -> Dict:
        """Evaluate a single query ranking"""
        results = {}
        
        # Traditional metrics
        results['traditional_mrr'] = self.tied_metrics.compute_traditional_mrr(ranking_result)
        results['traditional_hits_1'] = self.tied_metrics.compute_traditional_hits_k(ranking_result, 1)
        results['traditional_hits_3'] = self.tied_metrics.compute_traditional_hits_k(ranking_result, 3)
        results['traditional_hits_5'] = self.tied_metrics.compute_traditional_hits_k(ranking_result, 5)
        results['traditional_hits_10'] = self.tied_metrics.compute_traditional_hits_k(ranking_result, 10)
        
        if self.use_tied_metrics:
            # Tied metrics
            results['mtrr'] = self.optimized_metrics.compute_mtrr_efficient(
                ranking_result, self.max_permutations
            )
            results['tmhits_1'] = self.tied_metrics.compute_tmhits_k(ranking_result, 1)
            results['tmhits_3'] = self.tied_metrics.compute_tmhits_k(ranking_result, 3)
            results['tmhits_5'] = self.tied_metrics.compute_tmhits_k(ranking_result, 5)
            results['tmhits_10'] = self.tied_metrics.compute_tmhits_k(ranking_result, 10)
            
            # Tie analysis
            tied_groups = ranking_result.get_tied_groups()
            complexity = self.optimized_metrics.analyze_computational_complexity(tied_groups)
            
            results['num_tie_groups'] = complexity['num_tie_groups']
            results['max_tie_size'] = complexity['max_tie_size']
            results['total_permutations'] = complexity['total_permutations']
            results['is_exact_computation'] = complexity['is_tractable']
        
        return results
    
    def evaluate_query_set(self, ranking_results: List[RankingResult]) -> Dict:
        """Evaluate a set of query rankings"""
        individual_results = []
        
        for result in ranking_results:
            individual_results.append(self.evaluate_single_query(result))
        
        # Aggregate results
        aggregated = {}
        
        # Calculate means for all metrics
        for key in individual_results[0].keys():
            if isinstance(individual_results[0][key], (int, float)):
                values = [r[key] for r in individual_results]
                aggregated[f'mean_{key}'] = np.mean(values)
                aggregated[f'std_{key}'] = np.std(values)
        
        # Add summary statistics
        aggregated['num_queries'] = len(ranking_results)
        
        if self.use_tied_metrics:
            # Percentage of queries with ties
            queries_with_ties = sum(1 for r in individual_results if r['num_tie_groups'] > 0)
            aggregated['queries_with_ties_pct'] = queries_with_ties / len(ranking_results) * 100
            
            # Average differences between traditional and tied metrics
            mrr_diffs = [abs(r['traditional_mrr'] - r['mtrr']) for r in individual_results]
            aggregated['mean_mrr_difference'] = np.mean(mrr_diffs)
            aggregated['max_mrr_difference'] = np.max(mrr_diffs)
        
        return aggregated, individual_results
    
    def generate_evaluation_report(self, ranking_results: List[RankingResult], 
                                 model_name: str = "G-RAG") -> str:
        """Generate comprehensive evaluation report"""
        aggregated, individual = self.evaluate_query_set(ranking_results)
        
        report = f"""
🎯 G-RAG Evaluation Report: {model_name}
{'=' * 60}

📊 Dataset Summary:
   Number of queries: {aggregated['num_queries']}
"""
        
        if self.use_tied_metrics:
            report += f"""
   Queries with ties: {aggregated['queries_with_ties_pct']:.1f}%
   Avg tie groups per query: {aggregated['mean_num_tie_groups']:.1f}
   Max tie size observed: {aggregated['mean_max_tie_size']:.1f}
"""
        
        report += f"""

📈 Traditional Metrics:
   MRR: {aggregated['mean_traditional_mrr']:.4f} (±{aggregated['std_traditional_mrr']:.4f})
   Hits@1: {aggregated['mean_traditional_hits_1']:.4f} (±{aggregated['std_traditional_hits_1']:.4f})
   Hits@3: {aggregated['mean_traditional_hits_3']:.4f} (±{aggregated['std_traditional_hits_3']:.4f})
   Hits@5: {aggregated['mean_traditional_hits_5']:.4f} (±{aggregated['std_traditional_hits_5']:.4f})
   Hits@10: {aggregated['mean_traditional_hits_10']:.4f} (±{aggregated['std_traditional_hits_10']:.4f})
"""
        
        if self.use_tied_metrics:
            report += f"""

🔄 Tied Metrics (Fair Evaluation):
   MTRR: {aggregated['mean_mtrr']:.4f} (±{aggregated['std_mtrr']:.4f})
   TMHits@1: {aggregated['mean_tmhits_1']:.4f} (±{aggregated['std_tmhits_1']:.4f})
   TMHits@3: {aggregated['mean_tmhits_3']:.4f} (±{aggregated['std_tmhits_3']:.4f})
   TMHits@5: {aggregated['mean_tmhits_5']:.4f} (±{aggregated['std_tmhits_5']:.4f})
   TMHits@10: {aggregated['mean_tmhits_10']:.4f} (±{aggregated['std_tmhits_10']:.4f})

🔍 Metric Stability Analysis:
   Mean |MRR - MTRR|: {aggregated['mean_mrr_difference']:.4f}
   Max |MRR - MTRR|: {aggregated['max_mrr_difference']:.4f}
   
✅ Recommendation: {'Use tied metrics for fair comparison' if aggregated['queries_with_ties_pct'] > 10 else 'Traditional metrics sufficient'}
"""
        
        return report

# Demonstrate with G-RAG evaluation
def demo_grag_evaluation():
    """Demonstrate G-RAG evaluation with tied metrics"""
    
    # Create synthetic G-RAG results (simulating different reranker behaviors)
    grag_scenarios = []
    
    # Simulate G-RAG results on various query types
    query_types = [
        {"name": "Simple factual", "tie_prob": 0.2, "performance": 0.8},
        {"name": "Complex reasoning", "tie_prob": 0.4, "performance": 0.6},
        {"name": "Multi-hop", "tie_prob": 0.5, "performance": 0.7},
        {"name": "Ambiguous", "tie_prob": 0.6, "performance": 0.5}
    ]
    
    np.random.seed(42)
    
    for query_type in query_types:
        for _ in range(25):  # 25 queries per type
            n_docs = np.random.randint(10, 20)
            n_relevant = np.random.randint(1, 4)
            
            # Simulate G-RAG scores with ties
            scores = np.random.uniform(0.1, 1.0, n_docs)
            
            # Boost relevant documents based on performance
            relevance = [False] * n_docs
            relevant_indices = np.random.choice(n_docs, size=n_relevant, replace=False)
            for idx in relevant_indices:
                relevance[idx] = True
                if np.random.random() < query_type["performance"]:
                    scores[idx] += np.random.uniform(0.2, 0.4)
            
            # Introduce ties
            if np.random.random() < query_type["tie_prob"]:
                # Create some tied scores
                tie_value = np.random.choice(scores)
                n_tied = np.random.randint(2, min(5, n_docs))
                tied_indices = np.random.choice(n_docs, size=n_tied, replace=False)
                for idx in tied_indices:
                    scores[idx] = tie_value
            
            scores = np.clip(scores, 0, 1)
            doc_ids = [f"doc_{i}" for i in range(n_docs)]
            
            grag_scenarios.append(RankingResult(
                doc_ids=doc_ids,
                scores=scores.tolist(),
                relevance=relevance
            ))
    
    # Evaluate with tied metrics
    evaluator = GRAGEvaluator(use_tied_metrics=True)
    report = evaluator.generate_evaluation_report(grag_scenarios, "G-RAG")
    
    print(report)
    
    return grag_scenarios, evaluator

# Run demonstration
grag_scenarios, evaluator = demo_grag_evaluation()

## 🎓 Key Insights and Takeaways

### 1. **The Tied Ranking Problem**
- LLM-based rerankers frequently produce tied scores
- Traditional metrics give inconsistent results depending on tie-breaking
- This inconsistency makes model comparison unreliable

### 2. **Mathematical Foundation**
- **MTRR**: Averages reciprocal ranks across all possible tie arrangements
- **TMHits@K**: Extends Hits@K to handle tied rankings fairly
- Both metrics provide stable, tie-breaking-independent evaluation

### 3. **Computational Considerations**
- Exact computation becomes intractable for large tie groups
- Sampling-based approximation maintains accuracy while reducing cost
- Trade-off between precision and computational efficiency

### 4. **Practical Impact**
- Tied metrics provide fairer comparison between different LLM rerankers
- Essential for production systems where consistent evaluation is crucial
- Help identify which models are truly better vs. appearing better due to tie-breaking

### 5. **Integration with G-RAG**
- Tied metrics complement the G-RAG architecture
- Enable fair evaluation of the graph-based reranking approach
- Support better model selection and hyperparameter tuning

---

## 🚀 Research Extensions

1. **Adaptive Tie Handling**: Develop metrics that weight ties based on confidence scores
2. **Hierarchical Ties**: Handle nested tie structures in complex ranking scenarios
3. **Multi-objective Ties**: Extend to scenarios with multiple ranking criteria
4. **Online Evaluation**: Adapt tied metrics for streaming/online evaluation settings

## 🎯 Conclusion

The introduction of **Mean Tied Reciprocal Rank (MTRR)** and **Tied Mean Hits@K** represents a significant advancement in fair evaluation of retrieval systems, particularly those using LLMs as rerankers. These metrics:

- ✅ **Eliminate tie-breaking bias** in evaluation
- ✅ **Enable consistent model comparison** across different systems
- ✅ **Provide theoretical foundation** for fair ranking evaluation
- ✅ **Support practical implementation** with computational optimizations

For the G-RAG system and other modern retrieval-augmented generation approaches, tied metrics are essential for accurate performance assessment and meaningful progress measurement.

In [None]:
print("🎯 Focused Learning 4: Tied Ranking Metrics - Complete!")
print("📚 Key concepts covered:")
print("   ✅ Mean Tied Reciprocal Rank (MTRR)")
print("   ✅ Tied Mean Hits@K (TMHits@K)")
print("   ✅ Computational optimization strategies")
print("   ✅ LLM reranker evaluation")
print("   ✅ G-RAG integration framework")
print("\n🚀 Ready to apply fair evaluation metrics in your RAG systems!")