# Focused Learning 1: Evaluation Metrics for Semantic Search

## 🎯 Learning Objectives
- Master the mathematical foundations of nDCG, MRR, and mAP metrics
- Understand why these metrics are crucial for semantic search evaluation
- Implement metrics from scratch with step-by-step explanations
- Analyze how different ranking scenarios affect metric scores

## 📚 Paper Context
**Section 3.2**: "Assessing the proficiency and effectiveness of various semantic search methodologies necessitates the use of specific evaluation metrics. Our attention centers on the pivotal metrics: Normalized Discounted Cumulative Gain (nDCG), Mean Reciprocal Rank (MRR), and Mean Average Precision (mAP)."

## 🔬 Why These Metrics Matter
Unlike simple accuracy metrics, these evaluation measures account for:
- **Ranking Quality**: Position matters in search results
- **Graded Relevance**: Documents can be highly relevant, somewhat relevant, or irrelevant
- **User Experience**: Earlier results are more valuable to users

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📊 Evaluation Metrics Learning Environment Ready!")

## 1. Normalized Discounted Cumulative Gain (nDCG)

### Mathematical Foundation
From the paper (Equation 1-2):

$$nDCG@k = \frac{DCG@k}{IDCG@k}$$

$$DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)}$$

Where:
- `DCG@k`: Discounted Cumulative Gain at rank k
- `IDCG@k`: Ideal DCG (maximum possible DCG)
- `rel_i`: Graded relevance of result at position i

In [None]:
class NDCGCalculator:
    """Step-by-step nDCG calculation with detailed explanations"""
    
    def __init__(self):
        self.calculation_steps = []
    
    def calculate_dcg(self, relevance_scores: List[int], k: int = None, verbose: bool = True) -> float:
        """Calculate Discounted Cumulative Gain with detailed steps"""
        if k is None:
            k = len(relevance_scores)
        
        k = min(k, len(relevance_scores))
        dcg = 0.0
        calculation_details = []
        
        if verbose:
            print(f"\n🔢 Calculating DCG@{k}:")
            print("Position | Relevance | 2^rel - 1 | log₂(pos+1) | Contribution")
            print("-" * 60)
        
        for i in range(k):
            position = i + 1
            relevance = relevance_scores[i]
            
            # DCG formula components
            numerator = (2 ** relevance) - 1
            denominator = np.log2(position + 1)
            contribution = numerator / denominator
            
            dcg += contribution
            
            calculation_details.append({
                'position': position,
                'relevance': relevance,
                'numerator': numerator,
                'denominator': denominator,
                'contribution': contribution
            })
            
            if verbose:
                print(f"{position:8d} | {relevance:9d} | {numerator:9.0f} | {denominator:10.3f} | {contribution:12.4f}")
        
        if verbose:
            print(f"\n📊 Total DCG@{k} = {dcg:.4f}")
        
        return dcg, calculation_details
    
    def calculate_ideal_dcg(self, relevance_scores: List[int], k: int = None, verbose: bool = True) -> float:
        """Calculate Ideal DCG (best possible ranking)"""
        ideal_scores = sorted(relevance_scores, reverse=True)
        
        if verbose:
            print(f"\n⭐ Ideal ranking: {ideal_scores}")
        
        ideal_dcg, _ = self.calculate_dcg(ideal_scores, k, verbose)
        return ideal_dcg
    
    def calculate_ndcg(self, relevance_scores: List[int], k: int = None, verbose: bool = True) -> Dict:
        """Calculate nDCG with complete breakdown"""
        if verbose:
            print(f"\n🎯 nDCG Calculation for relevance scores: {relevance_scores}")
            print("=" * 70)
        
        # Calculate actual DCG
        dcg, dcg_details = self.calculate_dcg(relevance_scores, k, verbose)
        
        # Calculate ideal DCG
        ideal_dcg = self.calculate_ideal_dcg(relevance_scores, k, verbose)
        
        # Calculate nDCG
        if ideal_dcg == 0:
            ndcg = 0.0
        else:
            ndcg = dcg / ideal_dcg
        
        if verbose:
            print(f"\n🏆 Final nDCG@{k or len(relevance_scores)} = {dcg:.4f} / {ideal_dcg:.4f} = {ndcg:.4f}")
        
        return {
            'ndcg': ndcg,
            'dcg': dcg,
            'ideal_dcg': ideal_dcg,
            'dcg_details': dcg_details
        }

# Demonstrate nDCG calculation
ndcg_calc = NDCGCalculator()

# Example from paper: relevance scores (0=irrelevant, 1=somewhat relevant, 2=very relevant)
example_relevance = [2, 1, 0, 1, 2]  # Ranking: very, somewhat, irrelevant, somewhat, very
result = ndcg_calc.calculate_ndcg(example_relevance, k=3, verbose=True)

## 2. Mean Reciprocal Rank (MRR)

### Mathematical Foundation
From the paper (Equation 3):

$$MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$$

Where:
- `|Q|`: Total number of queries
- `rank_i`: Rank position of first very relevant document for query i

In [None]:
class MRRCalculator:
    """Mean Reciprocal Rank calculation with detailed analysis"""
    
    def find_first_relevant_rank(self, relevance_scores: List[int], 
                                predictions: List[float], 
                                min_relevance: int = 2) -> int:
        """Find rank of first highly relevant document"""
        # Sort by predictions (highest first)
        sorted_indices = np.argsort(predictions)[::-1]
        
        for rank, idx in enumerate(sorted_indices, 1):
            if relevance_scores[idx] >= min_relevance:
                return rank
        
        return float('inf')  # No relevant document found
    
    def calculate_mrr_single_query(self, relevance_scores: List[int], 
                                  predictions: List[float], 
                                  verbose: bool = True) -> Dict:
        """Calculate MRR for a single query with detailed breakdown"""
        
        # Find first relevant rank
        first_relevant_rank = self.find_first_relevant_rank(relevance_scores, predictions)
        
        # Calculate reciprocal rank
        if first_relevant_rank == float('inf'):
            rr = 0.0
        else:
            rr = 1.0 / first_relevant_rank
        
        if verbose:
            # Show ranking process
            sorted_indices = np.argsort(predictions)[::-1]
            print(f"\n🔍 MRR Calculation:")
            print("Rank | Doc Index | Prediction | Relevance | Very Relevant?")
            print("-" * 55)
            
            for rank, idx in enumerate(sorted_indices, 1):
                is_very_relevant = "✅ YES" if relevance_scores[idx] >= 2 else "❌ No"
                print(f"{rank:4d} | {idx:9d} | {predictions[idx]:10.3f} | {relevance_scores[idx]:9d} | {is_very_relevant}")
                
                if relevance_scores[idx] >= 2 and rank == first_relevant_rank:
                    print(f"     ⬆️ First very relevant document found at rank {rank}")
                    break
            
            print(f"\n📊 Reciprocal Rank = 1/{first_relevant_rank} = {rr:.4f}")
        
        return {
            'reciprocal_rank': rr,
            'first_relevant_rank': first_relevant_rank,
            'sorted_ranking': list(np.argsort(predictions)[::-1])
        }
    
    def calculate_mrr_multiple_queries(self, queries_data: List[Dict], verbose: bool = True) -> Dict:
        """Calculate MRR across multiple queries"""
        reciprocal_ranks = []
        query_details = []
        
        if verbose:
            print(f"\n📋 MRR Calculation for {len(queries_data)} Queries")
            print("=" * 60)
        
        for i, query_data in enumerate(queries_data):
            if verbose:
                print(f"\n🔍 Query {i+1}: {query_data.get('query', f'Query {i+1}')}")
            
            result = self.calculate_mrr_single_query(
                query_data['relevance'], 
                query_data['predictions'], 
                verbose=verbose
            )
            
            reciprocal_ranks.append(result['reciprocal_rank'])
            query_details.append(result)
        
        mean_rr = np.mean(reciprocal_ranks)
        
        if verbose:
            print(f"\n🏆 Mean Reciprocal Rank = {mean_rr:.4f}")
            print(f"📊 Individual RRs: {[f'{rr:.3f}' for rr in reciprocal_ranks]}")
        
        return {
            'mrr': mean_rr,
            'reciprocal_ranks': reciprocal_ranks,
            'query_details': query_details
        }

# Demonstrate MRR calculation
mrr_calc = MRRCalculator()

# Example single query
example_relevance = [0, 2, 1, 0, 1]  # Second document is very relevant
example_predictions = [0.9, 0.3, 0.7, 0.1, 0.5]  # Our model's confidence scores

single_result = mrr_calc.calculate_mrr_single_query(example_relevance, example_predictions, verbose=True)

## 3. Mean Average Precision (mAP)

### Mathematical Foundation
From the paper (Equations 4-5):

$$AP = \frac{\sum_{k=1}^{n} (P(k) \times rel_k)}{\text{# relevant documents}}$$

$$mAP = \frac{\sum_{q=1}^{Q} AP_q}{Q}$$

Where:
- `P(k)`: Precision at cutoff k
- `rel_k`: Relevancy score at rank k
- `AP_q`: Average Precision for query q

In [None]:
class MAPCalculator:
    """Mean Average Precision calculation with comprehensive analysis"""
    
    def calculate_precision_at_k(self, relevance_scores: List[int], 
                                sorted_indices: List[int], 
                                k: int) -> float:
        """Calculate precision at cutoff k"""
        relevant_count = 0
        
        for i in range(min(k, len(sorted_indices))):
            idx = sorted_indices[i]
            if relevance_scores[idx] > 0:  # Any relevance > 0
                relevant_count += 1
        
        return relevant_count / k if k > 0 else 0.0
    
    def calculate_ap_single_query(self, relevance_scores: List[int], 
                                 predictions: List[float], 
                                 k: int = None, 
                                 verbose: bool = True) -> Dict:
        """Calculate Average Precision for single query"""
        
        if k is None:
            k = len(relevance_scores)
        
        # Sort by predictions (highest first)
        sorted_indices = list(np.argsort(predictions)[::-1])
        total_relevant = sum(1 for rel in relevance_scores if rel > 0)
        
        if total_relevant == 0:
            return {'ap': 0.0, 'precision_at_k': [], 'calculation_steps': []}
        
        if verbose:
            print(f"\n📊 Average Precision Calculation (k={k}):")
            print(f"Total relevant documents: {total_relevant}")
            print("\nRank | Doc | Pred | Rel | P@k | Rel? | Contribution")
            print("-" * 60)
        
        precision_sum = 0.0
        precision_at_k_values = []
        calculation_steps = []
        
        for rank in range(1, min(k + 1, len(sorted_indices) + 1)):
            idx = sorted_indices[rank - 1]
            relevance = relevance_scores[idx]
            prediction = predictions[idx]
            
            # Calculate precision at this rank
            precision_at_rank = self.calculate_precision_at_k(relevance_scores, sorted_indices, rank)
            precision_at_k_values.append(precision_at_rank)
            
            # Add to precision sum if document is relevant
            is_relevant = relevance > 0
            contribution = precision_at_rank if is_relevant else 0.0
            precision_sum += contribution
            
            calculation_steps.append({
                'rank': rank,
                'doc_idx': idx,
                'prediction': prediction,
                'relevance': relevance,
                'precision_at_k': precision_at_rank,
                'is_relevant': is_relevant,
                'contribution': contribution
            })
            
            if verbose:
                relevant_symbol = "✅" if is_relevant else "❌"
                print(f"{rank:4d} | {idx:3d} | {prediction:.2f} | {relevance:3d} | {precision_at_rank:.3f} | {relevant_symbol:2s} | {contribution:.4f}")
        
        # Calculate final Average Precision
        ap = precision_sum / total_relevant
        
        if verbose:
            print(f"\n🎯 Average Precision = {precision_sum:.4f} / {total_relevant} = {ap:.4f}")
        
        return {
            'ap': ap,
            'precision_at_k': precision_at_k_values,
            'calculation_steps': calculation_steps,
            'precision_sum': precision_sum,
            'total_relevant': total_relevant
        }
    
    def calculate_map_multiple_queries(self, queries_data: List[Dict], 
                                      k: int = None, 
                                      verbose: bool = True) -> Dict:
        """Calculate Mean Average Precision across multiple queries"""
        
        ap_scores = []
        query_details = []
        
        if verbose:
            print(f"\n📋 mAP Calculation for {len(queries_data)} Queries")
            print("=" * 70)
        
        for i, query_data in enumerate(queries_data):
            if verbose:
                print(f"\n🔍 Query {i+1}: {query_data.get('query', f'Query {i+1}')}")
            
            result = self.calculate_ap_single_query(
                query_data['relevance'], 
                query_data['predictions'], 
                k=k, 
                verbose=verbose
            )
            
            ap_scores.append(result['ap'])
            query_details.append(result)
        
        mean_ap = np.mean(ap_scores)
        
        if verbose:
            print(f"\n🏆 Mean Average Precision (mAP@{k or 'all'}) = {mean_ap:.4f}")
            print(f"📊 Individual APs: {[f'{ap:.3f}' for ap in ap_scores]}")
        
        return {
            'map': mean_ap,
            'ap_scores': ap_scores,
            'query_details': query_details
        }

# Demonstrate mAP calculation
map_calc = MAPCalculator()

# Example with more complex relevance pattern
example_relevance = [1, 0, 2, 1, 0]  # Mixed relevance levels
example_predictions = [0.8, 0.2, 0.6, 0.9, 0.1]  # Our model's predictions

ap_result = map_calc.calculate_ap_single_query(example_relevance, example_predictions, k=3, verbose=True)

## 4. Comparative Analysis: Understanding Metric Behaviors

In [None]:
def compare_ranking_scenarios():
    """Compare how different ranking scenarios affect all three metrics"""
    
    # Define different ranking scenarios
    scenarios = {
        'Perfect_Ranking': {
            'relevance': [2, 2, 1, 1, 0],
            'predictions': [0.9, 0.8, 0.7, 0.6, 0.1],
            'description': 'Perfect ranking: highly relevant docs first'
        },
        'Worst_Ranking': {
            'relevance': [2, 2, 1, 1, 0],
            'predictions': [0.1, 0.2, 0.3, 0.4, 0.9],
            'description': 'Worst ranking: irrelevant docs first'
        },
        'Mixed_Ranking': {
            'relevance': [2, 2, 1, 1, 0],
            'predictions': [0.7, 0.3, 0.9, 0.1, 0.5],
            'description': 'Mixed ranking: some relevant docs early'
        },
        'Late_Success': {
            'relevance': [2, 2, 1, 1, 0],
            'predictions': [0.1, 0.2, 0.3, 0.9, 0.8],
            'description': 'Late success: relevant docs at end'
        }
    }
    
    results = []
    
    print("🔄 Comparative Analysis of Ranking Scenarios")
    print("=" * 80)
    
    for scenario_name, scenario_data in scenarios.items():
        print(f"\n📊 Scenario: {scenario_name}")
        print(f"Description: {scenario_data['description']}")
        print(f"Relevance: {scenario_data['relevance']}")
        print(f"Predictions: {scenario_data['predictions']}")
        print("-" * 50)
        
        # Calculate all metrics
        ndcg_result = ndcg_calc.calculate_ndcg(scenario_data['relevance'], k=3, verbose=False)
        mrr_result = mrr_calc.calculate_mrr_single_query(scenario_data['relevance'], scenario_data['predictions'], verbose=False)
        map_result = map_calc.calculate_ap_single_query(scenario_data['relevance'], scenario_data['predictions'], k=3, verbose=False)
        
        results.append({
            'Scenario': scenario_name,
            'nDCG@3': ndcg_result['ndcg'],
            'MRR': mrr_result['reciprocal_rank'],
            'mAP@3': map_result['ap'],
            'Description': scenario_data['description']
        })
        
        print(f"nDCG@3: {ndcg_result['ndcg']:.4f}")
        print(f"MRR:    {mrr_result['reciprocal_rank']:.4f}")
        print(f"mAP@3:  {map_result['ap']:.4f}")
    
    return pd.DataFrame(results)

# Run comparative analysis
comparison_df = compare_ranking_scenarios()

# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

metrics = ['nDCG@3', 'MRR', 'mAP@3']
colors = ['skyblue', 'lightcoral', 'lightgreen']

for i, metric in enumerate(metrics):
    ax = axes[i]
    bars = ax.bar(comparison_df['Scenario'], comparison_df[metric], color=colors[i], alpha=0.8)
    ax.set_title(f'{metric} Across Scenarios', fontsize=14, fontweight='bold')
    ax.set_ylabel('Score', fontsize=12)
    ax.set_ylim(0, 1)
    ax.tick_params(axis='x', rotation=45)
    ax.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar, value in zip(bars, comparison_df[metric]):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig('metric_comparison_scenarios.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n📊 Summary Table:")
print(comparison_df.round(4))

## 5. Interactive Metric Explorer

In [None]:
def interactive_metric_calculator():
    """Interactive tool to understand how metrics respond to different inputs"""
    
    print("🎮 Interactive Metric Calculator")
    print("=" * 50)
    print("Experiment with different relevance and prediction patterns!")
    print("\nRelevance scale: 0=Irrelevant, 1=Somewhat Relevant, 2=Very Relevant")
    
    # Pre-defined experiments
    experiments = [
        {
            'name': 'Early Success',
            'relevance': [2, 1, 0, 0, 1],
            'predictions': [0.9, 0.8, 0.3, 0.2, 0.7]
        },
        {
            'name': 'Late Discovery', 
            'relevance': [0, 0, 1, 2, 1],
            'predictions': [0.2, 0.3, 0.9, 0.8, 0.7]
        },
        {
            'name': 'Mixed Results',
            'relevance': [1, 0, 2, 1, 0],
            'predictions': [0.6, 0.8, 0.4, 0.7, 0.5]
        }
    ]
    
    results_summary = []
    
    for exp in experiments:
        print(f"\n🧪 Experiment: {exp['name']}")
        print(f"Relevance:   {exp['relevance']}")
        print(f"Predictions: {exp['predictions']}")
        
        # Calculate metrics
        ndcg_res = ndcg_calc.calculate_ndcg(exp['relevance'], k=3, verbose=False)
        mrr_res = mrr_calc.calculate_mrr_single_query(exp['relevance'], exp['predictions'], verbose=False)
        map_res = map_calc.calculate_ap_single_query(exp['relevance'], exp['predictions'], k=3, verbose=False)
        
        results_summary.append({
            'Experiment': exp['name'],
            'nDCG@3': ndcg_res['ndcg'],
            'MRR': mrr_res['reciprocal_rank'],
            'mAP@3': map_res['ap']
        })
        
        print(f"📊 Results: nDCG@3={ndcg_res['ndcg']:.3f}, MRR={mrr_res['reciprocal_rank']:.3f}, mAP@3={map_res['ap']:.3f}")
    
    return pd.DataFrame(results_summary)

interactive_results = interactive_metric_calculator()

# Create radar chart for metric comparison
fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(projection='polar'))

metrics = ['nDCG@3', 'MRR', 'mAP@3']
angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
angles += angles[:1]  # Complete the circle

colors = ['red', 'blue', 'green']

for i, (_, row) in enumerate(interactive_results.iterrows()):
    values = [row['nDCG@3'], row['MRR'], row['mAP@3']]
    values += values[:1]  # Complete the circle
    
    ax.plot(angles, values, 'o-', linewidth=2, label=row['Experiment'], color=colors[i])
    ax.fill(angles, values, alpha=0.25, color=colors[i])

ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics)
ax.set_ylim(0, 1)
ax.set_title('Metric Performance Comparison\n(Radar Chart)', size=16, fontweight='bold', pad=20)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
ax.grid(True)

plt.tight_layout()
plt.savefig('metric_radar_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n📊 Interactive Results Summary:")
print(interactive_results.round(4))

## 6. Key Insights and Learning Summary

### 🔍 What We've Learned

1. **nDCG (Normalized Discounted Cumulative Gain)**:
   - Emphasizes **graded relevance** and **position importance**
   - Higher scores for relevant documents at top positions
   - Normalizes by ideal ranking for fair comparison
   - Best for scenarios with multiple relevance levels

2. **MRR (Mean Reciprocal Rank)**:
   - Focuses on **first highly relevant result**
   - Critical for applications where users need quick answers
   - Sensitive to top-ranked results
   - Ideal for question-answering systems

3. **mAP (Mean Average Precision)**:
   - Considers **all relevant documents**
   - Balances precision across different cutoff points
   - Good for comprehensive retrieval evaluation
   - Useful when recall matters as much as precision

### 🎯 Why These Metrics Matter for Arabic Semantic Search

As highlighted in the paper, Arabic language presents unique challenges:
- **Complex morphology**: Same root can have many forms
- **Dialectal variations**: Different regions use different expressions
- **Limited resources**: Fewer labeled datasets available

These metrics help evaluate how well semantic search systems handle these challenges by providing nuanced performance measures that go beyond simple accuracy.

### 🚀 Practical Applications

- **Customer Support**: MRR critical for quick problem resolution
- **Document Search**: mAP important for comprehensive retrieval
- **Question Answering**: nDCG valuable for ranked answer lists
- **RAG Systems**: All three metrics crucial for retrieval quality assessment

In [None]:
# Create final comprehensive visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Metric sensitivity analysis
position_effect = []
for pos in range(1, 6):
    # Move a very relevant document to different positions
    relevance = [0] * 5
    relevance[pos-1] = 2
    predictions = [0.1] * 5
    predictions[pos-1] = 0.9
    
    ndcg_res = ndcg_calc.calculate_ndcg(relevance, k=5, verbose=False)
    mrr_res = mrr_calc.calculate_mrr_single_query(relevance, predictions, verbose=False)
    map_res = map_calc.calculate_ap_single_query(relevance, predictions, k=5, verbose=False)
    
    position_effect.append({
        'Position': pos,
        'nDCG@5': ndcg_res['ndcg'],
        'MRR': mrr_res['reciprocal_rank'],
        'mAP@5': map_res['ap']
    })

pos_df = pd.DataFrame(position_effect)

ax1.plot(pos_df['Position'], pos_df['nDCG@5'], 'o-', label='nDCG@5', linewidth=2, markersize=8)
ax1.plot(pos_df['Position'], pos_df['MRR'], 's-', label='MRR', linewidth=2, markersize=8)
ax1.plot(pos_df['Position'], pos_df['mAP@5'], '^-', label='mAP@5', linewidth=2, markersize=8)
ax1.set_xlabel('Position of Relevant Document')
ax1.set_ylabel('Metric Score')
ax1.set_title('Position Sensitivity Analysis')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Relevance level impact
relevance_levels = [0, 1, 2]
relevance_impact = []

for rel_level in relevance_levels:
    relevance = [rel_level, 0, 0, 0, 0]
    predictions = [0.9, 0.1, 0.1, 0.1, 0.1]
    
    ndcg_res = ndcg_calc.calculate_ndcg(relevance, k=3, verbose=False)
    mrr_res = mrr_calc.calculate_mrr_single_query(relevance, predictions, verbose=False)
    map_res = map_calc.calculate_ap_single_query(relevance, predictions, k=3, verbose=False)
    
    relevance_impact.append({
        'Relevance Level': rel_level,
        'nDCG@3': ndcg_res['ndcg'],
        'MRR': mrr_res['reciprocal_rank'],
        'mAP@3': map_res['ap']
    })

rel_df = pd.DataFrame(relevance_impact)

width = 0.25
x = np.arange(len(relevance_levels))
ax2.bar(x - width, rel_df['nDCG@3'], width, label='nDCG@3', alpha=0.8)
ax2.bar(x, rel_df['MRR'], width, label='MRR', alpha=0.8)
ax2.bar(x + width, rel_df['mAP@3'], width, label='mAP@3', alpha=0.8)
ax2.set_xlabel('Relevance Level')
ax2.set_ylabel('Metric Score')
ax2.set_title('Impact of Relevance Levels')
ax2.set_xticks(x)
ax2.set_xticklabels(['Irrelevant (0)', 'Somewhat (1)', 'Very (2)'])
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Cutoff sensitivity (k parameter)
k_values = range(1, 6)
relevance = [2, 1, 0, 1, 2]
predictions = [0.9, 0.7, 0.3, 0.6, 0.8]

k_sensitivity = []
for k in k_values:
    ndcg_res = ndcg_calc.calculate_ndcg(relevance, k=k, verbose=False)
    map_res = map_calc.calculate_ap_single_query(relevance, predictions, k=k, verbose=False)
    
    k_sensitivity.append({
        'k': k,
        'nDCG@k': ndcg_res['ndcg'],
        'mAP@k': map_res['ap']
    })

k_df = pd.DataFrame(k_sensitivity)

ax3.plot(k_df['k'], k_df['nDCG@k'], 'o-', label='nDCG@k', linewidth=2, markersize=8)
ax3.plot(k_df['k'], k_df['mAP@k'], 's-', label='mAP@k', linewidth=2, markersize=8)
ax3.set_xlabel('Cutoff k')
ax3.set_ylabel('Metric Score')
ax3.set_title('Cutoff Sensitivity Analysis')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Summary insights
ax4.axis('off')
insights_text = """
📊 KEY INSIGHTS FROM ANALYSIS

🎯 Position Matters:
• MRR most sensitive to top positions
• nDCG gradually decreases with position
• mAP considers all relevant documents

📈 Relevance Levels:
• nDCG best captures graded relevance
• MRR binary: relevant vs irrelevant
• mAP treats all relevant docs equally

⚙️ Cutoff Effects:
• Lower k values favor precision
• Higher k values include more context
• Choice depends on application needs

🔍 For Arabic Semantic Search:
• Use all three metrics together
• nDCG for ranking quality
• MRR for user experience
• mAP for comprehensive evaluation
"""

ax4.text(0.05, 0.95, insights_text, transform=ax4.transAxes, fontsize=11, 
         verticalalignment='top', fontfamily='monospace',
         bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

plt.tight_layout()
plt.savefig('comprehensive_metrics_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n" + "="*80)
print("🎓 EVALUATION METRICS MASTERY COMPLETED!")
print("="*80)
print("""
✅ What you've mastered:
• Deep understanding of nDCG, MRR, and mAP calculations
• Implementation of metrics from mathematical foundations
• Analysis of metric behaviors in different scenarios
• Practical insights for Arabic semantic search evaluation
• Interactive exploration of metric sensitivities

🚀 Ready to apply these metrics to real-world semantic search systems!
""")