# 08. Evaluation Metrics for IR\n
\n
## Table of Contents\n
1. [Introduction](#introduction)\n
2. [Precision, Recall, F1](#basic-metrics)\n
3. [Mean Average Precision (MAP)](#map)\n
4. [Mean Reciprocal Rank (MRR)](#mrr)\n
5. [Normalized Discounted Cumulative Gain (NDCG)](#ndcg)\n
\n
---\n
\n
## 1. Introduction <a name="introduction"></a>\n
Evaluating Information Retrieval systems is crucial to measure their effectiveness. We go beyond basic precision/recall to measuring ranked list quality.

In [5]:
import math

def calculate_precision_recall(retrieved, relevant):
    if not retrieved:
        return 0.0, 0.0
    intersection = len(retrieved & relevant)
    precision = intersection / len(retrieved)
    recall = intersection / len(relevant) if relevant else 0.0
    return precision, recall

def calculate_f1(precision, recall):
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

## 3. Mean Average Precision (MAP) <a name="map"></a>\n
MAP provides a single-figure measure of quality across recall levels. Among evaluation measures, MAP has been shown to have especially good discrimination and stability.

In [6]:
def average_precision(ranked_list, relevant_docs):
    pk_sum = 0.0
    num_rel = 0
    for i, doc_id in enumerate(ranked_list):
        if doc_id in relevant_docs:
            num_rel += 1
            pk = num_rel / (i + 1)
            pk_sum += pk
    if not relevant_docs:
        return 0.0
    return pk_sum / len(relevant_docs)

## 4. Mean Reciprocal Rank (MRR) <a name="mrr"></a>\n
\n
MRR is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.\n
\n
$$ MRR = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} $$

In [7]:
def reciprocal_rank(ranked_list, relevant_docs):
    for i, doc_id in enumerate(ranked_list):
        if doc_id in relevant_docs:
            return 1.0 / (i + 1)
    return 0.0

def mean_reciprocal_rank(query_results, query_relevance):
    rr_sum = 0.0
    for qid, ranked_list in query_results.items():
        rr_sum += reciprocal_rank(ranked_list, query_relevance.get(qid, set()))
    return rr_sum / len(query_results) if query_results else 0.0


# Example
q_res = {'q1': ['d1', 'd2', 'd3'], 'q2': ['d2', 'd3', 'd1']}
q_rel = {'q1': {'d3'}, 'q2': {'d2'}}

mrr = mean_reciprocal_rank(q_res, q_rel)
print(f"MRR Score: {mrr:.4f}")

MRR Score: 0.6667


## 5. Normalized Discounted Cumulative Gain (NDCG) <a name="ndcg"></a>\n
\n
NDCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.\n
\n
$$ DCG_p = \sum_{i=1}^{p} \frac{rel_i}{\log_2(i+1)} $$\n
\n
$$ NDCG_p = \frac{DCG_p}{IDCG_p} $$

In [8]:
def dcg_at_k(r, k):
    r = str(r)[:k]
    if not r:
        return 0.0
    return r[0] + sum(rel / math.log2(i + 2) for i, rel in enumerate(r[1:]))
    # Standard implementation often uses:
    # sum((2^rel - 1) / log2(i + 2))
    
def calculate_ndcg(ranked_list, relevant_scores, k):
    dcg = 0.0
    for i, doc_id in enumerate(ranked_list[:k]):
        rel = relevant_scores.get(doc_id, 0)
        dcg += (2**rel - 1) / math.log2(i + 2)
        
    # Ideal DCG (sort relevant docs by score descending)
    ideal_scores = sorted(relevant_scores.values(), reverse=True)
    idcg = 0.0
    for i, rel in enumerate(ideal_scores[:k]):
        idcg += (2**rel - 1) / math.log2(i + 2)
        
    if idcg == 0:
        return 0.0
    return dcg / idcg

# Example with Graded Relevance (3=High, 2=Medium, 1=Low, 0=Non-relevant)
ranked_docs = ['d1', 'd2', 'd3', 'd4', 'd5']
relevance_scores = {'d1': 3, 'd2': 2, 'd3': 3, 'd4': 0, 'd5': 1}

ndcg_score = calculate_ndcg(ranked_docs, relevance_scores, k=5)
print(f"NDCG@5: {ndcg_score:.4f}")

NDCG@5: 0.9575
