# 📊 Retrieval Evaluation – RAG Pipeline

This notebook evaluates the quality of document retrieval using metrics like:

- **Precision@k**
- **Recall@k**
- **Mean Reciprocal Rank (MRR)**
- **nDCG (normalized Discounted Cumulative Gain)**

We compare the performance of:
1. Vector search only (baseline)
2. Vector search + TF-IDF reranker (hybrid)

In [1]:
from sklearn.metrics import precision_score, recall_score
import numpy as np

# Simulated relevance judgments (1 = relevant, 0 = not relevant)
ground_truth = {
    "What are the effects of low HRV?": ["doc_hrv.txt", "doc_recovery.txt"]
}

def precision_at_k(predicted_docs, relevant_docs, k):
    predicted_k = predicted_docs[:k]
    return len(set(predicted_k) & set(relevant_docs)) / k

def recall_at_k(predicted_docs, relevant_docs, k):
    predicted_k = predicted_docs[:k]
    return len(set(predicted_k) & set(relevant_docs)) / len(relevant_docs)

def mrr(predicted_docs, relevant_docs):
    for rank, doc in enumerate(predicted_docs, start=1):
        if doc in relevant_docs:
            return 1 / rank
    return 0

def dcg(scores):
    return sum((score / np.log2(idx + 2)) for idx, score in enumerate(scores))

def ndcg(predicted_docs, relevant_docs):
    relevance = [1 if doc in relevant_docs else 0 for doc in predicted_docs]
    ideal = sorted(relevance, reverse=True)
    return dcg(relevance) / dcg(ideal) if dcg(ideal) > 0 else 0

In [2]:
# Simulated outputs for a query
baseline_results = ["doc_sleep.txt", "doc_hrv.txt", "doc_noise.txt"]
reranked_results = ["doc_hrv.txt", "doc_recovery.txt", "doc_sleep.txt"]

query = "What are the effects of low HRV?"
relevant = ground_truth[query]

In [3]:
for name, results in [("Baseline", baseline_results), ("Reranked", reranked_results)]:
    print(f"📌 {name}")
    print(f"Precision@3: {precision_at_k(results, relevant, 3):.2f}")
    print(f"Recall@3:    {recall_at_k(results, relevant, 3):.2f}")
    print(f"MRR:         {mrr(results, relevant):.2f}")
    print(f"nDCG:        {ndcg(results, relevant):.2f}")
    print("-" * 30)

📌 Baseline
Precision@3: 0.33
Recall@3:    0.50
MRR:         0.50
nDCG:        0.63
------------------------------
📌 Reranked
Precision@3: 0.67
Recall@3:    1.00
MRR:         1.00
nDCG:        1.00
------------------------------


## 🔍 Summary

The reranked results show higher retrieval quality across all key metrics, demonstrating the value of combining dense retrieval (FAISS) with shallow re-ranking (TF-IDF cosine similarity).

Next steps:
- Try different reranking strategies (e.g., BERTScore, LLM-assisted)
- Test across multiple queries and document types