# RAG Evaluation Lab

This notebook implements a small, self-contained RAG evaluation framework:

1. Build a synthetic document corpus and ground-truth Q&A set.
2. Create embeddings and a vector index for retrieval.
3. Evaluate retrieval quality with:
   - Recall@k
   - Precision@k
   - MRR (Mean Reciprocal Rank)
4. (Optional) Generate answers with an LLM using retrieved context.
5. (Optional) Use an LLM-as-judge to evaluate answer correctness and faithfulness.

You can adapt this structure to real datasets later. For now, the goal is to demonstrate *evaluation thinking* around RAG.

In [25]:
import os
from typing import List, Dict, Any, Tuple
import numpy as np
import pandas as pd

from sklearn.neighbors import NearestNeighbors

# Uncomment if you want to use OpenAI embeddings / LLM-as-judge
# from openai import OpenAI
# client = OpenAI()

# Optional: use SentenceTransformers locally instead of OpenAI embeddings
try:
    from sentence_transformers import SentenceTransformer
    HAS_SENTENCE_TRANSFORMERS = True
except ImportError:
    HAS_SENTENCE_TRANSFORMERS = False

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# If you use OpenAI:
# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# assert OPENAI_API_KEY, "Please set OPENAI_API_KEY in your environment."

## 1. Define a synthetic document corpus

We’ll create a small, focused corpus that simulates developer documentation for an internal tool or API.
Each document has an `id` and `text`.

In [26]:
documents = [
    {
        "id": "doc_1",
        "text": (
            "The Developer Telemetry API collects events such as suggestion_accept, "
            "suggestion_reject, and compile_error. Events are batched and sent every 5 seconds."
        ),
    },
    {
        "id": "doc_2",
        "text": (
            "To enable code generation in the IDE, developers must install the Codex Assistant plugin "
            "and authenticate using their API key in the settings panel."
        ),
    },
    {
        "id": "doc_3",
        "text": (
            "Recall@k is a retrieval evaluation metric that measures how often at least one "
            "relevant document appears in the top-k results for a query."
        ),
    },
    {
        "id": "doc_4",
        "text": (
            "The embeddings service supports multiple models. The default model is optimized for "
            "semantic search over short text like code comments, function names, and log messages."
        ),
    },
    {
        "id": "doc_5",
        "text": (
            "Latency budgets for AI-assisted coding should keep P95 completion under 500 milliseconds "
            "to avoid interrupting the developer's flow."
        ),
    },
]


docs_df = pd.DataFrame(documents)
docs_df

Unnamed: 0,id,text
0,doc_1,The Developer Telemetry API collects events su...
1,doc_2,"To enable code generation in the IDE, develope..."
2,doc_3,Recall@k is a retrieval evaluation metric that...
3,doc_4,The embeddings service supports multiple model...
4,doc_5,Latency budgets for AI-assisted coding should ...


## 2. Define ground-truth Q&A pairs

Each question has:
- `question`
- `answer`
- `relevant_docs`: list of document IDs that contain the necessary information.

This lets us compute retrieval metrics.

In [27]:
qa_pairs = [
    {
        "id": "q_1",
        "question": "Which events does the Developer Telemetry API collect?",
        "answer": "It collects suggestion_accept, suggestion_reject, and compile_error events.",
        "relevant_docs": ["doc_1"],
    },
    {
        "id": "q_2",
        "question": "How does a developer enable code generation in the IDE?",
        "answer": "Install the Codex Assistant plugin and authenticate with their API key in settings.",
        "relevant_docs": ["doc_2"],
    },
    {
        "id": "q_3",
        "question": "What is Recall@k in retrieval evaluation?",
        "answer": "Recall@k measures how often at least one relevant document appears in the top-k results.",
        "relevant_docs": ["doc_3"],
    },
    {
        "id": "q_4",
        "question": "What is the default embeddings model optimized for?",
        "answer": "Semantic search over short text like code comments, function names, and log messages.",
        "relevant_docs": ["doc_4"],
    },
    {
        "id": "q_5",
        "question": "What should P95 latency be for AI-assisted coding?",
        "answer": "Under 500 milliseconds so the developer's flow is not interrupted.",
        "relevant_docs": ["doc_5"],
    },
]


qa_df = pd.DataFrame(qa_pairs)
qa_df

Unnamed: 0,id,question,answer,relevant_docs
0,q_1,Which events does the Developer Telemetry API ...,"It collects suggestion_accept, suggestion_reje...",[doc_1]
1,q_2,How does a developer enable code generation in...,Install the Codex Assistant plugin and authent...,[doc_2]
2,q_3,What is Recall@k in retrieval evaluation?,Recall@k measures how often at least one relev...,[doc_3]
3,q_4,What is the default embeddings model optimized...,Semantic search over short text like code comm...,[doc_4]
4,q_5,What should P95 latency be for AI-assisted cod...,Under 500 milliseconds so the developer's flow...,[doc_5]


## 3. Create embeddings for documents and questions

You can use either:

- **SentenceTransformers** (local, no API key), or  
- **OpenAI embeddings** (commented out here, easy to switch in).

For this lab, SentenceTransformers is a good default if installed.

In [28]:
EMBEDDING_DIM = 384  # default for all-MiniLM-L6-v2

if HAS_SENTENCE_TRANSFORMERS:
    print("Using SentenceTransformers for embeddings.")
    st_model = SentenceTransformer("all-MiniLM-L6-v2")

    def embed_texts(texts: List[str]) -> np.ndarray:
        return st_model.encode(texts, show_progress_bar=False)
else:
    print("SentenceTransformers not found. Falling back to random embeddings (for structure only).")
    print("Install with: pip install sentence-transformers")

    def embed_texts(texts: List[str]) -> np.ndarray:
        # This is just a structural placeholder. Replace with real embeddings.
        return np.random.normal(size=(len(texts), EMBEDDING_DIM))


# If you want to use OpenAI embeddings instead, you could do:
#
# def embed_texts(texts: List[str]) -> np.ndarray:
#     response = client.embeddings.create(
#         model="text-embedding-3-small",
#         input=texts,
#     )
#     return np.array([item.embedding for item in response.data])

SentenceTransformers not found. Falling back to random embeddings (for structure only).
Install with: pip install sentence-transformers


In [29]:
doc_embeddings = embed_texts(docs_df["text"].tolist())
doc_embeddings.shape

# Build a simple nearest-neighbor index using cosine similarity
nn_model = NearestNeighbors(
    n_neighbors=5,
    metric="cosine"
)
nn_model.fit(doc_embeddings)

doc_id_to_index = {doc_id: i for i, doc_id in enumerate(docs_df["id"].tolist())}
index_to_doc_id = {i: doc_id for doc_id, i in doc_id_to_index.items()}

## 4. Define a retrieval function

Given a question, we embed it and retrieve the top-k most similar documents.

In [30]:
def retrieve_top_k(question: str, k: int = 3) -> List[Tuple[str, float]]:
    q_vec = embed_texts([question])
    distances, indices = nn_model.kneighbors(q_vec, n_neighbors=k)
    distances = distances[0]
    indices = indices[0]

    # cosine distance -> similarity (1 - distance)
    sims = 1 - distances

    results = []
    for idx, sim in zip(indices, sims):
        doc_id = index_to_doc_id[idx]
        results.append((doc_id, float(sim)))
    return results


# Quick sanity check
for q in qa_pairs:
    print(q["id"], q["question"])
    print(retrieve_top_k(q["question"], k=3))
    print("-----")

q_1 Which events does the Developer Telemetry API collect?
[('doc_3', 0.03358895374010673), ('doc_4', -0.02766602495166426), ('doc_1', -0.043924410386471546)]
-----
q_2 How does a developer enable code generation in the IDE?
[('doc_5', 0.008848852463883006), ('doc_2', 0.006211998633499194), ('doc_1', -0.012547493967877132)]
-----
q_3 What is Recall@k in retrieval evaluation?
[('doc_2', 0.0916082885693893), ('doc_5', 0.0025658563007698865), ('doc_3', -0.015039476951177688)]
-----
q_4 What is the default embeddings model optimized for?
[('doc_3', 0.05250199903712438), ('doc_1', 0.04694633891020672), ('doc_2', -0.0015075832319872973)]
-----
q_5 What should P95 latency be for AI-assisted coding?
[('doc_5', 0.08730096197955117), ('doc_2', 0.050156622319949884), ('doc_3', 0.015292664700832237)]
-----


## 5. Compute retrieval metrics

We will compute:

- **Recall@k** – fraction of queries where *any* relevant doc appears in the top-k.  
- **Precision@k** – fraction of retrieved docs that are relevant.  
- **MRR@k** – Mean Reciprocal Rank of first relevant doc in the top-k.

In [31]:
def compute_retrieval_metrics(
    qa_df: pd.DataFrame,
    k: int = 3
) -> Dict[str, float]:
    recalls = []
    precisions = []
    reciprocal_ranks = []

    for _, row in qa_df.iterrows():
        q_id = row["id"]
        question = row["question"]
        relevant_docs = set(row["relevant_docs"])

        retrieved = retrieve_top_k(question, k=k)
        retrieved_ids = [doc_id for doc_id, _ in retrieved]

        # Recall@k: 1 if any relevant doc is in top-k, else 0
        hit = int(bool(relevant_docs.intersection(retrieved_ids)))
        recalls.append(hit)

        # Precision@k: (# relevant in top-k) / k
        num_rel = len(relevant_docs.intersection(retrieved_ids))
        precisions.append(num_rel / k)

        # MRR@k: 1/rank_of_first_relevant if found, else 0
        rr = 0.0
        for rank, doc_id in enumerate(retrieved_ids, start=1):
            if doc_id in relevant_docs:
                rr = 1.0 / rank
                break
        reciprocal_ranks.append(rr)

    return {
        f"Recall@{k}": float(np.mean(recalls)),
        f"Precision@{k}": float(np.mean(precisions)),
        f"MRR@{k}": float(np.mean(reciprocal_ranks)),
    }


metrics_k3 = compute_retrieval_metrics(qa_df, k=3)
metrics_k5 = compute_retrieval_metrics(qa_df, k=5)
metrics_k3, metrics_k5

({'Recall@3': 0.8,
  'Precision@3': 0.26666666666666666,
  'MRR@3': 0.6666666666666666},
 {'Recall@5': 1.0, 'Precision@5': 0.2, 'MRR@5': 0.33999999999999997})

## 6. (Optional) Generate answers using an LLM

For each question, we:
1. Retrieve top-k documents.
2. Build a context string from retrieved docs.
3. Ask an LLM to answer the question using that context.

> ⚠️ This step requires an OpenAI API key or another LLM provider.

In [32]:
USE_LLM_GENERATION = False  # Set to True if you want to run this and have an API key


def build_context_from_docs(doc_ids: List[str]) -> str:
    """Build a text block from retrieved document IDs."""
    texts = []
    for doc_id in doc_ids:
        text = docs_df.loc[docs_df["id"] == doc_id, "text"].values[0]
        texts.append(f"[{doc_id}] {text}")
    return "\n".join(texts)


def generate_answer_with_llm(question: str, context: str) -> str:
    """
    Example with OpenAI client.
    Replace with your actual LLM call as needed.
    """

    # Example with OpenAI (commented out)
    #
    # response = client.chat.completions.create(
    #     model="gpt-4o-mini",
    #     messages=[
    #         {
    #             "role": "system",
    #             "content": (
    #                 "You are a helpful assistant that answers strictly based on "
    #                 "the provided context. Do NOT hallucinate."
    #             ),
    #         },
    #         {
    #             "role": "user",
    #             "content": (
    #                 f"Context:\n{context}\n\n"
    #                 f"Question: {question}\n"
    #                 "Answer briefly and accurately:"
    #             ),
    #         },
    #     ],
    # )
    # return response.choices[0].message.content.strip()

    # Placeholder for offline use
    return f"(Mock answer) Based on the context, here's an answer to: {question}"


generated_answers = []

if USE_LLM_GENERATION:
    for _, row in qa_df.iterrows():
        q_id = row["id"]
        question = row["question"]

        retrieved = retrieve_top_k(question, k=3)
        retrieved_ids = [doc_id for doc_id, _ in retrieved]
        context = build_context_from_docs(retrieved_ids)

        answer = generate_answer_with_llm(question, context)

        generated_answers.append(
            {
                "id": q_id,
                "question": question,
                "generated_answer": answer,
                "retrieved_docs": retrieved_ids,
            }
        )

    gen_df = pd.DataFrame(generated_answers)
    display(gen_df.head())
else:
    print("Skipping LLM generation. Set USE_LLM_GENERATION = True to enable.")


Skipping LLM generation. Set USE_LLM_GENERATION = True to enable.


## 7. (Optional) LLM-as-judge: correctness & faithfulness

Given:
- Question
- Ground-truth answer
- Generated answer
- Retrieved context

We can ask an LLM to rate:
- **Correctness**: Does the generated answer match the ground truth?
- **Faithfulness**: Is it supported by the retrieved context or hallucinated?

This mirrors how many teams evaluate RAG systems today.

In [33]:
USE_LLM_JUDGE = False  # set to True if you're ready to call the LLM

def score_answer_with_llm_judge(
    question: str,
    ground_truth: str,
    generated_answer: str,
    context: str,
) -> Dict[str, Any]:
    """
    Example prompt for LLM-as-judge. Returns parsed numeric scores.
    """
    prompt = f"""
You are evaluating a RAG system.

Question: {question}

Ground truth answer:
{ground_truth}

Generated answer:
{generated_answer}

Retrieved context:
{context}

Please answer in JSON with the following fields:
- correctness: integer from 1 to 5
- faithfulness: integer from 1 to 5
- explanation: short text explanation
"""
    # Example with OpenAI:
    # response = client.chat.completions.create(
    #     model="gpt-4o-mini",
    #     messages=[{"role": "user", "content": prompt}],
    # )
    # raw = response.choices[0].message.content
    # Here you'd parse JSON from `raw`.

    # Mock result structure
    return {
        "correctness": 4,
        "faithfulness": 4,
        "explanation": "Mocked score. Replace with real LLM judge call.",
    }


judge_results = []

if USE_LLM_JUDGE and USE_LLM_GENERATION:
    for _, row in qa_df.iterrows():
        q_id = row["id"]
        question = row["question"]
        ground_truth = row["answer"]

        gen_row = gen_df.loc[gen_df["id"] == q_id].iloc[0]
        generated_answer = gen_row["generated_answer"]
        retrieved_ids = gen_row["retrieved_docs"]
        context = build_context_from_docs(retrieved_ids)

        scores = score_answer_with_llm_judge(
            question=question,
            ground_truth=ground_truth,
            generated_answer=generated_answer,
            context=context,
        )
        scores["id"] = q_id
        judge_results.append(scores)

    judge_df = pd.DataFrame(judge_results)
    judge_df
else:
    print("Skipping LLM-as-judge evaluation. Enable USE_LLM_JUDGE and USE_LLM_GENERATION to run.")

Skipping LLM-as-judge evaluation. Enable USE_LLM_JUDGE and USE_LLM_GENERATION to run.


## 8. Summary & Next Steps

In this lab, we:

- Defined a small but structured corpus and ground-truth Q&A pairs.
- Built embeddings and a vector index for retrieval.
- Computed retrieval metrics: Recall@k, Precision@k, MRR.
- (Optionally) generated answers with an LLM using retrieved context.
- (Optionally) structured an LLM-as-judge evaluation for correctness and faithfulness.

Next steps to make this *production-grade*:

- Swap synthetic data with real docs (API docs, internal knowledge base, etc.).
- Replace mock embeddings with real models (OpenAI, Cohere, local SentenceTransformers).
- Plug in real LLM calls for answer generation and judging.
- Log results over time to detect regressions when models, prompts, or retrieval strategies change.