# Saucode Retrieval Evaluation Notebook

Este notebook implementa la **opción completa (ideal)** para evaluar el retrieval:

1. Define una lista de consultas de evaluación (`EVAL_QUERIES`).
2. Usa el mismo Qdrant y TF–IDF del API para obtener los **top‑k** documentos.
3. Genera un archivo `evals/retrieval_results.csv` **sin** `is_relevant`.
4. Se anota manualmente la relevancia (`is_relevant` = 0/1).
5. Ayuda a este notebook y para calcular **Precision@5** y **nDCG@5** para cada retriever.

In [2]:
import os
from dataclasses import dataclass
from typing import List, Dict, Any

import pandas as pd
import math

try:
    from qdrant_client import QdrantClient
    from qdrant_client.models import SparseVector
except ImportError:
    QdrantClient = None
    SparseVector = None
    print("[WARN] qdrant-client no está instalado en este entorno.\n"
          "Instala con: pip install qdrant-client")

## Configuración

In [29]:
# === Configuración de Qdrant ===
QDRANT_URL = "http://localhost:6333" 
QDRANT_API_KEY = None  
QDRANT_COLLECTION = "code_knowledge"  
# === Parámetros del benchmark ===
TOP_K = 5

EVALS_DIR = "retriever_result"
RETRIEVAL_RESULTS_CSV = os.path.join(EVALS_DIR, "retrieval_results.csv")
TFIDF_VECTORIZER_PATH="../infra/vectorizer.pkl"

## Definición de consultas de evaluación (`EVAL_QUERIES`)

- `query_id` es un identificador corto.
- `query_text` es la descripción de la necesidad de información.

In [37]:
EVAL_QUERIES = [
    {"query_id": "q001", "query_text": "long method with many if statements in Python"},
    {"query_id": "q002", "query_text": "refactoring guidelines for high cyclomatic complexity"},
    {"query_id": "q003", "query_text": "clean code practices for data preprocessing pipelines"},
    {"query_id": "q004", "query_text": "code smell: long parameter list in service functions"},
    {"query_id": "q005", "query_text": "best practices to reduce nested loops and branches"},
    {"query_id": "q006", "query_text": "how to refactor duplicated code across multiple modules"},
    {"query_id": "q007", "query_text": "improving readability in deeply nested try except blocks"},
    {"query_id": "q008", "query_text": "patterns to simplify large python classes with too many responsibilities"},
    {"query_id": "q009", "query_text": "refactoring strategy for functions doing IO and business logic together"},
    {"query_id": "q010", "query_text": "how to remove tight coupling between components in python applications"},
    {"query_id": "q011", "query_text": "best practices for organizing utility functions in python projects"},
    {"query_id": "q012", "query_text": "refactor excessive boolean flags in function signatures"},
    {"query_id": "q013", "query_text": "replace long switch or if chains with polymorphism patterns"},
    {"query_id": "q014", "query_text": "identify and eliminate dead code in python codebases"},
    {"query_id": "q015", "query_text": "refactoring large dictionary-driven functions with complex rules"},
    {"query_id": "q016", "query_text": "testing strategy for functions with side effects and global state"},
    {"query_id": "q017", "query_text": "refactor large data transformation functions into smaller steps"},
    {"query_id": "q018", "query_text": "design patterns to improve extensibility in ETL pipelines"},
    {"query_id": "q019", "query_text": "how to reduce dependency on global variables in python scripts"},
    {"query_id": "q020", "query_text": "refactoring methods that perform validation, transformation, and storage"},
    {"query_id": "q021", "query_text": "clean architecture guidelines for python backend services"},
    {"query_id": "q022", "query_text": "refactoring monolithic functions into command-query separation"},
    {"query_id": "q023", "query_text": "how to restructure code to improve unit test coverage"},
    {"query_id": "q024", "query_text": "strategies for eliminating temporal coupling in workflows"},
    {"query_id": "q025", "query_text": "refactoring python scripts to improve maintainability and modularity"},
    {"query_id": "q026", "query_text": "techniques to simplify callback-heavy asynchronous code"},
    {"query_id": "q027", "query_text": "how to handle magic numbers and inline configuration values"},
    {"query_id": "q028", "query_text": "refactor python functions that mix computation and logging"},
    {"query_id": "q029", "query_text": "best practices to break down god classes into cohesive components"},
    {"query_id": "q030", "query_text": "refactor long list comprehensions into readable steps"}
]

len(EVAL_QUERIES)

30

## Función de encoding TF–IDF

In [31]:
import pickle

_vectorizer=None

with open(TFIDF_VECTORIZER_PATH, "rb") as f:
     _vectorizer = pickle.load(f)

def encode_query_to_sparse_vector(text: str):
    vec = _vectorizer.transform([text])
    coo = vec.tocoo()
    return coo.col.tolist(), coo.data.tolist()

## Cliente de Qdrant

Función auxiliar para crear el cliente de Qdrant usando la configuración anterior.

In [33]:
def build_qdrant_client() -> QdrantClient:
    if QdrantClient is None:
        raise ImportError(
            "qdrant-client no está disponible. Instala con: pip install qdrant-client"
        )
    return QdrantClient(url=QDRANT_URL, api_key=QDRANT_API_KEY)


## Función de retrieval desde Qdrant

Esta función hace una búsqueda `TOP_K` usando **sparse vectors**. Si también
quieres probar un retriever denso, puedes añadir otro bloque similar usando
un campo de vectores densos en tu colección.

In [34]:
from qdrant_client.http.models import NamedSparseVector, SparseVector

VECTOR_NAME = "text"   

def retrieve_top_k_sparse(client, query_text: str, top_k: int = 5):
    indices, values = encode_query_to_sparse_vector(query_text)

    results = client.search(
        collection_name=QDRANT_COLLECTION,
        query_vector=NamedSparseVector(
            name=VECTOR_NAME,
            vector=SparseVector(
                indices=indices,
                values=values
            )
        ),
        limit=top_k,
        with_payload=True
    )

    rows = []
    for r in results:
        rows.append({
            "chunk_id": r.payload.get("chunk_id", r.id),
            "score": r.score,
            "payload": r.payload
        })
    return rows

## Generación de `retrieval_results.csv`

Este paso:

1. Recorre todas las `EVAL_QUERIES`.
2. Para cada consulta, llama al retriever **sparse** (y luego puedes añadir el denso).
3. Construye un `DataFrame` con columnas:
   - `query_id`, `query_text`
   - `retriever` (por ahora: `sparse`)
   - `rank` (1..TOP_K)
   - `chunk_id`, `score`
   - `is_relevant` (inicialmente `None`)
4. Escribe el archivo en `evals/retrieval_results.csv`.

Después de este paso, abre el CSV con tu editor y marca `is_relevant` con 0/1
para cada fila.

In [38]:
def build_retrieval_results_dataframe() -> pd.DataFrame:
    client = build_qdrant_client()
    rows = []

    for q in EVAL_QUERIES:
        qid = q["query_id"]
        qtext = q["query_text"]

        # --- Sparse retriever ---
        sparse_results = retrieve_top_k_sparse(client, qtext, top_k=TOP_K)

        for rank, item in enumerate(sparse_results, start=1):
            rows.append(
                {
                    "query_id": qid,
                    "query_text": qtext,
                    "retriever": "sparse",
                    "rank": rank,
                    "chunk_id": item["chunk_id"],
                    "text": item["payload"].get("text", ""),
                    "score": item["score"],
                    "is_relevant": None,
                }
            )

    df = pd.DataFrame(rows)
    return df


In [39]:
# Ejecuta esta celda para generar el CSV de resultados (sin is_relevant anotado).

os.makedirs(EVALS_DIR, exist_ok=True)
df_results = build_retrieval_results_dataframe()
df_results.to_csv(RETRIEVAL_RESULTS_CSV, index=False)
df_results.head()

  results = client.search(


Unnamed: 0,query_id,query_text,retriever,rank,chunk_id,text,score,is_relevant
0,q001,long method with many if statements in Python,sparse,1,Fluent.Python.2nd.Edition.(z-lib.org).pdf:p928_c1,EAFP Easier to ask for forgiveness than permis...,0.091903,
1,q001,long method with many if statements in Python,sparse,2,cc_knowledge_book.pdf:p141_c2,problems upon our callers. All it takes is one...,0.073481,
2,q001,long method with many if statements in Python,sparse,3,Fluent.Python.2nd.Edition.(z-lib.org).pdf:p950_c1,Chapter Summary This chapter started easily en...,0.063037,
3,q001,long method with many if statements in Python,sparse,4,cc_knowledge_book.pdf:p293_c1,262 Chapter 15: JUnit Internals We replaced th...,0.062122,
4,q001,long method with many if statements in Python,sparse,5,Fluent.Python.2nd.Edition.(z-lib.org).pdf:p924_c1,Chapter 18. Context Managers and else Blocks A...,0.060011,


## Cálculo de Precision@5 y nDCG@5

Una vez que hayas editado `evals/retrieval_results.csv` y añadido la columna
`is_relevant` con valores 0/1, usa las siguientes celdas para calcular
las métricas por retriever.

In [None]:
def precision_at_k(rels: List[int], k: int = 5) -> float:
    rels_k = rels[:k]
    return sum(rels_k) / float(k) if k > 0 else 0.0

def dcg_at_k(rels: List[int], k: int = 5) -> float:
    dcg = 0.0
    for i, rel in enumerate(rels[:k], start=1):
        dcg += rel / math.log2(i + 1)
    return dcg

def ndcg_at_k(rels: List[int], k: int = 5) -> float:
    dcg = dcg_at_k(rels, k)
    ideal = sorted(rels, reverse=True)
    idcg = dcg_at_k(ideal, k)
    return dcg / idcg if idcg > 0 else 0.0


In [None]:
def load_annotated_results(csv_path: str = RETRIEVAL_RESULTS_CSV) -> pd.DataFrame:
    df = pd.read_csv(csv_path)
    if "is_relevant" not in df.columns:
        raise ValueError("El CSV no tiene columna 'is_relevant'. Añádela antes de calcular métricas.")
    return df


In [None]:
def compute_metrics_per_retriever(df: pd.DataFrame, k: int = 5) -> pd.DataFrame:
    # Asegúrate de que is_relevant es 0/1 entero
    df["is_relevant"] = df["is_relevant"].fillna(0).astype(int)

    rows = []

    for retriever in df["retriever"].unique():
        df_r = df[df["retriever"] == retriever]
        p_at_ks = []
        ndcg_at_ks = []

        for qid in df_r["query_id"].unique():
            df_q = df_r[df_r["query_id"] == qid].sort_values("rank")
            rels = df_q["is_relevant"].tolist()

            p_at_ks.append(precision_at_k(rels, k=k))
            ndcg_at_ks.append(ndcg_at_k(rels, k=k))

        rows.append(
            {
                "retriever": retriever,
                f"P@{k}": sum(p_at_ks) / len(p_at_ks),
                f"nDCG@{k}": sum(ndcg_at_ks) / len(ndcg_at_ks),
                "num_queries": len(p_at_ks),
            }
        )

    return pd.DataFrame(rows)


In [None]:
# Una vez anotado el CSV con is_relevant, ejecuta esto:
df_annotated = load_annotated_results()
metrics_df = compute_metrics_per_retriever(df_annotated, k=5)
metrics_df

## Próximos pasos

- Añadir un segundo retriever (`dense`) reutilizando tu encoder denso y tu colección híbrida.
- Guardar también la información de payload (por ejemplo, título de documento, sección) en el CSV
  para facilitar la anotación manual de relevancia.
- Probar diferentes valores de `TOP_K` (por ejemplo 3, 10) y comparar métricas.
- Integrar este notebook con tus otros notebooks de evaluación de Saucode para tener un flujo
  de experimentación completo.