# 05 — Evaluation (TF‑IDF & optional SBERT)

Ziel: Unseren Mini‑Suchstack aus Woche 1 quantitativ prüfen.

**Was wir messen**
- *MRR* (Mean Reciprocal Rank)
- *Precision@k* (k=1,3,5)
- *MAP* (Mean Average Precision)
- *Coverage* (wie oft trifft ein System überhaupt ein Relevantes in Top‑k)
- *Latenz* (Millisekunden pro Anfrage)

Optional vergleichen wir TF‑IDF gegen SBERT (falls verfügbar).

In [18]:
import warnings

In [19]:
from tqdm import TqdmExperimentalWarning
warnings.filterwarnings("ignore", category=TqdmExperimentalWarning)

In [20]:
import json, time, warnings, importlib.util
from pathlib import Path
from typing import List, Dict, Tuple
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

DATA = Path("data"); DATA.mkdir(exist_ok=True)

def load_corpus() -> List[str]:
    p = DATA/"sample_corpus.json"
    if p.exists():
        with p.open("r", encoding="utf-8") as f:
            x = json.load(f)
            if isinstance(x, list) and all(isinstance(t, str) for t in x):
                return x
    return [
        "Die Snare ist zu laut und harsch",
        "Kick zu weich, es fehlt der Punch",
        "Vocals klingen nasal, 800 Hz absenken",
        "Bass maskiert die Kick, Sidechain nötig",
        "S-Laute sind scharf, De-Esser einsetzen",
    ]

corpus = load_corpus()
len(corpus), corpus[:2]

(5, ['Die Snare ist zu laut und harsch', 'Kick zu weich, es fehlt der Punch'])

## Setup & Daten
Wir laden den Korpus aus `data/sample_corpus.json`. Wenn die Datei fehlt, nutzen wir einen kleinen Fallback.  
Für SBERT ist eine CPU‑Installation ausreichend; wenn der Import scheitert, evaluieren wir nur TF‑IDF.

## Ranker definieren
Wir bauen einen TF‑IDF‑Ranker und versuchen optional einen SBERT‑Ranker zu laden. Beide liefern `(indices, scores)`.

In [21]:
tfidf = TfidfVectorizer(lowercase=True, ngram_range=(1,2), min_df=1)
X = tfidf.fit_transform(corpus)

def rank_tfidf(query: str, k: int = 5):
    qv = tfidf.transform([query])
    sims = linear_kernel(qv, X).ravel()
    order = np.argsort(-sims)
    topk = order[:k]
    return topk.tolist(), sims[topk].tolist()

# Sanity-Check
rank_tfidf("snare zu laut", k=min(5, len(corpus)))

([0, 1, 2, 3, 4], [0.5447735663555926, 0.09712682146733126, 0.0, 0.0, 0.0])

In [22]:
def try_sbert(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"):
    try:
        if importlib.util.find_spec("sentence_transformers") is None:
            return None

        from sentence_transformers import SentenceTransformer
        import warnings
        from tqdm import TqdmExperimentalWarning
        warnings.filterwarnings("ignore", category=TqdmExperimentalWarning)

        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            model = SentenceTransformer(model_name, device="cpu")

        doc_emb = model.encode(corpus, normalize_embeddings=True)

        def rank(query: str, k: int = 5):
            qv = model.encode([query], normalize_embeddings=True)
            sims = (qv @ doc_emb.T).ravel()
            order = np.argsort(-sims)
            topk = order[:k]
            return topk.tolist(), sims[topk].tolist()

        return rank
    except Exception as e:
        print("SBERT nicht verfügbar:", e)
        return None
        
rank_sbert = try_sbert()
rank_sbert

<function __main__.try_sbert.<locals>.rank(query: str, k: int = 5)>

## Ground Truth (Mini‑Set)
Kleines Mapping Query → relevante Dokument‑Indizes. Du kannst die Liste gern erweitern.

In [23]:
GT: Dict[str, List[int]] = {
    "snare zu laut": [0],
    "kick zu weich": [1],
    "vocals nasal 800 hz": [2],
    "bass maskiert kick sidechain": [3],
    "s-laute scharf de-esser": [4],
}
GT

{'snare zu laut': [0],
 'kick zu weich': [1],
 'vocals nasal 800 hz': [2],
 'bass maskiert kick sidechain': [3],
 's-laute scharf de-esser': [4]}

## Metriken
Wir implementieren MRR, Precision@k, AP/MAP und Coverage.

In [24]:
def precision_at_k(relevants: List[int], retrieved: List[int], k=3) -> float:
    R = set(relevants)
    topk = retrieved[:k]
    hits = sum(1 for i in topk if i in R)
    return hits / max(1, len(topk))

def reciprocal_rank(relevants: List[int], retrieved: List[int]) -> float:
    R = set(relevants)
    for r, idx in enumerate(retrieved, 1):
        if idx in R:
            return 1.0 / r
    return 0.0

def average_precision(relevants: List[int], retrieved: List[int], k=None) -> float:
    R = set(relevants)
    if not R:
        return 0.0
    ap_sum, hits = 0.0, 0
    cut = len(retrieved) if k is None else min(k, len(retrieved))
    for r in range(1, cut+1):
        if retrieved[r-1] in R:
            hits += 1
            ap_sum += hits / r
    return ap_sum / max(1, len(R))

def evaluate_run(run_name: str, rank_fn, ks=(1,3,5)) -> Tuple[pd.DataFrame, pd.DataFrame]:
    rows = []
    for q, rel in GT.items():
        t0 = time.perf_counter()
        idxs, _ = rank_fn(q, k=max(ks))
        latency_ms = (time.perf_counter() - t0) * 1000
        row = {
            "run": run_name,
            "query": q,
            "MRR": reciprocal_rank(rel, idxs),
            "AP@k": average_precision(rel, idxs, k=max(ks)),
            "latency_ms": latency_ms,
        }
        for k_ in ks:
            row[f"P@{k_}"] = precision_at_k(rel, idxs, k_)
            row[f"cov@{k_}"] = float(any(i in set(rel) for i in idxs[:k_]))
        rows.append(row)
    df = pd.DataFrame(rows)
    agg = df.drop(columns=["query"]).groupby("run").mean(numeric_only=True).round(3)
    return df, agg

# Beispiel: TF‑IDF auswerten
tf_row, tf_agg = evaluate_run("tfidf", rank_tfidf)
display(tf_agg); tf_row

Unnamed: 0_level_0,MRR,AP@k,latency_ms,P@1,cov@1,P@3,cov@3,P@5,cov@5
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tfidf,1.0,1.0,0.394,1.0,1.0,0.333,1.0,0.2,1.0


Unnamed: 0,run,query,MRR,AP@k,latency_ms,P@1,cov@1,P@3,cov@3,P@5,cov@5
0,tfidf,snare zu laut,1.0,1.0,0.6085,1.0,1.0,0.333333,1.0,0.2,1.0
1,tfidf,kick zu weich,1.0,1.0,0.37675,1.0,1.0,0.333333,1.0,0.2,1.0
2,tfidf,vocals nasal 800 hz,1.0,1.0,0.346334,1.0,1.0,0.333333,1.0,0.2,1.0
3,tfidf,bass maskiert kick sidechain,1.0,1.0,0.322667,1.0,1.0,0.333333,1.0,0.2,1.0
4,tfidf,s-laute scharf de-esser,1.0,1.0,0.315292,1.0,1.0,0.333333,1.0,0.2,1.0


In [25]:
if rank_sbert:
    sb_row, sb_agg = evaluate_run("sbert", rank_sbert)
    display(sb_agg); sb_row

Unnamed: 0_level_0,MRR,AP@k,latency_ms,P@1,cov@1,P@3,cov@3,P@5,cov@5
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
sbert,1.0,1.0,15.5,1.0,1.0,0.333,1.0,0.2,1.0


## Vergleich (optional klein visualisiert)
Wir kombinieren die Aggregationen (sofern SBERT vorhanden ist).

In [26]:
agg_list = [tf_agg]
if 'sb_agg' in globals():
    agg_list.append(sb_agg)
combined = pd.concat(agg_list)
combined

Unnamed: 0_level_0,MRR,AP@k,latency_ms,P@1,cov@1,P@3,cov@3,P@5,cov@5
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tfidf,1.0,1.0,0.394,1.0,1.0,0.333,1.0,0.2,1.0
sbert,1.0,1.0,15.5,1.0,1.0,0.333,1.0,0.2,1.0


## Export
Wir speichern die Detail‑Zeilen in `data/eval_details.csv` und die Aggregation in `data/eval_summary.csv`.

In [27]:
DATA.mkdir(exist_ok=True)
tf_row.to_csv(DATA/"eval_details_tfidf.csv", index=False)
tf_agg.to_csv(DATA/"eval_summary_tfidf.csv")
if 'sb_row' in globals():
    sb_row.to_csv(DATA/"eval_details_sbert.csv", index=False)
    sb_agg.to_csv(DATA/"eval_summary_sbert.csv")
print("Gespeichert in data/…")

Gespeichert in data/…


## Übungen
1. Erweitere den Ground‑Truth um neue Queries und relevante Dokumente.
2. Variiere TF‑IDF‑Parameter (`ngram_range`, Stopwörter, `min_df`) und beobachte MRR/Precision.
3. Erzeuge Noisy‑Queries (z. B. Rechtschreibfehler) und vergleiche TF‑IDF vs. SBERT Robustheit.
4. Logge Latenzen über 100 Wiederholungen (Warm/Cold) und bilde Quantile (P50/P90/P99).
