A calibrated falsification harness for retrieval evaluation. Bring your own retrieval system, your own corpus, your own metric.
Most retrieval-system papers report a single aggregate metric (nDCG@k, MRR) and call it a contribution. Three failure modes make this practice unsafe at any benchmark size and dangerous on small ones:
- Null-distribution silence. A learned ranker can absorb gold-label distribution shape without learning underlying query–document relevance. A constant predictor matched to the empirical class marginal can score non-trivially without using the query at all.
- Corpus drift between commits. ALTER TABLE migrations and feedback-loop side effects mutate runtime artifacts without changing source code. A "score-neutral" annotation can be true about the source diff while false about the runnable system.
- Small-sample claims masquerading as significance. A +0.02 metric gain on N < 50 queries usually sits inside the bench's noise floor.
This library implements a four-part falsification harness that addresses all
three, in <1000 lines of Python with only numpy as a runtime dependency:
| Component | What it does |
|---|---|
four_null_gate |
Tests engine output against four null distributions: gold-permuted (A), gold-uniform-random (B), random-retrieval (C), gold-marginal-matched (D, novel). The gate PASSes when Δ ≥ τ on all four. Null D is original to this work; it correctly rejects predictors matched to the empirical gold marginal that A and B can false-positive. |
lock_state / verify_state |
Walk a directory, hash every binary artifact via SHA-256, bind to the current git commit. CI-enforceable verification exits non-zero on any drift. |
bootstrap_ci, paired_permutation_p, cohens_d_paired |
Statistical reporting requirements: bootstrap confidence intervals on means and paired differences, paired permutation tests, effect-size reporting. |
power_n_required |
Compute N required to detect an observed Δ at α=0.05, power=0.80. Quantifies why your UNDER-NS verdict is UNDER-NS. |
The methodology is corpus-agnostic, metric-agnostic, and engine-agnostic. The public interface accepts any callable that returns top-K rankings.
git clone https://github.com/<your-handle>/falsify-eval.git
cd falsify-eval
pip install -e .
python3 examples/synthetic_demo.pyThe demo runs the four-null gate on three systems against a 50-query synthetic benchmark with 5 labels: a constant predictor (deliberately broken), a mock plausible engine (70% top-1 hit rate), and an oracle (perfect top-1).
Expected output:
=== System: constant_predictor (deliberately broken) ===
real mean nDCG@5 = 0.X
Null A: Δ ≈ +0 ✓ but small
Null B: Δ ≈ +0 ✓ but small
Null D: Δ = exactly 0.0000 ✗ FAIL ← marginal-matched null catches it
GATE: ✗ FAIL (correctly rejected)
=== System: mock_engine (plausible retrieval) ===
real mean nDCG@5 = 0.6X
Δ ≥ +0.4 across all 4 nulls ✓ PASS
GATE: ✓ PASS (correctly accepted)
=== System: oracle (upper bound) ===
real mean nDCG@5 = 1.0
GATE: ✓ PASS by maximum margin
import numpy as np
from falsify_eval import four_null_gate, lock_state, verify_state
from falsify_eval import bootstrap_ci, paired_permutation_p, cohens_d_paired
# 1. The four-null gate
def my_metric(retrieved_ids, gold, rel):
"""Your scoring function. Return float in [0, 1]."""
return 1.0 if gold in retrieved_ids[:5] else 0.0
result = four_null_gate(
retrieved_lists=engine_top_k_per_query, # list of lists of item-ids
gold_list=gold_label_per_query, # list of class labels
rel_list=relevance_per_query, # list of relevance grades
metric_fn=my_metric,
item_pool=all_item_ids_in_corpus, # for null C
k=5, n_trials=50, tau=0.05, seed=2026,
)
assert result["gate_passes"], f"Engine failed gate: {result['deltas']}"
# 2. Cryptographic state lock
import json
lock = lock_state("./corpus", git_repo=".", bench_score=result["real_mean"])
with open("corpus.lock.json", "w") as f:
json.dump(lock, f, indent=2, sort_keys=True)
# Later (in CI or before reporting a score):
diff = verify_state("corpus.lock.json", "./corpus")
assert diff["matches"], f"Corpus drift: {diff}"
# 3. Statistical reporting
mean, ci_lo, ci_hi, sd = bootstrap_ci(per_query_scores)
delta_mean, p = paired_permutation_p(scores_with_feature, scores_without)
d = cohens_d_paired(scores_with_feature, scores_without)
verdict = (
"HIT" if delta_mean > 0 and p < 0.05 else
"UNDER-NS" if delta_mean > 0 else
"MISS"
)If the four-null gate PASSes (Δ ≥ τ on all four nulls) at N_trials = 50, τ = 0.05, then with Bonferroni-corrected confidence ≥ 0.95:
- The engine is not equivalent to a label-permutation-invariant ranker (rejected by G_A).
- The engine is not achieving its score solely via the uniform-class-prior assumption (rejected by G_B).
- The engine is not equivalent to a uniform-random retriever (rejected by G_C).
- The engine is not equivalent to a gold-marginal-matched predictor (rejected by G_D — new in this work).
A passing gate is a necessary condition for credible reporting, not sufficient. It does not prove (a) the engine learned the actual relevance signal, only that it learned something beyond the four enumerated trivial classes; (b) the engine generalises beyond the evaluation set; (c) per-feature contribution claims are significant (handled separately by the statistical reporting helpers); (d) the engine is free of bench-developer overfitting in query phrasing.
The companion methodology paper is included in this repository:
PREPRINT.md— Calibrated Falsification Harnesses for Retrieval Evaluation (v7, with N=10,000 validation, broken-predictor suite, sensitivity grid, and the soundness proposition for the four-null gate).SUPPLEMENTARY.md— extended tables, ablations, and the bench-size calibration curve.
Submission to arXiv is pending. The DOI will be added to CITATION.cff on acceptance. In the interim, the markdown is the canonical source; both files are immutable for v0.1.0 (verifiable via lock_state against the v0.1.0 tag).
| Capability | DVC | MLflow | W&B Artifacts | Ragas | TruLens | falsify-eval |
|---|---|---|---|---|---|---|
| Vendor-free | ✓ | ✓ (server) | ✗ | ✓ | partial | ✓ |
| Pure-text human-readable lock | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Couples artifact hash + verified score | ✗ | ✗ | partial | ✗ | partial | ✓ |
| Falsification gate (CI-enforced) | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Marginal-matched null | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Positive-control self-validation | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
If this library helps your work, please cite the methodology paper:
@article{sharma2026calibrated,
title = {Calibrated Falsification Harnesses for Retrieval Evaluation},
author = {Sharma, Sparsh},
year = {2026},
eprint = {<arxiv-id-when-published>},
archivePrefix = {arXiv},
primaryClass = {cs.IR}
}The methodology paper offers a $2000 bounty for any of:
- A retrieval system that PASSes the four-null gate (τ=0.05, N_trials=50) and whose top-K output can be shown via separate evidence to not actually use the query.
- A counterexample to Proposition 1's Hoeffding + Bonferroni argument.
- A reproducible drift between the demo's published numbers and a third-party run on identical artifacts.
Submit issues tagged bug-bounty against this repository.
Issues and PRs welcome. The reference implementation is intentionally minimal; the goal is for the protocol to be small enough that adopters audit the entire library before depending on it.
v0.1.0 — initial public release. The interfaces are documented in the methodology paper and intended to be stable; minor implementation refinements expected through v0.x; v1.0 ships when the broken-predictor suite is extended to cover at least 10 distinct broken systems.
Independent work. No institutional affiliation. Apache 2.0.