SAGEBench

Simulated Bruker .d fixtures and a small benchmark harness for the Sage database search engine.

The packaged datasets (CI smoke, HeLa 150K G30M, HLA 10K G40, HLA 100K G60, plus a text-only results bundle) are archived on Zenodo at https://zenodo.org/records/20089515 - DOI 10.5281/zenodo.20089515.

A note on simulated data. Everything here is generated by TimSim. There is no guarantee that the simulated .d files are close enough to real Bruker DDA data to draw firm conclusions about FDR calibration of any particular search engine. Treat the numbers in this repo as useful for catching regressions and for relative comparisons under controlled, ground-truth-backed conditions - not as a substitute for entrapment or real-data benchmarks.

The aim is two-fold:

Provide CI-grade Bruker test data for Sage so regressions in the timsTOF code path (e.g. the timsrust URL/canonicalize bug from lazear/sage#228) get caught before release.
Provide a larger, ground-truth-backed evaluation set so anyone can compute true FDR / TPR for Sage on simulated DDA data, in the same spirit as timsim-bench does for DIA.

All data is generated with TimSim. Because TimSim records the exact peptides, charges, RTs and intensities it injected into the synthetic run, ground truth is exact rather than inherited from another search engine.

Datasets

Slot	Source	Peptides	Gradient	Reps	Approx. size
HeLa CI smoke	generated here (`scripts/generate_hela_ci_smoke.sh`)	1 500	5 min	2	~235 MB / rep
HeLa 150K eval	reused from `SUBMISSION/MBR/data/150K_G30M`	150 000	30 min	3	~6 GB / rep
HLA 10K eval	reused from `SUBMISSION/zenodo/.../thunder-10K`	10 000	40 min	3	~825 MB / rep
HLA 100K eval	reused from `SUBMISSION/zenodo/.../thunder-100K`	100 000	60 min	3	~3 GB / rep

Symlinks in data/refs/ point at the existing copies on this machine so the harness can run against them without duplicating multi-GB trees. The CI-smoke .d lives directly under data/ci-smoke/.

External users: download the bundles directly from the Zenodo record above (sagebench-ci-smoke.tar.gz, sagebench-hela-150k-g30m.tar.gz, sagebench-hla-10k-g40.tar.gz, sagebench-hla-100k-g3600.tar.gz). Each archive expands to a self-contained directory with the .d, its synthetic_data.db, and the TimSim config that produced it.

Layout

SAGEBench/
├── configs/             TimSim configs (smoke is novel; the rest document existing runs)
├── seeds/               Findings-mode seed CSV for the smoke fixture
├── scripts/             Regen, sage runner, Zenodo bundling
├── sagebench/           Python harness: ground-truth join + FDR / TPR
├── data/refs/           Symlinks into the existing zenodo / MBR data
├── data/ci-smoke/       Regenerated tiny .d (gitignored)
├── notebooks/           Demo notebook for sage-on-simulated
├── tests/               Harness tests
└── github/              Draft comment + PR materials for lazear/sage

First-run results

End-to-end on this host (sage 0.15.0-beta.2 with a local one-liner fix to read_tdf; see github/ISSUE_228_COMMENT.md). All numbers at 1 % peptide-FDR (peptide_q ≤ 0.01):

Dataset	Precursors	True FDR	TPR
HeLa CI smoke	871	0.23 %	60.0 %
HeLa 150K G07M (rep001)	1 005	4.11 %	0.31 %
HeLa 150K G15M (rep001)	1 913	3.14 %	0.61 %
HeLa 150K G30M (rep001)	3 572	3.81 %	1.09 %
HeLa 150K G30M (rep002)	3 636	3.00 %	1.12 %
HeLa 150K G30M (rep003)	3 732	3.64 %	1.14 %
HeLa 300K G30M (rep001)	3 773	3.58 %	1.20 %
HLA 10K G40	2 233	2.42 %	13.3 %

The HeLa TPR is low because TimSim's truth set is the full 150 000 peptides injected, but DDA topN only ever fragments ~3 % of them on a 30-minute gradient (~113 000 PASEF selections vs 315 000 simulated precursors). A "TPR over the fragmentable subset" would land an order of magnitude higher; for now we report TPR vs full truth so the metric is unambiguous.

Three-level evaluation (HeLa 150K G30M rep001, q ≤ 0.01)

Level	Reported	True FDR	TPR (full)	TPR (frag)
ion	3 572	3.81 %	1.09 %	2.03 %
peptide	3 230	3.59 %	2.00 %	3.03 %
protein	2 061	0.05 %	11.4 %	12.1 %

TPR (full) is over the entire TimSim truth (every peptide injected). TPR (frag) is over the fragmentable subset - entries whose (charge, m/z) match an MS2 selection event in pasef_meta. For DDA we can only identify what got fragmented, so the fragmentable denominator is the honest one. The lift is ~2× on G30M (54 % of ions are fragmentable); on shorter gradients (e.g. G07M, where only ~14 % of ions are fragmentable), the lift is closer to an order of magnitude.

Protein-level FDR is dramatically better than peptide/ion (aggregation absorbs single misidentifications). The drift sage exhibits is at the peptide/precursor level.

TDC analysis: where does the q-value drift come from?

Re-rescoring HeLa 150K G30M via sagepy.qfdr.tdc exposes the LDA discriminant as the source of sage's calibration drift. At q ≤ 0.01, peptide-level:

Estimator	Reported	True FDR
sage peptide_q (native)	3 230	3.59 %
hyperscore + PSM-level TDC	2 892	1.80 %
hyperscore + peptide double comp.	2 862	1.57 %
LDA disc. + PSM-level TDC	3 563	3.93 %
LDA disc. + peptide double comp.	3 539	3.67 %

Raw hyperscore-driven TDC sits well within the 1 % target; LDA-driven TDC drifts to ~3.7 %. PSM-vs-double competition has only minor effect within either score type. See runs/REPORT.html for the calibration plots and Venns at all three levels.

Build the HTML report

python -m sagebench report \
    --config sagebench-report.toml \
    --out runs/REPORT.html

Self-contained runs/REPORT.html (inline SVG plots) - embed or share without any other assets.

See runs/RESULTS.md for FP categorisation and discussion.

Quick start

# 1. Install the harness in a TimSim-capable venv (Python 3.11+).
poetry install

# 2. Generate the CI smoke fixture (~minutes; 2 replicates).
#    Needs a TimSim build with `from_findings` support (rustims
#    feature/simulate-from-findings, commit 03398133+). The MHCbench
#    venv on this host already has it:
TIMSIM=/scratch/TMAlign/MHCbench/.venv/bin/timsim \
    bash scripts/generate_hela_ci_smoke.sh

# 3. Run sage against both replicates (multi-file pass).
SAGE=/path/to/sage bash scripts/run_sage.sh \
    runs/ci-smoke/ \
    data/ci-smoke/SAGEBENCH-CI-HELA-SMOKE-001/SAGEBENCH-CI-HELA-SMOKE-001.d \
    data/ci-smoke/SAGEBENCH-CI-HELA-SMOKE-002/SAGEBENCH-CI-HELA-SMOKE-002.d

# 4. Score against ground truth.
python -m sagebench eval \
    --sage-out runs/ci-smoke/results.sage.parquet \
    --truth   data/ci-smoke/SAGEBENCH-CI-HELA-SMOKE-001/synthetic_data.db \
              data/ci-smoke/SAGEBENCH-CI-HELA-SMOKE-002/synthetic_data.db

Conventions

Configs use the post-refactor TimSim TOML schema ([paths], [experiment], …) with from_findings = true for the smoke fixture and full-FASTA digestion for the larger eval sets.
Reference blanks come from SUBMISSION/zenodo/blanks/ (HLA Thunder) or /scratch/raw/dda/blanks/ (HeLa K240723).
The harness consumes Sage's parquet output and joins on (sequence_unimod, charge) against TimSim's synthetic_data.db to compute true FDR per the definition in si1_mm.tex of the paper submission.

Why a separate project?

timsim-bench is DIA-leaning and tied to the paper's metric pipeline. SAGEBench is DDA-only, sage-only, and meant to live independently so external users (incl. the Sage maintainer) can pick it up without pulling in the rest of the submission package.

Citation

If you use these datasets, please cite the Zenodo deposit:

SAGEBench datasets: simulated TIMS-TOF DDA fixtures for the Sage search engine. Zenodo. https://doi.org/10.5281/zenodo.20089515

Please also cite the TimSim preprint, which describes the simulator that produced the underlying .d files:

Teschner D, Xiao Z, Maier T, et al. Complete Simulation of timsTOF PASEF Raw Datasets with Timsim Enables Precise Evaluation of False Discovery and Phosphosite Localization Error Rates. Research Square (preprint). https://www.researchsquare.com/article/rs-9032301/v1

The TimSim source itself lives at https://github.com/theGreatHerrLebert/rustims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAGEBench

Datasets

Layout

First-run results

Three-level evaluation (HeLa 150K G30M rep001, q ≤ 0.01)

TDC analysis: where does the q-value drift come from?

Build the HTML report

Quick start

Conventions

Why a separate project?

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data		data
github		github
notebooks		notebooks
sagebench		sagebench
scripts		scripts
seeds		seeds
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
sagebench-report.toml		sagebench-report.toml

Folders and files

Latest commit

History

Repository files navigation

SAGEBench

Datasets

Layout

First-run results

Three-level evaluation (HeLa 150K G30M rep001, q ≤ 0.01)

TDC analysis: where does the q-value drift come from?

Build the HTML report

Quick start

Conventions

Why a separate project?

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages