A training-free benchmark for zero-shot content identity in single-source audio embeddings.
VocSim probes the intrinsic geometric quality of frozen audio embeddings: how well do general-purpose audio models map acoustically variable instances of the same content to nearby points, without any task-specific training? The benchmark aggregates 125K single-source clips from 19 corpora — human speech, animal vocalizations, environmental sounds — and evaluates representations with retrieval-based metrics (P@1, P@5) and a calibrated Global Separation Rate (GSR).
Basha, M., Zai, A. T., Stoll, S., & Hahnloser, R. H. R. (2026). VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio. ICML 2026. arXiv:2512.10120
| Path | Purpose |
|---|---|
features/ |
Frozen-encoder feature extractors (Whisper, WavLM, CLAP, BEATs, …) |
distances/ |
Pairwise distance backends (cosine, Euclidean, Spearman) |
benchmarks/ |
Evaluation suite (P@k, GSR with permutation calibration, clustering) |
trainers/ |
AE/VAE baselines |
vocsim/ |
Pipeline runner |
configs/ |
Default configs for datasets, features, distances, benchmarks |
reproducibility/ |
Configs + scripts to reproduce paper results end-to-end |
git clone https://github.com/vocsim/benchmark.git
cd benchmark
# Conda
conda env create -f environment.yml
conda activate vocsim_env
# or pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install .A CUDA-capable GPU is strongly recommended.
The pipeline runner orchestrates dataset loading, optional training, feature extraction, distance computation, and benchmarking. To reproduce the main paper table:
python reproducibility/scripts/vocsim.py --steps allAvailable stages: train, features, distances, benchmarks, all. Cached outputs are reused if you skip a step.
Edit configs/base.yaml to set project_root, data_dir, results_dir, features_dir, and models_dir before the first run.
All datasets used in the paper are hosted on the Hugging Face Hub:
| Dataset | Use | Link |
|---|---|---|
| VocSim (public) | Core 15-subset benchmark (114,641 clips) | vocsim/public |
| Avian Perception | Alignment with zebra-finch perceptual judgments | vocsim/avian-perception-benchmark |
| Mouse Strain | C57 vs DBA USV classification | vocsim/mouse-strain-classification-benchmark |
| Mouse Identity | Individual mouse identification from USVs | vocsim/mouse-identity-classification-benchmark |
Blind out-of-distribution sets (low-resource speech in Shipibo-Conibo and Chintang) are held out for server-side evaluation via the leaderboard.
The official leaderboard lives at huggingface.co/spaces/vocsim/VocSim. Submit through the 🚀 Submit your Embedding! tab; results on the blind sets are computed server-side and published with P@1, P@5, and GSR reported separately.
The pipeline is modular — adding a new component is local to a single file.
Feature extractor. Subclass features.base.FeatureExtractor, implement _initialize and extract, then register it in configs/features.yaml:
- name: MyExtractor
module: features.my_extractor
class: MyExtractorClass
params:
model_path: "path/to/model.pt"Distance metric. Subclass distances.base.DistanceCalculator, implement _initialize and compute_pairwise, register in configs/distances.yaml.
Benchmark. Subclass benchmarks.base.Benchmark, implement _initialize and evaluate, register in configs/benchmarks.yaml.
Dataset. Prepare a Hugging Face-format dataset with audio and label columns and add an entry to configs/datasets.yaml.
reproducibility/ contains the configs and scripts used for every table and figure in the paper:
configs/vocsim_paper.yaml,avian_paper.yaml,mouse_strain_paper.yaml,mouse_identity_paper.yamlscripts/vocsim.py,avian_perception.py,mouse_strain.py,mouse_identity.pyscripts/misc/— table generation, plots, dataset statistics
Update the paths in the config files, then run the corresponding script with --steps all.
@inproceedings{basha2026vocsim,
title = {VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio},
author = {Basha, Maris and Zai, Anja T. and Stoll, Sabine and Hahnloser, Richard H. R.},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
doi = {10.48550/arXiv.2512.10120}
}Code: MIT (see LICENSE). Datasets and pretrained models retain their original licenses; see Appendix A.1.1 of the paper for a per-source breakdown.
We thank the maintainers of every dataset and pretrained model VocSim builds on. Issues and pull requests are welcome at github.com/vocsim/benchmark/issues.