VocSim

A training-free benchmark for zero-shot content identity in single-source audio embeddings.

VocSim probes the intrinsic geometric quality of frozen audio embeddings: how well do general-purpose audio models map acoustically variable instances of the same content to nearby points, without any task-specific training? The benchmark aggregates 125K single-source clips from 19 corpora — human speech, animal vocalizations, environmental sounds — and evaluates representations with retrieval-based metrics (P@1, P@5) and a calibrated Global Separation Rate (GSR).

Basha, M., Zai, A. T., Stoll, S., & Hahnloser, R. H. R. (2026). VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio. ICML 2026. arXiv:2512.10120

What's in this repo

Path	Purpose
`features/`	Frozen-encoder feature extractors (Whisper, WavLM, CLAP, BEATs, …)
`distances/`	Pairwise distance backends (cosine, Euclidean, Spearman)
`benchmarks/`	Evaluation suite (P@k, GSR with permutation calibration, clustering)
`trainers/`	AE/VAE baselines
`vocsim/`	Pipeline runner
`configs/`	Default configs for datasets, features, distances, benchmarks
`reproducibility/`	Configs + scripts to reproduce paper results end-to-end

Installation

git clone https://github.com/vocsim/benchmark.git
cd benchmark

# Conda
conda env create -f environment.yml
conda activate vocsim_env

# or pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install .

A CUDA-capable GPU is strongly recommended.

Quick start

The pipeline runner orchestrates dataset loading, optional training, feature extraction, distance computation, and benchmarking. To reproduce the main paper table:

python reproducibility/scripts/vocsim.py --steps all

Available stages: train, features, distances, benchmarks, all. Cached outputs are reused if you skip a step.

Edit configs/base.yaml to set project_root, data_dir, results_dir, features_dir, and models_dir before the first run.

Datasets

All datasets used in the paper are hosted on the Hugging Face Hub:

Dataset	Use	Link
VocSim (public)	Core 15-subset benchmark (114,641 clips)	`vocsim/public`
Avian Perception	Alignment with zebra-finch perceptual judgments	`vocsim/avian-perception-benchmark`
Mouse Strain	C57 vs DBA USV classification	`vocsim/mouse-strain-classification-benchmark`
Mouse Identity	Individual mouse identification from USVs	`vocsim/mouse-identity-classification-benchmark`

Blind out-of-distribution sets (low-resource speech in Shipibo-Conibo and Chintang) are held out for server-side evaluation via the leaderboard.

Submitting to the leaderboard

The official leaderboard lives at huggingface.co/spaces/vocsim/VocSim. Submit through the 🚀 Submit your Embedding! tab; results on the blind sets are computed server-side and published with P@1, P@5, and GSR reported separately.

Extending VocSim

The pipeline is modular — adding a new component is local to a single file.

Feature extractor. Subclass features.base.FeatureExtractor, implement _initialize and extract, then register it in configs/features.yaml:

- name: MyExtractor
  module: features.my_extractor
  class: MyExtractorClass
  params:
    model_path: "path/to/model.pt"

Distance metric. Subclass distances.base.DistanceCalculator, implement _initialize and compute_pairwise, register in configs/distances.yaml.

Benchmark. Subclass benchmarks.base.Benchmark, implement _initialize and evaluate, register in configs/benchmarks.yaml.

Dataset. Prepare a Hugging Face-format dataset with audio and label columns and add an entry to configs/datasets.yaml.

Reproducing the paper

reproducibility/ contains the configs and scripts used for every table and figure in the paper:

configs/vocsim_paper.yaml, avian_paper.yaml, mouse_strain_paper.yaml, mouse_identity_paper.yaml
scripts/vocsim.py, avian_perception.py, mouse_strain.py, mouse_identity.py
scripts/misc/ — table generation, plots, dataset statistics

Update the paths in the config files, then run the corresponding script with --steps all.

Citation

@inproceedings{basha2026vocsim,
  title     = {VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio},
  author    = {Basha, Maris and Zai, Anja T. and Stoll, Sabine and Hahnloser, Richard H. R.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  doi       = {10.48550/arXiv.2512.10120}
}

License

Code: MIT (see LICENSE). Datasets and pretrained models retain their original licenses; see Appendix A.1.1 of the paper for a per-source breakdown.

Acknowledgements

We thank the maintainers of every dataset and pretrained model VocSim builds on. Issues and pull requests are welcome at github.com/vocsim/benchmark/issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VocSim

What's in this repo

Installation

Quick start

Datasets

Submitting to the leaderboard

Extending VocSim

Reproducing the paper

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
configs		configs
data		data
distances		distances
features		features
reproducibility		reproducibility
trainers		trainers
utils		utils
vocsim		vocsim
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
__init__.py		__init__.py
environment.yml		environment.yml
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VocSim

What's in this repo

Installation

Quick start

Datasets

Submitting to the leaderboard

Extending VocSim

Reproducing the paper

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages