Skip to content

vocsim/benchmark

Repository files navigation

VocSim

A training-free benchmark for zero-shot content identity in single-source audio embeddings.

License: MIT Dataset Leaderboard Paper

VocSim probes the intrinsic geometric quality of frozen audio embeddings: how well do general-purpose audio models map acoustically variable instances of the same content to nearby points, without any task-specific training? The benchmark aggregates 125K single-source clips from 19 corpora — human speech, animal vocalizations, environmental sounds — and evaluates representations with retrieval-based metrics (P@1, P@5) and a calibrated Global Separation Rate (GSR).

Basha, M., Zai, A. T., Stoll, S., & Hahnloser, R. H. R. (2026). VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio. ICML 2026. arXiv:2512.10120

What's in this repo

Path Purpose
features/ Frozen-encoder feature extractors (Whisper, WavLM, CLAP, BEATs, …)
distances/ Pairwise distance backends (cosine, Euclidean, Spearman)
benchmarks/ Evaluation suite (P@k, GSR with permutation calibration, clustering)
trainers/ AE/VAE baselines
vocsim/ Pipeline runner
configs/ Default configs for datasets, features, distances, benchmarks
reproducibility/ Configs + scripts to reproduce paper results end-to-end

Installation

git clone https://github.com/vocsim/benchmark.git
cd benchmark

# Conda
conda env create -f environment.yml
conda activate vocsim_env

# or pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install .

A CUDA-capable GPU is strongly recommended.

Quick start

The pipeline runner orchestrates dataset loading, optional training, feature extraction, distance computation, and benchmarking. To reproduce the main paper table:

python reproducibility/scripts/vocsim.py --steps all

Available stages: train, features, distances, benchmarks, all. Cached outputs are reused if you skip a step.

Edit configs/base.yaml to set project_root, data_dir, results_dir, features_dir, and models_dir before the first run.

Datasets

All datasets used in the paper are hosted on the Hugging Face Hub:

Dataset Use Link
VocSim (public) Core 15-subset benchmark (114,641 clips) vocsim/public
Avian Perception Alignment with zebra-finch perceptual judgments vocsim/avian-perception-benchmark
Mouse Strain C57 vs DBA USV classification vocsim/mouse-strain-classification-benchmark
Mouse Identity Individual mouse identification from USVs vocsim/mouse-identity-classification-benchmark

Blind out-of-distribution sets (low-resource speech in Shipibo-Conibo and Chintang) are held out for server-side evaluation via the leaderboard.

Submitting to the leaderboard

The official leaderboard lives at huggingface.co/spaces/vocsim/VocSim. Submit through the 🚀 Submit your Embedding! tab; results on the blind sets are computed server-side and published with P@1, P@5, and GSR reported separately.

Extending VocSim

The pipeline is modular — adding a new component is local to a single file.

Feature extractor. Subclass features.base.FeatureExtractor, implement _initialize and extract, then register it in configs/features.yaml:

- name: MyExtractor
  module: features.my_extractor
  class: MyExtractorClass
  params:
    model_path: "path/to/model.pt"

Distance metric. Subclass distances.base.DistanceCalculator, implement _initialize and compute_pairwise, register in configs/distances.yaml.

Benchmark. Subclass benchmarks.base.Benchmark, implement _initialize and evaluate, register in configs/benchmarks.yaml.

Dataset. Prepare a Hugging Face-format dataset with audio and label columns and add an entry to configs/datasets.yaml.

Reproducing the paper

reproducibility/ contains the configs and scripts used for every table and figure in the paper:

  • configs/vocsim_paper.yaml, avian_paper.yaml, mouse_strain_paper.yaml, mouse_identity_paper.yaml
  • scripts/vocsim.py, avian_perception.py, mouse_strain.py, mouse_identity.py
  • scripts/misc/ — table generation, plots, dataset statistics

Update the paths in the config files, then run the corresponding script with --steps all.

Citation

@inproceedings{basha2026vocsim,
  title     = {VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio},
  author    = {Basha, Maris and Zai, Anja T. and Stoll, Sabine and Hahnloser, Richard H. R.},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  doi       = {10.48550/arXiv.2512.10120}
}

License

Code: MIT (see LICENSE). Datasets and pretrained models retain their original licenses; see Appendix A.1.1 of the paper for a per-source breakdown.

Acknowledgements

We thank the maintainers of every dataset and pretrained model VocSim builds on. Issues and pull requests are welcome at github.com/vocsim/benchmark/issues.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages