Skip to content

vectominist/usad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

USAD: Universal Speech and Audio Representation via Distillation

Python PyTorch fairseq License

USAD arXiv 🤗 USAD on HuggingFace

USAD2 arXiv 🤗 USAD2 on HuggingFace

USAD

We introduce Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that combines speech, sound, and music into one model. USAD uses layer-to-layer distillation from domain-specific models to train a student model, achieving competitive performance across multiple benchmarks and tasks with a single encoder.

USAD Framework

Proposed USAD Framework

Speech and Audio Evaluation

Speech and Audio Evaluation

HEAR Benchmark

HEAR Benchmark

Please cite our paper if you find this repository and/or the paper useful.

@inproceedings{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  booktitle={ASRU},
  year={2025}
}

USAD 2.0

USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing (HEAR and MARBLE) and LLM-based evaluations (XARES-LLM).

USAD 2.0 Framework

Proposed USAD 2.0 Framework

Benchmarks: HEAR, MARBLE, XARES-LLM

USAD 2.0 Evaluation Results

Please cite our paper if you find this repository and/or the paper useful.

@inproceedings{chang2026usad2,
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
  booktitle={Interspeech},
  year={2026}
}

Instructions

Inference (Minimal Installation)

Follow instructions in usad_inference to download model weights and extract features for downstream tasks. Only PyTorch and TorchAudio are required.

Inference (🤗 HuggingFace)

Load pre-trained USAD or USAD 2.0 models from HuggingFace.

from transformers import AutoModel
model_id = "MIT-SLS/USAD-Small"  # `MIT-SLS/USAD-Base` / `MIT-SLS/USAD-Large`
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).cuda().eval()
wav = model.load_audio("path/to/audio").unsqueeze(0)
results = model(wav)  # keys: "x", "mel", "hidden_states", "ffn"

Training and Fine-tuning

Follow instructions in usad_fairseq to train USAD from scratch or fine-tune for audio tagging. fairseq installation is required. Training log can be found here.

🚧 The current implementation doesn't support USAD 2.0 training, but you can refer to criterions/usad2.py for the proposed domain-aware distillation.

Acknowledgement

Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.

Contact

Please open an issue or email me (hengjui [at] mit.edu) if you have any questions 😊

⚠️ It's known that facebookresearch/fairseq has a lot of issues, so please don't contact me for any fairseq-related problems 😊

About

Official implementation of "USAD: Universal Speech and Audio Representation via Distillation"

Resources

License

Stars

Watchers

Forks

Contributors