We introduce Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that combines speech, sound, and music into one model. USAD uses layer-to-layer distillation from domain-specific models to train a student model, achieving competitive performance across multiple benchmarks and tasks with a single encoder.
USAD Framework
Speech and Audio Evaluation
HEAR Benchmark
Please cite our paper if you find this repository and/or the paper useful.
@inproceedings{chang2025usad,
title={{USAD}: Universal Speech and Audio Representation via Distillation},
author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
booktitle={ASRU},
year={2025}
}USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing (HEAR and MARBLE) and LLM-based evaluations (XARES-LLM).
USAD 2.0 Framework
Benchmarks: HEAR, MARBLE, XARES-LLM
Please cite our paper if you find this repository and/or the paper useful.
@inproceedings{chang2026usad2,
title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
booktitle={Interspeech},
year={2026}
}Follow instructions in usad_inference to download model weights and extract features for downstream tasks. Only PyTorch and TorchAudio are required.
Load pre-trained USAD or USAD 2.0 models from HuggingFace.
from transformers import AutoModel
model_id = "MIT-SLS/USAD-Small" # `MIT-SLS/USAD-Base` / `MIT-SLS/USAD-Large`
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).cuda().eval()
wav = model.load_audio("path/to/audio").unsqueeze(0)
results = model(wav) # keys: "x", "mel", "hidden_states", "ffn"Follow instructions in usad_fairseq to train USAD from scratch or fine-tune for audio tagging. fairseq installation is required. Training log can be found here.
🚧 The current implementation doesn't support USAD 2.0 training, but you can refer to criterions/usad2.py for the proposed domain-aware distillation.
Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.
Please open an issue or email me (hengjui [at] mit.edu) if you have any questions 😊




