USAD: Universal Speech and Audio Representation via Distillation

USAD
USAD 2.0
Instructions
Acknowledgement
Contact

USAD

We introduce Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that combines speech, sound, and music into one model. USAD uses layer-to-layer distillation from domain-specific models to train a student model, achieving competitive performance across multiple benchmarks and tasks with a single encoder.

USAD Framework

Speech and Audio Evaluation

HEAR Benchmark

Please cite our paper if you find this repository and/or the paper useful.

@inproceedings{chang2025usad,
  title={{USAD}: Universal Speech and Audio Representation via Distillation},
  author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
  booktitle={ASRU},
  year={2025}
}

USAD 2.0

USAD 2.0 is a bidirectional transformer-based universal audio encoder that extracts useful representations across multiple audio domains (speech/sound/music) by distilling from SSL/supervised audio foundation models without labeled data. USAD 2.0 achieves strong or state-of-the-art performance across probing (HEAR and MARBLE) and LLM-based evaluations (XARES-LLM).

USAD 2.0 Framework

Benchmarks: HEAR, MARBLE, XARES-LLM

Please cite our paper if you find this repository and/or the paper useful.

@inproceedings{chang2026usad2,
  title={{USAD 2.0}: Scaling Representation Distillation for Universal Audio Understanding},
  author={Chang, Heng-Jui and Liu, Alexander H. and Bhati, Saurabhchand and Athi, Mrudula and Ratnarajah, Anton and Chhetri, Amit and Glass, James},
  booktitle={Interspeech},
  year={2026}
}

Instructions

Inference (Minimal Installation)

Follow instructions in usad_inference to download model weights and extract features for downstream tasks. Only PyTorch and TorchAudio are required.

Inference (🤗 HuggingFace)

Load pre-trained USAD or USAD 2.0 models from HuggingFace.

from transformers import AutoModel
model_id = "MIT-SLS/USAD-Small"  # `MIT-SLS/USAD-Base` / `MIT-SLS/USAD-Large`
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).cuda().eval()
wav = model.load_audio("path/to/audio").unsqueeze(0)
results = model(wav)  # keys: "x", "mel", "hidden_states", "ffn"

Training and Fine-tuning

Follow instructions in usad_fairseq to train USAD from scratch or fine-tune for audio tagging. fairseq installation is required. Training log can be found here.

🚧 The current implementation doesn't support USAD 2.0 training, but you can refer to criterions/usad2.py for the proposed domain-aware distillation.

Acknowledgement

Our implementation is based on the awesome facebookresearch/fairseq, cwx-worst-one/EAT, and sooftware/conformer repositories.

Contact

Please open an issue or email me (hengjui [at] mit.edu) if you have any questions 😊

⚠️ It's known that facebookresearch/fairseq has a lot of issues, so please don't contact me for any fairseq-related problems 😊

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
test		test
usad_fairseq		usad_fairseq
usad_inference		usad_inference
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USAD: Universal Speech and Audio Representation via Distillation

USAD

USAD 2.0

Instructions

Inference (Minimal Installation)

Inference (🤗 HuggingFace)

Training and Fine-tuning

Acknowledgement

Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

USAD: Universal Speech and Audio Representation via Distillation

USAD

USAD 2.0

Instructions

Inference (Minimal Installation)

Inference (🤗 HuggingFace)

Training and Fine-tuning

Acknowledgement

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages