Code for the paper Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer @ NAACL 2022.
AudioSet can be downloaded and preprocessed via this tool.
See AudioSet. It elaborates on our customized index files for pre-training on AudioSet.
See AudioTxt. It elaborates on our curation methods and customized index files for audio-text fine-tuning.
Check out the running script bash/run_bimodal_va.sh
.
Check out the running script bash/run_bimodal_at.sh
. Fine-tuning starts with a VA pre-trained audio encoder.
We provide a checkpoint that performs best for each task.
Model | AudioCaps | Clotho (18s) | Clotho (10s) |
---|---|---|---|
VIP-ANT | 00051623 | 00043681 | 00043681 |
+AT w/ GC | 00006210 | 00006900 | 00004140 |
Model | ESC50 (w/ prompt) | US8K (w/ prompt) |
---|---|---|
VIP-ANT | 00083391 | 00079420 |
+AT w/ GC | 00004140 | 00004140 |
Dockerfile
defines minimum dependencies of the repo.
MIT