In this paper we propose to learn the alignment between audio and lyrics using contrastive learning to achieve higher-quality music captions.
For copyright considerations we are only able to provide the song interpretation dataset but not the netease dataset.
- Download the metadata to data/music4all.
- Download the song waveforms to data/music4all/audios.
- (Optional) Download the song embeddings to data/music4all/audios. If not downloaded the code will generate the embeddings from scratch.
- (Optional) Download the CNN music encoder to ckp/
python run_train.py
Try different corpora and random seeds.
python run_eval.py
Try different corpora and random seeds.
@inproceedings{he2023alcap,
title={ALCAP: Alignment-Augmented Music Captioner},
author={He, Zihao and Hao, Weituo and Lu, Wei-Tsung and Chen, Changyou and Lerman, Kristina and Song, Xuchen},
booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
pages={16501--16512},
year={2023}
}