This repo contains the code for the following paper. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
You can also refer to our Huggingface collection for quick start.
To get started, git clone the repo and install the required dependencies using the following commands:
git clone https://github.com/zihaod/MusiLingo
cd MusiLingo
conda create -n musilingo python=3.10.6
conda activate musilingo
pip install -r requirements.txt
The MusiLingo model is pre-trained on LP-MusicCaps-MSD dataset and we only provide the annotation of the dataset in this repo under ```data/music_data/msd`````. The audio is part of the Million Song Dataset (MSD) which you may not be able to download from Internet easily.
We provide a copy of annotation of MusicCaps dataset under data/music_data/MusicCaps_ann
. The audio is part of Google's AudioSet and you can download it from YouTube.
We develop MI dataset and saved under data/music_data/MusicInstruct
. The audios are identical with MusicCaps. You can also find more information at the Huggingface page.
You can doanloaw the MusicQA dataset to data/music_data/MusicQA
from Huggingface.
You need to prepare the pretrained Vicuna weights following instructions here. Once you have the weights, put the weights under the model/7B_vicuna
folder. The final contents should look like this:
Vicuna_7B
├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin
Currently we use Vicuna 7B v0 by default. Finally, go to musilingo/configs/models/musilingo.yaml
and set llama_model
to be PATH/TO/Vicuna_7B/
. Alternatively, you can also set it to the huggingface path lmsys/vicuna-7b-delta-v0
.
We use MERT-v1-330M as music encoder for MusiLingo model. You can download it from Huggingface page to model/MERT-v1-330M
Run the following command to pretrain the model. Set NUM_GPU
to be the actual available number of gpus.
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage1_pretrain.yaml
For each dataset, run the command provided in the corresponding section. Again, set NUM_GPU
to be the actual available number of gpus.
We can use instruction tuning on MusicCaps to perform captioning tasks by giving a default question prompt.
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage2_finetune_musiccaps.yaml
We can run instruction tuning on the whole MI dataset, or only on either the long or the short questions.
For the whole MI dataset:
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage2_finetune_cmi.yaml
For the short question version:
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage2_finetune_cmi_short.yaml
For the long question version:
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage2_finetune_cmi_long.yaml
Run the following command to finetune on MusicQA:
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/musilingo_stage2_finetune_musicqa.yaml
To do the inference on MusicInstruct dataset, use the following code
Python qa.py --qa_type short
Python qa.py --qa_type long
If you cannot download from the ckpt in this repo, you can download the pretrained model checkpoints
MusiLingo (long) | MusiLingo (MusicQA) | MusiLingo (short) |
---|---|---|
Download | Download | Download |
model/ckpt/long/ |
model/ckpt/musicqa/ |
model/ckpt/short/ |
If you find the work useful for your research, please consider citing it using the following BibTeX entry:
@inproceedings{deng2024musilingo,
title={MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response},
author={Deng, Zihao and Ma, Yinghao and Liu, Yudong and Guo, Rongchen and Zhang, Ge and Chen, Wenhu and Huang, Wenhao and Benetos, Emmanouil},
booktitle={Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)},
year={2024},
organization={Association for Computational Linguistics}
}