Skip to content


Repository files navigation


This repo contains the code for the following paper. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

You can also refer to our Huggingface collection for quick start.

Environment setup

To get started, git clone the repo and install the required dependencies using the following commands:

git clone
cd MusiLingo
conda create -n musilingo python=3.10.6
conda activate musilingo
pip install -r requirements.txt

Data Preparation

1. LP-MusicCaps-MSD

The MusiLingo model is pre-trained on LP-MusicCaps-MSD dataset and we only provide the annotation of the dataset in this repo under ```data/music_data/msd`````. The audio is part of the Million Song Dataset (MSD) which you may not be able to download from Internet easily.

2. MusicCaps

We provide a copy of annotation of MusicCaps dataset under data/music_data/MusicCaps_ann. The audio is part of Google's AudioSet and you can download it from YouTube.

3. MusicInstruct (MI)

We develop MI dataset and saved under data/music_data/MusicInstruct. The audios are identical with MusicCaps. You can also find more information at the Huggingface page.

4. MusicQA

You can doanloaw the MusicQA dataset to data/music_data/MusicQA from Huggingface.

Model Preparation


You need to prepare the pretrained Vicuna weights following instructions here. Once you have the weights, put the weights under the model/7B_vicuna folder. The final contents should look like this:

├── config.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin

Currently we use Vicuna 7B v0 by default. Finally, go to musilingo/configs/models/musilingo.yaml and set llama_model to be PATH/TO/Vicuna_7B/. Alternatively, you can also set it to the huggingface path lmsys/vicuna-7b-delta-v0.


We use MERT-v1-330M as music encoder for MusiLingo model. You can download it from Huggingface page to model/MERT-v1-330M


1. Pretraining with LP-MusicCaps-MSD

Run the following command to pretrain the model. Set NUM_GPU to be the actual available number of gpus.

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage1_pretrain.yaml

2. Instruction Finetuing

For each dataset, run the command provided in the corresponding section. Again, set NUM_GPU to be the actual available number of gpus.

2.1 MusicCaps

We can use instruction tuning on MusicCaps to perform captioning tasks by giving a default question prompt.

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage2_finetune_musiccaps.yaml

2.2 MusicInstruct

We can run instruction tuning on the whole MI dataset, or only on either the long or the short questions.

For the whole MI dataset:

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage2_finetune_cmi.yaml

For the short question version:

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage2_finetune_cmi_short.yaml

For the long question version:

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage2_finetune_cmi_long.yaml

2.3 MusicQA

Run the following command to finetune on MusicQA:

torchrun --nproc-per-node NUM_GPU --cfg-path train_configs/musilingo_stage2_finetune_musicqa.yaml


To do the inference on MusicInstruct dataset, use the following code Python --qa_type short Python --qa_type long

Model Checkpoints

If you cannot download from the ckpt in this repo, you can download the pretrained model checkpoints

MusiLingo (long) MusiLingo (MusicQA) MusiLingo (short)
Download Download Download
model/ckpt/long/ model/ckpt/musicqa/ model/ckpt/short/

Citing This Work

If you find the work useful for your research, please consider citing it using the following BibTeX entry:

  title={MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response},
  author={Deng, Zihao and Ma, Yinghao and Liu, Yudong and Guo, Rongchen and Zhang, Ge and Chen, Wenhu and Huang, Wenhao and Benetos, Emmanouil},
  booktitle={Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024)},
  organization={Association for Computational Linguistics}


No description, website, or topics provided.






No releases published


No packages published
