Demo πΆ | π Paper (coming soon)
YuE-s1-7B-anneal-en-cot π€ | YuE-s1-7B-anneal-en-icl π€ | YuE-s1-7B-anneal-jp-kr-cot π€
YuE-s1-7B-anneal-jp-kr-icl π€ | YuE-s1-7B-anneal-zh-cot π€ | YuE-s1-7B-anneal-zh-icl π€
YuE-s2-1B-general π€ | YuE-upsampler π€
Our model's name is YuE (δΉ). In Chinese, the word means "music" and "happiness." Some of you may find words that start with Yu hard to pronounce. If so, you can just call it "yeah." We wrote a song with our model's name, see here.
YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and accompaniment track. YuE is capable of modeling diverse genres/languages/vocal techniques. Please visit the Demo Page for amazing vocal performance.
- 2025.01.29 π: We have updated the license description. we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by M-A-P.
- 2025.01.28 π«Ά: Thanks to Fahd for creating a tutorial on how to quickly get started with YuE. Here is his demonstration.
- 2025.01.26 π₯: We have released the YuE series.
- Support dual-track ICL mode.
- Support gradio interface.
- Support transformers tensor parallel.
- Online serving on huggingface space.
- Example finetune code for enabling BPM control using π€ Transformers.
YuE requires significant GPU memory for generating long sequences. Below are the recommended configurations:
- For GPUs with 24GB memory or less: Run up to 2 sessions concurrently to avoid out-of-memory (OOM) errors.
- For full song generation (many sessions, e.g., 4 or more): Use GPUs with at least 80GB memory. i.e. H800, A100, or multiple RTX4090s with tensor parallel.
To customize the number of sessions, the interface allows you to specify the desired session count. By default, the model runs 2 sessions (1 verse + 1 chorus) to avoid OOM issue.
On an H800 GPU, generating 30s audio takes 150 seconds. On an RTX 4090 GPU, generating 30s audio takes approximately 360 seconds.
Quick start VIDEO TUTORIAL by Fahd: Link here. We recommend watching this video if you are not familiar with machine learning or the command line.
Make sure properly install flash attention 2 to reduce VRAM usage.
# We recommend using conda to create a new environment.
conda create -n yue python=3.8 # Python >=3.8 is recommended.
conda activate yue
# install cuda >= 11.8
conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
# For saving GPU memory, FlashAttention 2 is mandatory.
# Without it, long audio may lead to out-of-memory (OOM) errors.
# Be careful about matching the cuda version and flash-attn version
pip install flash-attn --no-build-isolation
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://github.com/multimodal-art-projection/YuE.git
cd YuE/inference/
git clone https://huggingface.co/m-a-p/xcodec_mini_infer
Now generate music with YuE using π€ Transformers. Make sure your step 1 and 2 are properly set up.
Note:
-
Set
--run_n_segments
to the number of lyric sections if you want to generate a full song. Additionally, you can increase--stage2_batch_size
based on your available GPU memory. -
You may customize the prompt in
genre.txt
andlyrics.txt
. See prompt engineering guide here. -
LM ckpts will be automatically downloaded from huggingface.
# This is the CoT mode.
cd YuE/inference/
python infer.py \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt genre.txt \
--lyrics_txt lyrics.txt \
--run_n_segments 2 \
--stage2_batch_size 4 \
--output_dir ./output \
--cuda_idx 0 \
--max_new_tokens 3000
If you want to use music in-context-learning (provide a reference song), enable --use_audio_prompt
, --prompt_start_time
, and --prompt_end_time
to specify the audio segment.
Note:
-
ICL requires a different ckpt, e.g.
m-a-p/YuE-s1-7B-anneal-en-icl
. -
Music ICL generally requires a 30s audio segment. The model will write new songs with similiar style of the provided audio, and may improve musicality.
-
We have 4 modes for ICL: mix, vocal, instrumental, and dual-track.
-
We currently only support mix mode.
-
Dual-track mode work the best, will support in the infer code soon.
# This is the ICL mode. Currently only mix-ICL is supported.
cd YuE/inference/
python infer.py \
--stage1_model m-a-p/YuE-s1-7B-anneal-en-icl \
--stage2_model m-a-p/YuE-s2-1B-general \
--genre_txt genre.txt \
--lyrics_txt lyrics.txt \
--run_n_segments 2 \
--stage2_batch_size 4 \
--output_dir ./output \
--cuda_idx 0 \
--max_new_tokens 3000 \
--audio_prompt_path {YOUR_AUDIO_FILE} \
--prompt_start_time 0 \
--prompt_end_time 30
The prompt consists of three parts: genre tags, lyrics, and ref audio.
-
An example genre tagging prompt can be found here.
-
A stable tagging prompt usually consists of five components: genre, instrument, mood, gender, and timbre. All five should be included if possible, separated by space (space delimiter).
-
Although our tags have an open vocabulary, we have provided the top 200 most commonly used tags. It is recommended to select tags from this list for more stable results.
-
The order of the tags is flexible. For example, a stable genre tagging prompt might look like: "inspiring female uplifting pop airy vocal electronic bright vocal vocal."
-
Additionally, we have introduced the "Mandarin" and "Cantonese" tags to distinguish between Mandarin and Cantonese, as their lyrics often share similarities.
-
An example lyric prompt can be found here.
-
We support multiple languages, including but not limited to English, Mandarin Chinese, Cantonese, Japanese, and Korean. The default top language distribution during the annealing phase is revealed in issue 12. A language ID on a specific annealing checkpoint indicates that we have adjusted the mixing ratio to enhance support for that language.
-
The lyrics prompt should be divided into sessions, with structure labels (e.g., [verse], [chorus], [bridge], [outro]) prepended. Each session should be separated by 2 newline character "\n\n".
-
DONOT put too many words in a single segment, since each session is around 30s (
--max_new_tokens 3000
by default). -
We find that [intro] label is less stable, so we recommend starting with [verse] or [chorus].
-
For generating music with no vocal, see issue 18.
-
Audio prompt is optional. Providing ref audio for ICL usually increase the good case rate, and result in less diversity since the generated token space is bounded by the ref audio. CoT only (no ref) will result in a more diverse output.
-
We find that dual-track ICL mode gives the best musicality and prompt following. We will support this mode soon.
-
Use the chorus part of the music as prompt will result in better musicality.
- Our models are licensed under Creative Commons Attribution Non Commercial 4.0, meaning the model weights themselves CANNOT be used for commercial purposes.
- However, we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by M-A-P.
- We DO NOT assume any responsibility for any misuse of this model, including but not limited to illegal, malicious, or unethical activities.
- Users are solely responsible for any content generated with the model and any consequences arising from its use.
If you find our paper and code useful in your research, please consider giving a star β and citation π :)
@misc{yuan2025yue,
title={YuE: Open Music Foundation Models for Full-Song Generation},
author={Ruibin Yuan and Hanfeng Lin and Shawn Guo and Ge Zhang and Jiahao Pan and Yongyi Zang and Haohe Liu and Xingjian Du and Xeron Du and Zhen Ye and Tianyu Zheng and Yinghao Ma and Minghao Liu and Lijun Yu and Zeyue Tian and Ziya Zhou and Liumeng Xue and Xingwei Qu and Yizhi Li and Tianhao Shen and Ziyang Ma and Shangda Wu and Jun Zhan and Chunhui Wang and Yatian Wang and Xiaohuan Zhou and Xiaowei Chi and Xinyue Zhang and Zhenzhu Yang and Yiming Liang and Xiangzhou Wang and Shansong Liu and Lingrui Mei and Peng Li and Yong Chen and Chenghua Lin and Xie Chen and Gus Xia and Zhaoxiang Zhang and Chao Zhang and Wenhu Chen and Xinyu Zhou and Xipeng Qiu and Roger Dannenberg and Jiaheng Liu and Jian Yang and Stephen Huang and Wei Xue and Xu Tan and Yike Guo},
howpublished={\url{https://github.com/multimodal-art-projection/YuE}},
year={2025},
note={GitHub repository}
}