State of the Art of Music Generation with Deep Learning and AI, mainly focus on text to music generation.
- Personalized Music Composition: Music generation models may become increasingly adept at creating personalized music based on a user’s emotions and preferences.
- Music Accompaniment Applications: With the demand from industries such as gaming, film, and advertising, music generation models can provide real-time, automated background scores for these media.
- Music Education and Compositional Aid: Music generation technology can help musicians and composers come up with new musical ideas, or be used in music education to demonstrate compositional techniques.
- Interactive Performance: Using music generation models in live performances to interact with the audience and create a unique performance experience.
- Music Analysis and Reconstruction: Utilizing models to analyze the style of musical works and reconstruct or generate new pieces based on that analysis.
Chinese of above
- 个性化音乐创作:音乐生成模型可能会越来越擅长根据用户的情绪和喜好创作个性化的音乐。
- 音乐伴随应用:随着游戏、电影、广告等行业的需求,音乐生成模型可以为这些媒介提供实时、自动化的配乐。
- 音乐教育和辅助作曲:音乐生成技术可以帮助音乐家和作曲家提出新的音乐想法,或者用于音乐教育来展示作曲技巧。
- 交互式表演:在现场表演中使用音乐生成模型,实现与观众的互动,创造独一无二的表演体验。
- 音乐分析和重构:利用模型分析音乐作品的风格,并基于分析重构或生成新的作品。
The AI generation of text, video, and images has surged since 2023, but the advent of audio generation has been somewhat delayed. This repository tracks the latest advancements in audio generation technology. Both of closed or open sourced achievements will be included!
model | Sample Rate | Len | Input | Music | Example | Infer. Time | Data | Model Arch |
---|---|---|---|---|---|---|---|---|
WaveNet(2016) | 16kHz@1 | Secs | None | Piano or speech | Piano | = Audio len. | 260 | |
Jukebox(2020) OpenAI | 44.1kHz@1 | Mins | Lyrics, author, etc. | Song with the lyrics | Song | Hours | 70k | |
RAVE(2021) | 48kHz@2 | Secs | Latent | Single-genre Music | Strings | = Audio len. | 100 | |
AudioLM(2022) Google | 16kHz@1 | Mins | text prompt | Piano or speech | Piano | Mins | 40k | |
Musika(2022) | 22.5kHz@1 | Secs | context vector | Single-genre Music | Piano | = Audio len. | 1k | |
Riffusion(2022) | 44.1kHz@1 | 5s | Text (genre, author, etc.) | Music of any genre | Jazzy clarinet | Mins | - | |
MusicLM(2023) Google | 24kHz@2 | up to 5-Mins | text prompt or image | Music | Music | - | 5.5k MusicCaps | Transformer-based multi-stage autoregressive modeling |
MusicGen(2023) Meta | 32kHz@2 | Minus | Text (a phrase/sentence) | music | music | depends on model size | 20k licensed music | |
Moûsai(2023) | 48kHz@1 | Mins | Text (genre, author, etc.) | Music of any genre lyrics | African drums | = Audio len. | 2.5k | |
JEN-1(2023.8) | 48kHz@2 | Mins | text prompt | piano instruments | piano | - | private 5k hours | autoencoder and autoregressive and non-autoregressive diffusion + transformer |
JEN-1 Composer(2023.11) | 48kHz@4 | Mins | text prompt | music | music | - | - | autoencoder and autoregressive and non-autoregressive diffusion + transformer |
基于Transformer的大语言模型(Autoregressive模型)生成方法先是用Embedding的方式将词/符号转为Token,然后通过Attention结构Decoder获取上下文和长上下文的关联,而基于Diffusion模型(Non-autoregressive模型)的图像生成方法是一次可以生成一幅图,而不是一个一个像素生存,这两种结构上的差异在于Transformer是Token by Token生成的,效率上并不够高,而Diffusion是整张图生成的,如何将二者结合起来,获得高效、高质量的模型就至关重要了。
从OpenAI的Sora模型可以看到,未来音频(音乐、人声)生成采用类似的结构也许会取得突破性进展,类似于OpenAI的visual patches,会由audio pathes的结构(这种结构是音频的表示单元,类似语言模型中的词,在时间长度并不是等长的),通过类似图中的videencoder将视频转到latent space,audioEncoder也是将其转到latent space patches(并非是RVQ,而是人类建模物理信号层面上目前是不可解释的,但可以采用RVQ代替),然后这些将Spacetime latent patches作为大语言模型中的Token,但是将大语言模型中的Transformer换成diffusion transformer,
Suno AI V3 released, v4 is already under development,
- to make full, two-minute songs in seconds
- txt to music generation
- support many styles and genres
- website https://app.suno.ai/
musicGen Meta open sourced https://github.com/facebookresearch/audiocraft/tree/main AudioCraft provides the code and models for MusicGen, a simple and controllable model for music generation. MusicGen is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. Check out our sample page or test the available demo!
-
20K hours of licensed music to train MusicGen, internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
-
model have sizes of small(300M), medium(1.5B) and large(3.3B)
improved from original google musicLM papers, support Multi Band Diffusion
SoundStream neural audio codec, SoundStream uses residual vector quantization(RVQ),
MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations.
more deatils see : https://ai.meta.com/resources/models-and-libraries/audiocraft/
SOUNDRAW's AI
- similar to Suno, also text prompt. https://soundraw.io/
JEN-1 Can works for Generation, painting and Continuation.
entails the amalgamation of both the auto-regressive(from LLM) and non-autoregressive modes(from image generation, stable diffusion) into a cohesive omnidirectional diffusion model.
AudioLM https://arxiv.org/pdf/2209.03143.pdf MusicLM https://arxiv.org/pdf/2301.11325.pdf dataset 28hours music from Goolge not open sourced now. audio examples : https://google-research.github.io/seanet/musiclm/examples/ https://google-research.github.io/seanet/audiolm/examples/