# **MusicGen** : Simple and Controllable Music Generation

paper: https://arxiv.org/abs/2306.05284 \
code: https://github.com/facebookresearch/audiocraft \
docs: https://huggingface.co/docs/transformers/main/en/model_doc/musicgen \
musicgen api info: https://replicate.com/facebookresearch/musicgen/api \

Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive
Transformer model trained over a 32kHz <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. Unlike existing methods like [MusicLM](https://arxiv.org/abs/2301.11325), MusicGen doesn't require a self-supervised semantic representation, and it generates
all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict
them in parallel, thus having only 50 auto-regressive steps per second of audio.

### **API**
- `small`: 300M model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-small)
- `medium`: 1.5B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-medium)
- `melody`: 1.5B model, text to music and text+melody to music - [🤗 Hub](https://huggingface.co/facebook/musicgen-melody)
- `large`: 3.3B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-large)

In [None]:
!head -n 3 /proc/meminfo

MemTotal:       13294252 kB
MemFree:         5584756 kB
MemAvailable:   12197240 kB


### **Setting**

In [None]:
!python3 -m pip install -U git+https://github.com/facebookresearch/audiocraft#egg=audiocraft

In [None]:
from audiocraft.models import musicgen
from audiocraft.utils.notebook import display_audio
import torch

In [None]:
!unzip melody.zip

Archive:  melody.zip
   creating: melody/
  inflating: __MACOSX/._melody       
  inflating: melody/melody1.mp3      
  inflating: __MACOSX/melody/._melody1.mp3  
  inflating: melody/melody2.wav      
  inflating: __MACOSX/melody/._melody2.wav  
  inflating: melody/melody3.wav      
  inflating: __MACOSX/melody/._melody3.wav  


### **Text to Music (Text Conditioning)**

In [None]:
model = musicgen.MusicGen.get_pretrained('medium', device='cuda')
model.set_generation_params(duration=8)

Downloading (…)ssion_state_dict.bin:   0%|          | 0.00/236M [00:00<?, ?B/s]

Downloading state_dict.bin:   0%|          | 0.00/3.68G [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
text_prompt = 'crazy EDM, heavy bang'

res = model.generate([text_prompt], progress=True)
display_audio(res, 32000)



In [None]:
text_prompt_list = ['crazy EDM, heavy bang',
                    'classic reggae track with an electronic guitar solo',
                    'lofi slow bpm electro chill with organic samples',
                    'rock with saturated guitars, a heavy bass line and crazy drum break and fills.',
                    'earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves',
]

res = model.generate(text_prompt_list, progress=True)
display_audio(res, 32000)

Output hidden; open in https://colab.research.google.com to view.

### **Text to Music (Melody & Text Conditioning)**
We now experiment with our novel chroma-based melody conditioning. We condition on famous melodies from classical music along with new text description to provide interpretations in any genre or style. We use our MusicGen 1.5B with melody and text conditioning.





In [None]:
import torchaudio

In [None]:
model = musicgen.MusicGen.get_pretrained('melody', device='cuda')
model.set_generation_params(duration=8)

Downloading (…)ssion_state_dict.bin:   0%|          | 0.00/236M [00:00<?, ?B/s]

Downloading state_dict.bin:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th
100%|██████████| 80.2M/80.2M [00:00<00:00, 153MB/s]


In [None]:
melody, sr = torchaudio.load('/content/melody_file/melody1.mp3')
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

In [None]:
res = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
display_audio(res, 32000)

# **Bark** : Generating Multilingual Speech

code: https://github.com/suno-ai/bark \
docs: https://huggingface.co/docs/transformers/main/en/model_doc/bark \

Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints, which are ready for inference and available for commercial use.

### **Huggingface Demo**

In [None]:
# https://huggingface.co/spaces/suno/bark

### **Setting**

In [None]:
!pip install git+https://github.com/suno-ai/bark.git

In [None]:
!pip uninstall -y torch torchvision torchaudio

Found existing installation: torch 2.0.1+cu118
Uninstalling torch-2.0.1+cu118:
  Successfully uninstalled torch-2.0.1+cu118
Found existing installation: torchvision 0.15.2+cu118
Uninstalling torchvision-0.15.2+cu118:
  Successfully uninstalled torchvision-0.15.2+cu118
Found existing installation: torchaudio 2.0.2+cu118
Uninstalling torchaudio-2.0.2+cu118:
  Successfully uninstalled torchaudio-2.0.2+cu118


In [None]:
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Looking in indexes: https://download.pytorch.org/whl/nightly/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu118/torch-2.1.0.dev20230726%2Bcu118-cp310-cp310-linux_x86_64.whl (2319.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 GB[0m [31m570.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchvision-0.16.0.dev20230726%2Bcu118-cp310-cp310-linux_x86_64.whl (6.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchaudio-2.1.0.dev20230726%2Bcu118-cp310-cp310-linux_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-triton==2.1.0+9e3e10c5ed (from torch)
  Downloading https://download.pytor

In [None]:
!unzip voice_presets.zip

Archive:  voice_presets.zip
   creating: voice_presets/
  inflating: __MACOSX/._voice_presets  
  inflating: voice_presets/.DS_Store  
  inflating: __MACOSX/voice_presets/._.DS_Store  
  inflating: voice_presets/ko_speaker_0.mp3  
  inflating: __MACOSX/voice_presets/._ko_speaker_0.mp3  
  inflating: voice_presets/ko_speaker_1.mp3  
  inflating: __MACOSX/voice_presets/._ko_speaker_1.mp3  
  inflating: voice_presets/en_speaker_1.npz  
  inflating: __MACOSX/voice_presets/._en_speaker_1.npz  
  inflating: voice_presets/ko_speaker_0.npz  
  inflating: __MACOSX/voice_presets/._ko_speaker_0.npz  
  inflating: voice_presets/ko_speaker_1.npz  
  inflating: __MACOSX/voice_presets/._ko_speaker_1.npz  


In [None]:
from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

preload_models()

Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 147MB/s]


### **Text to Speech (Text conditioning)**
- `[laughter]`
- `[laughs]`
- `[sighs]`
- `[music]`
- `[gasps]`
- `[clears throat]`
- `—` or `...` for hesitations
- `♪` for song lyrics
- CAPITALIZATION for emphasis of a word
- `[MAN]` and `[WOMAN]` to bias Bark toward male and female speakers, respectively

In [None]:
text_prompt = """
     Hello, my name is james. And, uh — and I like pizza. [laughs]
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 603/603 [00:10<00:00, 58.59it/s]
100%|██████████| 31/31 [00:28<00:00,  1.08it/s]


In [None]:
text_prompt = """
     [WOMAN] uh — and I like pizza. [laughs]
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 508/508 [00:07<00:00, 66.88it/s]
100%|██████████| 26/26 [00:22<00:00,  1.15it/s]


### **Text to Speech (Voice & Text Conditioning)**

In [None]:
# without voice condition
text_prompt = """
    I have a silky smooth voice, and today I will tell you about
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 359/359 [00:06<00:00, 53.83it/s]
100%|██████████| 18/18 [00:15<00:00,  1.15it/s]


In [None]:
text_prompt = """
    I have a silky smooth voice, and today I will tell you about
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt, history_prompt="/content/voice_presets/en_speaker_1.npz")
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 423/423 [00:04<00:00, 94.87it/s]
100%|██████████| 22/22 [00:20<00:00,  1.08it/s]


In [None]:
text_prompt = """
    뉴진스의 하입보이요!
"""
audio_array = generate_audio(text_prompt, history_prompt="/content/voice_presets/ko_speaker_0.npz")
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 134/134 [00:01<00:00, 82.24it/s]
100%|██████████| 7/7 [00:07<00:00,  1.13s/it]


### **Multilingual Speech Generation**

| Language | Status |
| --- | :---: |
| English (en) | ✅ |
| German (de) | ✅ |
| Spanish (es) | ✅ |
| French (fr) | ✅ |
| Hindi (hi) | ✅ |
| Italian (it) | ✅ |
| Japanese (ja) | ✅ |
| Korean (ko) | ✅ |
| Polish (pl) | ✅ |
| Portuguese (pt) | ✅ |
| Russian (ru) | ✅ |
| Turkish (tr) | ✅ |
| Chinese, simplified (zh) | ✅ |

In [None]:
text_prompt = """
    추석은 내가 가장 좋아하는 명절이다. 나는 며칠 동안 휴식을 취하고 친구 및 가족과 시간을 보낼 수 있습니다.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 530/530 [00:07<00:00, 68.66it/s]
100%|██████████| 27/27 [00:26<00:00,  1.03it/s]


In [None]:
text_prompt = """
    銀閣寺は、京都市東山区に位置する見事な仏教寺院で、日本の文化遺産に指定されています。庭園や建築物が特に美しく、その美しさで有名です。
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 642/642 [00:07<00:00, 86.22it/s]
100%|██████████| 33/33 [00:30<00:00,  1.08it/s]


### **Non-verbal Sound Generation**

In [None]:
text_prompt = "[clears throat] Hello uh ..., my dog is cute [laughter]"

audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 643/643 [00:08<00:00, 73.46it/s]
100%|██████████| 33/33 [00:29<00:00,  1.11it/s]


### **Music Generation**

In [None]:
text_prompt = """
    ♪ classic reggae track with an electronic guitar solo ♪
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 502/502 [00:05<00:00, 90.55it/s]
100%|██████████| 26/26 [00:23<00:00,  1.09it/s]


### **Audio AI Timeline**

https://github.com/archinetai/audio-ai-timeline