# **MusicGen** : Simple and Controllable Music Generation

paper: https://arxiv.org/abs/2306.05284 \
code: https://github.com/facebookresearch/audiocraft \
docs: https://huggingface.co/docs/transformers/main/en/model_doc/musicgen \
musicgen api info: https://replicate.com/facebookresearch/musicgen/api \

Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive
Transformer model trained over a 32kHz <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. Unlike existing methods like [MusicLM](https://arxiv.org/abs/2301.11325), MusicGen doesn't require a self-supervised semantic representation, and it generates
all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict
them in parallel, thus having only 50 auto-regressive steps per second of audio.

### **API**
- `small`: 300M model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-small)
- `medium`: 1.5B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-medium)
- `melody`: 1.5B model, text to music and text+melody to music - [🤗 Hub](https://huggingface.co/facebook/musicgen-melody)
- `large`: 3.3B model, text to music only - [🤗 Hub](https://huggingface.co/facebook/musicgen-large)

In [1]:
!head -n 3 /proc/meminfo

MemTotal:       13294252 kB
MemFree:        10341576 kB
MemAvailable:   12443988 kB


### **Setting**

In [2]:
!python3 -m pip install -U git+https://github.com/facebookresearch/audiocraft#egg=audiocraft

Collecting audiocraft
  Cloning https://github.com/facebookresearch/audiocraft to /tmp/pip-install-18musrwx/audiocraft_2b2147b27bff4f5f911158dc5f2a56aa
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/audiocraft /tmp/pip-install-18musrwx/audiocraft_2b2147b27bff4f5f911158dc5f2a56aa
  Resolved https://github.com/facebookresearch/audiocraft to commit e96018613ac82b1afe0f0cce7861dfe08ba2b3bf
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting av (from audiocraft)
  Downloading av-10.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.0/31.0 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops (from audiocraft)
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flashy>=0.0.1 (from audiocraft)
  Dow

In [3]:
from audiocraft.models import musicgen
from audiocraft.utils.notebook import display_audio
import torch

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
%ls

[0m[01;34mMyDrive[0m/


In [13]:
%cd MyDrive
%ls

/content/drive/MyDrive
 2023년강의일정표.gsheet     [0m[01;34m논문[0m/
 [01;34mai_expert[0m/                  연금_2022-12-14.gsheet
[01;34m'Colab Notebooks'[0m/           체중.gsheet
'가계부 2022-10-31.gsheet'  '투자포트폴리오 2022-12-20.gsheet'


In [14]:
%cd ai_expert

/content/drive/MyDrive/ai_expert


In [15]:
%ls

[0m[01;34m'13. 한보형 교수님'[0m/  [01;34m'16_윤성의 교수님'[0m/  [01;34m'20_유승주 교수님'[0m[K/
[01;34m'14. 김선주 교수님'[0m/  [01;34m'17_주한별 교수님'[0m/   [01;34m21_multimodal[0m/
[01;34m'15_서홍석 교수님'[0m/   [01;34m'18_홍승훈 교수님'[0m/


In [16]:
%cd 21_multimodal

/content/drive/MyDrive/ai_expert/21_multimodal


In [17]:
%ls

 DocumentAI_OCR.ipynb          SAM_application.pdf
 HF_Agent.pdf                  [0m[01;34mSAMSUNG[0m/
 melody.zip                    Transformers_Agent_예시.ipynb
 Multimodal_Audio.ipynb        voice_presets.zip
 Multi-modal-with-SAM.ipynb    [01;34m원본[0m/
'Remove_Fill Anything.ipynb'


In [18]:
%cd SAMSUNG

/content/drive/MyDrive/ai_expert/21_multimodal/SAMSUNG


In [19]:
%ls

 DocumentAI_OCR.ipynb   Multimodal_Audio.ipynb        voice_presets.zip
 DocumentAI_OCR.zip    'Remove_Fill Anything.ipynb'
 melody_files.zip       sam.zip


In [20]:
!unzip melody_files.zip

Archive:  melody_files.zip
   creating: melody_files/
  inflating: __MACOSX/._melody_files  
   creating: melody_files/melody_files/
  inflating: melody_files/melody_files/melody1.mp3  
  inflating: melody_files/melody_files/melody2.wav  
  inflating: melody_files/melody_files/melody3.wav  


### **Text to Music (Text Conditioning)**

In [21]:
model = musicgen.MusicGen.get_pretrained('medium', device='cuda')
model.set_generation_params(duration=8)



Downloading state_dict.bin:   0%|          | 0.00/3.68G [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)ssion_state_dict.bin:   0%|          | 0.00/236M [00:00<?, ?B/s]

In [22]:
text_prompt = 'crazy EDM, heavy bang'

res = model.generate([text_prompt], progress=True)
display_audio(res, 32000)



In [23]:
text_prompt_list = ['crazy EDM, heavy bang',
                    'classic reggae track with an electronic guitar solo',
                    'lofi slow bpm electro chill with organic samples',
                    'rock with saturated guitars, a heavy bass line and crazy drum break and fills.',
                    'earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves',
]

res = model.generate(text_prompt_list, progress=True)
display_audio(res, 32000)

Output hidden; open in https://colab.research.google.com to view.

### **Text to Music (Melody & Text Conditioning)**
We now experiment with our novel chroma-based melody conditioning. We condition on famous melodies from classical music along with new text description to provide interpretations in any genre or style. We use our MusicGen 1.5B with melody and text conditioning.





In [None]:
import torchaudio

In [None]:
model = musicgen.MusicGen.get_pretrained('melody', device='cuda')
model.set_generation_params(duration=8)

In [None]:
melody, sr = torchaudio.load('/content/melody_files/melody1.mp3')
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

In [None]:
res = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
display_audio(res, 32000)

# **Bark** : Generating Multilingual Speech

code: https://github.com/suno-ai/bark \
docs: https://huggingface.co/docs/transformers/main/en/model_doc/bark \

Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints, which are ready for inference and available for commercial use.

### **Huggingface Demo**

In [1]:
# https://huggingface.co/spaces/suno/bark

### **Setting**

In [2]:
!pip install git+https://github.com/suno-ai/bark.git

Collecting git+https://github.com/suno-ai/bark.git
  Cloning https://github.com/suno-ai/bark.git to /tmp/pip-req-build-1wn5or8s
  Running command git clone --filter=blob:none --quiet https://github.com/suno-ai/bark.git /tmp/pip-req-build-1wn5or8s
  Resolved https://github.com/suno-ai/bark.git to commit 56b0ba13f7c281cbffa07ea9abf7b30273a60b6a
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting boto3 (from suno-bark==0.0.1a0)
  Downloading boto3-1.28.28-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from suno-bark==0.0.1a0)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting botocore<1.32.0,>=1.31.28 (from boto3->suno-bark==0.0.1a0)
  Downloading botocore-1.31.28-py3-none-any.w

In [3]:
!pip uninstall -y torch torchvision torchaudio

Found existing installation: torch 2.0.1+cu118
Uninstalling torch-2.0.1+cu118:
  Successfully uninstalled torch-2.0.1+cu118
Found existing installation: torchvision 0.15.2+cu118
Uninstalling torchvision-0.15.2+cu118:
  Successfully uninstalled torchvision-0.15.2+cu118
Found existing installation: torchaudio 2.0.2+cu118
Uninstalling torchaudio-2.0.2+cu118:
  Successfully uninstalled torchaudio-2.0.2+cu118


In [4]:
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Looking in indexes: https://download.pytorch.org/whl/nightly/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu118/torch-2.1.0.dev20230816%2Bcu118-cp310-cp310-linux_x86_64.whl (2321.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 GB[0m [31m563.4 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchvision-0.16.0.dev20230816%2Bcu118-cp310-cp310-linux_x86_64.whl (6.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m99.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio
  Downloading https://download.pytorch.org/whl/nightly/cu118/torchaudio-2.1.0.dev20230816%2Bcu118-cp310-cp310-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m90.4 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-triton==2.1.0+e6216047b8 (from torch)
  Downloading https://download.pytor

In [5]:
!unzip voice_presets.zip

Archive:  voice_presets.zip
   creating: voice_presets/
  inflating: __MACOSX/._voice_presets  
  inflating: voice_presets/.DS_Store  
  inflating: __MACOSX/voice_presets/._.DS_Store  
  inflating: voice_presets/ko_speaker_0.mp3  
  inflating: __MACOSX/voice_presets/._ko_speaker_0.mp3  
  inflating: voice_presets/ko_speaker_1.mp3  
  inflating: __MACOSX/voice_presets/._ko_speaker_1.mp3  
  inflating: voice_presets/en_speaker_1.npz  
  inflating: __MACOSX/voice_presets/._en_speaker_1.npz  
  inflating: voice_presets/ko_speaker_0.npz  
  inflating: __MACOSX/voice_presets/._ko_speaker_0.npz  
  inflating: voice_presets/ko_speaker_1.npz  
  inflating: __MACOSX/voice_presets/._ko_speaker_1.npz  


In [6]:
from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

preload_models()

Downloading text_2.pt:   0%|          | 0.00/5.35G [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading coarse_2.pt:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Downloading fine_2.pt:   0%|          | 0.00/3.74G [00:00<?, ?B/s]

Downloading: "https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th" to /root/.cache/torch/hub/checkpoints/encodec_24khz-d7cc33bc.th
100%|██████████| 88.9M/88.9M [00:00<00:00, 155MB/s]


### **Text to Speech (Text conditioning)**
- `[laughter]`
- `[laughs]`
- `[sighs]`
- `[music]`
- `[gasps]`
- `[clears throat]`
- `—` or `...` for hesitations
- `♪` for song lyrics
- CAPITALIZATION for emphasis of a word
- `[MAN]` and `[WOMAN]` to bias Bark toward male and female speakers, respectively

In [7]:
text_prompt = """
     Hello, my name is james. And, uh — and I like pizza. [laughs]
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 612/612 [00:08<00:00, 74.81it/s]
100%|██████████| 31/31 [00:29<00:00,  1.04it/s]


In [8]:
text_prompt = """
     [WOMAN] uh — and I like pizza. [laughs]
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 548/548 [00:06<00:00, 88.49it/s]
100%|██████████| 28/28 [00:26<00:00,  1.06it/s]


### **Text to Speech (Voice & Text Conditioning)**

In [9]:
# without voice condition
text_prompt = """
    I have a silky smooth voice, and today I will tell you about
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 563/563 [00:08<00:00, 66.55it/s]
100%|██████████| 29/29 [00:27<00:00,  1.06it/s]


In [10]:
text_prompt = """
    I have a silky smooth voice, and today I will tell you about
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt, history_prompt="/content/voice_presets/en_speaker_1.npz")
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 377/377 [00:06<00:00, 54.74it/s]
100%|██████████| 19/19 [00:19<00:00,  1.02s/it]


In [11]:
text_prompt = """
    뉴진스의 하입보이요!
"""
audio_array = generate_audio(text_prompt, history_prompt="/content/voice_presets/ko_speaker_0.npz")
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 75/75 [00:02<00:00, 35.79it/s]
100%|██████████| 4/4 [00:07<00:00,  1.87s/it]


In [12]:
# ffmpeg
# yt-dlp

### **Multilingual Speech Generation**

| Language | Status |
| --- | :---: |
| English (en) | ✅ |
| German (de) | ✅ |
| Spanish (es) | ✅ |
| French (fr) | ✅ |
| Hindi (hi) | ✅ |
| Italian (it) | ✅ |
| Japanese (ja) | ✅ |
| Korean (ko) | ✅ |
| Polish (pl) | ✅ |
| Portuguese (pt) | ✅ |
| Russian (ru) | ✅ |
| Turkish (tr) | ✅ |
| Chinese, simplified (zh) | ✅ |

In [13]:
text_prompt = """
    추석은 내가 가장 좋아하는 명절이다. 나는 며칠 동안 휴식을 취하고 친구 및 가족과 시간을 보낼 수 있습니다.
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 542/542 [00:07<00:00, 72.00it/s]
100%|██████████| 28/28 [00:28<00:00,  1.00s/it]


In [14]:
text_prompt = """
    銀閣寺は、京都市東山区に位置する見事な仏教寺院で、日本の文化遺産に指定されています。庭園や建築物が特に美しく、その美しさで有名です。
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 748/748 [00:11<00:00, 62.97it/s]
100%|██████████| 38/38 [00:37<00:00,  1.02it/s]


### **Non-verbal Sound Generation**

In [15]:
text_prompt = "[clears throat] Hello uh ..., my dog is cute [laughter]"

audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 746/746 [00:08<00:00, 84.86it/s]
100%|██████████| 38/38 [00:37<00:00,  1.01it/s]


### **Music Generation**

In [16]:
text_prompt = """
    ♪ classic reggae track with an electronic guitar solo ♪
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)

100%|██████████| 486/486 [00:08<00:00, 58.36it/s]
100%|██████████| 25/25 [00:22<00:00,  1.10it/s]


In [None]:
# # wav to npy
# from scipy.io.wavfile import read
# import numpy as np
# a = read("adios.wav")
# a_numpy = numpy.array(a[1],dtype=float)
# np.save("save_file_name.npy", a_numpy)
# # wav to npz
# from scipy.io.wavfile import read
# import numpy as np
# a = read("adios.wav")
# a_numpy = numpy.array(a[1],dtype=float)
# np.savez("save_file_name.npz", a_numpy)

### **Audio AI Timeline**

https://github.com/archinetai/audio-ai-timeline