In [1]:
#!pip install --upgrade --quiet pip
#!pip install --quiet transformers datasets[audio]

In [2]:
from transformers import MusicgenForConditionalGeneration
import torch

  from .autonotebook import tqdm as notebook_tqdm
2023-06-28 10:15:17.123898: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-28 10:15:17.148790: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load the Model

The pre-trained MusicGen small, medium and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/models?search=facebook/musicgen-) on the Hugging Face Hub. Change the repo id with the checkpoint size you wish to load. We'll default to the small checkpoint, which is the fastest of the three but has the lowest audio quality:

In [3]:
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

We can then place the model on our accelerator device (if available), or leave it on the CPU otherwise:

In [4]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model.to(device);

## Generation

MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by setting
`do_sample=True` in the call to `MusicgenForConditionalGeneration.generate` (see below).

### Unconditional Generation

The inputs for unconditional (or 'null') generation can be obtained through the method `MusicgenForConditionalGeneration.get_unconditional_inputs`. We can then run auto-regressive generation using the `.generate` method, specifying `do_sample=True` to enable sampling mode:

In [5]:
unconditional_inputs = model.get_unconditional_inputs(num_samples=1, max_new_tokens=256)

audio_values = model.generate(**unconditional_inputs, do_sample=True)

The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [6]:
from IPython.display import Audio

sampling_rate = model.config.audio_encoder.sampling_rate
Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

Or save them as a `.wav` file using a third-party library, e.g. `scipy`:

In [8]:
import scipy

scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

### Text-Conditional Generation

The model can generate an audio sample conditioned on a text prompt through use of the `MusicgenProcessor` to pre-process
the inputs. The pre-processed inputs can then be passed to the `.generate` method to generate text-conditional audio samples.
Again, we enable sampling mode by setting `do_sample=True`:

In [10]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits
(which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or
'null' prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input
prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results,
use a `guidance_scale=3` for text and audio-conditional generation.

### Audio-Prompted Generation

The same `MusicgenProcessor` can be used to pre-process an audio prompt that is used for audio continuation. In the
following example, we load an audio file using the 🤗 Datasets library, pre-process it using the processor class, 
and then forward the inputs to the model for generation:

In [11]:
from datasets import load_dataset

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]

# take the first half of the audio sample
sample["array"] = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=sample["array"],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

To demonstrate batched audio-prompted generation, we'll slice our sample audio by two different proportions to give two audio samples of different length.
Since the input audio prompts vary in length, they will be *padded* to the length of the longest audio sample in the batch before being passed to the model.

To recover the final audio samples, the `audio_values` generated can be post-processed to remove padding by using the processor class once again:

In [13]:
sample = next(iter(dataset))["audio"]

# take the first quater of the audio sample
sample_1 = sample["array"][: len(sample["array"]) // 4]

# take the first half of the audio sample
sample_2 = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=[sample_1, sample_2],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs.to(device), do_sample=True, guidance_scale=3, max_new_tokens=256)

# post-process to remove padding from the batched audio
audio_values = processor.decode_audio(audio_values, padding_mask=inputs.padding_mask)

Audio(audio_values[0], rate=sampling_rate)