<img src="https://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{audiocraft} -->

# 🎸 Generating Music using [Audiocraft](https://github.com/facebookresearch/audiocraft) and W&B 🐝

<!--- @wandbcode{musicgen-colab} -->

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/audiocraft/AudioCraft.ipynb)

In this notebook we demonstrate how you can generate music and other types of audio from text prompts or generate new music from existing music using SoTA models such as [MusicGen](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md) and [AudioGen](https://github.com/facebookresearch/audiocraft/blob/main/docs/AUDIOGEN.md) from [Audiocraft](https://github.com/facebookresearch/audiocraft) and play and visualize them using [Weights & Biases](https://wandb.ai/site).

If you want to know more about the underlying architectures for MusicGen and AudioGen and explore some cool audio samples generated by these models, you can check out [this W&B report](http://wandb.me/audiocraft_2mp).

In [None]:
# @title Install AudioCraft + WandB
!pip install -U git+https://git@github.com/facebookresearch/audiocraft#egg=audiocraft
!pip install -qq -U wandb

In [None]:
# @title
import os
import random
from tempfile import TemporaryDirectory

from scipy import signal
from scipy.io import wavfile

import torchaudio
from audiocraft.models import AudioGen, MusicGen, MultiBandDiffusion
from audiocraft.data.audio import audio_write

import wandb
import numpy as np
from tqdm.auto import tqdm
from google.colab import files
import matplotlib.pyplot as plt

In [None]:
# @title ## Audio Generation Configs

# @markdown In this section, you can interact with the user interface to chose the models you want to use to generate audio, prompts and other configs. Once you execute this cell, it initializes a [wandb run](https://docs.wandb.ai/guides/runs) which will be used to automatically log all the generated audio along with all the prompts and configs, to ensure your AI-generated music is never lost and your experiments are always reproducible and easy to share. 

# @markdown **Note:** If you have provided prompts, you will be prompted to provide an audio file in addition to the prompts to condition the model. If you don't want to provide a file as an additional condition to the model, just press on the `cancel` button.

# @markdown ---
# @markdown WandB Project Name
project_name = "audiocraft" # @param {type:"string"}

wandb.init(project=project_name, job_type="musicgen/inference")

config = wandb.config

# @markdown Select the Model for audio generation supported by [AudioCraft](https://github.com/facebookresearch/audiocraft). You can select either the MusicGen model variants (great for generating music) or the AudioGen model variants (great for generating non-musical audio). Also note that you can run all variants of MusicGen except the `large` one on the free-tier Colab GPU.
model_name = "musicgen-small" # @param ["musicgen-small", "musicgen-medium", "musicgen-large", "musicgen-melody", "audiogen-medium"]
config.model_name = "facebook/" + model_name if model_name == "audiogen-medium" else model_name

# @markdown Whether to enable [MultiBand Diffusion](https://github.com/facebookresearch/audiocraft/blob/main/docs/MBD.md) or not. MultiBand diffusion is a collection of 4 models that can decode tokens from EnCodec tokenizer into waveform audio. Note that enabling this increases the time required to generate the audio.
enable_multi_band_diffusion = True # @param {type:"boolean"}
# config.enable_multi_band_diffusion = enable_multi_band_diffusion

if "musicgen" not in model_name:
    wandb.termwarn("Multi-band Diffusion is only available for Musicgen")
    config.enable_multi_band_diffusion = False
else:
    config.enable_multi_band_diffusion = enable_multi_band_diffusion

# @markdown ---
# @markdown ## Conditional Generation Configs

# @markdown The prompt for generating audio. You can give multiple prompts separated by `|` in the input. You can also leave it blank for unconditional generation.
config.prompts = "happy rock | energetic EDM | sad jazz" # @param {type:"string"}

descriptions = [prompt.strip() for prompt in config.prompts.split("|")]
config.is_unconditional = config.prompts.strip() == ""

input_audio, input_sampling_rate, wandb_input_audio = None, None, None
if not config.is_unconditional:
    input_audio_file = files.upload()
    if input_audio_file != {}:
        if config.model_name == "facebook/audiogen-medium":
            error = f"{config.model_name} does not support audio-based conditioning"
            raise ValueError(error)
        wandb_input_audio = wandb.Audio(list(input_audio_file.keys())[0])
        input_audio, input_sampling_rate = torchaudio.load(
            list(input_audio_file.keys())[0]
        )
        config.input_audio_available = True
    else:
        config.input_audio_available = False
else:
    if config.model_name == "facebook/audiogen-medium":
        error = f"{config.model_name} does not support unconditional generration"
        raise ValueError(error)

# @markdown Number of audio samples generated, this is relevant only for unconditional generation, i.e, if `config.prompts` is left blank.
config.num_samples = 4 # @param {type:"slider", min:1, max:10, step:1}

# @markdown Specify the random seed
seed = None # @param {type:"raw"}

max_seed = int(1024 * 1024 * 1024)
if not isinstance(seed, int):
    seed = random.randint(1, max_seed)
if seed < 0:
    seed = - seed
seed = seed % max_seed
config.seed = seed

# @markdown ---
# @markdown ## Generation Parameters
# @markdown Use sampling if True, else do argmax decoding
config.use_sampling = True # @param {type:"boolean"}

# @markdown `top_k` used for sampling; limits us to `k` number of  of the top tokens to consider.
config.top_k = 250 # @param {type:"slider", min:0, max:1000, step:1}

# @markdown `top_p` used for sampling; limits us to the top tokens within a probability mass `p`
config.top_p = 0.0 # @param {type:"slider", min:0, max:1.0, step:0.01}

# @markdown Softmax temperature parameter
config.temperature = 1.0 # @param {type:"slider", min:0, max:1.0, step:0.01}

# @markdown Duration of the generated waveform
config.duration = 10 # @param {type:"slider", min:1, max:30, step:1}

# @markdown Coefficient used for classifier free guidance
config.cfg_coef = 3 # @param {type:"slider", min:1, max:100, step:1}

# @markdown Whether to perform 2 forward for Classifier Free Guidance instead of batching together the two. This has some impact on how things are padded but seems to have little impact in practice.
config.two_step_cfg = False # @param {type:"boolean"}

# @markdown When doing extended generation (i.e. more than 30 seconds), by how much should we extend the audio each time. Larger values will mean less context is preserved, and shorter value will require extra computations.
config.extend_stride = 0 # @param {type:"slider", min:0, max:30, step:1}

In [None]:
# @title Generate Audio using MusicGen

# @markdown In this section, the audio is generated using the configs, specified in the aforementioned section. If you wish to peek behind the curtain and checkout the code, click on the `Show Code` button. In order to know about the different APIs for audio generation, visit the [official audiocraft documentations](https://facebookresearch.github.io/audiocraft/api_docs/audiocraft/index.html).

model = None
if config.model_name == "facebook/audiogen-medium":
    model = AudioGen.get_pretrained(config.model_name)
elif "musicgen" in config.model_name:
    model = MusicGen.get_pretrained(config.model_name.split("-")[-1])

multi_band_diffusion = None
if config.enable_multi_band_diffusion:
    multi_band_diffusion = MultiBandDiffusion.get_mbd_musicgen()

model.set_generation_params(
    use_sampling=config.use_sampling,
    top_k=config.top_k,
    top_p=config.top_p,
    temperature=config.temperature,
    duration=config.duration,
    cfg_coef=config.cfg_coef,
    two_step_cfg=config.two_step_cfg,
    extend_stride=config.extend_stride
)

generated_wav, tokens = None, None
if config.is_unconditional:
    if input_audio is None:
        if "musicgen" in config.model_name:
            generated_wav, tokens = model.generate_unconditional(
                num_samples=config.num_samples,
                progress=True,
                return_tokens=True
            )
        else:
            generated_wav = model.generate_unconditional(
                num_samples=config.num_samples,
                progress=True,
            )
    else:
        if "musicgen" in config.model_name:
            generated_wav, tokens = model.generate_with_chroma(
                descriptions,
                input_audio[None].expand(3, -1, -1),
                input_sampling_rate,
                return_tokens=True
            )
        else:
            generated_wav = model.generate_with_chroma(
                descriptions,
                input_audio[None].expand(3, -1, -1),
                input_sampling_rate,
            )
else:
    if "musicgen" in config.model_name:
        generated_wav, tokens = model.generate(
            descriptions,
            progress=True,
            return_tokens=True
        )
    else:
        generated_wav = model.generate(
            descriptions,
            progress=True,
        )

generated_wav_diffusion = None
if config.enable_multi_band_diffusion:
    generated_wav_diffusion = multi_band_diffusion.tokens_to_wav(tokens)

In [None]:
# @title Log Audio to Weights & Biases Dashboard

# @markdown In this section, we log the generated audio to Weights & Biases where you can listen and visualize them using an interactive audio player and waveform visualizer. Also, shoutout to [Atanu Sarkar](https://github.com/mratanusarkar) for building the spectrogram viusalization function which lets you visualize the spectrogram of the generated audio inside a [`wandb.Table`](https://docs.wandb.ai/guides/tables/tables-walkthrough).

def get_spectrogram(audio_file, output_file):
    sample_rate, samples = wavfile.read(audio_file)
    frequencies, times, Sxx = signal.spectrogram(samples, sample_rate)

    log_Sxx = 10 * np.log10(Sxx + 1e-10)
    vmin = np.percentile(log_Sxx, 5)
    vmax = np.percentile(log_Sxx, 95)

    mean_spectrum = np.mean(log_Sxx, axis=1)
    threshold_low = np.percentile(mean_spectrum, 5)
    threshold_high = np.percentile(mean_spectrum, 95)

    freq_indices = np.where(mean_spectrum > threshold_low)
    freq_min = 20
    freq_max = frequencies[freq_indices].max()

    fig, ax = plt.subplots()
    cmap = plt.get_cmap('magma')

    ax.pcolormesh(
        times,
        frequencies,
        log_Sxx,
        shading='gouraud',
        cmap=cmap,
        vmin=vmin,
        vmax=vmax
    )
    ax.axis('off')
    ax.set_ylim([freq_min, freq_max])

    plt.subplots_adjust(left=0, right=1, top=1, bottom=0)
    plt.savefig(
        output_file, format='png', bbox_inches='tight', pad_inches=0
    )
    plt.close()

    return wandb.Image(output_file)


temp_dir = TemporaryDirectory()
columns = ["Model", "Prompt", "Generated-Audio", "Spectrogram", "Seed"]
if input_audio is not None:
    columns.insert(2, "Input-Audio")
if config.enable_multi_band_diffusion:
    columns.insert(4, "Generated-Audio-Diffusion")
    columns.insert(5, "Spectrogram-Diffusion")
wandb_table = wandb.Table(columns=columns)

for idx, wav in enumerate(generated_wav):

    file_name = os.path.join(temp_dir.name, str(idx))
    audio_write(
        file_name,
        wav.cpu(),
        model.sample_rate,
        strategy="loudness",
        loudness_compressor=True,
    )
    wandb_audio = wandb.Audio(file_name +  ".wav")
    wandb.log({"Generated-Audio": wandb_audio}, commit=False)

    file_name_diffusion, wandb_diffusion_audio = None, None
    if config.enable_multi_band_diffusion:
        file_name_diffusion = os.path.join(
            temp_dir.name, str(idx) + "_diffusion"
        )
        audio_write(
            file_name_diffusion,
            generated_wav_diffusion[idx].cpu(),
            model.sample_rate,
            strategy="loudness",
            loudness_compressor=True,
        )
        wandb_diffusion_audio = wandb.Audio(file_name_diffusion +  ".wav")
        wandb.log(
            {"Generated-Audio-Diffusion": wandb_diffusion_audio},
            commit=False
        )

    wandb.log({}, commit=True)

    desc = descriptions[idx] if len(descriptions) > 1 else config.prompts
    wandb_table_row = [
        model_name,
        desc,
        wandb_audio,
        get_spectrogram(
            audio_file=file_name +  ".wav",
            output_file=os.path.join(temp_dir.name, str(idx) + ".png")
        ),
        config.seed
    ]
    if input_audio is not None:
        wandb_table_row.insert(2, wandb_input_audio)
    if config.enable_multi_band_diffusion:
        wandb_table_row.insert(4, wandb_diffusion_audio)
        wandb_table_row.insert(
            5,
            get_spectrogram(
                audio_file=file_name_diffusion +  ".wav",
                output_file=os.path.join(
                    temp_dir.name, str(idx) + "_diffusion.png"
                )
            )
        )
    wandb_table.add_data(*wandb_table_row)

wandb.log({"Generated-Audio-Table": wandb_table})

wandb.finish()
temp_dir.cleanup()

This is how the W&B Table looks like with the interactive audio player, waveform visualizer and spectrogram visualization along with the prompts and other configs. Note that the notebook automatically sets the seed if you leave it blank, so your experiments are always reproducible.

![](https://github.com/wandb/examples/blob/example/audiocraft/colabs/audiocraft/assets/music_gen.png?raw=1)

If you want to know more about the underlying architectures for MusicGen and AudioGen and explore some cool audio samples generated by these models, you can check out [this W&B report](http://wandb.me/audiocraft_2mp).