<a href="https://colab.research.google.com/github/sathyasubrahamanaya/100-pandas-puzzles/blob/master/MusicGen_Dreamboothing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MusicGen Dreamboothing

## Goal of this notebook

This notebook will teach you how to use how to easily fine-tune MusicGen in a Dreambooth-like fashion, using 🤗 Transformers, 🤗 Peft and annotation tools.




## TL;DR pointers

1. [Installation in one line](#installation) -> `!pip install --quiet git+https://github.com/ylacombe/musicgen-dreamboothing git+https://github.com/huggingface/transformers demucs msclap`
2. [Loading and cleaning the dataset]()
3. [Labeling the dataset](#s2st)
4. [Pre-processing the dataset](#s2tt)
5. [Training](#t2ts)

## Resources

1. [Original repository](https://github.com/facebookresearch/audiocraft)
2. [Dreamboothing MusicGen Repository](https://github.com/ylacombe/musicgen-dreamboothing)
3. [Musicgen docs in 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
4. [Musicgen Melody docs in 🤗 Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen_melody)


**This notebook is a demonstration, there are many more features showcased in the [Dreamboothing MusicGen Repository](https://github.com/ylacombe/musicgen-dreamboothing), notably wandb tracking, e.g [here](https://wandb.ai/ylacombe/musicgen_finetuning_experiments/runs/lk6x8k4u?nw=nwuserylacombe)**

## Presentation of the model

[MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen) is a state-of-the-art controllable text-to-music model, made of a single stage auto-regressive Transformer model trained over a 32kHz [Encodec](https://huggingface.co/facebook/encodec_32khz) tokenizer with 4 codebooks sampled at 50 Hz.

## Prepare the Environment

Throughout this tutorial, we'll use a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

##### <a name="installation"> We just need to install the 🤗 Transformers package from the main branch and other requirements from the MusicGen Dreamboothing repository:</a>

In [1]:
!pip install --quiet git+https://github.com/ylacombe/musicgen-dreamboothing demucs msclap
!pip install --upgrade --force-reinstall msclap

  Preparing metadata (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.[0m[31m
[0mCollecting msclap
  Using cached msclap-1.3.3-py3-none-any.whl.metadata (6.0 kB)
Collecting librosa<0.11.0,>=0.10.1 (from msclap)
  Using cached librosa-0.10.2.post1-py3-none-any.whl.metadata (8.6 kB)
Collecting numba<0.59.0,>=0.58.0 (from msclap)
  Using cached numba-0.58.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.7 kB)
Collecting numpy<2.0.0,>=1.23.0 (from msclap)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas<3.0.0,>=2.0.0 (from msclap)
  Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Coll

You should link you Hugging Face account so that you can push model repositories on the Hub. This will allow you to save your trained models on the Hub so that you can share them with the community.

Run the command below and then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not have one already. You should make sure that this token has "write" privileges.

In [1]:
!git config --global credential.helper store
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write

## Loading and cleaning the dataset


For this notebook, I'll use a tiny subset of this [royalty-free music dataset from Kaggle](https://www.kaggle.com/competitions/kaggle-pog-series-s01e02/overview).

This subset consists in 54 samples of around 30 secondes that have been filtered from the original dataset to only contain "punk" music.

I've then loaded this dataset locally using the [`datasets`](https://huggingface.co/docs/datasets/v2.17.0/en/index) library and push it to the HuggingFace hub in this dataset: [ylacombe/tiny-punk](https://huggingface.co/datasets/ylacombe/tiny-punk).


**Note:** You can find [here](https://github.com/ylacombe/musicgen-dreamboothing/tree/main?tab=readme-ov-file#how-do-i-use-audio-files-that-i-have-with-your-training-code) a tiny guide on how to push your own local music dataset on the HuggingFace hub.




In [2]:
from datasets import load_dataset

dataset = load_dataset("ylacombe/tiny-punk", split="clean")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Note how I've been able to load a specific split of the dataset in just one line.

Now that the dataset is ready, let's listen to some samples. In this dataset, the audio column is simply called `"audio"`.

In [3]:
from IPython.display import Audio

Audio(dataset[0]["audio"]["array"], rate=dataset[0]["audio"]["sampling_rate"])

By listening to the samples, you'll quickly remark that there's both vocals and instrumentals.

However, MusicGen was **only trained on instrumentals and therefore can't be trained on vocals**. Lucky us, Meta AI also open-sourced an audio separation model called [demucs](https://github.com/adefossez/demucs/tree/main).

Let's use the latter to have instruments only.

In [4]:
from demucs import pretrained
from demucs.apply import apply_model
from demucs.audio import convert_audio
from datasets import Audio
import torch

demucs = pretrained.get_model("htdemucs")
if torch.cuda.device_count() > 0:
    demucs.to("cuda:0")

audio_column_name = "audio"

def wrap_audio(audio, sr):
    return {"array": audio.cpu().numpy(), "sampling_rate": sr}

def filter_stems(batch, rank=None):
    device = "cpu" if torch.cuda.device_count() == 0 else "cuda:0"

    wavs = [
        convert_audio(
            torch.tensor(audio["array"][None], device=device).to(
                torch.float32
            ),
            audio["sampling_rate"],
            demucs.samplerate,
            demucs.audio_channels,
        ).T
        for audio in batch["audio"]
    ]
    wavs_length = [audio.shape[0] for audio in wavs]

    wavs = torch.nn.utils.rnn.pad_sequence(
        wavs, batch_first=True, padding_value=0.0
    ).transpose(1, 2)
    stems = apply_model(demucs, wavs)

    batch[audio_column_name] = [
        wrap_audio(s[:-1, :, :length].sum(0).mean(0), demucs.samplerate)
        for (s, length) in zip(stems, wavs_length)
    ]

    return batch

num_proc = 1

dataset = dataset.map(
    filter_stems,
    batched=True,
    batch_size=8,
    with_rank=True,
    num_proc=num_proc,
)
dataset = dataset.cast_column(audio_column_name, Audio())

del demucs

That's great, in just a few lines of code, we:
- load the demucs model
- define a function to apply it to a batch of samples and filter out vocals
- applied it easily to the whole dataset thanks to [`Dataset.map`](https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.Dataset.map).

Now let's listen to the filtered audio!

In [5]:
from IPython.display import Audio

Audio(dataset[0]["audio"]["array"], rate=dataset[0]["audio"]["sampling_rate"])

## Labeling the dataset


The previous step was mandatory, but the dataset is still missing some music description.

Here, we'll use [librosa](https://librosa.org/doc/latest/index.html) to get tempo and key, and clap similarity to get genre, instrument and mood.

In [6]:
from utils import instrument_classes, genre_labels, mood_theme_classes
print("Genres", genre_labels)
print("Instruments:", instrument_classes)
print("Moods", mood_theme_classes)

Genres ['Blues, Boogie Woogie', 'Blues, Chicago Blues', 'Blues, Country Blues', 'Blues, Delta Blues', 'Blues, Electric Blues', 'Blues, Harmonica Blues', 'Blues, Jump Blues', 'Blues, Louisiana Blues', 'Blues, Modern Electric Blues', 'Blues, Piano Blues', 'Blues, Rhythm & Blues', 'Blues, Texas Blues', 'Brass & Military, Brass Band', 'Brass & Military, Marches', 'Brass & Military, Military', "Children's, Educational", "Children's, Nursery Rhymes", "Children's, Story", 'Classical, Baroque', 'Classical, Choral', 'Classical, Classical', 'Classical, Contemporary', 'Classical, Impressionist', 'Classical, Medieval', 'Classical, Modern', 'Classical, Neo-Classical', 'Classical, Neo-Romantic', 'Classical, Opera', 'Classical, Post-Modern', 'Classical, Renaissance', 'Classical, Romantic', 'Electronic, Abstract', 'Electronic, Acid', 'Electronic, Acid House', 'Electronic, Acid Jazz', 'Electronic, Ambient', 'Electronic, Bassline', 'Electronic, Beatdown', 'Electronic, Berlin-School', 'Electronic, Big Be

In [8]:
from msclap import CLAP
import librosa
import tempfile
import torchaudio
import random
import numpy as np
import os

clap_model = CLAP(version="2023", use_cuda=True)
instrument_embeddings = clap_model.get_text_embeddings(instrument_classes)
genre_embeddings = clap_model.get_text_embeddings(genre_labels)
mood_embeddings = clap_model.get_text_embeddings(mood_theme_classes)

def enrich_text(batch):
    audio, sampling_rate = (
        batch["audio"]["array"],
        batch["audio"]["sampling_rate"],
    )

    tempo, _ = librosa.beat.beat_track(y=audio, sr=sampling_rate)
    tempo = f"{str(np.round(tempo))} bpm"  # not usually accurate lol
    chroma = librosa.feature.chroma_stft(y=audio, sr=sampling_rate)
    key = np.argmax(np.sum(chroma, axis=1))
    key = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"][key]

    with tempfile.TemporaryDirectory() as tempdir:
        path = os.path.join(tempdir, "tmp.wav")
        torchaudio.save(path, torch.tensor(audio).unsqueeze(0), sampling_rate)
        audio_embeddings = clap_model.get_audio_embeddings([path])

    instrument = clap_model.compute_similarity(
        audio_embeddings, instrument_embeddings
    ).argmax(dim=1)[0]
    genre = clap_model.compute_similarity(
        audio_embeddings, genre_embeddings
    ).argmax(dim=1)[0]
    mood = clap_model.compute_similarity(
        audio_embeddings, mood_embeddings
    ).argmax(dim=1)[0]

    instrument = instrument_classes[instrument]
    genre = genre_labels[genre]
    mood = mood_theme_classes[mood]

    metadata = [key, tempo, instrument, genre, mood]

    random.shuffle(metadata)
    batch["metadata"] = ", ".join(metadata)
    return batch

dataset = dataset.map(
    enrich_text,
    desc="add metadata",
)

del clap_model, instrument_embeddings, genre_embeddings, mood_embeddings


add metadata:   0%|          | 0/54 [00:00<?, ? examples/s]

Now let's look at what the metadata look like:

In [9]:
print(dataset[0]["metadata"])

acousticbassguitar, [103.] bpm, groovy, Rock, Garage Rock, G#


**Note:**
The previous step are resources-consuming, and be done only once.
If you want to keep trace of the enriched dataset, you can push it to the hub under the name that you want (let's say `punk-data`):

`dataset.push_to_hub(punk-data)`

The dataset will be available under your HuggingFace handle - mine is ylacombe so the dataset would be available in `ylacombe/punk-data`.

## Dreamboothing

Now it's time to train your own music model!

We first need to load the processor, to take care of tokenization, and the Musicgen that we'll use.
Here, we're going to use [MusicGen-Melody](https://huggingface.co/facebook/musicgen-melody), a 1.5B version of MusciGen that can be conditionned by audio chroma.

In [10]:
from transformers import (
    AutoProcessor,
    AutoModelForTextToWaveform,
)

processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
model = AutoModelForTextToWaveform.from_pretrained("facebook/musicgen-melody")

model.freeze_text_encoder()
model.freeze_audio_encoder()

preprocessor_config.json:   0%|          | 0.00/369 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/68.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

Let's listen how the model does with `punk` music, this will be a great way to show how the model learn with the Dreamboothing process!

In [11]:
from IPython.display import Audio
import torch

device = torch.device("cuda:0" if torch.cuda.device_count()>0 else "cpu")

model.to(device)

inputs = processor(
    text=["80s punk and pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
).to(device)

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)


Audio(audio_values.cpu().numpy().squeeze(), rate=32000)

The next steps are a bunch of pre-processing steps, namely:
1. Resample the audio samples if needed.
2. Add an instance prompt - the famous anchor term that will allow your model to learn the feature you want it to learn. Here, it'll be `punk` as I want my model to be better at generating punk songs like the ones from my dataset.
3. Tokenize the music descriptions
4. Encode the audio samples using Encodec.

In [12]:
from transformers import AutoFeatureExtractor
from datasets import Audio

instance_prompt = "punk"

# take audio_encoder_feature_extractor
audio_encoder_feature_extractor = AutoFeatureExtractor.from_pretrained(
    model.config.audio_encoder._name_or_path,
)

# resample audio if necessary
dataset_sampling_rate = dataset[0]["audio"]["sampling_rate"]

if dataset_sampling_rate != audio_encoder_feature_extractor.sampling_rate:
    dataset = dataset.cast_column(
        "audio",
        Audio(
            sampling_rate=audio_encoder_feature_extractor.sampling_rate
        ),
    )


# Preprocessing the datasets.
# We need to read the audio files as arrays and tokenize the targets.
def prepare_audio_features(batch):
    # load audio

    metadata = batch["metadata"]
    metadata = f"{instance_prompt}, {metadata}"
    batch["input_ids"] = processor.tokenizer(metadata)["input_ids"]

    # load audio
    target_sample = batch["audio"]
    labels = audio_encoder_feature_extractor(
        target_sample["array"], sampling_rate=target_sample["sampling_rate"]
    )
    batch["labels"] = labels["input_values"]

    # take length of raw audio waveform
    batch["target_length"] = len(target_sample["array"].squeeze())
    return batch


dataset = dataset.map(
    prepare_audio_features,
    remove_columns=dataset.column_names,
    num_proc=2,
    desc="preprocess datasets",
)

preprocessor_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

preprocess datasets (num_proc=2):   0%|          | 0/54 [00:00<?, ? examples/s]

In [13]:

audio_decoder = model.audio_encoder
num_codebooks = model.decoder.config.num_codebooks
audio_encoder_pad_token_id = model.config.decoder.pad_token_id

pad_labels = torch.ones((1, 1, num_codebooks, 1)) * audio_encoder_pad_token_id

if torch.cuda.device_count() == 1:
    audio_decoder.to("cuda")

def apply_audio_decoder(batch):

    with torch.no_grad():
        labels = audio_decoder.encode(
            torch.tensor(batch["labels"]).to(audio_decoder.device)
        )["audio_codes"]

    # add pad token column
    labels = torch.cat(
        [pad_labels.to(labels.device).to(labels.dtype), labels], dim=-1
    )

    labels, delay_pattern_mask = model.decoder.build_delay_pattern_mask(
        labels.squeeze(0),
        audio_encoder_pad_token_id,
        labels.shape[-1] + num_codebooks,
    )

    labels = model.decoder.apply_delay_pattern_mask(labels, delay_pattern_mask)

    # the first timestamp is associated to a row full of BOS, let's get rid of it
    batch["labels"] = labels[:, 1:].cpu()
    return batch

# Encodec doesn't truely support batching
# Pass samples one by one to the GPU
dataset = dataset.map(
    apply_audio_decoder,
    num_proc=1,
    desc="Apply encodec",
)



Apply encodec:   0%|          | 0/54 [00:00<?, ? examples/s]

We'll also add the [LoRA adaptors](https://huggingface.co/docs/peft/en/developer_guides/lora) on top of the model, thanks to [PEFT](https://huggingface.co/docs/peft). This is what will allow to train fast and with low GPU resources!



In [14]:
from peft import LoraConfig, get_peft_model

# TODO(YL): add modularity here
target_modules = (
    [
        "enc_to_dec_proj",
        "audio_enc_to_dec_proj",
        "k_proj",
        "v_proj",
        "q_proj",
        "out_proj",
        "fc1",
        "fc2",
        "lm_heads.0",
    ]
    + [f"lm_heads.{str(i)}" for i in range(len(model.decoder.lm_heads))]
    + [f"embed_tokens.{str(i)}" for i in range(len(model.decoder.lm_heads))]
)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=target_modules,
    lora_dropout=0.05,
    bias="none",
)
model.enable_input_require_grads()
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 21,754,112 || all params: 1,577,038,658 || trainable%: 1.3794


There's two last steps to do before finally move to training !

1. Define the Trainer, i.e the class that will take care of the training under-the-hood.
2. Define a collator, i.e a class that will pass samples to the GPU.


In [15]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from dataclasses import dataclass

class MusicgenTrainer(Seq2SeqTrainer):
    def _pad_tensors_to_max_len(self, tensor, max_length):
        if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
            # If PAD token is not defined at least EOS token has to be defined
            pad_token_id = (
                self.tokenizer.pad_token_id
                if self.tokenizer.pad_token_id is not None
                else self.tokenizer.eos_token_id
            )
        else:
            if self.model.config.pad_token_id is not None:
                pad_token_id = self.model.config.pad_token_id
            else:
                raise ValueError(
                    "Pad_token_id must be set in the configuration of the model, in order to pad tensors"
                )

        padded_tensor = pad_token_id * torch.ones(
            (tensor.shape[0], max_length, tensor.shape[2]),
            dtype=tensor.dtype,
            device=tensor.device,
        )
        length = min(max_length, tensor.shape[1])
        padded_tensor[:, :length] = tensor[:, :length]
        return padded_tensor


@dataclass
class DataCollatorMusicGenWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.AutoProcessor`)
            The processor used for proccessing the data.
    """

    processor: AutoProcessor

    def __call__(
        self, features
    ):
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        labels = [
            torch.tensor(feature["labels"]).transpose(0, 1) for feature in features
        ]
        # (bsz, seq_len, num_codebooks)
        labels = torch.nn.utils.rnn.pad_sequence(
            labels, batch_first=True, padding_value=-100
        )

        input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
        input_ids = self.processor.tokenizer.pad(input_ids, return_tensors="pt")

        batch = {"labels": labels, **input_ids}

        return batch

# Instantiate custom data collator
data_collator = DataCollatorMusicGenWithPadding(
    processor=processor,
)



Finally, we'll train our model!

We define how we train and what to do with the final model thanks to the training arguments, here:
- we'll train on 4 epochs using a learning rate of 2e-4
- we'll use gradient checkpointing and gradient accumulation to use less GPU
- we'll push the final model to the hub

In [18]:

training_args = Seq2SeqTrainingArguments(
      output_dir="./output/",
      num_train_epochs=4,
      gradient_accumulation_steps=8,
      gradient_checkpointing=True,
      per_device_train_batch_size= 2,
      learning_rate=2e-4,
      weight_decay=0.1,
      adam_beta2=0.99,
      fp16=True,
      dataloader_num_workers=2,
      logging_steps=2,
      report_to="none",
      push_to_hub=True,
      push_to_hub_model_id="musicgen-melody-lora_joel",
)


# Initialize MusicgenTrainer
trainer = MusicgenTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    tokenizer=processor,
)

train_result = trainer.train()

trainer.save_model()
trainer.save_state()


kwargs = {
    "finetuned_from": "facebook/musicgen-melody",
    "tasks": "text-to-audio",
    "tags": ["text-to-audio", "tiny-punk"],
    "dataset": "ylacombe/tiny-punk",
}

trainer.push_to_hub(**kwargs)


{'loss': 58.309, 'grad_norm': 7.1028218269348145, 'learning_rate': 0.0001666666666666667, 'epoch': 0.5925925925925926}
{'loss': 39.9536, 'grad_norm': 2.6882870197296143, 'learning_rate': 0.00013333333333333334, 'epoch': 1.0}
{'loss': 57.7466, 'grad_norm': 4.853188514709473, 'learning_rate': 0.0001, 'epoch': 1.5925925925925926}
{'loss': 39.5859, 'grad_norm': 1.8834244012832642, 'learning_rate': 6.666666666666667e-05, 'epoch': 2.0}
{'loss': 57.5936, 'grad_norm': 3.0074303150177, 'learning_rate': 3.3333333333333335e-05, 'epoch': 2.5925925925925926}
{'loss': 39.4602, 'grad_norm': 2.651796340942383, 'learning_rate': 0.0, 'epoch': 3.0}
{'train_runtime': 275.2239, 'train_samples_per_second': 0.785, 'train_steps_per_second': 0.044, 'train_loss': 48.77482604980469, 'epoch': 3.0}


CommitInfo(commit_url='https://huggingface.co/sathyavgc/musicgen-melody-lora_joel/commit/8cd52b0f0e56e00c85196a12e497717edcfbe579', commit_message='End of training', commit_description='', oid='8cd52b0f0e56e00c85196a12e497717edcfbe579', pr_url=None, repo_url=RepoUrl('https://huggingface.co/sathyavgc/musicgen-melody-lora_joel', endpoint='https://huggingface.co', repo_type='model', repo_id='sathyavgc/musicgen-melody-lora_joel'), pr_revision=None, pr_num=None)

## Inference
Now that we have a trained model, we can use it quite easily with the following snippet!

In [27]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForTextToWaveform, AutoProcessor
import torch

device = torch.device("cuda:0" if torch.cuda.device_count()>0 else "cpu")

repo_id = "sathyavgc/musicgen-melody-lora-punk-colab"

config = PeftConfig.from_pretrained(repo_id)
model = AutoModelForTextToWaveform.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(model, repo_id).to(device)

processor = AutoProcessor.from_pretrained(repo_id)

inputs = processor(
    text=["80s punk and pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
).to(device)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)


adapter_config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/87.1M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/369 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

In [28]:
from IPython.display import Audio

Audio(audio_values.cpu().numpy().squeeze(), rate=32000)

## Conclusion

In this notebook, we've quickly shown how to do dreamboothing with MusicGen and 30 minutes of songs!

Everything that has been shown here can also be done using the [Dreamboothing MusicGen Repository](https://github.com/ylacombe/musicgen-dreamboothing) which supports many other features (multi-GPUs fine-tuning, logging on wandb, etc.)!

## What's next?

You can dreambooth your own musicgen and quickly share and publish the resulting models! Let your imagination flourish!
