# Training a GPT-2 Language Model on the POP909 Dataset

The below script is dedicated to training a language model on the POP909 dataset using the GPT-2 architecture. It begins by installing necessary libraries and importing required modules. Then it loads the preprocessed dataset and tokenizes it using the Hugging Face tokenizer.

The script proceeds to define the GPT-2 model architecture and configure it based on specified parameters such as the number of layers, attention heads, and embedding dimension. It calculates and prints the size of the GPT-2 model in terms of parameters.

Next, it prepares the data for training by creating a data collator that handles batch preparation and creates language model labels. It applies the data collator to a small subset of the training data to verify its functionality.

The script then sets up the training environment, including logging configurations for Weights & Biases (WandB) integration and Hugging Face Hub login. It defines a custom trainer class that extends the Trainer class provided by the transformers library. This custom trainer includes additional functionality to log prediction distributions during evaluation.

Training hyperparameters and configuration parameters are defined, and a WandB run is initiated to monitor training progress and log metrics.

The model training loop is executed using the Trainer object, which handles training epochs, batch processing, and evaluation. During evaluation, the custom trainer logs generated audio samples for qualitative analysis.

Once training is complete, the WandB run is finished, and the model checkpoint is saved to the specified output directory. Overall, this notebook provides a comprehensive pipeline for training a GPT-2 language model on the POP909 dataset, including data preprocessing, model setup, training, and result logging.

In [1]:
import locale
print(locale.getpreferredencoding())

UTF-8


In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [3]:
# Install necessary libraries
!pip install -U protobuf==4.21.2
!pip install datasets
!pip install wandb
!pip install note_seq
!pip install transformers[torch]
!pip install accelerate -U



In [4]:
# Check if the code is running in Google Colab environment
if "google.colab" in str(get_ipython()):
    # Inform the user about installing dependencies in Colab
    print("Installing dependencies...")

    # Install fluidsynth and its development libraries using pip
    !apt-get install fluidsynth
    !apt-get install -qq libasound2-dev libjack-dev

    # Install the pyfluidsynth library using pip
    !pip install -qU pyfluidsynth

Installing dependencies...
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fluidsynth is already the newest version (2.2.5-1).
0 upgraded, 0 newly installed, 0 to remove and 32 not upgraded.


In [5]:
# Import necessary libraries and modules
import os
from argparse import Namespace

import note_seq
import numpy as np
import wandb
from datasets import load_dataset
from huggingface_hub import notebook_login
from transformers import AutoTokenizer, AutoConfig, GPT2LMHeadModel, DataCollatorForLanguageModeling, set_seed, Trainer, TrainingArguments

import matplotlib.pyplot as plt


In [6]:
# Set the Protocol Buffers Python implementation to "python"
# This line is used to resolve compatibility issues related to Protocol Buffers (protobuf)
# It explicitly selects the pure Python implementation of the protobuf library
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

In [7]:
# Set parameters for WandB (Weights & Biases) integration
wandb_project = "pop909_musicgen"
entity = "musicgen"
data_processed = "pop909_processed"

## Download Dataset and tokenizer from Hugging Face

In the pretokenization notebook, we trained a tokenizer. We'll use it here first to do some basic EDA to understand our data and what type of model size is better (number of layers, heads, etc.)

In [8]:
# Load a dataset named "aimusicgen/pop909_clean_data" using the Hugging Face datasets library
# The split parameter is set to "train" to load the training split of the dataset
ds = load_dataset("aimusicgen/pop909_clean_data", split="train")

# Split the loaded dataset into training and testing sets using train_test_split method
# The test_size parameter specifies the fraction of the dataset to include in the test split (here, 10%)
# The shuffle parameter is set to True to shuffle the data before splitting
raw_datasets = ds.train_test_split(test_size=0.1, shuffle=True)

# Instantiate a tokenizer using the AutoTokenizer class from the transformers library
# The "tokenizer" argument is the pretrained tokenizer from the previous notebook
tokenizer = AutoTokenizer.from_pretrained("aimusicgen/pop909_tokenizer")

# Display the raw datasets, which now include both training and testing splits
# The raw_datasets variable now contains a tuple with two elements: training and testing datasets
raw_datasets


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 29930
    })
    test: Dataset({
        features: ['text'],
        num_rows: 3326
    })
})

### Why is the data suffled before it is split into training and testing?

The data is suffled to help introduce **randomness** into the data. This is important because it prevents any inherent order in the dataset from influencing the learning algorithm. If the data is ordered in a certain way (e.g., all samples of one class followed by another), shuffling helps ensure that the model sees a representative mix of samples from all classes during both training and testing.

Some algorithms might perform differently or learn **biased** patterns if trained on data with a specific order. By shuffling the data, you reduce the risk of the model learning patterns based on the order of the examples.

Shuffling contributes to better **generalization** of the model. If the model is exposed to diverse examples during training (rather than learning specific patterns related to the order of the data), it is more likely to perform well on new, unseen data.

When splitting a dataset into training and testing sets, shuffling ensures that both sets contain a representative mix of examples. This is important, especially in scenarios like **cross-validation**, where you repeatedly split the data into different training and testing sets.

In [9]:
# Define the context length for tokenization
# In this case, it is set to 512, meaning the input sequences will be truncated or padded to this length
context_length = 512

# Define a tokenization function named "tokenize" that takes an element as input
def tokenize(element):
    # Use the tokenizer to process the "text" field of the input element
    # Set truncation to True to truncate sequences longer than the specified context length
    # Set max_length to the context_length to ensure all sequences have the same length
    # Set padding to False to avoid adding padding tokens to the sequences
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        padding=False
    )

    # Return a dictionary containing the "input_ids" field from the tokenizer outputs
    return {"input_ids": outputs["input_ids"]}


In [10]:
# Select an example from the training split of the raw dataset (index 1000)
selected_example = raw_datasets["train"][1000]

# Tokenize the selected example using the tokenize function defined earlier
# The tokenize function processes the "text" field of the input example
# and returns a dictionary containing the "input_ids" field with tokenized representation
tk_sample = tokenize(selected_example)

In [11]:
# Print the length of the 'input_ids' field in the tokenized sample
# This indicates the number of tokens in the tokenized representation of the input text
print(f"Len of tk_sample ids {len(tk_sample['input_ids'])}")

# Print the entire tokenized sample
# The tokenized sample is a dictionary containing the 'input_ids' field
print(f"tk_sample {tk_sample}")

Len of tk_sample ids 512
tk_sample {'input_ids': [100, 62, 60, 12, 9, 31, 6, 35, 13, 25, 8, 24, 6, 31, 17, 14, 30, 19, 5, 30, 7, 18, 6, 33, 14, 32, 6, 47, 8, 46, 10, 64, 8, 63, 7, 70, 8, 69, 13, 85, 7, 84, 9, 16, 9, 34, 11, 12, 9, 104, 7, 103, 14, 31, 9, 45, 13, 44, 64, 8, 19, 6, 30, 6, 63, 70, 45, 51, 13, 44, 6, 45, 17, 14, 44, 75, 69, 13, 16, 9, 50, 9, 18, 11, 12, 9, 19, 45, 13, 18, 44, 6, 64, 19, 13, 18, 63, 64, 49, 15, 17, 7, 48, 8, 49, 14, 16, 10, 48, 63, 45, 47, 13, 46, 5, 44, 47, 19, 45, 51, 14, 18, 44, 6, 46, 45, 19, 45, 17, 8, 18, 44, 7, 44, 9, 50, 5, 16, 11, 12, 9, 45, 19, 45, 8, 18, 7, 44, 44, 19, 19, 13, 18, 49, 49, 5, 18, 13, 31, 10, 48, 8, 30, 6, 48, 47, 31, 13, 46, 6, 30, 49, 19, 13, 48, 5, 18, 33, 77, 33, 14, 32, 7, 33, 35, 14, 32, 75, 32, 10, 34, 9, 76, 11, 61, 62, 60, 12, 9, 31, 6, 35, 13, 25, 8, 24, 6, 31, 17, 14, 30, 19, 5, 30, 7, 18, 6, 33, 14, 32, 6, 47, 8, 46, 10, 64, 8, 63, 7, 70, 8, 69, 13, 85, 7, 84, 9, 16, 9, 34, 11, 12, 9, 104, 7, 103, 14, 31, 9, 45, 13, 44,

As you can see only 742 tk_sample ids came back when there was an index of 1,000. This is possibly due to padding. It was set to *padding=False* in the *tokenize* function. Without padding, the resulting *input_ids* will not be padded to the maximum length, and if the original text is shorter than the specified *max_length*, the tokenized sequence will be shorter.  

In [12]:
# Apply the tokenize function to the entire raw dataset using the map method
# The tokenize function is applied in a batched manner, improving efficiency
# The remove_columns parameter is set to remove the columns from the raw dataset
# (excluding the "train" split) after tokenization, as they are no longer needed
tokenized_datasets = raw_datasets.map(
    tokenize,                # The tokenization function to be applied
    batched=True,            # Tokenize in batches for efficiency
    remove_columns=raw_datasets["train"].column_names  # Remove unnecessary columns after tokenization
)

# Display the resulting tokenized datasets
# The tokenized_datasets variable now contains both training and testing splits with tokenized representations
tokenized_datasets

Map:   0%|          | 0/29930 [00:00<?, ? examples/s]

Map:   0%|          | 0/3326 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 29930
    })
    test: Dataset({
        features: ['input_ids'],
        num_rows: 3326
    })
})

## Train the model

In [13]:
# Define hyperparameters for model architecture
# These parameters can be adjusted based on the size of the data and the specific requirements of the task

n_layer = 6 # Number of layers in the transformer model
n_head = 8 # Number of attention heads in each transformer layer
n_emb = 512 # Dimensionality of the embedding layer

In [14]:
# Use Hugging Face's AutoConfig to create a configuration for the GPT-2 model
# The configuration is based on the "gpt2" pre-trained model, and some parameters are customized

# Define the configuration using AutoConfig
config = AutoConfig.from_pretrained(
    "gpt2",                               # Base model: "gpt2"
    vocab_size=len(tokenizer),            # Vocabulary size based on the tokenizer
    n_positions=context_length,           # Maximum position embeddings (context length)
    n_layer=n_layer,                      # Number of transformer layers
    n_head=n_head,                        # Number of attention heads in each layer
    pad_token_id=tokenizer.pad_token_id,  # ID of the padding token
    bos_token_id=tokenizer.bos_token_id,  # ID of the beginning-of-sequence token
    eos_token_id=tokenizer.eos_token_id,  # ID of the end-of-sequence token
    n_embd=n_emb                           # Dimensionality of the embedding layer
)

# Create an instance of the GPT-2 language model using the configured parameters
model = GPT2LMHeadModel(config)


In [15]:
# Calculate and print the size of the GPT-2 model in terms of parameters
# The size is measured in millions (M) of parameters

# Calculate the total number of parameters in the GPT-2 model
model_size = sum(t.numel() for t in model.parameters())

# Print the size of the GPT-2 model in millions of parameters
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

GPT-2 size: 19.3M parameters


Create a datacollator to will take care of preparing the data:

"Before we can start training, we need to set up a data collator that will take care of creating the batches. We can use the DataCollatorForLanguageModeling collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the input_ids.

Note that DataCollatorForLanguageModeling supports both masked language modeling (MLM) and causal language modeling (CLM). By default it prepares data for MLM, but we can switch to CLM by setting the argument mlm=False:"

In [16]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [17]:
# Test the data collator on a small subset of the training data
out = data_collator([tokenized_datasets["train"][i] for i in range(5)])

# Print the shapes of the collated outputs
for key in out:
    print(f"{key} shape: {out[key].shape}")

# Display the collated outputs
print(f"Collated outputs: {out}")

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([5, 512])
attention_mask shape: torch.Size([5, 512])
labels shape: torch.Size([5, 512])
Collated outputs: {'input_ids': tensor([[100,  62,  60,  ...,   3,   3,   3],
        [100,  62,  60,  ...,   3,   3,   3],
        [100,  62,  60,  ...,   3,   3,   3],
        [100,  62,  60,  ...,   3,   3,   3],
        [100,  62,  60,  ...,  11,  12,   9]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[ 100,   62,   60,  ..., -100, -100, -100],
        [ 100,   62,   60,  ..., -100, -100, -100],
        [ 100,   62,   60,  ..., -100, -100, -100],
        [ 100,   62,   60,  ..., -100, -100, -100],
        [ 100,   62,   60,  ...,   11,   12,    9]])}


In [18]:
# Login into wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mnaomitunstead[0m ([33mmusicgen[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [19]:
# Set the WandB environment variable to log the model checkpoint
%env WANDB_LOG_MODEL='checkpoint'

env: WANDB_LOG_MODEL='checkpoint'


In [20]:
# Login into Hugging Face
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [21]:
# Create the args for out trainer

# Get the output directory with timestamp.
output_path = "output"
steps = 350

# Define training configuration parameters
# Commented parameters correspond to the small model
config = {"output_dir": output_path, # Specify the directory where output will be saved
          "num_train_epochs": 1, # Number of epochs for training
          "per_device_train_batch_size": 8, # Batch size for training on each device
          "per_device_eval_batch_size": 4, # Batch size for evaluation on each device
          "evaluation_strategy": "steps", # Strategy for evaluation during training (e.g., 'steps' means evaluate every certain number of steps)
          "save_strategy": "steps",  # Strategy for saving checkpoints during training
          "eval_steps": steps, # Number of steps before evaluation
          "logging_steps":steps, # Number of steps before logging
          "logging_first_step": True, # Log the first step
          "save_total_limit": 5, # Limit on the total number of checkpoints to save
          "save_steps": steps, # Number of steps before saving a checkpoint
          "lr_scheduler_type": "cosine", # Type of learning rate scheduler (e.g., 'cosine' for cosine annealing)
          "learning_rate":5e-4, # Initial learning rate
          "warmup_ratio": 0.01, # Ratio of warmup steps to total training steps
          "weight_decay": 0.01, # Weight decay coefficient for regularization
          "seed": 1, # Random seed for reproducibility
          "load_best_model_at_end": True, # Whether to load the best model at the end of training
          "report_to": "wandb"} # Report the training progress to Weights & Biases

# Create Namespace object with configuration
args = Namespace(**config)

In [22]:
# Set the random seed for reproducibility
set_seed(args.seed)

In [23]:
# Initialize and start a new WandB run for training
run = wandb.init(project=wandb_project, job_type="training", config=args)

In [24]:
# Code for converting token sequences to NoteSequences with audio-related information

# Constants for note durations
NOTE_LENGTH_16TH_120BPM = 0.25 * 60 / 120
BAR_LENGTH_120BPM = 4.0 * 60 / 120

def token_sequence_to_note_sequence(token_sequence, use_program=True, use_drums=True, instrument_mapper=None, only_piano=False):
    """
    Convert a token sequence to a NoteSequence with audio-related information.
    Args:
        token_sequence (list or str): Token sequence representing musical information.
        use_program (bool): Whether to use program information for instruments.
        use_drums (bool): Whether to include drums in the output.
        instrument_mapper (dict): Mapping of instrument names to MIDI program numbers.
        only_piano (bool): Whether to include only piano instruments in the output.
    Returns:
        note_sequence (NoteSequence): Converted NoteSequence.
    """

    if isinstance(token_sequence, str):
        token_sequence = token_sequence.split()

    note_sequence = empty_note_sequence()

    # Render all notes.
    current_program = 1
    current_is_drum = False
    current_instrument = 0
    track_count = 0
    current_bar_index = 0  # Initialize current_bar_index here
    for token_index, token in enumerate(token_sequence):

        if token == "PIECE_START":
            pass
        elif token == "PIECE_END":
            print("The end.")
            break
        elif token == "TRACK_START":
            current_bar_index = 0
            track_count += 1
            pass
        elif token == "TRACK_END":
            pass
        elif token == "KEYS_START":
            pass
        elif token == "KEYS_END":
            pass
        elif token.startswith("KEY="):
            pass
        elif token.startswith("INST"):
            instrument = token.split("=")[-1]
            if instrument != "DRUMS" and use_program:
                if instrument_mapper is not None:
                    if instrument in instrument_mapper:
                        instrument = instrument_mapper[instrument]
                current_program = int(instrument)
                current_instrument = track_count
                current_is_drum = False
            if instrument == "DRUMS" and use_drums:
                current_instrument = 0
                current_program = 0
                current_is_drum = True
        elif token == "BAR_START":
            current_time = current_bar_index * BAR_LENGTH_120BPM
            current_notes = {}
        elif token == "BAR_END":
            current_bar_index += 1
            pass
        elif token.startswith("NOTE_ON"):
            pitch = int(token.split("=")[-1])
            note = note_sequence.notes.add()
            note.start_time = current_time
            note.end_time = current_time + 4 * NOTE_LENGTH_16TH_120BPM
            note.pitch = pitch
            note.instrument = current_instrument
            note.program = current_program
            note.velocity = 80
            note.is_drum = current_is_drum
            current_notes[pitch] = note
        elif token.startswith("NOTE_OFF"):
            pitch = int(token.split("=")[-1])
            if pitch in current_notes:
                note = current_notes[pitch]
                note.end_time = current_time
        elif token.startswith("TIME_DELTA"):
            delta = float(token.split("=")[-1]) * NOTE_LENGTH_16TH_120BPM
            current_time += delta
        elif token.startswith("DENSITY="):
            pass
        elif token == "[PAD]":
            pass
        else:
            #print(f"Ignored token {token}.")
            pass

    # Make the instruments right.
    instruments_drums = []
    for note in note_sequence.notes:
        pair = [note.program, note.is_drum]
        if pair not in instruments_drums:
            instruments_drums += [pair]
        note.instrument = instruments_drums.index(pair)

    if only_piano:
        for note in note_sequence.notes:
            if not note.is_drum:
                note.instrument = 0
                note.program = 0

    return note_sequence

def empty_note_sequence(qpm=120.0, total_time=0.0):
    """
    Create an empty NoteSequence with specified tempo and total time.
    Args:
        qpm (float): Quarter notes per minute (tempo).
        total_time (float): Total time of the NoteSequence.
    Returns:
        note_sequence (NoteSequence): Empty NoteSequence.
    """
    note_sequence = note_seq.protobuf.music_pb2.NoteSequence()
    note_sequence.tempos.add().qpm = qpm
    note_sequence.ticks_per_quarter = note_seq.constants.STANDARD_PPQ
    note_sequence.total_time = total_time
    return note_sequence

In [26]:
# first create a custom trainer to log prediction distribution
# Set the sample rate for audio processing
SAMPLE_RATE=44100

class CustomTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def evaluation_loop(
        self,
        dataloader,
        description,
        prediction_loss_only=None,
        ignore_keys=None,
        metric_key_prefix="eval",
    ):
        # Import pyfluidsynth module here
        import fluidsynth as fs

        # Adjust FluidSynth polyphony
        fluidsynth_settings = {
            "synth.polyphony": 128,  # Increase polyphony to 128 voices
            # Add more FluidSynth settings as needed
        }

        # Create a FluidSynth instance
        fluidsynth_instance = fs.Synth()

        # Initialize FluidSynth with modified settings
        for setting, value in fluidsynth_settings.items():
            fluidsynth_instance.setting(setting, value)


        # Call super class method to get the eval outputs
        eval_output = super().evaluation_loop(
            dataloader,
            description,
            prediction_loss_only,
            ignore_keys,
            metric_key_prefix,
        )

         # Log the prediction distribution using `wandb.Histogram` method.
        if wandb.run is not None:
            # Encode a starting token to begin the generation
            input_ids = self.tokenizer.encode("PIECE_START", return_tensors="pt").cuda()

            # Generate more tokens for each voice
            for voice_num in range(1, 5):
                generated_ids = self.model.generate(
                    input_ids,
                    max_length=512,
                    do_sample=True,
                    temperature=0.75, # Set temperature for sampling (higher values for more randomness, lower for more determinism)
                    # top_p = 0.8, # Set top-p sampling parameters (nucleus sampling) to control diversity
                    # top_k = 50, # Set top-k sampling parameters to restrict generation to the top-k most likely tokens
                    eos_token_id=self.tokenizer.encode("TRACK_END")[0]
                )

                # Decode the generated tokens into a token sequence
                token_sequence = self.tokenizer.decode(generated_ids[0])

                # Convert the token sequence into a NoteSequence
                note_sequence = token_sequence_to_note_sequence(token_sequence)

                # Synthesize the audio from the NoteSequence
                synth = note_seq.fluidsynth
                array_of_floats = synth(note_sequence, sample_rate=SAMPLE_RATE)

                # Convert the float audio samples to int16 format
                int16_data = note_seq.audio_io.float_samples_to_int16(array_of_floats)

                # Log the generated audio using the wandb.Audio method
                wandb.log({"Generated_audio_voice_" + str(voice_num): wandb.Audio(int16_data, SAMPLE_RATE)})

        # Return the evaluation output
        return eval_output

In [27]:
# Create TrainingArguments object with training configuration
train_args = TrainingArguments(**config)

In [28]:
 # Create training arguments
train_args = TrainingArguments(**config)

# Initialize the custom trainer for training the model
trainer = CustomTrainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)


In [29]:
# Train the model.
trainer.train()

Step,Training Loss,Validation Loss


TypeError: 'NoneType' object is not iterable

In [None]:
# call wandb.finish() to finish the run
wandb.finish()