# NesGen: With Note-Duration Representation
A complete example of music generation using a Transformer model is presented in this notebook.

Notebook presented for the A.A. 2023/2024 Deep Learning project.

Group members: 
* Valerio Di Zio - valerio.dizio@studio.unibo.it
* Francesco Magnani - mail@difresh.it
* Luca Rubboli - mail@diluca.it


# Installing the libraries and downloading the dataset
The dataset used as a case study is the Maestro dataset, which contains .midi tracks with a form suitable to be handled and processed by an artificial intelligence model.

In order to analyze midi files by going to extract information from them, the pretty_midi library was used.

In [1]:
!pip install miditok
!pip install symusic
!pip install torch
!pip install transformers
!pip install accelerate
!pip install evaluate
!pip install tensorboard
!pip install scikit-learn
!pip install pretty_midi

!wget https://storage.googleapis.com/magentadata/datasets/maestro/v3.0.0/maestro-v3.0.0-midi.zip
!unzip 'maestro-v3.0.0-midi.zip'
!rm 'maestro-v3.0.0-midi.zip'
!mv 'maestro-v3.0.0' 'Maestro'

from copy import deepcopy
from pathlib import Path
from random import shuffle

from evaluate import load as load_metric
from miditok import REMI, TokenizerConfig
from miditok.pytorch_data import DatasetMIDI, DataCollator
from miditok.utils import split_files_for_training
from miditok.data_augmentation import augment_dataset
from torch import Tensor, argmax
from torch.utils.data import DataLoader
from torch.cuda import is_available as cuda_available, is_bf16_supported
from torch.backends.mps import is_available as mps_available
from transformers import AutoModelForCausalLM, MistralConfig, Trainer, TrainingArguments, GenerationConfig
from transformers.trainer_utils import set_seed
from tqdm import tqdm
import json
import pretty_midi
import os

Collecting miditok
  Downloading miditok-3.0.4-py3-none-any.whl.metadata (10 kB)
Collecting symusic>=0.5.0 (from miditok)
  Downloading symusic-0.5.5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (11 kB)
Collecting pySmartDL (from symusic>=0.5.0->miditok)
  Downloading pySmartDL-1.3.4-py3-none-any.whl.metadata (2.8 kB)
Downloading miditok-3.0.4-py3-none-any.whl (157 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.2/157.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading symusic-0.5.5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading pySmartDL-1.3.4-py3-none-any.whl (20 kB)
Installing collected packages: pySmartDL, symusic, miditok
Successfully installed miditok-3.0.4 pySmartDL-1.3.4 symusic-0.5.5
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metada

# Tokenization
Tokenization is probably one of the most important steps in the process.

For tokenization, in this specific example, a representation in the following form was used.

**note-duration**
* Example: "C4-1.0, C4-1.0, G4-1.0, G4-1.0, A4-1.0, A4-1.0, G4-2.0, F4-1.0, F4-1.0, E4-1.0, E4-1.0, D4-1.0, D4-1.0, C4-2.0, G4-1.0, G4-1.0, F4-1.0, F4-1.0, E4-1.0, E4-1.0, D4-2.0, G4-1.0, G4-1.0, F4-1.0, F4-1.0, E4-1.0, E4-1.0, D4-2.0"

This allows each token to be assigned, 1 note.

A midi file, in fact, is much more complicated than this and by going about tokenizing differently, there is a risk of generating tokens that in sequence do not make sense.

By going to restrict the model so that we get for each token 1 note we make subsequent training and generation easier.

In [2]:
def find_midi_files(directory):
    """Recursively finds all MIDI files in the directory."""
    midi_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith((".mid", ".midi")):
                midi_files.append(os.path.join(root, file))
    return midi_files

def midi_to_note_representation(file_path):
    """Converts a MIDI file into a note-duration representation."""
    try:
        midi_data = pretty_midi.PrettyMIDI(file_path)
        note_events = []

        for instrument in midi_data.instruments:
            for note in instrument.notes:
                # Convert pitch to note name
                note_name = pretty_midi.note_number_to_name(note.pitch)
                duration = note.end - note.start
                note_events.append(f"{note_name}-{duration:.1f}")

        return ", ".join(note_events)
    except Exception as e:
        print(f"Error in file conversion {file_path}: {e}")
        return None

def create_dataset_from_midi(directory, output_file):
    """Creates a JSON dataset with the representation of notes from MIDI files."""
    dataset = {}
    midi_files = find_midi_files(directory)

    for midi_file in tqdm(midi_files):
        note_representation = midi_to_note_representation(midi_file)
        if note_representation:
            dataset[midi_file] = note_representation

    with open(output_file, "w") as json_file:
        json.dump(dataset, json_file, indent=4)

    print(f"Dataset created and saved in {output_file}")

midi_directory = "./Maestro"
output_dataset_file = "midi_dataset.json"


create_dataset_from_midi(midi_directory, output_dataset_file)

100%|██████████| 1276/1276 [05:34<00:00,  3.82it/s]


Dataset creato e salvato in midi_dataset.json


In [3]:
dataset_file = "midi_dataset.json"

with open(dataset_file, "r") as json_file:
    dataset = json.load(json_file)

In [4]:
import numpy as np
maestro_dataset = list(dataset.values())

In [5]:
with open("midi_dataset_nolabel.json", "w") as json_file:
    json.dump(maestro_dataset, json_file, indent=4)

## Text Tokenization

For the actual tokenization, then to convert the text into a numeric index, we relied on the Tokenizer provided by the tensorflow.keras library.

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(filters="", lower=False, split=",")
tokenizer.fit_on_texts(maestro_dataset)
tokenized_melodies = tokenizer.texts_to_sequences(maestro_dataset)

In [7]:
pre_processed_dataset = np.concatenate(tokenized_melodies)

# Train, Vaild and Test split
One of the problems faced during the development of the following solution was the variable length of midi tracks.

Some of the solutions explored were:
1. add “null” padding to all tracks to reach the length of the longest one;
    * During the development of this solution, we realized that we make the dataset much larger and full of
      empty notes. Increasing the complexity without benefiting from it.
3. cut all sequences to a certain predefined length, adding padding;
    * By cutting sequences the problem described in the previous solution is alleviated by having much less
      padding to add. But a new problem arises, namely the loss of information, useful for training the model.
5. concatenate all the tracks, so as to have a single track and split it into sequences as needed.
    * In our opinion this is the best solution, it does not require padding and allows us to preserve all the
      information, the only problem is the examples where multiple songs are joined but we think it is a good
      compromise.

In [8]:
train_pct = 0.7  # 70% training
val_pct = 0.2    # 20% validation
test_pct = 0.1   # 10% test

n = len(pre_processed_dataset)
train_end = int(train_pct * n)
val_end = train_end + int(val_pct * n)

train_data = pre_processed_dataset[:train_end]
val_data = pre_processed_dataset[train_end:val_end]
test_data = pre_processed_dataset[val_end:]

In [9]:
import tensorflow as tf

In [10]:
ids_dataset_train = tf.data.Dataset.from_tensor_slices(train_data)
ids_dataset_val = tf.data.Dataset.from_tensor_slices(val_data)
ids_dataset_test = tf.data.Dataset.from_tensor_slices(test_data)
seq_length = 1024 

sequences_train = ids_dataset_train.batch(seq_length+1, drop_remainder=True)
sequences_val = ids_dataset_val.batch(seq_length+1, drop_remainder=True)
sequences_test = ids_dataset_test.batch(seq_length+1, drop_remainder=True)

def split_input_target(sequence):
    input_seq = tf.cast(sequence[:-1], tf.int32)
    target_seq = tf.cast(sequence[1:], tf.int32)
    return input_seq, target_seq

train_ds = sequences_train.map(split_input_target)
val_ds = sequences_val.map(split_input_target)
test_ds = sequences_test.map(split_input_target)

BATCH_SIZE = 16


# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).

BUFFER_SIZE = 10000

train_ds = (
    train_ds
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

val_ds = (
    val_ds
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

test_ds = (
    test_ds
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

# Training
For the realization of the model we relied on Transformers technology, specifically we used the TFAutoModelForCausalLM model found in the Hugging Face library.

In [11]:
!pip install transformers

  pid, fd = os.forkpty()




In [12]:
from transformers import TFAutoModelForCausalLM, MistralConfig

# Define the model configuration
model_config = MistralConfig(
    vocab_size=len(tokenizer.word_index),
    hidden_size=512,
    intermediate_size=1024,
    num_hidden_layers=8,
    num_attention_heads=8,
    num_key_value_heads=4,
    sliding_window=256,
    max_position_embeddings=8192,
)

# Initialize the TensorFlow model
model = TFAutoModelForCausalLM.from_config(model_config)
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss,
              optimizer="adam",
              weighted_metrics=["sparse_categorical_accuracy"],
              jit_compile=True,
              )

In [13]:
for input_example_batch, target_example_batch in train_ds.take(1):

  example_batch_predictions = model(input_example_batch)
  logits = example_batch_predictions.logits
  print(logits.shape, "# (batch_size, sequence_length, vocab_size)")



# Check shapes
print("Prediction shape:", logits.shape)
print("Target shape:", target_example_batch.shape)

# Ensure reduction is feasible
predicted_classes = tf.argmax(logits, axis=-1)  # (batch_size, seq_length)
print("Reduced prediction shape:", predicted_classes.shape)

# Compare shapes after reduction
if predicted_classes.shape == target_example_batch.shape:
    print("Shapes are compatible for comparison.")
else:
    print("Shapes are NOT compatible for comparison.")

# Verify dtype compatibility
print("Prediction dtype:", logits.dtype)
print("Target dtype:", target_example_batch.dtype)

(16, 1024, 6663) # (batch_size, sequence_length, vocab_size)
Prediction shape: (16, 1024, 6663)
Target shape: (16, 1024)
Reduced prediction shape: (16, 1024)
Shapes are compatible for comparison.
Prediction dtype: <dtype: 'float32'>
Target dtype: <dtype: 'int32'>


In [14]:
EPOCHS = 20

history = model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=val_ds,
)

Epoch 1/20
Cause: for/else statement not yet supported


I0000 00:00:1732467006.972612     144 service.cc:145] XLA service 0x7bb0c0005370 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1732467006.972668     144 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
W0000 00:00:1732467007.440365     144 assert_op.cc:38] Ignoring Assert operator tf_mistral_for_causal_lm/model/assert_less/Assert/Assert
I0000 00:00:1732467017.994268     144 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.




W0000 00:00:1732467227.127586     143 assert_op.cc:38] Ignoring Assert operator tf_mistral_for_causal_lm/model/assert_less/Assert/Assert


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


# Generation
For generation we generated a seed, i.e., a set of initial notes with which the model starts to predict the next ones, to do this we use a random midi present in the test set

In [15]:
def get_seed():
    for seed_ids, _ in test_ds.take(1):
    
      seed = seed_ids
    
    return seed[0]

In [16]:
dump_seed = False

seed = get_seed()
input_ids = tf.convert_to_tensor(seed)  # Assuming seed is a 1D tensor of token IDs
input_ids = tf.expand_dims(input_ids, 0)  # Add an extra dimension to represent batch size

if dump_seed:
    midi = tokenizer.decode([seed])
    midi.dump_midi("seed.mid")

# Generate continuation
outputs = model.generate(
    input_ids=input_ids,
    max_new_tokens=512,  # Maximum length of generated sequence
    num_return_sequences=1,  # Number of sequences to return
    do_sample=True,  # Use sampling (True) or greedy decoding (False)
    temperature=0.7,  # Sampling temperature (lower is more conservative)
    eos_token_id=-1
)

input_length = input_ids.shape[1]
generated_tokens = outputs[:, input_length:] # skip seed

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:-1 for open-end generation.


In [18]:
generated_sequence_array = generated_tokens.numpy()
generated_melody = tokenizer.sequences_to_texts(
    generated_sequence_array
)[0]

In [20]:
def parse_note_string(note_string):
    """
    Converts a note string to a list of tuples (note, duration).
    Example: “C4-1.0, G4-1.0” -> [(“C4”, 1.0), (“G4”, 1.0)]
    """
    notes = []
    for note in note_string.split("  "):
        note_name, duration = note.strip().split("-")
        notes.append((note_name, float(duration)))
    return notes

def create_midi_from_notes(note_string, output_file):
    """
    Creates a MIDI file from a string of notes.
    """
    # Parse the string into a list of notes.
    notes = parse_note_string(note_string)

    # Create a PrettyMIDI object and a tool
    midi = pretty_midi.PrettyMIDI()
    instrument = pretty_midi.Instrument(program=0)  # 0 = piano

    current_time = 0.0  # Initial time

    for note_name, duration in notes:
        # Convert the note to a MIDI number
        note_number = pretty_midi.note_name_to_number(note_name)
        # Create MIDI note
        note = pretty_midi.Note(velocity=100, pitch=note_number,
                                start=current_time, end=current_time + duration)
        # Add note to instrument
        instrument.notes.append(note)
        # Update current time
        current_time += duration

    # Add instrument to MIDI
    midi.instruments.append(instrument)

    # Save the MIDI file
    midi.write(output_file)
    print(f"File MIDI creato: {output_file}")