### What is happening in the preprocessing

# preprocess_midi Function:

- Takes a file path as input.
- Loads the MIDI file using `pyd.PrettyMIDI`.
- Extracts notes from all instruments in the MIDI file.
- Rounds the start and end times of each note to two decimal places.
- Sorts the notes based on their start times.
- Uses `MidiEventProcessor` to encode the notes into a representation sequence.
- Updates a global variable `total` by adding the length of the representation sequence.
- Returns the representation sequence.

# preprocess_pop909 Function:

- Takes the root folder of the POP909 dataset (`midi_root`) and a directory to save processed data (`save_dir`) as inputs.
- Iterates through MIDI files in the `midi_root` directory.
- Calls `preprocess_midi` for each MIDI file to obtain the representation sequence.
- Collects the representation sequences into a NumPy array (`save_py`).
- Saves the NumPy array as a file named "pop909-event-token.npy" using `np.save`.

# Data Processing:

- The representation sequence appears to be a sequence of events or tokens derived from the notes in the MIDI files. The exact nature of these events depends on the implementation of `MidiEventProcessor`.
- The total length of the representation sequences is accumulated in the global variable `total`.

# Execution:

- The code is executed with the POP909 dataset located in the "../pop909" folder, and the processed data is saved in the "midi_data/" directory.


ref: https://github.com/music-x-lab/POP909-Dataset/blob/master/data_process/data_process.ipynb

In [14]:
#  '''
# This is the data processing script for POP909:A Pop song Dataset for Music Arrangement Generation
# ============
# It will allow you to quickly process the POP909 Files (Midi) into the Google Magenta's music representation 
#     as like [Music Transformer](https://magenta.tensorflow.org/music-transformer) 
#             [Performance RNN](https://magenta.tensorflow.org/performance-rnn).

# '''


In [11]:
import os
import pickle
import numpy as np
import pretty_midi as pyd
from music21 import converter, metadata, environment

total = 0

def preprocess_midi(path):
    global total
    data = pyd.PrettyMIDI(path)
    main_notes = []
    acc_notes = []
    for ins in data.instruments:
        acc_notes.extend(ins.notes)
    for i in range(len(main_notes)):
        main_notes[i].start = round(main_notes[i].start, 2)
        main_notes[i].end = round(main_notes[i].end, 2)
    for i in range(len(acc_notes)):
        acc_notes[i].start = round(acc_notes[i].start, 2)
        acc_notes[i].end = round(acc_notes[i].end, 2)
    main_notes.sort(key=lambda x: x.start)
    acc_notes.sort(key=lambda x: x.start)
    
    # Replace the following lines with your own processing logic
    main_events = [(note.start, note.end, note.pitch) for note in main_notes]
    acc_events = [(note.start, note.end, note.pitch) for note in acc_notes]
    
    total += len(main_events) + len(acc_events)
    return main_events, acc_events

def is_midi_file(filename):
    return filename.lower().endswith(".mid")

def preprocess_pop909(midi_root, save_dir):
    save_py = []

    try:
        # Create the "midi_data/" directory if it doesn't exist
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)

        # Traverse subdirectories in "POP909_v3"
        for subdir, dirs, files in os.walk(midi_root):
            for filename in files:
                file_path = os.path.join(subdir, filename)
                if is_midi_file(file_path):
                    try:
                        main_events, acc_events = preprocess_midi(file_path)
                        save_py.append((main_events, acc_events))
                    except KeyboardInterrupt:
                        print(' Abort')
                        return
                    except Exception as e:
                        print(f'Error processing file {file_path}: {e}')

        save_py = np.array(save_py)
        print(save_py.size)
        np.save(os.path.join(save_dir, "pop909-event-token.npy"), save_py)

    except Exception as e:
        print(f'An error occurred: {e}')

# Specify the paths
midi_root = "POP909_v3"
save_dir = "midi_data/"

# Call the preprocessing function
preprocess_pop909(midi_root, save_dir)


  save_py = np.array(save_py)


1756


In [1]:
import os
import pickle
import numpy as np
import pretty_midi as pyd

total = 0

def preprocess_midi(path):
    global total
    data = pyd.PrettyMIDI(path)
    main_notes = []
    acc_notes = []
    
    # Extract all notes from all instruments
    for ins in data.instruments:
        acc_notes.extend(ins.notes)
    
    # Quantize notes in each bar
    for i in range(len(acc_notes)):
        acc_notes[i].start = round(acc_notes[i].start, 2)
        acc_notes[i].end = round(acc_notes[i].end, 2)
        # Quantize to the nearest 16th note
        acc_notes[i].start = round(acc_notes[i].start * 4) / 4
        acc_notes[i].end = round(acc_notes[i].end * 4) / 4
    
    acc_notes.sort(key=lambda x: x.start)
    
    # Replace the following lines with your own processing logic
    main_events = [(note.start, note.end, note.pitch, note.duration, note.velocity) for note in main_notes]
    acc_events = [(note.start, note.end, note.pitch, note.duration, note.velocity) for note in acc_notes]
    
    total += len(main_events) + len(acc_events)
    return main_events, acc_events

def is_midi_file(filename):
    return filename.lower().endswith(".mid")

def preprocess_pop909(midi_root, save_dir):
    save_py = []

    try:
        # Create the "midi_data/" directory if it doesn't exist
        if not os.path.exists(save_dir):
            os.makedirs(save_dir)

        # Traverse subdirectories in "POP909_v3"
        for subdir, dirs, files in os.walk(midi_root):
            for filename in files:
                file_path = os.path.join(subdir, filename)
                if is_midi_file(file_path):
                    try:
                        main_events, acc_events = preprocess_midi(file_path)
                        save_py.append((main_events, acc_events))
                    except KeyboardInterrupt:
                        print(' Abort')
                        return
                    except Exception as e:
                        print(f'Error processing file {file_path}: {e}')

        save_py = np.array(save_py)
        print(save_py.size)
        np.save(os.path.join(save_dir, "pop909-event-token.npy"), save_py)

    except Exception as e:
        print(f'An error occurred: {e}')

# Specify the paths
midi_root = "POP909_v3"
save_dir = "midi_data/"

# Call the preprocessing function
preprocess_pop909(midi_root, save_dir)


  save_py = np.array(save_py)


1756


In [2]:
import os
import numpy as np
import pretty_midi as pyd

total = 0

def preprocess_midi(path):
    global total
    data = pyd.PrettyMIDI(path)
    main_notes = []
    acc_notes = []
    
    # Extract all notes from all instruments
    for ins in data.instruments:
        acc_notes.extend(ins.notes)
    
    # Quantize notes in each bar
    for i in range(len(acc_notes)):
        acc_notes[i].start = round(acc_notes[i].start, 2)
        acc_notes[i].end = round(acc_notes[i].end, 2)
        # Quantize to the nearest 16th note
        acc_notes[i].start = round(acc_notes[i].start * 4) / 4
        acc_notes[i].end = round(acc_notes[i].end * 4) / 4
    
    acc_notes.sort(key=lambda x: x.start)
    
    # Replace the following lines with your own processing logic
    main_events = [(note.start, note.end, note.pitch, note.end - note.start, note.velocity) for note in main_notes]
    acc_events = [(note.start, note.end, note.pitch, note.end - note.start, note.velocity) for note in acc_notes]
    
    total += len(main_events) + len(acc_events)
    return main_events, acc_events


def is_midi_file(filename):
    return filename.lower().endswith(".mid")

def preprocess_pop909(midi_root, save_dir):
    save_py = []

    try:
        # Create the "midi_data_v2/" directory if it doesn't exist
        save_dir_v2 = save_dir + "_v2"
        if not os.path.exists(save_dir_v2):
            os.makedirs(save_dir_v2)

        # Traverse subdirectories in "POP909_v3"
        for subdir, dirs, files in os.walk(midi_root):
            for filename in files:
                file_path = os.path.join(subdir, filename)
                if is_midi_file(file_path):
                    try:
                        main_events, acc_events = preprocess_midi(file_path)
                        save_py.append((main_events, acc_events))
                    except KeyboardInterrupt:
                        print(' Abort')
                        return
                    except Exception as e:
                        print(f'Error processing file {file_path}: {e}')

        save_py = np.array(save_py)
        print(save_py.size)
        np.save(os.path.join(save_dir_v2, "pop909-event-token.npy"), save_py)

    except Exception as e:
        print(f'An error occurred: {e}')

# Specify the paths
midi_root = "POP909_v3"
save_dir = "midi_data"

# Call the preprocessing function
preprocess_pop909(midi_root, save_dir)


  save_py = np.array(save_py)


1756


In [None]:
miditok.MIDITokenizer

import os
import numpy as np
import pretty_midi as pyd
from miditok import REMI, TokenizerConfig
from pathlib import Path

def is_midi_file(filename):
    return filename.lower().endswith(".mid")

def tokenize_midi(midi_path, tokenizer):
    data = pyd.PrettyMIDI(midi_path)

    # Extract tempo information
    tempo_changes = data.get_tempo_changes()
    tempo_tokens = tokenizer.tokenize_tempo(tempo_changes[0], tempo_changes[1])

    # Extract note information
    notes = []
    for instrument in data.instruments:
        notes.extend(instrument.notes)

    note_tokens = tokenizer.tokenize_notes(notes)

    # Extract chord, pitch, duration, and velocity information
    chord_tokens = tokenizer.tokenize_chords(data)
    pitch_tokens, duration_tokens, velocity_tokens = tokenizer.tokenize_notes_details(notes)

    return tempo_tokens, note_tokens, chord_tokens, pitch_tokens, duration_tokens, velocity_tokens

def preprocess_pop909(midi_root, save_dir):
    save_tokens = []

    try:
        # Create the "midi_data_v2/" directory if it doesn't exist
        save_dir_v2 = save_dir + "_v2"
        if not os.path.exists(save_dir_v2):
            os.makedirs(save_dir_v2)

        # Creating a multitrack tokenizer configuration
        config = TokenizerConfig(num_velocities=16, use_chords=True, use_programs=True)
        tokenizer = REMI(config)

        # Traverse subdirectories in "POP909_v3"
        for subdir, dirs, files in os.walk(midi_root):
            for filename in files:
                file_path = os.path.join(subdir, filename)
                if is_midi_file(file_path):
                    try:
                        tempo_tokens, note_tokens, chord_tokens, pitch_tokens, duration_tokens, velocity_tokens = tokenize_midi(file_path, tokenizer)
                        save_tokens.append((tempo_tokens, note_tokens, chord_tokens, pitch_tokens, duration_tokens, velocity_tokens))
                    except KeyboardInterrupt:
                        print(' Abort')
                        return
                    except Exception as e:
                        print(f'Error processing file {file_path}: {e}')

        save_tokens = np.array(save_tokens)
        np.save(os.path.join(save_dir_v2, "pop909-tokenized.npy"), save_tokens)

        # Tokenize a whole dataset and save it at JSON files
        midi_paths = list(Path(midi_root).rglob("*.mid"))
        data_augmentation_offsets = [2, 1, 1]
        tokenizer.tokenize_midi_dataset(midi_paths, Path(save_dir_v2, "tokens_noBPE"), data_augment_offsets=data_augmentation_offsets)

        # Constructs the vocabulary with BPE, from the token files
        tokenizer.learn_bpe(
            vocab_size=10000,
            tokens_paths=list(Path(save_dir_v2, "tokens_noBPE").rglob("**/*.json")),
            start_from_empty_voc=False,
        )

        # Saving the tokenizer configuration
        tokenizer.save_params(Path(save_dir_v2, "tokenizer.json"))

    except Exception as e:
        print(f'An error occurred: {e}')

# Specify the paths
midi_root = "POP909_v3"
save_dir = "midi_data"

# Call the preprocessing function
preprocess_pop909(midi_root, save_dir)


Error processing file POP909_v3\001\001.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\002\002.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\003\003.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\004\004.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\005\005.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\006\006.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\007\007.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\008\008.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\009\009.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\010\010.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\011\011.mid: 'REMI' object has no attr

Error processing file POP909_v3\099\099.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\100\100.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\101\101.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\103\103.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\104\104.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\105\105.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\106\106.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\107\107.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\108\108.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\109\109.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\110\110.mid: 'REMI' object has no attr

Error processing file POP909_v3\190\190.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\191\191.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\192\192.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\193\193.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\194\194.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\195\195.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\196\196.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\197\197.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\198\198.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\199\199.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\200\200.mid: 'REMI' object has no attr

Error processing file POP909_v3\282\282.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\283\283.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\284\284.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\285\285.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\286\286.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\287\287.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\288\288.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\289\289.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\290\290.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\291\291.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\292\292.mid: 'REMI' object has no attr

Error processing file POP909_v3\373\373.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\374\374.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\375\375.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\376\376.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\377\377.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\378\378.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\379\379.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\380\380.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\381\381.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\382\382.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\383\383.mid: 'REMI' object has no attr

Error processing file POP909_v3\465\465.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\466\466.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\467\467.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\468\468.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\469\469.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\470\470.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\471\471.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\472\472.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\473\473.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\474\474.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\475\475.mid: 'REMI' object has no attr

Error processing file POP909_v3\562\562.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\564\564.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\565\565.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\566\566.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\567\567.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\568\568.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\569\569.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\570\570.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\571\571.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\572\572.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\573\573.mid: 'REMI' object has no attr

Error processing file POP909_v3\656\656.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\657\657.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\658\658.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\659\659.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\660\660.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\661\661.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\662\662.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\663\663.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\664\664.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\665\665.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\666\666.mid: 'REMI' object has no attr

Error processing file POP909_v3\763\763.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\764\764.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\765\765.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\766\766.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\767\767.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\768\768.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\769\769.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\770\770.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\771\771.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\772\772.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\773\773.mid: 'REMI' object has no attr

Error processing file POP909_v3\853\853.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\854\854.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\855\855.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\856\856.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\857\857.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\858\858.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\859\859.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\860\860.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\861\861.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\862\862.mid: 'REMI' object has no attribute 'tokenize_tempo'
Error processing file POP909_v3\863\863.mid: 'REMI' object has no attr

Tokenizing MIDIs (midi_data_v2/tokens_noBPE): 100%|██████████████████████████████████| 878/878 [01:24<00:00, 10.38it/s]
Performing data augmentation: 100%|██████████████████████████████████████████████████| 878/878 [00:39<00:00, 22.21it/s]
Loading token files: 100%|████████████████████████████████████████████████████████| 6129/6129 [00:40<00:00, 153.04it/s]


mmm-lmd
https://colab.research.google.com/drive/1KLbe-ZnIyvpPypVqYapBRs-o5Q1E7a9R?usp=sharing#scrollTo=ex9Lt0yWw5Ud

In [6]:
import os
from copy import deepcopy
from math import ceil

from miditoolkit import MidiFile
from tqdm import tqdm

MAX_NB_BAR = 8
MIN_NB_NOTES = 20
dataset = "POP909"  # Change to your dataset name

merged_out_dir = os.path.join("C:/Users/naomi/Thesis/Thesis/Thesis-main/output", f"{dataset}-chunked")
os.makedirs(merged_out_dir, exist_ok=True)

# Adjust the base path to your dataset
base_path = os.path.join("C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909", dataset)
midi_paths = [os.path.join(root, file) for root, dirs, files in os.walk(base_path) for file in files if file.endswith(('.mid', '.midi'))]

print(f"Found {len(midi_paths)} MIDI files.")

for i, midi_path in enumerate(tqdm(midi_paths, desc="CHUNKING MIDIS")):
    try:
        # Determine the output directory for this file
        relative_path = os.path.relpath(midi_path, base_path)
        output_dir = os.path.join(merged_out_dir, os.path.dirname(relative_path))
        os.makedirs(output_dir, exist_ok=True)

        # Check if chunks already exist
        midi_filename = os.path.splitext(os.path.basename(midi_path))[0]
        chunk_paths = [f for f in os.listdir(output_dir) if f.startswith(f"{midi_filename}_") and f.endswith('.mid')]
        if len(chunk_paths) > 0:
            print(f"Chunks for {midi_path} already exist, skipping...")
            continue

        # Loads MIDI, merges, and saves it
        midi = MidiFile(midi_path)
        ticks_per_cut = MAX_NB_BAR * midi.ticks_per_beat * 4
        nb_cuts = ceil(midi.max_tick / ticks_per_cut)
        if nb_cuts < 2:
            continue

        print(f"Processing {midi_path}")
        midis = [deepcopy(midi) for _ in range(nb_cuts)]

        for j, track in enumerate(midi.instruments):  # sort notes as they are not always sorted right
            track.notes.sort(key=lambda x: x.start)
            for midi_short in midis:  # clears notes from shorten MIDIs
                midi_short.instruments[j].notes = []
            for note in track.notes:
                cut_id = note.start // ticks_per_cut
                note_copy = deepcopy(note)
                note_copy.start -= cut_id * ticks_per_cut
                note_copy.end -= cut_id * ticks_per_cut
                midis[cut_id].instruments[j].notes.append(note_copy)

        # Saving MIDIs
        for j, midi_short in enumerate(midis):
            if sum(len(track.notes) for track in midi_short.instruments) < MIN_NB_NOTES:
                continue
            output_filename = f"{midi_filename}_{j}.mid"
            output_path = os.path.join(output_dir, output_filename)
            midi_short.dump(output_path)

    except Exception as e:
        print(f"An error occurred while processing {midi_path}: {e}")


Found 0 MIDI files.


CHUNKING MIDIS: 0it [00:00, ?it/s]


In [23]:
import os
from copy import deepcopy
from math import ceil
from miditoolkit import MidiFile
from tqdm import tqdm

MAX_NB_BAR = 8
MIN_NB_NOTES = 20
dataset = "POP909"  # Change to your dataset name

merged_out_dir = os.path.join("C:/Users/naomi/Thesis/Thesis/Thesis-main/output", f"{dataset}-chunked")
os.makedirs(merged_out_dir, exist_ok=True)

# Adjust the root folder to your dataset
root_folder = 'C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909'

# Iterate over folders and MIDI files
for folder_name in os.listdir(root_folder):
    folder_path = os.path.join(root_folder, folder_name)

    # Check if the item in the directory is a folder
    if os.path.isdir(folder_path):
        print(f"Processing folder: {folder_name}")

        # Iterate over MIDI files in the folder
        for file_name in os.listdir(folder_path):
            if file_name.endswith('.mid'):
                midi_file_path = os.path.join(folder_path, file_name)

                try:
                    # Determine the output directory for this file
                    relative_path = os.path.relpath(midi_file_path, root_folder)
                    output_dir = os.path.join(merged_out_dir, os.path.dirname(relative_path))
                    os.makedirs(output_dir, exist_ok=True)

                    # Check if chunks already exist
                    midi_filename = os.path.splitext(os.path.basename(midi_file_path))[0]
                    chunk_paths = [f for f in os.listdir(output_dir) if f.startswith(f"{midi_filename}_") and f.endswith('.mid')]
                    if len(chunk_paths) > 0:
                        print(f"Chunks for {midi_file_path} already exist, skipping...")
                        continue

                    # Loads MIDI, merges, and saves it
                    midi = MidiFile(midi_file_path)
                    ticks_per_cut = MAX_NB_BAR * midi.ticks_per_beat * 4
                    nb_cuts = ceil(midi.max_tick / ticks_per_cut)
                    if nb_cuts < 2:
                        continue

                    print(f"Processing {midi_file_path}")
                    midis = [deepcopy(midi) for _ in range(nb_cuts)]

                    for j, track in enumerate(midi.instruments):  # sort notes as they are not always sorted right
                        track.notes.sort(key=lambda x: x.start)
                        for midi_short in midis:  # clears notes from shorten MIDIs
                            midi_short.instruments[j].notes = []
                        for note in track.notes:
                            cut_id = note.start // ticks_per_cut
                            note_copy = deepcopy(note)
                            note_copy.start -= cut_id * ticks_per_cut
                            note_copy.end -= cut_id * ticks_per_cut
                            midis[cut_id].instruments[j].notes.append(note_copy)

                    # Saving MIDIs
                    for j, midi_short in enumerate(midis):
                        if sum(len(track.notes) for track in midi_short.instruments) < MIN_NB_NOTES:
                            continue
                        output_filename = f"{midi_filename}_{j}.mid"
                        output_path = os.path.join(output_dir, output_filename)
                        midi_short.dump(output_path)

                except Exception as e:
                    print(f"An error occurred while processing {midi_file_path}: {e}")


Processing folder: 001
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\001\001.mid
Processing folder: 002
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\002\002.mid
Processing folder: 003
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\003\003.mid
Processing folder: 004
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\004\004.mid
Processing folder: 005
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\005\005.mid
Processing folder: 006
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\006\006.mid
Processing folder: 007
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\007\007.mid
Processing folder: 008
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\008\008.mid
Processing folder: 009
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\009\009.mid
Processing folder: 010
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\010\010.mid
Processing folder: 011
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 089
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\089\089.mid
Processing folder: 090
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\090\090.mid
Processing folder: 091
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\091\091.mid
Processing folder: 092
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\092\092.mid
Processing folder: 093
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\093\093.mid
Processing folder: 094
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\094\094.mid
Processing folder: 095
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\095\095.mid
Processing folder: 096
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\096\096.mid
Processing folder: 097
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\097\097.mid
Processing folder: 098
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\098\098.mid
Processing folder: 099
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 177
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\177\177.mid
Processing folder: 178
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\178\178.mid
Processing folder: 179
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\179\179.mid
Processing folder: 180
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\180\180.mid
Processing folder: 181
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\181\181.mid
Processing folder: 182
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\182\182.mid
Processing folder: 183
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\183\183.mid
Processing folder: 184
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\184\184.mid
Processing folder: 185
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\185\185.mid
Processing folder: 186
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\186\186.mid
Processing folder: 187
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 265
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\265\265.mid
Processing folder: 266
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\266\266.mid
Processing folder: 267
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\267\267.mid
Processing folder: 268
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\268\268.mid
Processing folder: 269
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\269\269.mid
Processing folder: 270
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\270\270.mid
Processing folder: 271
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\271\271.mid
Processing folder: 272
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\272\272.mid
Processing folder: 273
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\273\273.mid
Processing folder: 274
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\274\274.mid
Processing folder: 275
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 353
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\353\353.mid
Processing folder: 354
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\354\354.mid
Processing folder: 355
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\355\355.mid
Processing folder: 356
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\356\356.mid
Processing folder: 357
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\357\357.mid
Processing folder: 358
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\358\358.mid
Processing folder: 359
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\359\359.mid
Processing folder: 360
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\360\360.mid
Processing folder: 361
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\361\361.mid
Processing folder: 362
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\362\362.mid
Processing folder: 363
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 441
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\441\441.mid
Processing folder: 442
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\442\442.mid
Processing folder: 443
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\443\443.mid
Processing folder: 444
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\444\444.mid
Processing folder: 445
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\445\445.mid
Processing folder: 446
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\446\446.mid
Processing folder: 447
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\447\447.mid
Processing folder: 448
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\448\448.mid
Processing folder: 449
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\449\449.mid
Processing folder: 450
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\450\450.mid
Processing folder: 451
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 529
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\529\529.mid
Processing folder: 530
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\530\530.mid
Processing folder: 531
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\531\531.mid
Processing folder: 532
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\532\532.mid
Processing folder: 533
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\533\533.mid
Processing folder: 534
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\534\534.mid
Processing folder: 535
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\535\535.mid
Processing folder: 536
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\536\536.mid
Processing folder: 537
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\537\537.mid
Processing folder: 538
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\538\538.mid
Processing folder: 539
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 617
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\617\617.mid
Processing folder: 618
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\618\618.mid
Processing folder: 619
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\619\619.mid
Processing folder: 620
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\620\620.mid
Processing folder: 621
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\621\621.mid
Processing folder: 622
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\622\622.mid
Processing folder: 623
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\623\623.mid
Processing folder: 624
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\624\624.mid
Processing folder: 625
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\625\625.mid
Processing folder: 626
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\626\626.mid
Processing folder: 627
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 705
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\705\705.mid
Processing folder: 706
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\706\706.mid
Processing folder: 707
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\707\707.mid
Processing folder: 708
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\708\708.mid
Processing folder: 709
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\709\709.mid
Processing folder: 710
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\710\710.mid
Processing folder: 711
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\711\711.mid
Processing folder: 712
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\712\712.mid
Processing folder: 713
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\713\713.mid
Processing folder: 714
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\714\714.mid
Processing folder: 715
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 793
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\793\793.mid
Processing folder: 794
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\794\794.mid
Processing folder: 795
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\795\795.mid
Processing folder: 796
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\796\796.mid
Processing folder: 797
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\797\797.mid
Processing folder: 798
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\798\798.mid
Processing folder: 799
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\799\799.mid
Processing folder: 800
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\800\800.mid
Processing folder: 801
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\801\801.mid
Processing folder: 802
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\802\802.mid
Processing folder: 803
Processing C:/Users/naomi/Thesis/Thes

Processing folder: 881
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\881\881.mid
Processing folder: 882
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\882\882.mid
Processing folder: 883
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\883\883.mid
Processing folder: 884
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\884\884.mid
Processing folder: 885
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\885\885.mid
Processing folder: 886
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\886\886.mid
Processing folder: 887
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\887\887.mid
Processing folder: 888
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\888\888.mid
Processing folder: 889
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\889\889.mid
Processing folder: 890
Processing C:/Users/naomi/Thesis/Thesis/Thesis-main/POP909\890\890.mid
Processing folder: 891
Processing C:/Users/naomi/Thesis/Thes