# Tokenization

The below script utilizes the Musicaiz library to tokenize MIDI files from the POP909 dataset, generating tokenized representations suitable for deep learning tasks. Initially, it installs the necessary library using pip. The script then imports essential modules such as os for interacting with the operating system and MMMTokenizer from Musicaiz for tokenization.

The input and output directories are specified (input_directory and output_directory, respectively), defining the locations of MIDI files to be tokenized and the directory where tokenized outputs will be saved. Tokenization parameters, such as windowing, time unit, window size, and tempo, are configured using MMMTokenizerArguments.

The script iterates through MIDI files in the input directory, tokenizing each file using the configured tokenizer. It stores the tokenized outputs and updates the vocabulary set with unique tokens encountered during tokenization.

After processing all MIDI files, the script saves the tokenized outputs into a single file (all_tokenized_outputs.txt) and the vocabulary into another file (vocabulary.txt). These files provide valuable resources for subsequent deep learning tasks, enabling the utilization of tokenized MIDI representations in various models and analyses.

In case of any errors during the tokenization process, the script handles exceptions and continues processing other MIDI files, ensuring robustness and completeness of the tokenization procedure. **Overall, this script facilitates the conversion of MIDI data into tokenized representations, laying the groundwork for further analysis and modeling.**

In [None]:
!pip install musicaiz

In [None]:
# Install necessary library
import os  # Operating system interaction
from pathlib import Path  # File path handling
from musicaiz.tokenizers import MMMTokenizer, MMMTokenizerArguments  # Tokenization modules

In [1]:
# Input and output directories
input_directory = r"C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked"
output_directory = r"C:\Users\naomi\Thesis\Thesis\Thesis-main\tokenized_output_v2"

# Tokenization parameters
args = MMMTokenizerArguments(
    windowing=True,       # Indicates whether windowing is enabled for tokenization
    time_unit="SIXTEENTH",  # Defines the time unit for tokenization (e.g., "SIXTEENTH" note)
    num_programs=None,     # Number of MIDI programs to consider (None means all)
    shuffle_tracks=True,   # Determines if tracks are shuffled before tokenization
    track_density=False,   # Indicates whether to consider track density for tokenization
    window_size=4,         # Size of the window for tokenization (in time units)
    hop_length=1,          # Hop length between consecutive windows (in time units)
    time_sig=False,        # Indicates whether time signature is considered for tokenization
    velocity=False,        # Determines if velocity information is used for tokenization
    quantize=False,        # Determines if quantization is applied to note timings
    tempo=True,            # Indicates whether tempo information is considered for tokenization
)


# Initialize variables to store tokenized outputs and vocabulary
all_tokenized_outputs = []
vocabulary = set()

# Iterate through MIDI files in the input directory
for root, dirs, files in os.walk(input_directory):
    for file in files:
        if file.endswith(".mid"):
            midi_path = os.path.join(root, file)

            try:
                # Tokenize file
                tokenizer = MMMTokenizer(midi_path, args)
                tokenized_output = tokenizer.tokenize_file()

                # Append tokenized output to the list
                all_tokenized_outputs.append(tokenized_output)

                # Update vocabulary set
                vocabulary.update(tokenized_output.split())

            except Exception as e:
                print(f"Error processing {midi_path}: {e}")
                continue

# Save all tokenized outputs into one file
all_tokenized_path = os.path.join(output_directory, "all_tokenized_outputs.txt")
with open(all_tokenized_path, 'w') as all_tokenized_file:
    all_tokenized_file.write('\n'.join(all_tokenized_outputs))

# Save vocabulary
vocabulary_path = os.path.join(output_directory, "vocabulary.txt")
with open(vocabulary_path, 'w') as vocab_file:
    for word in vocabulary:
        vocab_file.write(f"{word}\n")


Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_0.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_1.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_5.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\002\002_6.mid: Start time must be lower than the end time.
Error processing C:\Users\n

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_0.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_1.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_5.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\094\094_6.mid: Start time must be lower than the end time.
Error processing C:\Users\n

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_11.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_12.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_13.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_14.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\185\185_4.mid: Start time must be lower than the end time.
Error processing C:\Use

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_0.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_1.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_10.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\224\224_5.mid: Start time must be lower than the end time.
Error processing C:\Users\

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_5.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_6.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_7.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_8.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\266\266_9.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\289\289_0.mid: list index out of range
Error processing C:\Users\naomi\Thesis\Thesis\T

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_5.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_6.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_7.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\560\560_8.mid: Start time must be lower than the end time.
Error processing C:\Users\n

Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\773\773_2.mid: list index out of range
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_0.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_1.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_2.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_3.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_4.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\Thesis-main\output\POP909-chunked\791\791_5.mid: Start time must be lower than the end time.
Error processing C:\Users\naomi\Thesis\Thesis\T

Reference: https://github.com/carlosholivan/musicaiz?tab=readme-ov-file