In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import time
import numpy as np
import os
from pathlib import Path
import pretty_midi
import json
import seaborn as sns
import matplotlib.pyplot as plt

### Loading MIDI Token Data
This section of the code is responsible for loading the tokenized MIDI data, which represents musical sequences for instrument 29 (electric guitar, as per the MIDI General Standard). The data is stored in a JSON file located in the raw_data directory. The code uses the pathlib library to handle file paths in a platform-independent manner and the json library to parse the JSON file into a Python object, preparing the tokenized data for further processing in the melody transformer model.

In [2]:
# Define the file path for the tokenized MIDI data
# Path("raw_data") creates a directory object, and / "29.json" appends the file name
file_path = Path("raw_data") / "29.json"

# Load the data
# Open the JSON file in read mode ("r") and parse its contents into a Python object
with file_path.open("r") as f:
    tokens = json.load(f)  # tokens now contains the deserialized JSON data (e.g., list or dict of MIDI tokens)

### Device Configuration for Model Training
This section configures the computational device for training the melody transformer model. It checks for the availability of a CUDA-enabled GPU using the PyTorch library and selects it as the primary device if available; otherwise, it defaults to the CPU. This ensures the model can leverage GPU acceleration for faster training when possible, while maintaining compatibility with CPU-only environments. The chosen device is printed for verification.

In [3]:
# Select device: use CUDA (GPU) if available, otherwise fall back to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print the selected device for confirmation
print(f"Using device: {device}")  # Outputs "cuda" if GPU is available, else "cpu"

Using device: cuda


### Hyperparameter Configuration for the Melody Transformer

This section defines the key hyperparameters for the melody transformer model, which are critical for configuring its architecture and training behavior. These parameters include the dimensionality of token embeddings, the number of attention heads in the multi-head attention mechanism, the number of stacked transformer encoder layers, the size of the feed-forward network within each layer, and the dropout probability for regularization. These values are chosen to balance model capacity, computational efficiency, and generalization when generating musical sequences from MIDI data for instrument 29 (electric guitar).

In [4]:
# The dimensionality of the token embeddings. This is the size of the vector
# that will represent each token in the input sequence, capturing its semantic features.
EMBEDDING_DIM = 64  # Set to 256, a common choice for moderate model capacity

# The number of attention heads in the multi-head attention mechanism.
# Each head processes a portion of the embedding, and EMBEDDING_DIM must be divisible by NUM_HEADS.
NUM_HEADS = 4  # 8 heads allow parallel attention computations; 256 / 8 = 32 dimensions per head

# The number of Transformer encoder layers to stack.
# More layers increase model depth and capacity but also computational cost.
NUM_ENCODER_LAYERS = 2  # 4 layers provide a balance between expressiveness and efficiency

# The dimension of the feed-forward network within each Transformer layer.
# This defines the size of the intermediate layer in the feed-forward subnetwork.
FF_DIM = 256  # 1024 units for a larger capacity in the feed-forward network

# The dropout probability to be applied in the model.
# Dropout helps prevent overfitting by randomly setting a fraction of activations to zero during training.
DROPOUT = 0.1  # 10% dropout for regularization, a standard value for transformer models

### Training Hyperparameter Configuration
This section specifies the hyperparameters governing the training process of the melody transformer model. These include the batch size, which determines the number of sequences processed in parallel; the sequence length, which sets the context window for backpropagation; the number of training epochs, which defines how many times the model iterates over the dataset; the learning rate, which controls the step size for weight updates; and the logging interval, which dictates how frequently training progress is reported. These parameters are carefully selected to ensure effective training of the model on MIDI data for instrument 29 (electric guitar), balancing computational efficiency and model performance.

In [5]:
# The number of independent sequences to process in parallel.
# Larger batch sizes can improve training stability but require more memory.
BATCH_SIZE = 32  # 32 sequences per batch, suitable for moderate GPU memory

# The length of the subsequences to be used for training. This is also known
# as the context window or "backpropagation through time" (BPTT) length.
# Shorter sequences reduce memory usage but may limit the model's ability to capture long-term dependencies.
SEQUENCE_LENGTH = 64  # 64 tokens, appropriate for capturing musical patterns in MIDI data

# The number of epochs to train the model for.
# Each epoch represents one full pass through the training dataset.
EPOCHS = 5  # 5 epochs, a reasonable starting point for convergence on MIDI data

# The learning rate for the optimizer.
# Controls the step size for updating model weights during optimization.
LEARNING_RATE = 0.001  # 0.001 is a standard learning rate for transformer models with Adam optimizer

# How often to log training progress (in batches).
# Logging provides insights into training dynamics without excessive output.
LOG_INTERVAL = 200  # Log progress every 200 batches for monitoring training

### Token Conversion to String Representations
This section defines a utility function, tokens_to_strings, which converts an array of tokenized MIDI data into a list of string representations. This function is essential for processing or analyzing the tokenized MIDI sequences (representing musical events for instrument 29, electric guitar) in a human-readable format, facilitating tasks such as comparison, debugging, or logging during the training or evaluation of the melody transformer model. The function ensures compatibility with downstream processes that may require string-based token representations.

In [6]:
def tokens_to_strings(tokens):
    """Convert array tokens to string representations for comparison.
    
    Args:
        tokens: A list or array of tokens, typically integers or symbols representing MIDI events.
    
    Returns:
        A list of strings, where each string is the string representation of a token.
    """
    # Convert each token in the input array to its string representation using a list comprehension
    return [str(token) for token in tokens]  # Returns a list of stringified tokens

### Vocabulary Analysis of MIDI Token Dataset
This section implements the analyze_vocabulary function, which performs a statistical analysis of the tokenized MIDI dataset for instrument 29 (electric guitar). The function calculates key metrics, such as the vocabulary size (number of unique tokens), total token count, and average token frequency, to provide insights into the dataset's characteristics. It also identifies and displays the most and least frequent tokens, along with their occurrence percentages, to highlight common and rare musical events. These statistics are crucial for understanding the dataset's complexity and informing the design of the melody transformer model, particularly for tasks like embedding layer configuration and handling rare tokens during training.

In [7]:
def analyze_vocabulary(data_tokens):
    """
    Analyzes the dataset to determine vocabulary characteristics.
    Returns vocabulary size and other statistics.
    
    Args:
        data_tokens: A list or array of tokens representing MIDI events.
    
    Returns:
        tuple: (vocab_size, stats_dict)
            - vocab_size: Integer, the number of unique tokens in the dataset.
            - stats_dict: Dictionary containing unique_strings, token_counts, and total_tokens.
    """
    # Print a message to indicate the start of vocabulary analysis
    print("Analyzing vocabulary from the dataset...")
    
    # Convert tokens to string representations for consistent comparison
    # Uses the tokens_to_strings function defined earlier
    token_strings = tokens_to_strings(data_tokens)
    
    # Create a set of unique token strings to compute vocabulary size
    unique_strings = list(set(token_strings))
    vocab_size = len(unique_strings)  # Number of unique tokens
    
    # Count the frequency of each token in the dataset
    token_counts = {}
    for token_str in token_strings:
        # Increment the count for the token, initializing to 0 if not present
        token_counts[token_str] = token_counts.get(token_str, 0) + 1
    
    # Calculate the total number of tokens in the dataset
    total_tokens = len(data_tokens)
    
    # Sort tokens by frequency in descending order for analysis
    sorted_tokens = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
    
    # Print summary statistics for the vocabulary
    print(f"Vocabulary Analysis Results:")
    print(f"  - Total unique tokens (vocab size): {vocab_size}")
    print(f"  - Total tokens in dataset: {total_tokens}")
    print(f"  - Average token frequency: {total_tokens / vocab_size:.2f}")
    
    # Display the top 10 most frequent tokens and their percentages
    print(f"  - Top 10 most frequent tokens:")
    for i in range(min(10, len(sorted_tokens))):
        token_str, count = sorted_tokens[i]
        percentage = (count / total_tokens) * 100  # Calculate percentage of total tokens
        print(f"    {token_str}: {count} occurrences ({percentage:.2f}%)")
    
    # Display up to 5 of the least frequent tokens and their percentages
    print(f"  - Some rare tokens (bottom 5):")
    for i in range(max(0, len(sorted_tokens) - 5), len(sorted_tokens)):
        token_str, count = sorted_tokens[i]
        percentage = (count / total_tokens) * 100  # Calculate percentage of total tokens
        print(f"    {token_str}: {count} occurrences ({percentage:.2f}%)")
    
    # Return the vocabulary size and a dictionary of statistics for further use
    return vocab_size, {
        'unique_strings': unique_strings,  # List of unique token strings
        'token_counts': token_counts,      # Dictionary of token frequencies
        'total_tokens': total_tokens       # Total number of tokens
    }

### Token Mapping for Efficient Embedding
This section defines the create_token_mapping function, which generates bidirectional mappings between the tokenized MIDI data and contiguous integer indices for instrument 29 (electric guitar). The function creates two dictionaries: one mapping token strings to indices (token_to_idx) and another mapping indices back to the original tokens (idx_to_token). This mapping is essential for converting the dataset's tokens into a format suitable for the melody transformer’s embedding layer, ensuring efficient processing and compatibility with the model’s input requirements. The function also provides feedback on the number of unique tokens mapped, aiding in debugging and verification.

In [8]:
def create_token_mapping(unique_strings, original_tokens):
    """
    Creates a mapping from original array tokens to contiguous indices.
    
    Args:
        unique_strings: List of unique token strings from the dataset.
        original_tokens: List or array of original tokens from the dataset.
    
    Returns:
        tuple: (token_to_idx, idx_to_token)
            - token_to_idx: Dictionary mapping token strings to contiguous indices.
            - idx_to_token: Dictionary mapping indices back to original tokens.
    """
    # Print a message to indicate the start of token mapping
    print("Creating token mapping for efficient embedding...")
    
    # Create a dictionary mapping string representations to their original token values
    str_to_token = {}
    for token in original_tokens:
        token_str = str(token)  # Convert token to string for consistent keying
        if token_str not in str_to_token:
            str_to_token[token_str] = token  # Store the original token for each string
    
    # Create a dictionary mapping token strings to contiguous indices (0 to vocab_size-1)
    token_to_idx = {token_str: idx for idx, token_str in enumerate(unique_strings)}
    
    # Create a dictionary mapping indices back to original tokens
    idx_to_token = {idx: str_to_token[token_str] for idx, token_str in enumerate(unique_strings)}
    
    # Print confirmation of the number of unique tokens mapped
    print(f"  - Mapped {len(unique_strings)} unique array tokens to indices 0-{len(unique_strings)-1}")
    
    # Return the bidirectional mappings for use in token processing
    return token_to_idx, idx_to_token

### Dataset Remapping to Contiguous Indices
This section defines the remap_dataset function, which transforms the original MIDI token dataset for instrument 29 (electric guitar) into a tensor of contiguous integer indices using the token_to_idx mapping. This remapping is crucial for preparing the data for input to the melody transformer model, as it ensures that tokens are represented as sequential integers suitable for the model's embedding layer. The function converts each token to its string representation, maps it to the corresponding index, and creates a PyTorch tensor with the remapped indices. It also provides diagnostic output, including the total number of tokens remapped and the range of indices, to verify the transformation process.

In [9]:
def remap_dataset(data_tokens, token_to_idx):
    """
    Remaps the dataset to use contiguous indices instead of original token arrays.
    
    Args:
        data_tokens: List or array of original tokens from the MIDI dataset.
        token_to_idx: Dictionary mapping token strings to contiguous indices.
    
    Returns:
        torch.Tensor: A tensor of remapped indices with dtype torch.long, suitable for model input.
    """
    # Print a message to indicate the start of dataset remapping
    print("Remapping dataset to use contiguous indices...")
    
    # Initialize an empty list to store the remapped indices
    remapped_indices = []
    # Iterate through each token in the dataset
    for token in data_tokens:
        token_str = str(token)  # Convert the token to its string representation
        # Append the corresponding index from the token_to_idx mapping
        remapped_indices.append(token_to_idx[token_str])
    
    # Convert the list of indices to a PyTorch tensor with long integer type
    remapped_data = torch.tensor(remapped_indices, dtype=torch.long)
    
    # Print diagnostic information about the remapping process
    print(f"  - Remapped {len(data_tokens)} array tokens")
    # Report the range of indices in the remapped dataset
    print(f"  - Remapped token range: {remapped_data.min().item()} to {remapped_data.max().item()}")
    
    # Return the remapped dataset as a tensor
    return remapped_data

### Vocabulary Analysis and Dataset Preparation
This section orchestrates the preprocessing of the MIDI token dataset for instrument 29 (electric guitar) to prepare it for training the melody transformer model. It begins by creating a copy of the raw token data to preserve the original dataset. The analyze_vocabulary function is called to compute the vocabulary size and extract statistics, such as unique tokens and their frequencies, which inform the model's embedding layer design. The create_token_mapping function generates bidirectional mappings between tokens and contiguous indices, enabling efficient embedding. The dataset is then remapped to these indices using remap_dataset, converting the tokens into a PyTorch tensor suitable for model input. The remapped data is transferred to the selected device (GPU or CPU) for training, and a sample of the remapped data is printed for verification. Finally, the dynamic vocabulary size is reported, which is critical for configuring the embedding and output layers of the transformer model.

In [10]:
# Create a copy of the raw tokens to avoid modifying the original data
raw_train_tokens = tokens.copy()  # Ensures the original tokens list remains unchanged

# Analyze the vocabulary to determine its size and statistics
# Calls analyze_vocabulary to compute unique tokens, total tokens, and frequency statistics
VOCAB_SIZE, vocab_stats = analyze_vocabulary(raw_train_tokens)

# Create bidirectional mappings between token strings and contiguous indices
# token_to_idx maps token strings to indices; idx_to_token maps indices back to original tokens
token_to_idx, idx_to_token = create_token_mapping(vocab_stats['unique_strings'], raw_train_tokens)

# Remap the dataset to use contiguous indices for efficient model input
# Converts the raw tokens into a PyTorch tensor of indices
train_data = remap_dataset(raw_train_tokens, token_to_idx)

# Print a sample of the first 20 remapped indices for verification
print(f"Sample of remapped data: {train_data[:20]}...")

# Move the remapped dataset to the selected device (GPU or CPU) for training
train_data = train_data.to(device)  # Ensures compatibility with the model's computation device

# Print the dynamically determined vocabulary size, which will be used for model configuration
print(f"\n--- DYNAMIC VOCABULARY SIZE: {VOCAB_SIZE} ---")
print(f"This will be used for embedding and output layer dimensions.")

Analyzing vocabulary from the dataset...
Vocabulary Analysis Results:
  - Total unique tokens (vocab size): 6201
  - Total tokens in dataset: 7254
  - Average token frequency: 1.17
  - Top 10 most frequent tokens:
    [53, 4.14, 4.45, 68]: 3 occurrences (0.04%)
    [48, 4.14, 4.46, 70]: 3 occurrences (0.04%)
    [41, 4.14, 4.46, 74]: 3 occurrences (0.04%)
    [53, 4.53, 4.59, 64]: 3 occurrences (0.04%)
    [48, 4.53, 4.6, 86]: 3 occurrences (0.04%)
    [53, 4.66, 4.73, 64]: 3 occurrences (0.04%)
    [48, 4.66, 4.73, 76]: 3 occurrences (0.04%)
    [53, 4.91, 4.99, 62]: 3 occurrences (0.04%)
    [47, 4.91, 5.0, 96]: 3 occurrences (0.04%)
    [53, 5.17, 5.4, 74]: 3 occurrences (0.04%)
  - Some rare tokens (bottom 5):
    [82, 225.55, 225.7, 127]: 1 occurrences (0.01%)
    [79, 225.74, 225.86, 113]: 1 occurrences (0.01%)
    [82, 225.87, 225.96, 106]: 1 occurrences (0.01%)
    [79, 225.99, 226.11, 119]: 1 occurrences (0.01%)
    [84, 226.09, 226.74, 100]: 1 occurrences (0.01%)
Creating tok

### Batch Generation for Next Token Prediction
This section defines the get_batch function, which generates batches of source and target sequences from the remapped MIDI dataset for training the melody transformer model on instrument 29 (electric guitar). The function is designed to support the next token prediction task, where the model learns to predict the subsequent token in a sequence given the preceding tokens. It randomly selects starting indices within the dataset, extracts sequences of a specified length (seq_length), and creates source (x) and target (y) tensors, where the target is the source sequence shifted one position to the right. A sample batch is generated and printed to illustrate the relationship between source and target sequences, aiding in the verification of the data preparation process for training.

In [11]:
def get_batch(source_data, seq_length, batch_size):
    """
    Generates a batch of source and target sequences for training.
    This is the core of how we set up the "next token prediction" task.
    
    Args:
        source_data: torch.Tensor, the remapped dataset of token indices.
        seq_length: Integer, the length of each sequence (context window).
        batch_size: Integer, the number of sequences in a batch.
    
    Returns:
        tuple: (x, y)
            - x: torch.Tensor, source sequences of shape (batch_size, seq_length).
            - y: torch.Tensor, target sequences of shape (batch_size, seq_length).
    """
    # Get the total length of the dataset
    num_tokens = len(source_data)
    
    # Generate random starting indices for sequences
    # Ensure there's enough room for seq_length + 1 to include the target token
    start_indices = torch.randint(0, num_tokens - seq_length - 1, (batch_size,))
    
    # Create source sequences by stacking slices of source_data
    # Each slice is of length seq_length, starting at a random index
    x = torch.stack([source_data[i : i + seq_length] for i in start_indices])
    
    # Create target sequences, shifted one position to the right
    # For each input token, the target is the next token in the sequence
    y = torch.stack([source_data[i + 1 : i + seq_length + 1] for i in start_indices])
    
    return x, y

# Demonstrate an example batch for clarity
print(f"\nLet's see an example of a single batch with batch_size=1 and seq_length=5:")
# Generate a sample batch with batch_size=1 and seq_length=5
x_sample, y_sample = get_batch(train_data, 5, 1)
# Convert tensors to lists for readable output, squeezing to remove batch dimension
print(f"Source (x): {x_sample.squeeze().tolist()}")
print(f"Target (y): {y_sample.squeeze().tolist()}")
# Highlight the relationship between source and target sequences
print("Notice that the target is the source sequence shifted one position to the right.")


Let's see an example of a single batch with batch_size=1 and seq_length=5:
Source (x): [1152, 3694, 2462, 156, 5374]
Target (y): [3694, 2462, 156, 5374, 2776]
Notice that the target is the source sequence shifted one position to the right.


### Positional Encoding for Transformer Input
This section defines the PositionalEncoding class, a critical component of the melody transformer model designed for MIDI data from instrument 29 (electric guitar). The class implements positional encodings to inject information about the token positions in the sequence, as the transformer architecture lacks an inherent sense of order. It follows the standard sinusoidal encoding scheme proposed by Vaswani et al. (2017), where sine and cosine functions are used to create position-specific vectors that are added to the token embeddings. The class supports dropout for regularization and is designed to handle sequences up to a maximum length (max_len). The positional encodings are stored as a buffer, ensuring they are part of the model’s state but not trainable parameters. This implementation ensures that the model can capture the sequential nature of musical tokens during training and generation.

In [12]:
class PositionalEncoding(nn.Module):
    """
    Injects position information into the token embeddings.
    Since the Transformer architecture itself doesn't have a notion of order,
    we add these positional encodings to the input embeddings.
    
    Args:
        d_model: Integer, the dimensionality of the token embeddings.
        dropout: Float, the dropout probability for regularization (default: 0.1).
        max_len: Integer, the maximum sequence length to precompute encodings for (default: 5000).
    """
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()  # Initialize the parent nn.Module class
        self.dropout = nn.Dropout(p=dropout)  # Initialize dropout layer with specified probability

        # Create a matrix for positional encodings of shape (max_len, d_model)
        pe = torch.zeros(max_len, d_model)  # Initialize zero matrix for positional encodings
        
        # Create a position tensor [0, 1, 2, ..., max_len-1] with shape (max_len, 1)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Calculate the division term for the sine and cosine functions
        # Uses the formula: exp(2i * -log(10000) / d_model) for even indices
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        # Apply sine to even indices (2i) in the positional encoding matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        
        # Apply cosine to odd indices (2i+1) in the positional encoding matrix
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add a batch dimension and transpose to shape (max_len, 1, d_model)
        pe = pe.unsqueeze(0).transpose(0, 1)
        
        # Register 'pe' as a buffer to include it in the model's state without training
        self.register_buffer('pe', pe)
        
        # Print confirmation of successful initialization
        print("Initialized PositionalEncoding module.")

    def forward(self, x):
        """
        Forward pass to add positional encodings to input embeddings.
        
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim], input token embeddings.
        
        Returns:
            Tensor: Input embeddings with positional encodings added, after applying dropout.
        """
        # Add positional encodings to the input tensor, matching the sequence length
        x = x + self.pe[:x.size(0), :]  # Slice pe to match input sequence length
        
        # Apply dropout to the combined embeddings for regularization
        return self.dropout(x)

### Transformer Model Architecture for Next Token Prediction
This section defines the TransformerModel class, which implements the core architecture of the melody transformer for next token prediction on MIDI data for instrument 29 (electric guitar). The model, built using PyTorch’s nn.Module, integrates several key components: (1) a token embedding layer to map input token indices to dense vectors, (2) a positional encoding layer to incorporate sequence order, (3) a stack of transformer encoder layers for modeling complex dependencies in the musical sequences, and (4) a final linear layer to produce logits over the vocabulary for next token prediction. The model is initialized with hyperparameters such as vocabulary size, embedding dimension, number of attention heads, feed-forward dimension, number of encoder layers, and dropout rate. Weight initialization is performed to ensure stable training, and the forward pass includes detailed logging of tensor shapes for debugging and verification. This architecture is designed to capture the sequential and contextual patterns in MIDI token sequences, enabling the generation of coherent musical output.

In [13]:
class TransformerModel(nn.Module):
    """
    A Transformer model for sequence-to-sequence tasks.
    In our case, it's used for next-token prediction.
    
    Args:
        vocab_size: Integer, the size of the token vocabulary.
        d_model: Integer, the dimensionality of the token embeddings.
        nhead: Integer, the number of attention heads in the multi-head attention mechanism.
        d_hid: Integer, the dimension of the feed-forward network in each transformer layer.
        nlayers: Integer, the number of transformer encoder layers.
        dropout: Float, the dropout probability for regularization (default: 0.1).
    """
    def __init__(self, vocab_size, d_model, nhead, d_hid, nlayers, dropout=0.1):
        super(TransformerModel, self).__init__()  # Initialize the parent nn.Module class
        self.model_type = 'Transformer'  # Identifier for the model type
        self.d_model = d_model  # Store embedding dimension for use in forward pass
        self.vocab_size = vocab_size  # Store vocabulary size for embedding and output layers

        # 1. Token Embedding Layer: Maps input token indices to dense vectors
        self.encoder = nn.Embedding(vocab_size, d_model)  # Embedding layer for tokens
        print(f"Initialized nn.Embedding: maps {vocab_size} tokens to {d_model}-dim vectors.")

        # 2. Positional Encoding: Adds positional information to token embeddings
        self.pos_encoder = PositionalEncoding(d_model, dropout)  # Positional encoding layer

        # 3. Transformer Encoder Layers: Core of the model for sequence modeling
        # Define a single transformer encoder layer with specified parameters
        encoder_layers = nn.TransformerEncoderLayer(d_model, nhead, d_hid, dropout, batch_first=False)
        # Stack multiple encoder layers to form the transformer encoder
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)
        print(f"Initialized nn.TransformerEncoder with {nlayers} layers.")

        # 4. Final Linear Layer (Decoder): Maps transformer output to vocabulary logits
        self.decoder = nn.Linear(d_model, vocab_size)  # Linear layer for output logits
        print(f"Initialized final nn.Linear decoder: maps {d_model}-dim vectors to {vocab_size} (vocab size) logits.")

        # Initialize weights for the embedding and linear layers
        self.init_weights()

    def init_weights(self):
        """Initializes weights for the embedding and linear layers."""
        initrange = 0.1  # Define the range for uniform weight initialization
        # Initialize embedding layer weights with uniform distribution
        self.encoder.weight.data.uniform_(-initrange, initrange)
        # Zero the bias of the decoder layer
        self.decoder.bias.data.zero_()
        # Initialize decoder layer weights with uniform distribution
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        """
        Forward pass of the model.
        
        Args:
            src: Tensor, shape [seq_len, batch_size], input token indices.
            src_mask: Tensor, the mask for the src sequence to prevent attention to padding tokens.
        
        Returns:
            Tensor: Output logits of shape [seq_len, batch_size, vocab_size].
        """
        # Log the start of the forward pass for debugging
        print("\n--- Inside Model Forward Pass ---")
        print(f"Input `src` shape: {src.shape} [Sequence Length, Batch Size]")

        # 1. Embed the tokens and scale by sqrt(d_model) to stabilize gradients
        src = self.encoder(src) * math.sqrt(self.d_model)
        print(f"Shape after Embedding and Scaling: {src.shape} [Seq Len, Batch, Embedding Dim]")

        # 2. Add positional encoding to incorporate sequence order
        src = self.pos_encoder(src)
        print(f"Shape after Positional Encoding: {src.shape} [Seq Len, Batch, Embedding Dim]")

        # 3. Pass through the transformer encoder layers to model sequence dependencies
        output = self.transformer_encoder(src, src_mask)
        print(f"Shape after Transformer Encoder: {output.shape} [Seq Len, Batch, Embedding Dim]")

        # 4. Decode the output to produce logits over the vocabulary
        output = self.decoder(output)
        print(f"Shape after Final Decoder Layer: {output.shape} [Seq Len, Batch, Vocab Size]")
        print("--- End of Model Forward Pass ---\n")
        
        return output

### Causal Mask Generation for Transformer Training
This section defines the generate_square_subsequent_mask function, which creates a square causal mask for the melody transformer model used with MIDI data for instrument 29 (electric guitar). The mask ensures that during training, the model only attends to previous and current tokens in the sequence, preventing it from accessing future tokens, which is critical for the next token prediction task. The function generates a square matrix of size sz (sequence length), where the upper triangular portion (excluding the diagonal) is filled with negative infinity to mask future tokens, and the lower triangular portion (including the diagonal) is filled with zeros to allow attention. This causal masking enforces the autoregressive property of the transformer, ensuring it learns to predict the next token based solely on prior context.

In [14]:
def generate_square_subsequent_mask(sz):
    """
    Generates a square causal mask for the sequence.
    The masked positions are filled with -inf.
    Unmasked positions are 0. This prevents the model from "cheating" by
    looking at future tokens during training.
    
    Args:
        sz: Integer, the size of the square mask (sequence length).
    
    Returns:
        torch.Tensor: A square causal mask of shape [sz, sz], where unmasked positions
                     are 0.0 and masked positions are -inf.
    """
    # Create an upper triangular matrix of ones (True where position i <= j)
    # triu returns the upper triangle of a matrix, with 1s on and above the diagonal
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
    
    # Convert the boolean mask to a float tensor
    # Fill masked positions (False, i.e., future tokens) with -inf
    # Fill unmasked positions (True, i.e., current and past tokens) with 0.0
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    
    return mask  # Return the causal mask tensor

### Model Instantiation and Training Setup
This section initializes the melody transformer model for generating musical sequences from MIDI data for instrument 29 (electric guitar). The TransformerModel is instantiated using the dynamically calculated vocabulary size (VOCAB_SIZE) and predefined hyperparameters for embedding dimension, number of attention heads, feed-forward dimension, number of encoder layers, and dropout rate. The model is moved to the selected device (GPU or CPU) for computation. The loss function is defined as cross-entropy loss, suitable for the next token prediction task, and the Adam optimizer is configured with the specified learning rate. Diagnostic outputs confirm the model’s initialization and report the total number of trainable parameters, providing insight into the model’s complexity for the academic context.

In [15]:
# Instantiate the transformer model with the specified hyperparameters
# VOCAB_SIZE is dynamically determined from the dataset
# Other parameters (EMBEDDING_DIM, NUM_HEADS, FF_DIM, NUM_ENCODER_LAYERS, DROPOUT) are predefined
model = TransformerModel(
    VOCAB_SIZE,          # Vocabulary size from dataset analysis
    EMBEDDING_DIM,       # Dimensionality of token embeddings
    NUM_HEADS,           # Number of attention heads in multi-head attention
    FF_DIM,              # Dimension of feed-forward network in transformer layers
    NUM_ENCODER_LAYERS,  # Number of transformer encoder layers
    DROPOUT              # Dropout probability for regularization
).to(device)  # Move the model to the selected device (GPU or CPU)

# Define the loss function for next token prediction
# CrossEntropyLoss combines log softmax and negative log likelihood loss
criterion = nn.CrossEntropyLoss()

# Define the Adam optimizer with the specified learning rate
# Optimizes all trainable parameters of the model
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Print confirmation of model initialization with the dynamic vocabulary size
print(f"Model initialized with dynamic vocabulary size: {VOCAB_SIZE}")

# Calculate and print the total number of trainable parameters in the model
# Summing the number of elements (numel) in each parameter tensor
print(f"Total model parameters: {sum(p.numel() for p in model.parameters()):,}")

Initialized nn.Embedding: maps 6201 tokens to 64-dim vectors.
Initialized PositionalEncoding module.
Initialized nn.TransformerEncoder with 2 layers.
Initialized final nn.Linear decoder: maps 64-dim vectors to 6201 (vocab size) logits.




Model initialized with dynamic vocabulary size: 6201
Total model parameters: 899,897


### Training Loop for the Melody Transformer
This section implements the train function, which defines the training loop for one epoch of the melody transformer model, designed for next token prediction on MIDI data for instrument 29 (electric guitar). The function sets the model to training mode, generates a causal mask to enforce autoregressive behavior, and iterates over the dataset in batches. For each batch, it retrieves source and target sequences, permutes them to match the transformer’s expected input shape, computes the model’s output, and calculates the cross-entropy loss. Gradients are computed, clipped to prevent exploding gradients, and used to update the model weights via the Adam optimizer. Training progress is logged periodically, including loss, perplexity, learning rate, and batch processing time, to monitor convergence and performance. The loop runs for the specified number of epochs, with timing information printed for each epoch. This implementation ensures robust training while providing detailed diagnostics for academic analysis.

In [16]:
epoch_losses = []
epoch_perplexities = []
epoch_grad_norms = []

def train(epoch):
    model.train()
    total_loss = 0.
    total_grad_norm = 0.
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(SEQUENCE_LENGTH).to(device)
    num_batches = len(train_data) // (SEQUENCE_LENGTH * BATCH_SIZE)
    batch_losses = []

    print(f"\n--- Starting Epoch {epoch} ---")
    for batch, i in enumerate(range(0, train_data.size(0) - 1 - SEQUENCE_LENGTH, SEQUENCE_LENGTH)):
        data, targets = get_batch(train_data, SEQUENCE_LENGTH, BATCH_SIZE)
        data = data.permute(1, 0).to(device)
        targets = targets.permute(1, 0).to(device)

        if batch == 0:
            print(f"Shape of data batch: {data.shape}")
            print(f"Shape of target batch: {targets.shape}")
            print(f"Shape of causal mask: {src_mask.shape}")
            print("Starting batch iterations...")

        optimizer.zero_grad()
        if batch == 0 and epoch == 1:
            output = model(data, src_mask)
        else:
            _print = __builtins__.print
            __builtins__.print = lambda *args, **kwargs: None
            output = model(data, src_mask)
            __builtins__.print = _print

        loss = criterion(output.view(-1, VOCAB_SIZE), targets.reshape(-1))
        loss.backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        total_grad_norm += grad_norm.item()
        batch_losses.append(loss.item())

        if batch % LOG_INTERVAL == 0 and batch > 0:
            lr = optimizer.param_groups[0]['lr']
            ms_per_batch = (time.time() - start_time) * 1000 / LOG_INTERVAL
            cur_loss = total_loss / LOG_INTERVAL
            cur_ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches * BATCH_SIZE:5d} batches | '
                  f'lr {lr:02.5f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {cur_ppl:8.2f} | grad norm {grad_norm:5.2f}')
            total_loss = 0
            start_time = time.time()

    # Epoch summary
    avg_loss = sum(batch_losses) / len(batch_losses)
    avg_ppl = math.exp(avg_loss)
    avg_grad_norm = total_grad_norm / len(batch_losses)
    epoch_time = time.time() - start_time
    print(f'--- End of Epoch {epoch} | Avg Loss: {avg_loss:.2f} | '
          f'Avg Perplexity: {avg_ppl:.2f} | Avg Grad Norm: {avg_grad_norm:.2f} | '
          f'Time: {epoch_time:.2f}s ---')

    # Store metrics
    epoch_losses.append(avg_loss)
    epoch_perplexities.append(avg_ppl)
    epoch_grad_norms.append(avg_grad_norm)

In [17]:
def plot_metrics():
    sns.set(style="whitegrid")
    epochs = list(range(1, len(epoch_losses) + 1))

    # Plot loss
    plt.figure(figsize=(10, 6))
    sns.lineplot(x=epochs, y=epoch_losses, marker='o', label='Loss')
    plt.title('Training Loss Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.savefig('graphs/loss_plot.png')
    plt.close()

    # Plot perplexity
    plt.figure(figsize=(10, 6))
    sns.lineplot(x=epochs, y=epoch_perplexities, marker='o', label='Perplexity')
    plt.title('Training Perplexity Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Perplexity')
    plt.savefig('graphs/perplexity_plot.png')
    plt.close()

    # Plot gradient norm
    plt.figure(figsize=(10, 6))
    sns.lineplot(x=epochs, y=epoch_grad_norms, marker='o', label='Gradient Norm')
    plt.title('Gradient Norm Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Gradient Norm')
    plt.savefig('graphs/grad_norm_plot.png')
    plt.close()

In [18]:
for epoch in range(1, EPOCHS + 1):
    train(epoch)
plot_metrics()
print("Training completed. Plots saved in 'graphs' folder.")


--- Starting Epoch 1 ---
Shape of data batch: torch.Size([64, 32])
Shape of target batch: torch.Size([64, 32])
Shape of causal mask: torch.Size([64, 64])
Starting batch iterations...

--- Inside Model Forward Pass ---
Input `src` shape: torch.Size([64, 32]) [Sequence Length, Batch Size]
Shape after Embedding and Scaling: torch.Size([64, 32, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Positional Encoding: torch.Size([64, 32, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Transformer Encoder: torch.Size([64, 32, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Final Decoder Layer: torch.Size([64, 32, 6201]) [Seq Len, Batch, Vocab Size]
--- End of Model Forward Pass ---

--- End of Epoch 1 | Avg Loss: 6.95 | Avg Perplexity: 1042.40 | Avg Grad Norm: 0.40 | Time: 0.97s ---

--- Starting Epoch 2 ---
Shape of data batch: torch.Size([64, 32])
Shape of target batch: torch.Size([64, 32])
Shape of causal mask: torch.Size([64, 64])
Starting batch iterations...
--- End of Epoch 2 | Avg Loss

### Sequence Generation with the Melody Transformer
This section defines the predict function, which generates a musical sequence token by token using the trained melody transformer model for MIDI data corresponding to instrument 29 (electric guitar). The function takes a seed sequence of token indices, generates a sequence up to a specified maximum length (max_len), and optionally converts the output indices back to their original token representations using the idx_to_token mapping. The model is set to evaluation mode to disable dropout, and a causal mask ensures that predictions are autoregressive, attending only to previous tokens. The function applies softmax to the model’s output logits to obtain probabilities, samples the next token using multinomial sampling, and iteratively builds the sequence. This approach enables the generation of coherent musical sequences, which is critical for evaluating the model’s performance in a creative and academic context.

In [19]:
def predict(model, seed_sequence, max_len=50, idx_to_token=None):
    """
    Generates a sequence token by token based on a seed.
    
    Args:
        model: The trained TransformerModel instance.
        seed_sequence: List of integers, the initial sequence of token indices.
        max_len: Integer, the maximum number of tokens to generate (default: 50).
        idx_to_token: Dictionary, mapping indices to original tokens (optional).
    
    Returns:
        List: The generated sequence of token indices.
    """
    model.eval()  # Set the model to evaluation mode, disabling dropout
    # Print the input seed sequence for verification
    print(f"Seed sequence (indices): {seed_sequence}")
    # If idx_to_token is provided, convert and print the seed in original token format
    if idx_to_token:
        original_tokens = [idx_to_token[idx] for idx in seed_sequence]
        print(f"Seed sequence (original array tokens): {original_tokens}")
    
    # Convert the seed sequence to a tensor with shape [seq_len, batch_size=1]
    input_tensor = torch.tensor(seed_sequence, dtype=torch.long).unsqueeze(1).to(device)
    
    # Initialize the generated sequence with a copy of the seed
    generated_sequence = seed_sequence.copy()

    # Disable gradient computation for inference to save memory and computation
    with torch.no_grad():
        for step in range(max_len):  # Generate up to max_len tokens
            # Get the current sequence length for creating the causal mask
            current_seq_len = input_tensor.size(0)
            # Generate a causal mask to prevent attending to future tokens
            mask = generate_square_subsequent_mask(current_seq_len).to(device)
            
            # Run the model’s forward pass to get output logits
            if step == 0:
                output = model(input_tensor, mask)  # Include logging for the first step
            else:
                # Suppress print statements in the forward pass for cleaner logs
                _print = __builtins__.print
                __builtins__.print = lambda *args, **kwargs: None
                output = model(input_tensor, mask)
                __builtins__.print = _print
            
            # Extract logits for the last token in the sequence
            last_token_logits = output[-1, 0, :]  # Shape: [vocab_size]
            
            # Apply softmax to convert logits to probabilities
            probabilities = torch.softmax(last_token_logits, dim=-1)
            
            # Sample the next token index from the probability distribution
            next_token = torch.multinomial(probabilities, 1).item()
            
            # Convert the predicted index to the original token if idx_to_token is provided
            predicted_original = idx_to_token[next_token] if idx_to_token else next_token
            
            # Append the predicted token index to the generated sequence
            generated_sequence.append(next_token)
            
            # Update the input tensor by appending the new token
            # Shape becomes [current_seq_len + 1, 1]
            input_tensor = torch.cat([input_tensor, torch.tensor([[next_token]], device=device)], dim=0)

    # Return the complete generated sequence of token indices
    return generated_sequence

### Sequence Generation and Output Display
This section demonstrates the application of the predict function to generate a musical sequence for instrument 29 (electric guitar) using a predefined seed sequence of token indices. The seed sequence, represented as remapped indices, is passed to the predict function along with the idx_to_token mapping to generate a sequence of up to 100 tokens. The generated sequence is then printed in both its remapped index form and its original token form, providing a clear view of the model’s output in the context of the MIDI dataset. This step is crucial for evaluating the melody transformer’s ability to produce coherent musical sequences, and the output can be analyzed for musical quality or further processed into MIDI format for playback in an academic setting.

In [20]:
# Define a seed sequence of token indices in the remapped space
# These indices correspond to tokens in the vocabulary for instrument 29 (electric guitar)
seed_indices = [3456, 2345, 1234, 2333]  # Example indices for the seed sequence

# Generate a sequence using the predict function
# max_len=100 specifies the maximum length of the generated sequence
# idx_to_token is used to map indices back to original tokens
predicted_sequence = predict(model, seed_indices, max_len=100, idx_to_token=idx_to_token)

# Print the original seed sequence in remapped indices
print(f"Original Seed (indices): {seed_indices}")

# Convert and print the seed sequence to original token representations
print(f"Original Seed (original array tokens): {[idx_to_token[idx] for idx in seed_indices]}")

# Print the full generated sequence in remapped indices
print(f"Generated Sequence (indices): {predicted_sequence}")

# Convert and print the generated sequence to original token representations
print(f"Generated Sequence (original array tokens): {[idx_to_token[idx] for idx in predicted_sequence]}")

Seed sequence (indices): [3456, 2345, 1234, 2333]
Seed sequence (original array tokens): [[41, 12.89, 13.07, 117], [46, 53.56, 53.72, 105], [45, 10.22, 10.55, 127], [57, 162.85, 163.37, 127]]

--- Inside Model Forward Pass ---
Input `src` shape: torch.Size([4, 1]) [Sequence Length, Batch Size]
Shape after Embedding and Scaling: torch.Size([4, 1, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Positional Encoding: torch.Size([4, 1, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Transformer Encoder: torch.Size([4, 1, 64]) [Seq Len, Batch, Embedding Dim]
Shape after Final Decoder Layer: torch.Size([4, 1, 6201]) [Seq Len, Batch, Vocab Size]
--- End of Model Forward Pass ---

Original Seed (indices): [3456, 2345, 1234, 2333]
Original Seed (original array tokens): [[41, 12.89, 13.07, 117], [46, 53.56, 53.72, 105], [45, 10.22, 10.55, 127], [57, 162.85, 163.37, 127]]
Generated Sequence (indices): [3456, 2345, 1234, 2333, 3136, 47, 5711, 1302, 4033, 3134, 3295, 649, 1627, 3456, 4323, 1657, 5

### Saving Generated Sequence to JSON File
This section handles the storage of the generated musical sequence for instrument 29 (electric guitar) as a JSON file. The code creates a directory named generated_data if it does not already exist, ensuring robust file handling across different systems using the pathlib library. The generated sequence, represented as remapped indices in predicted_sequence, is converted back to its original token representations using the idx_to_token mapping. These tokens are then saved to a JSON file (29.json) in the specified directory. This step is critical for preserving the model’s output for further analysis, such as MIDI file generation or evaluation in an academic context, enabling reproducibility and documentation of the melody transformer’s results.

In [21]:
# Define the output directory for generated data
folder = Path("generated_data")
# Create the directory, including parent directories if needed, and ignore if it already exists
folder.mkdir(parents=True, exist_ok=True)

# Define the file path for the output JSON file
file_path = folder / "29.json"

# Open the file in write mode and save the generated sequence
# Convert the sequence indices back to original tokens using idx_to_token
with file_path.open("w") as f:
    json.dump([idx_to_token[idx] for idx in predicted_sequence], f)  # Save tokens as a JSON list