### chaptGPT specs   

A decoder-only transformer in pytorch to predict 'next output' at each time step. 

Each time step t is represented by a vector of n=4 tokens from the Descript DAC encoder. 
The length of the sequence (context window) is Ti=86 for inference, and Tt=8*Ti for training. That is, the context window for training is 8 times the length of the context window for inference. 
The attention is "causal", looking only back in time, and the maximum look-back time for the attention blocks is Ti (even when the sequence is longer during training). That is, the masking matrix is *banded* - triangular to be causal, and limited in lookback which results in a diagonal band). This prevents much of the training on shortened context that happens when tokens are near the beginning of traning examples. 

The size of the vocabulary (the number of descrete values in each codebook) for each of the n tokens is V=1024. 

The dataloader will as is usual, supply batches in triplets  (input,target, conditioning info) where the size of each input and output is Tt*n (the sequence length times the number of tokens at each time step). The tokens are indexes for the vocabulary in the range of (0, V-1). The targets are shifted relative to the input sequence by 1 as is typical for networks the learn to predict the output for the next time step. 

The first layer in the architecture will be a learnable "multiembedding" layer that embeds each of the 4 tokens at each time step as an m-dimensional vector. The n m-dimensional vectors are concatenated to provide the n*m dimensional input embeddings for the transformer blocks at each time step. 

A positional code is then added to each n*m dimensional vector. For positional encoding, we use Rotary Position Embedding (RoPE).

We use a stack of b transformer blocks that are standard (using layer norms, a relu for activation, and a forward expansion factor of 4 form the linear layer). Each transformer block consumes and produces a context window length sequence of m*n dimensional vectors. 

After the last transformer block, there is a linear layer that maps the m*n dimensional vectors to the output size which is V*n (the vocabulary size time the number of tokens stacked at each time step). These are the logits that will be fed to the softmax functions (one for each of the n tokens) that provide the probability distribtion across the vocabulary set. We use the criterion nn.CrossEntropyLoss() for computing the loss using the targets provided by the dataloader, and Adam for the optimizer.

Again, at inference time, the fixed-length context window is shorter than the training sequence window length, and equal to the maximum look-back time of the attention blocks. The inference process takes the output produced at each time step (a stack of n tokens), and shift them in to a sliding window that is used for input for the next time step. The length of the sequences generated during inference is arbitrary and should be settable with a parameter. 


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import time

import numpy as np

# and for creating a custom dataset and loader:
from torch.utils.data import DataLoader
import os
import yaml
import shutil

from utils.utils import generate_mask, save_model, writeDACFile, interpolate_vectors

# from DACTransformer.DACTransformer import TransformerDecoder
# from DACTransformer.CondQueryTransformer import ClassConditionedTransformer
# from DACTransformer.CondKeyTransformer import ClassConditionedKeyTransformer
# from DACTransformer.PostNormCondDACTransformer import PostNormCondDACTransformerDecoder
from DACTransformer.RopeCondDACTransformer import RopeCondDACTransformer

from dataloader.dataset import CustomDACDataset, onehot, getNumClasses

In [2]:
from torch.utils.tensorboard import SummaryWriter

### <font color='blue'> Parameters 
</font>

In [3]:
paramfile = 'params.yaml' # 'params.yaml' #'params_sm.yaml'
DEVICE='cuda' #####################################################''cuda'

In [4]:
# Training data dir

# Load YAML file
with open(paramfile, 'r') as file:
    params = yaml.safe_load(file)


data_dir = params['data_dir']
validator_data_dir = params['validator_data_dir']

# ---------     for the transformer  --------------#
vocab_size = params['vocab_size']
num_tokens = params['num_tokens']

cond_classes = getNumClasses() # 0
cond_params = params['cond_params']
cond_size = cond_classes + cond_params # num_classes + num params - not a FREE parameter!

embed_size = params['tblock_input_size'] -cond_size # 240 #32  # embed_size must be divisible by num_heads and by num tokens
print(f'embed_size is {embed_size}')

Ti = params['Ti']
Tt = params['Tt']
batch_size = params['batch_size']

sequence_length = Tt  # For training

num_layers = params['num_layers']
num_heads = params['num_heads']
forward_expansion = params['forward_expansion']
dropout_rate = params['dropout_rate']
learning_rate = params['learning_rate']
num_epochs=params['num_epochs']

experiment_name=params['experiment'] 
outdir = 'runs' + '/' + experiment_name
basefname= 'out' + '.e' + str(embed_size) + '.l' + str(num_layers) + '.h' + str(num_heads) 

ErrorLogRate = params['ErrorLogRate'] #10
checkpoint_interval = params['checkpoint_interval']

verboselevel=0

TransformerClass =  globals().get(params['TransformerClass'])  

print(f"using TransformerClass = {params['TransformerClass']}") 
print(f'basefname = {basefname}')
print(f'outdir = {outdir}')

###########################################################################
# Ensure the destination directory exists
#destination_dir = os.path.dirname(outdir + '/' + paramfile)
#if not os.path.exists(destination_dir):
#    os.makedirs(destination_dir)
    
if not os.path.exists(outdir):
    os.makedirs(outdir)
shutil.copy(paramfile, outdir + '/params.yaml')  # copy whatever paramfile was used to outdir and name it params.yaml

embed_size is 504
using TransformerClass = RopeCondDACTransformer
basefname = out.e504.l4.h8
outdir = runs/2025.02.19_ROPE_mask86layers4fexp4


'runs/2025.02.19_ROPE_mask86layers4fexp4/params.yaml'

### <font color='blue'> Set up cuda. 
Without it, training runs about 4 times slower  
</font>

In [5]:
if DEVICE == 'cuda' :
    torch.cuda.device_count()
    torch.cuda.get_device_properties(0).total_memory/1e9

    device = torch.device(DEVICE) # if the docker was started with --gpus all, then can choose here with cuda:0 (or cpu)
    torch.cuda.device_count()
    print(f'memeory on cuda 0 is  {torch.cuda.get_device_properties(0).total_memory/1e9}')
else :
    device=DEVICE
device

memeory on cuda 0 is  25.37816064


device(type='cuda')

### <font color='blue'> Load data 
</font>

In [6]:
# Create an instance of the dataset
dataset = CustomDACDataset(data_dir=data_dir)
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

#Validator data set
if validator_data_dir != None :
    validator_dataset=CustomDACDataset(data_dir=validator_data_dir)
    validator_dataloader= DataLoader(validator_dataset, batch_size=batch_size, shuffle=True)

#---------------------------------------------------------------
# Test data dir
for batch_idx, (inputs, targets, cvect) in enumerate(dataloader):
    #pass
    # Your training code here
    # inputs: batch of input data of shape [batch_size, N, T-1]
    # targets: corresponding batch of target data of shape [batch_size, N, T-1]
    
    if (batch_idx == 0) : 
        print(f"Batch {batch_idx + 1}")
        print(f"Inputs shape: {inputs.shape}")
        print(f"Targets shape: {targets.shape}")
        print(f"cvect shape: {cvect.shape}")
        print(f'cevect is {cvect}')

Batch 1
Inputs shape: torch.Size([8, 430, 4])
Targets shape: torch.Size([8, 430, 4])
cvect shape: torch.Size([8, 8])
cevect is tensor([[0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.9000],
        [0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.4000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.1500],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.5500],
        [0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.5000],
        [0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.9000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.3000]])


### <font color='blue'> Instantiate model 
</font>

In [7]:
mask = generate_mask(Tt, Ti).to(device)
print(f'Mask.shape is {mask.shape}')
mask

Mask.shape is torch.Size([430, 430])


tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [-inf, -inf, -inf,  ..., 0., -inf, -inf],
        [-inf, -inf, -inf,  ..., 0., 0., -inf],
        [-inf, -inf, -inf,  ..., 0., 0., 0.]], device='cuda:0')

In [8]:
# Instantiate model, put it on the device
#model = TransformerDecoder(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, num_tokens, vocab_size).to(device)
print(f'Creating model with embed_size={embed_size}, cond_size={cond_size}')

# if TransformerClass == TransformerDecoder :
#     model = TransformerDecoder(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)
# elif  TransformerClass == ClassConditionedTransformer:
#     model = ClassConditionedTransformer(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)
# else :
#     model = ClassConditionedKeyTransformer(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)

### if TransformerClass == PostNormCondDACTransformerDecoder :
###    model = TransformerClass(embed_size+cond_size, embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)
### else :
###    model = TransformerClass(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)
model = TransformerClass(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, cond_classes, num_tokens, vocab_size, cond_size, verboselevel).to(device)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Count the number of parameters
num_params = sum(p.numel() for p in model.parameters())
print(f'Total number of parameters: {num_params}')

Creating model with embed_size=504, cond_size=8
Setting up MultiEmbedding with vocab_size= 1024, embed_size= 504, num_codebooks= 4
Setting up RotaryPositionalEmbedding with embed_size= 504, max_len= 430
Total number of parameters: 15837760


### <font color='blue'> Train !! 
</font>

In [9]:
# Initialize SummaryWriter
writer = SummaryWriter(outdir)


In [10]:

t0 = time.time()

for epoch in range(num_epochs):
    torch.cuda.empty_cache()
    model.train()
    for batch_idx, (input_data, target_data, cond_data) in enumerate(dataloader):
        if verboselevel > 5 :
            print(f' ---- submitting batch with input_data={input_data.shape}, target_data={target_data.shape}, cond_data={cond_data.shape}')
        #print(f"b{batch_idx} ", end='')
        optimizer.zero_grad()

        # Move inputs and targets to the device
        input_data, target_data, cond_data = input_data.to(device), target_data.to(device), cond_data.to(device)
        
        if cond_size==0 :  #Ignore conditioning data
            cond_expanded=None
        else : 
            # for dataset exammples, expand the conditioning info across all time steps before passing to models
            cond_expanded = cond_data.unsqueeze(1).expand(-1, input_data.size(1), -1)
        
        #print(f'    after loading a batch,  input_data.shape is {input_data.shape}, and cond_data.shape is {cond_data.shape}')
        #print(f'    after loading a batch,  cond_expanded.shape is {cond_expanded.shape}')
        #print(f'    after loading a batch,  mask.shape is {mask.shape}')
        #print(f' model={model}')
        
        
        output = model(input_data, cond_expanded, mask)
    
        if verboselevel > 5 :
            print(f' TTTTTTTT after training, output shape ={output.shape}')
            print(f' TTTTTTTT Passing to CRITERION with , output.reshape(-1, vocab_size) = {output.reshape(-1, vocab_size).shape} and target_data.reshape(-1) = {target_data.reshape(-1).shape}' )
        
        loss = criterion(output.reshape(-1, vocab_size), target_data.reshape(-1)) # collapses all target_data dimensions into a single dimension

        loss.backward()
        optimizer.step()
    if (epoch+1) % ErrorLogRate == 0:
        print(f'')
        print(f'EPOCH {epoch+1}  ', end='')
        print(f'loss: {loss}')
        # Log the loss to TensorBoard
        writer.add_scalar('Loss/train', loss, epoch)
        
        if validator_data_dir != None :
            model.eval()
            with torch.no_grad():
                val_loss = 0
                for val_inputs, val_targets, cond_data in validator_dataloader:
                    val_inputs, val_targets, cond_data = val_inputs.to(device), val_targets.to(device), cond_data.to(device)
                    
                    if cond_size==0 :  #Ignore conditioning data
                        cond_expanded=None
                    else: 
                        # for dataset exammples, expand the conditioning info across all time steps before passing to models
                        cond_expanded = cond_data.unsqueeze(1).expand(-1, input_data.size(1), -1)

                    
                    val_outputs = model(val_inputs,cond_expanded, mask)
                    
                    val_loss += criterion(val_outputs.reshape(-1, vocab_size), val_targets.reshape(-1)) # collapses all target_data dimensions into a single dimension
                    #val_loss += criterion(val_outputs, val_targets).item()

            print(f'Validation Loss: {val_loss / len(validator_dataloader)}')
            writer.add_scalar('Loss/validation', val_loss / len(validator_dataloader), epoch)

            t1 = time.time()
            train_time = t1-t0
            print(f'train time for {epoch+1} epochs, was {train_time}' )
            
    if (epoch+1) % checkpoint_interval == 0:
        lastbasename = outdir+"/"+basefname+"_chkpt_"+str(epoch+1).zfill(4)
        print(f'save model to : {lastbasename} + ".pth" ')
        save_model(model, Ti,  lastbasename +".pth")
        
    
t1 = time.time()
train_time = t1-t0
print(f'train time for {num_epochs} epochs, was {train_time}' )
print(f'loss  =  {loss}' )

    


EPOCH 10  loss: 4.88798713684082
Validation Loss: 3.5976099967956543
train time for 10 epochs, was 125.13021945953369

EPOCH 20  loss: 3.3256430625915527
Validation Loss: 3.508418083190918
train time for 20 epochs, was 246.873375415802
save model to : runs/2025.02.19_ROPE_mask86layers4fexp4/out.e504.l4.h8_chkpt_0025 + ".pth" 

EPOCH 30  loss: 3.21386981010437
Validation Loss: 3.4533638954162598
train time for 30 epochs, was 369.7977879047394

EPOCH 40  loss: 1.4053882360458374
Validation Loss: 3.3868377208709717
train time for 40 epochs, was 493.34233570098877

EPOCH 50  loss: 2.753322124481201
Validation Loss: 3.482856035232544
train time for 50 epochs, was 614.843718290329
save model to : runs/2025.02.19_ROPE_mask86layers4fexp4/out.e504.l4.h8_chkpt_0050 + ".pth" 

EPOCH 60  loss: 2.1383743286132812
Validation Loss: 3.4238016605377197
train time for 60 epochs, was 736.3487875461578

EPOCH 70  loss: 3.988708257675171
Validation Loss: 3.45188307762146
train time for 70 epochs, was 857.

In [11]:
#just check that inference attention mask will look right
#Actually, the inference mask can be None since we are using a context window only as long as the maximum look-back in the training mask
# thats why taking the mask with :TI is upper-triangular. Longer dims would show a banded mask again.
foo=mask[:Ti, :Ti]
foo

tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [0., 0., 0.,  ..., 0., -inf, -inf],
        [0., 0., 0.,  ..., 0., 0., -inf],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

### <font color='blue'> Use CKPT_DAC_Audio7.ipynb  
to see and hear your generated audio   
</font>