### chaptGPT specs   

A decoder-only transformer in pytorch to predict 'next output' at each time step. 

Each time step t is represented by a vector of n=4 tokens. 
The length of the sequence (context window) is Ti=86 for inference, and Tt=8*Ti for training. That is, the context window for training is 8 times the length of the context window for inference. 
The attention is "causal", looking only back in time, and the maximum look-back time for the attention blocks is Ti (even when the sequence is longer during training).

The size of the vocabulary for each of the n tokens is V=1024. 

The dataloader will as is usual, supply batches pairs of (input,target) where the size of each input and output is Tt*n (the sequence length times the number of tokens at each time step). The tokens are indexes for the vocabulary in the range of (0, V-1). The targets are shifted relative to the input sequence by 1 as is typical for networks the learn to predict the output for the next time step. 

The first layer in the architecture will be a learnable "multiembedding" layer that embeds each of the 4 tokens at each time step as an 8-dimensional vector. The n 8-dimensional vectors is concatenated to provide the 32 dimensional input for the transformer blocks at each time step. 

A positional code is then added to each 32 dimensional vector. For positional encoding, use Rotary Position Embedding (RoPE).

We use a stack of b=8 transformer blocks that are standard (using layer norms, a relu for activation, and a forward expansion factor of 4 form the linear layer). Each transformer block consumes and produces a context window length sequence of 32 dimensional vectors. 

After the last transformer block, there should be linear layer that maps the 32 dimensional vectors to the output size which is V*n (the vocabulary size time the number of tokens stacked at each time step). These are the logits that will be fed to the softmax functions (one for each of the n tokens) that provide the probability distribtion across the vocabulary set. Use the criterion nn.CrossEntropyLoss() for computing the loss using the targets provided by the dataloader, and Adam for the optimizer.

Again, at inference time, the fixed-length context window is shorter than the training sequence window length, and equal to the maximum look-back time of the attention blocks. The inference process should take the output produced at each time step (a stack of n tokens), and shift them in to a sliding window that is used for input for the next time step. Please use the "Incremental Token Generation Using Cached States" approach to minimize the computational burden of the shifting input window. Please be careful that it works correctly with the RoPE positional coding. The length of the sequences generated during inference is arbitrary and should be settable with a parameter. 

Include an example (with dummy inputs and targets) of how to call the code for a training step, and an example of inference for a specifiable output length.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import math
import time

import numpy as np

# and for creating a custom dataset and loader:
from torch.utils.data import Dataset, DataLoader
import os
import dac

from utils.utils import generate_mask, save_model, writeDACFile
from DACTransformer.DACTransformer import TransformerDecoder

In [2]:
from torch.utils.tensorboard import SummaryWriter

### <font color='blue'> Parameters 
</font>

In [3]:
# Training data dir

# testsnd="all"
# data_dir= "/scratch/syntex/PisWinAppBee_long_44/dac-train"
# validator_data_dir= "/scratch/syntex/PisWinAppBee_long_44/dac-val"

testsnd="bees" # pistons, wind, applause, bees 
data_dir="/scratch/syntex/PisWinAppBee_long_44/onesnddac/"+testsnd+"-train-small"  ##******* small
validator_data_dir="/scratch/syntex/PisWinAppBee_long_44/onesnddac/"+testsnd+"-val-small"

# ---------     for the transformr  --------------#
vocab_size = 1024
num_tokens = 4
embed_size = 256 # 240 #32  # embed_size must be divisible by num_heads and by num tokens
Ti = 86
Tt = 430 # must match the length of the sequences in the batch
batch_size = 10
sequence_length = Tt  # For training

num_layers=4
num_heads=8 # 8 # embed_size must be divisible by num_heads
forward_expansion=4 #4
dropout_rate=0.1
learning_rate=0.001

top_n = 5   # not used yet

num_epochs=100
experiment_name='scratchbees'  #the higher the embed size (to 240 anyway) the fewer epochs are necessary
outdir = 'runs' + '/' + experiment_name
basefname=testsnd+ '.e' + str(embed_size) + '.l' + str(num_layers) + '.h' + str(num_heads) 

DEVICE='cuda'#My experiments show CUDA is only) 4 times faster than CPU!
ALSO_RUN_DUMMY_STEP =False # there a two cells that setup a single dummy input and target and take one training step

inference_steps=86*20 # second fact is number of seconds (for 86 tokens/sec)

ErrorLogRate=10
checkpoint_interval=25

print(f'basefname = {basefname}')

basefname = bees.e256.l4.h8


### <font color='blue'> Set up cuda. 
Without it, training runs about 4 times slower  
</font>

In [4]:
torch.cuda.device_count()
torch.cuda.get_device_properties(0).total_memory/1e9

device = torch.device(DEVICE) # if the docker was started with --gpus all, then can choose here with cuda:0 (or cpu)
torch.cuda.device_count()
print(f'memeory on cuda 0 is  {torch.cuda.get_device_properties(0).total_memory/1e9}')

device

memeory on cuda 0 is  25.216745472


device(type='cuda')

### <font color='blue'> Create a custom dataset 
</font>

In [5]:
class CustomDACDataset(Dataset):
    def __init__(self, data_dir):
        """
        Args:
            data_dir (string): Directory with all the data files.
        """
        self.data_dir = data_dir
        self.file_names = os.listdir(data_dir)

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, idx):
        fpath = os.path.join(self.data_dir, self.file_names[idx])
        dacfile=dac.DACFile.load(fpath)  # Load the data file
        data = dacfile.codes

        # Assuming data is a tensor of shape [1, N, T]
        # We remove the first dimension to get a tensor of shape [N, T]
        data = data.squeeze(0)

        # The input is the data itself
        input_data = data[:, :-1]  # All time steps except the last one
        # The target is the data shifted by one time step
        target_data = data[:, 1:]  # All time steps except the first one

        # Transpose the last dimensions so we get [T, N] for the transformer
        return input_data.transpose(0, 1), target_data.transpose(0, 1)


In [6]:
# Create an instance of the dataset
dataset = CustomDACDataset(data_dir=data_dir)
# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

#Validator data set
if validator_data_dir != None :
    validator_dataset=CustomDACDataset(data_dir=validator_data_dir)
    validator_dataloader= DataLoader(validator_dataset, batch_size=batch_size, shuffle=True)

#---------------------------------------------------------------
# Test data dir
for batch_idx, (inputs, targets) in enumerate(dataloader):
    pass
    # Your training code here
    # inputs: batch of input data of shape [batch_size, N, T-1]
    # targets: corresponding batch of target data of shape [batch_size, N, T-1]
    
    #print(f"Batch {batch_idx + 1}")
    #print(f"Inputs shape: {inputs.shape}")
    #print(f"Targets shape: {targets.shape}")
print(f"Batch {batch_idx + 1}")
print(f"Inputs shape: {inputs.shape}")
print(f"Targets shape: {targets.shape}")

Batch 60
Inputs shape: torch.Size([8, 430, 4])
Targets shape: torch.Size([8, 430, 4])


In [7]:

mask = generate_mask(Tt, Ti).to(device)
print(f'Mask.shape is {mask.shape}')
mask

Mask.shape is torch.Size([430, 430])


tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [-inf, -inf, -inf,  ..., 0., -inf, -inf],
        [-inf, -inf, -inf,  ..., 0., 0., -inf],
        [-inf, -inf, -inf,  ..., 0., 0., 0.]], device='cuda:0')

In [8]:
# Instantiate model, put it on the device
model = TransformerDecoder(embed_size, num_layers, num_heads, forward_expansion, dropout_rate, Tt, num_tokens, vocab_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

if ALSO_RUN_DUMMY_STEP : 
    # Dummy input and target
    input_data = torch.randint(0, vocab_size, (batch_size, sequence_length, num_tokens)).to(device)
    target_data = torch.randint(0, vocab_size, (batch_size, sequence_length, num_tokens)).to(device)  # Shifted by dataloader typically

    print(f'shape of input_data is {input_data.shape}')
    print(f'shape of target_data is {target_data.shape}')

In [9]:
# DUMMY train one step on dummy input
if ALSO_RUN_DUMMY_STEP : 
    model.train()

    optimizer.zero_grad()

    # Move inputs and targets to the device
    input_data, target_data = input_data.to(device), target_data.to(device) 

    output = model(input_data, mask)
    loss = criterion(output.reshape(-1, vocab_size), target_data.reshape(-1)) # collapses all target_data dimensions into a single dimension

    loss.backward()
    optimizer.step()


    print(f'output.shape is {output.shape}')
    print(f'output[:, -1, :].shape is {output[:, -1, :].shape}')
    print(f'output[:, -1, :].max(-1)[1].shape is {output[:, -1, :].max(-1)[1].shape}')

    # I think this takes the max of each of the 4 tokens, even if they come from 
    next_token = output[:, -1, :].max(-1)[1]
    print(f'The indices of the 4 max scoring token are {next_token}')
        

In [10]:
# def inference(model, mask, inference_steps, fname) :
#     model.eval()
#     input_data = torch.randint(0, vocab_size, (1, Ti, num_tokens)).to(device)  # Smaller context window for inference
#     predictions = []

#     t0 = time.time()
#     for i in range(inference_steps):  # Generate 100 tokens
#         output = model(input_data, mask)

#         # This takes the last vector of the sequence (the new predicted token stack) so has size(b,1,4,1024)
#         # This it takes the max across the last dimension (scores for each element of the vocabulary (for each of the 4 tokens))
#         # .max returns a duple of tensors, the first are the max vals (one for each token) and the second are the
#         #        indices in the range of the vocabulary size. 
#         # THAT IS, the 4 selected "best" tokens are taken independently
#         next_token = output[:, -1, :, :].max(-1)[1]  # Greedy decoding for simplicity
#         predictions.append(next_token)
#         input_data = torch.cat([input_data, next_token.unsqueeze(1)], dim=1)[:, 1:]  # Slide window

#     t1 = time.time()
#     inf_time = t1-t0
#     print(f'inference time for {inference_steps} steps, or {inference_steps/86} seconds of sound is {inf_time}' )

#     dacseq = torch.cat(predictions, dim=0).unsqueeze(0).transpose(1, 2)
#     if mask == None:
#         writeDACFile(fname + '_unmasked', dacseq)
#     else :
#         writeDACFile(fname, dacseq)       

In [11]:
def inference(model, Ti, vocab_size, num_tokens, inference_steps, fname) :
    model.eval()
    mask = generate_mask(Ti, Ti).to(device)
    input_data = torch.randint(0, vocab_size, (1, Ti, num_tokens)).to(device)  # Smaller context window for inference
    predictions = []

    t0 = time.time()
    for i in range(inference_steps):  # Generate 100 tokens
        output = model(input_data, mask)

        # This takes the last vector of the sequence (the new predicted token stack) so has size(b,1,4,1024)
        # This it takes the max across the last dimension (scores for each element of the vocabulary (for each of the 4 tokens))
        # .max returns a duple of tensors, the first are the max vals (one for each token) and the second are the
        #        indices in the range of the vocabulary size. 
        # THAT IS, the 4 selected "best" tokens are taken independently
        next_token = output[:, -1, :, :].max(-1)[1]  # Greedy decoding for simplicity
        predictions.append(next_token)
        input_data = torch.cat([input_data, next_token.unsqueeze(1)], dim=1)[:, 1:]  # Slide window

    t1 = time.time()
    inf_time = t1-t0
    print(f'inference time for {inference_steps} steps, or {inference_steps/86} seconds of sound is {inf_time}' )

    dacseq = torch.cat(predictions, dim=0).unsqueeze(0).transpose(1, 2)
    if mask == None:
        writeDACFile(fname + '_unmasked', dacseq)
    else :
        writeDACFile(fname, dacseq)       

### <font color='blue'> Train !! 
</font>

In [12]:
# Initialize SummaryWriter
writer = SummaryWriter(outdir)

In [13]:

t0 = time.time()

for epoch in range(num_epochs):
    torch.cuda.empty_cache()
    model.train()
    for batch_idx, (input_data, target_data) in enumerate(dataloader):
        #print(f"b{batch_idx} ", end='')
        optimizer.zero_grad()

        # Move inputs and targets to the device
        input_data, target_data = input_data.to(device), target_data.to(device) 

        output = model(input_data, mask)
        loss = criterion(output.reshape(-1, vocab_size), target_data.reshape(-1)) # collapses all target_data dimensions into a single dimension

        loss.backward()
        optimizer.step()
    if (epoch+1) % ErrorLogRate == 0:
        print(f'')
        print(f'EPOCH {epoch+1}  ', end='')
        print(f'loss: {loss}')
        # Log the loss to TensorBoard
        writer.add_scalar('Loss/train', loss, epoch)
        
        if validator_data_dir != None :
            model.eval()
            with torch.no_grad():
                val_loss = 0
                for val_inputs, val_targets in validator_dataloader:
                    val_inputs, val_targets = val_inputs.to(device), val_targets.to(device)
                    val_outputs = model(val_inputs, mask)
                    
                    val_loss += criterion(val_outputs.reshape(-1, vocab_size), val_targets.reshape(-1)) # collapses all target_data dimensions into a single dimension
                    #val_loss += criterion(val_outputs, val_targets).item()

            print(f'Validation Loss: {val_loss / len(validator_dataloader)}')
            writer.add_scalar('Loss/validation', val_loss, epoch)
            
    if (epoch+1) % checkpoint_interval == 0:
        lastbasename = outdir+"/"+basefname+"_chkpt_"+str(epoch+1).zfill(4)
        save_model(model, Ti,  lastbasename +".pth")
        #inference(model, mask[:Ti, :Ti], inference_steps, lastbasename) 
        inference(model, Ti, vocab_size, num_tokens, inference_steps, lastbasename) 
            
t1 = time.time()
train_time = t1-t0
print(f'train time for {num_epochs} epochs, was {train_time}' )

    


EPOCH 10  loss: 4.029446601867676
Validation Loss: 5.015473365783691

EPOCH 20  loss: 3.082047700881958
Validation Loss: 5.757481098175049
inference time for 1720 steps, or 20.0 seconds of sound is 3.9363088607788086

EPOCH 30  loss: 2.575639486312866
Validation Loss: 6.420976161956787

EPOCH 40  loss: 2.2692220211029053
Validation Loss: 6.933044910430908

EPOCH 50  loss: 2.11422061920166
Validation Loss: 7.254635810852051
inference time for 1720 steps, or 20.0 seconds of sound is 3.8679521083831787

EPOCH 60  loss: 1.9818941354751587
Validation Loss: 7.465778350830078

EPOCH 70  loss: 1.9187541007995605
Validation Loss: 7.759516716003418
inference time for 1720 steps, or 20.0 seconds of sound is 3.936495304107666

EPOCH 80  loss: 1.816812515258789
Validation Loss: 7.910547256469727

EPOCH 90  loss: 1.7511812448501587
Validation Loss: 8.132528305053711

EPOCH 100  loss: 1.6574862003326416
Validation Loss: 8.234443664550781
inference time for 1720 steps, or 20.0 seconds of sound is 3.9

In [14]:
#just check that inference attention mask looks right
#Actually, the inference mask can be None since we are using a context window only as long as the maximum look-back in the training mask
# thats why taking the mask with :TI is upper-triangular. Longer dims would show a banded mask again.
foo=mask[:Ti, :Ti]
foo

tensor([[0., -inf, -inf,  ..., -inf, -inf, -inf],
        [0., 0., -inf,  ..., -inf, -inf, -inf],
        [0., 0., 0.,  ..., -inf, -inf, -inf],
        ...,
        [0., 0., 0.,  ..., 0., -inf, -inf],
        [0., 0., 0.,  ..., 0., 0., -inf],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

### <font color='blue'> User readDac.ipynb  
to see and hear your generated audio   
</font>