## CNN Attn: GPU Example

### Intro 
So this notebook is just an example of how to get one of my early audio CNN/ Transformer models onto a GPU. It will include some of my thoughts and issues that I have had to consider. This is an age gender detector from samples of audio from the Monzilla Common Voice Dataset. 

##### Import the relavent libraries

In [3]:
import torch 
import torchaudio
import torchaudio.transforms as T
import torchaudio.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from collections import Counter

##### Setup for possible use of the GPU (any serious tuning will require GPU)

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### Data
This notebook is no connected to live data and the main data and my initial input had been the audio waveforms as pytorch Tensors. These were being transformed into their final form in the Dataset, but this meant the an additional calculation in the loop. The calculation was not too expensive when running on a CPU but an inability to set up the transformation on GPU meant much of the advantage was lost. 

In [None]:
train_specs = torch.load("train_waveforms.pt")
dev_specs = torch.load("dev_waveforms.pt")
train_ages = torch.load("train_ages.pt")
dev_ages = torch.load("dev_ages.pt")
train_gens = torch.load("train_gens.pt")
dev_gens = torch.load("dev_gens.pt")

After loading, the following code cuts them all down to a 3.8 second time window, transforms the waveforms (sample rate of 32000) to make a 1x128x128. The derivatives of the coefficients are concattenated to the 2nd and 3rd input chanels of the image. Somewhere the resulting tensor must be transferred to device to allow GPU processing.  

In [None]:
transform = T.MelSpectrogram(sample_rate=32000,n_fft=1900)
m = train_waveforms.size(0)
test_specs = []
for i in range(m):
  form = train_waveforms[i,:121600]
  x = transform(form).to(device)
  x = x[:,:-1] # loosing the 129 on the x axis
  delta = F.compute_deltas(x,win_length=7)
  delta2 = F.compute_deltas(delta,win_length=7)
  cat = torch.cat((x.unsqueeze(0),delta.unsqueeze(0),delta2.unsqueeze(0))).to(device)
  test_specs.append(cat.unsqueeze(0))

test_specs = torch.cat(test_specs,0)

### Dataset
The Dataset inherits from torch.utils.data.Dataset and requires an \_\_init\_\_, a \_\_getitem\_\_, and a \_\_len\_\_. There is no requirement to do anything fancy here, but it is an opportunity to make sure all tensors are on the GPU. Depending on how you receive the data, the spectrogram transformation can be incorporated into the \_\_getitem\_\_ method. From a computational and complexity point of view it is better if the the datapoints are already transformed. 

In [None]:
class MyDataset(Dataset):
  """
  Creates a pytorch dataset class
  inputs Waveforms Tensor, age list, gender list
  outputs Pytorch Dataset where [i] 3x128x128 ChannelsxFreqxTime
  with C1 mel energies, C2, derivatives, C3 derivatives
  """
  
  
  def __init__(self, specs, age, gender):
    self.waveforms = specs
    self.age = torch.LongTensor(age).to(device)
    self.gender = torch.LongTensor(gender).to(device)

  def __getitem__(self,index):
    x = self.waveforms[index]
    y = self.age[index]
    z = self.gender[index]
    
    return x,y,z
    
  def __len__(self):
    return len(self.waveforms)

### Model
Refer to https://rjnclarke.wixsite.com/rich-clarke for full details of the model. Basically this is a shallow CNN with 2 convolutions of 12 and 30 filters. This shallow structure appears well suited to the audio spectrogram. In addition to this the primary channel of the image is fed into 3 attention heads, as in Attention is all you need. These are the rows of the spectrogram, so the embedding is the coefficient values at each time interval across a given frequency, and each frequency is part of the sequence. The transformer encoder is then interpreting these frequency bands as they relate to each other. The CNN and Attn are simply cocatenated before being fed into a classifier. As there is minimal reduction in CNN or Transformer, the transfer to the Fully conected layer is expensive in terms of parameters. This version has 137,000,000. The model returns a result for the age problem and one for the gender problem. 

In [None]:
class GenderAgeattnModel(nn.Module):
    def __init__(self):
        super(GenderAgeattnModel, self).__init__()
        
        # dropouts 
        self.dropout1 = nn.Dropout(0.10) # important to optimize 
        self.dropout2 = nn.Dropout(0.15) # important to optimize 
        
        # dense
        self.batch1 = nn.BatchNorm2d(3)

        self.linear3 = nn.Linear(16384,3000)
        
        self.batch4 = nn.BatchNorm1d(3000)
        self.relu4 = nn.ReLU()
        self.linear4 = nn.Linear(3000,100)
        
        self.batch5 = nn.BatchNorm1d(100)
        self.relu5 = nn.ReLU()
        self.linear_gen = nn.Linear(100,2)
        self.linear_age = nn.Linear(100,6)
        
        
        
        # Attention track 
        # Head #1
        self.mp = nn.MaxPool2d(2) # unused 
        self.k1 = nn.Linear(128, 128) # input_size, k size
        self.v1 = nn.Linear(128, 128) # input_size, v size
        self.q1 = nn.Linear(128, 128) # input_size, q_size 

        # Head #2
        self.k2 = nn.Linear(128, 128)
        self.v2 = nn.Linear(128, 128)
        self.q2 = nn.Linear(128, 128)
        
        # Head #2
        self.k3 = nn.Linear(128, 128)
        self.v3 = nn.Linear(128, 128)
        self.q3 = nn.Linear(128, 128)
        
        self.softmax = nn.Softmax(dim=2)
        self.attention_head_projection = nn.Linear(384, 128) #dim_v * 2heads
        self.norm_mh = nn.LayerNorm(128)
        self.relu_attn = nn.ReLU() # unused 
       

    def forward(self, x):
        x = self.batch1(x)  
        x = self.dropout1(x)
        
        # attn track 
        ins = x[:,0,:,:]
        
        # Attention Head 1
        qs1 = self.q1(ins)
        ks1 = self.k1(ins)
        vs1 = self.v1(ins)
        sims_1 = torch.bmm(qs1,torch.transpose(ks1,1,2))/ np.sqrt(64)
        soft_1 = self.softmax(sims_1)
        weighted_1 = torch.bmm(soft_1,vs1)
        
        # Attention Head 2
        qs2 = self.q2(ins)
        ks2 = self.k2(ins)
        vs2 = self.v2(ins)
        sims_2 = torch.bmm(qs2,torch.transpose(ks2,1,2))/ np.sqrt(64)
        soft_2 = self.softmax(sims_2)
        weighted_2 = torch.bmm(soft_2,vs2) # B x 128 x 128
        
        
        # Attention Head 3
        qs3 = self.q3(ins)
        ks3 = self.k3(ins)
        vs3 = self.v3(ins)
        sims_3 = torch.bmm(qs3,torch.transpose(ks3,1,2))/ np.sqrt(64)
        soft_3 = self.softmax(sims_3)
        weighted_3 = torch.bmm(soft_3,vs3)
    
        # concat attn heads and project 
        concat = torch.cat((weighted_1,weighted_2,weighted_3),dim=2) #B x 128 x 384 
        projected = self.attention_head_projection(concat) # B x 128 x 128
        normed = self.norm_mh(projected) # B x 128 x 128
        out_attn = self.relu_attn(normed)
        out_attn = out_attn.view(-1,16384) # B x 16384
        
        # dense   
        full = self.dropout2(out_attn)
        x = self.linear3(full)
        
        #Bx3000
        
        x = self.batch4(x)
        x = self.relu4(x)
        x = self.linear4(x)
        
        #Bx100
        
        
        x = self.batch5(x)
        x = self.relu5(x)
        gen = self.linear_gen(x) # B x 6
        age = self.linear_age(x) # B x 2 

        return gen, age

keep track of accuracy (to go back to numpy the tensors must be on CPU)

In [None]:
def accuracy(out_age,out_gen, target_age,target_gen):
    """Computes the precision@k for the specified values of k"""
    batch_size = target_age.shape[0]

    _, pred_age = torch.max(out_age, dim=-1)
    _,pred_gen = torch.max(out_gen,dim=1)
    pred_age = np.array(pred_age.to("cpu"))
    pred_gen = np.array(pred_gen.to("cpu"))
    target_age = np.array(target_age.to("cpu"))
    target_gen = np.array(target_gen.to("cpu"))
    
    correct_age = pred_age == target_age 
    correct_gen = pred_gen == target_gen
    correct = correct_age == correct_gen
    
    acc = np.sum(correct) / batch_size

    return acc

### Training function
This will be taking two loss functions, one for each problem. The weighting between these is another hyper parameter to consider.

In [None]:
def train(epoch, data_loader, model, optimizer, criterion_age,criterion_gender,gen_weight=0.5):
        
    running_loss = []
    running_acc = []
    run_ind = 0
    total_acc = 0
    total_loss = 0 
    
    for idx, (data, age, gender) in enumerate(data_loader):
        
      if torch.cuda.is_available():
          data = data.cuda()
          age = age.cuda()
          gender = gender.cuda()
          

      optimizer.zero_grad()
      
      # get output from data 
      out_gen,out_age = model(data)
      
      # calculate loss and gradient 
      loss_gen = criterion_gender(out_gen, gender) # solving age problem
      loss_age = criterion_age(out_age,age)
      loss = gen_weight*loss_gen + (1-gen_weight)*loss_age
      loss.backward()
      running_loss.append(loss.item())
      running_acc.append(accuracy(out_age,out_gen,age,gender))
      run_ind += 1
      total_acc += accuracy(out_age,out_gen,age,gender) * (data.size(0)/len(train_dataset))
      total_loss += loss.item() * (data.size(0)/len(train_dataset))

      
      # adjust learning weights 
      optimizer.step()

      if (idx+1) % 32 == 0:
          print(f"epoch:{epoch},run:{run_ind}/{len(train_loader)},loss:{round(np.sum(running_loss)/len(running_loss),4)},Acc:{round(np.sum(running_acc)/len(running_acc),4)}")
          running_loss = []
          running_acc = []
            
    print("##################")
    print(f"Epoch {epoch}, Train Accuracy {round(total_acc,4)},Train Loss {round(total_loss,4)}")
    return total_acc, total_loss

### Validating

In [None]:
def validate(epoch, val_loader, model, criterion_age,criterion_gender,gen_weight=0.5):
    # evaluation loop
    

    for idx, (data, age, gender) in enumerate(val_loader):

        if torch.cuda.is_available():
            data = data.cuda()
            age = age.cuda()
            gender = gender.cuda()
            
        with torch.no_grad():
            
            out_gen,out_age = model(data)
            loss_gen = criterion_gender(out_gen, gender) # solving age problem
            loss_age = criterion_age(out_age,age)
            loss = gen_weight*loss_gen + (1-gen_weight)*loss_age
            
            acc = accuracy(out_age,out_gen,age,gender)
            

    print(f"Epoch {epoch}, Val Accuracy {round(acc,4)}, Val Loss {round(loss.item(),4)}")
    return acc, loss

### Setup
The validation set is going through in one big batch. The model must be put on the device. 

In [None]:
train_dataset = MyDataset(train_specs,train_ages,train_gens)
dev_dataset = MyDataset(dev_specs,dev_ages,dev_gens)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
dev_loader = DataLoader(dev_dataset,batch_size=len(dev_dataset),shuffle=False)

# define model
model = GenderAgeattnModel().to(device)

Sanity check on training batches and dimensions 64x3x128x128 = Batch x Channels x Height x Width...

In [None]:
examples = iter(train_loader)
example_data,example_age,example_gender= examples.next()
example_data.size()

### Imbalance 
Implemented "Class-Balanced Loss Based on Effective Number of Samples" Cui et al (2019). This is nice weighting scheme based on marginal utility of data for each class which will be used to bias the loss functions. The weights calculate dmust go to the GPU.

In [None]:
def reweight(cls_num_list, beta=0.999):
    """
    Takes a list of frequencies of all classes and returns a list of weights 
    per class taking into accout class balanced loss and focal loss 
    
    Args:
        cls_num_list: # of each class in training 
        beta = hyperparameter usually N-1/N 

    Returns:
        per_cls_weights = a list of all weights 
    """
    per_cls_weights = None
    result = []
    normed = []
    C = len(cls_num_list)
    for i in range(C):
        result.append((1-beta)/(1-beta**cls_num_list[i]))
    normalize = np.sum(result) * (1/C)
    for i in range(C):
        normed.append(result[i]/normalize)
    per_cls_weights = normed

    return per_cls_weights

In [None]:
# class frequency
count_age = list(dict(Counter(train_ages)).items())
sorted_age = [y for (x,y) in sorted(count_age, key=lambda x: x[0], reverse=False)]
count_gens = list(dict(Counter(train_gens)).items())
sorted_gens = [y for (x,y) in sorted(count_gens, key=lambda x: x[0], reverse=False)]

# weightings 
per_cls_weights_age = torch.Tensor(reweight(sorted_age)).to(device)
per_cls_weights_gender = torch.Tensor(reweight(sorted_gens)).to(device)

For this case we have imbalance but it isn't dramatic, so the weighting are just slight nudges. 

In [None]:
per_cls_weights_age

In [None]:
per_cls_weights_age

### Loss Functions 
One each, with the weighting schemes included. 

In [None]:
# criteria
criterion_age = nn.CrossEntropyLoss(weight=per_cls_weights_age)
criterion_gender = nn.CrossEntropyLoss(weight=per_cls_weights_gender)

### Optimizer

In [None]:
learning_rate = 0.0007
weight_decay = 0.0005
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate,weight_decay=weight_decay) 

### Training Loop 

In [None]:
tr_accs = []    
tr_losses = []
val_accs = []
val_losses = []

epochs = 10
for epoch in list(range(epochs)):
    tr_acc, tr_loss = train(epoch, train_loader, model, optimizer, criterion_age,criterion_gender)
    val_acc, val_loss = validate(epoch, test_loader, model, criterion_age,criterion_gender)
    tr_accs.append(tr_acc)
    tr_losses.append(tr_loss)
    val_accs.append(val_acc)
    val_losses.append(val_loss)

### Comments
This isn't the best model, just an early one. This will run in a few hours on the 20,000 + datapoints I used to train, but as it need to be tuned, the CPU becomes impossible. With 137,000,000 parameters, tuning over a modest 64 hyperparameter combinations is at approx 6 hours on a single GPU. 