Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says "YOUR ANSWER HERE" or `YOUR CODE HERE` and remove the `raise NotImplementedError()` lines. 

Code blocks starting with a `# tests` comment provide unit tests which have to run without errors in order to get full points. Be aware that there might be further 'secret' tests to check correct implementation! I.e. the provided unit tests are necessary but not sufficient for full points!

You are always welcome to add **additional plots, tests, or debug outputs**.
However, make sure to: **1) not break the automated tests**, and **2) switch off any excessive debug output** when you submit your notebook!
Note that there is a **hard limit of 720s for execution time** of individual cells when autograding!

Please add your name and student ID below:

In [1]:
NAME = "Teodor Chakarov" 
STUDENT_ID = "12141198" 


In [2]:
assert len(NAME) > 0, "Enter your name!"
assert len(STUDENT_ID) > 0, "Enter your student ID!"

# Intelligent Audio and Music Analysis Assignment 5

This assignment accounts for 60 points of assignment block B (100 points total)

## Drum Transcription with Convolutional Neural Network

In this assignment we will implement a convolutional neural network (CNN) approach for automatic drum transcription on three-instrument solo drum tracks.
While the implemented method is simple and the used dataset is synthetic, the training setup shows that it is capable to generalize.

### GPU Support
Our JupyterHub, unfortunately, does not yet provide GPU support. Nevertheless, this assignemnt can be run as-is on JupyterHub, however training of the neural network will take around 2-3 hours.

In order to speed up training if you are in a hurry, you can run this notebook on any local machine with GPU and cuda support, or alternatively you can use [google colab](https://colab.research.google.com/) and drive, if you have a google account.

Simply copy/paste the solved cells into the notebook on JupyterHub. Also add the output model file to the assignment directory on JupyterHub for your submission. 

Make sure the notebook runs using the trained model, before you submit!

In [3]:
# DO NOT COPY OR MODIFY THIS CELL!!

import os
import traceback
# This code block enables this notebook to run on google colab.
try:
    from google.colab import drive
    print('Running in colab...\n===================')
    COLAB = True
    # drive_home = '/content/drive'
    # drive.mount(drive_home)
    # print('Drive connected!\n================')
    !pip install madmom # torch==1.4.0 torchvision==0.5.0 --upgrade
    print('Installed dependencies!\n=======================')
    print('Downloading data...\n===================')
    if not os.path.exists('dataset.npz'):
        !wget -O dataset.npz "https://www.ifs.tuwien.ac.at/~vogl/other/dataset.npz" --no-check-certificate
    print('===================\nMake sure you activated GPU support: Edit->Notebook settings->Hardware acceleration->GPU\n==================')
except:
    print('=======================\nNOT running in colab...\n=======================')
    COLAB = False

NOT running in colab...


## Data

As in the previous assignments, we will use a dataset stored in numpy arrays. 

The dataset consists of four drum tracks synthesized with four different drum kits.
The dataset can be considered a *toy dataset* since it lacks many of the challenges of real-world datasets:
annotations are perfect, it is consistent in sound quality, there is only little variation in sounds and playing styles.
Nevertheless, working with this kind of data allows quick and simple evaluation if an approach can work at all.

Let's load the data from the numpy archives:

In [4]:
# DO NOT COPY OR MODIFY THIS CELL!!

import numpy as np
print(np.version.version)
from time import time as get_time

dataset_path = os.path.join(os.environ['HOME'], 'shared', '194.039-2023W', 'data', 'assignment_5')
if os.path.exists('dataset.npz'):
    dataset_path = '.'

audio_sample_rate = 44100
spec_frame_rate = 100

dataset = np.load(os.path.join(dataset_path, 'dataset.npz'), allow_pickle=True, mmap_mode='r')

# The dataset is pre-split into train, valid, and test set:
# For training, tracks 0 and 1 with drum kit 0 and 1 are used (4 tracks)
# For validation, track 2 with kit 2 is used (1 track)
# For testing, track 3 with kit 3 is used (1 track)
# shorthands for the data:
train_feat = dataset['train_feat']
train_targets = dataset['train_targets']
print("train tracks: {}".format(len(train_feat)))

valid_feat = dataset['valid_feat']
valid_targets = dataset['valid_targets']
print("valid tracks: {}".format(len(valid_feat)))

test_feat = dataset['test_feat']
test_targets = dataset['test_targets']
test_annot = dataset['test_annot']
print("test tracks: {}".format(len(test_feat)))

spec_freq_bins = train_feat[0].shape[1]
num_insts = train_targets[0].shape[1]

print(train_feat[0].shape)

1.24.4
train tracks: 4
valid tracks: 1
test tracks: 1
(24840, 79)


In [5]:
drums_example_idx = 0

In [6]:
# DO NOT COPY OR MODIFY THIS CELL!!

# get annotations from onset data archive:
print(dataset['train_audio'].shape)
example_drums = dataset['train_audio'][drums_example_idx]
example_drums_annotations = dataset['train_annot'][drums_example_idx]
example_drums_spec = train_feat[drums_example_idx]


# we can also play the audio in this jupyter notebook
import IPython.display as ipd
ipd.Audio(example_drums[:audio_sample_rate*10], rate=audio_sample_rate)

(4,)


## Task 1: Neural Network Class (15 Points)

First, we need to build a neural network. In order to do not have to worry about differentiation and gradient calculation, we will use an automatic differentiation framework. Popular choices in the context of neural networks are [tensorflow](https://www.tensorflow.org/) and [pytorch](https://pytorch.org/). Since the learning curve is a bit more steep for tensorflow, we will go with pytorch for this example.

### 1.1 Network Architecture

First, let's define a network class using pytorch classes.
In the constructor (`__init__` method) we first define the individual components for the network. The components have to be stored as fields (`self.<fieldname>`) in order to be accessible to pytorch for performing automatic differentiation.

The architecture we will use is the same as the one presented in the lecture (CNN architecture, slide 41). It consists of two convolutional blocks which contain two convolutional layers each, followed by two dense output layers.

The individual layers we need for the architecture are:
1. **2D Convolution**, channels (in/out): 1/32, kernel: 3x3
2. **2D Batchnorm**, channels: 32
3. **2D Convolution**, channels: 32/32, kernel: 3x3
4. **2D Batchnorm**, channels: 32
5. **2D Dropout** 
6. **2D Convolution**, channels (in/out): 32/64, kernel: 3x3
7. **2D Batchnorm**, channels: 64
8. **2D Convolution**, channels: 64/64, kernel: 3x3
9. **2D Batchnorm**, channels: 64
10. **2D Dropout**
11. **Dense layer**, in: 64*7, out: 50
12. **1D Batchnorm**, channels: 50 
13. **1D Dropout**
14. **Dense layer**, in: 50, out: 3

**Note:** The inputs for the first dense layer (64\*1\*7) results from the 64 channels of the previous convolutions, the reduced size in context dimension (1), and the reduced size in features dimension (7) -- c.f. dimensions in task 2.2.

**Implement the constructor**, adding the necessary layers as fields (`self.<fieldname> = ...` ) to the class.
Use the pytorch classes: `nn.Conv2d`, `nn.BatchNorm2d`, `nn.Dropout2d`, `nn.Linear`, `nn.BatchNorm1d`, and `nn.Dropout`.

**Note:** You do not have to worry about functional elements without tuneable parameters like activation functions, pooling, etc. for now - these will be added in the forward function.

In [7]:
# install torch, if not already available
!pip install torch



In [8]:
# DO NOT COPY OR MODIFY THIS CELL!!

# import pytorch stuff
import torch
import torch.nn as nn
import torch.nn.functional as torch_func

if not COLAB:
    torch.set_num_threads(16)  # we have to set the number of threads < 64, this is a workaround

In [9]:
class ConvolutionalNet(nn.Module):
    def __init__(self, dropout_p=0.5, debug = False):
        super(ConvolutionalNet, self).__init__()
        self.debug = debug 
        kernel_2d = (3, 3)

        # First Convolutional Block
        self.conv1 = nn.Conv2d(1, 32, kernel_size=kernel_2d)
        self.conv1_bn = nn.BatchNorm2d(num_features=32)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=kernel_2d)
        self.conv2_bn = nn.BatchNorm2d(num_features=32)
        #self.maxpool1 = nn.MaxPool2d((3,3))
        self.conv2_do = nn.Dropout2d(p=dropout_p)

        # Second Convolutional Block
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=kernel_2d)
        self.conv3_bn = nn.BatchNorm2d(num_features=64)
        self.conv4 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=kernel_2d)
        self.conv4_bn = nn.BatchNorm2d(num_features=64)
        #self.maxpool2 = nn.MaxPool2d((3,3))
        self.conv4_do = nn.Dropout2d(p=dropout_p)

        # Dense Layers
        self.fc1_input_size = 64 * 17 * 71
        self.fc1 = nn.Linear(in_features=self.fc1_input_size, out_features=50)
        self.bn1 = nn.BatchNorm1d(num_features=50)
        self.dropout1 = nn.Dropout(p=dropout_p)
        self.fc2 = nn.Linear(in_features=50, out_features=3)

    def forward(self, x):
        # This block belongs to task 2.2
        # This function calculates a forward pass through the network (i.e. calculates the output
        # for given input x). Hand x through the layers of the network and calculate the output.
        # Don't forget to apply the nonlinearities (activation functions). Use ReLU activation
        # function ( torch_func.relu() ) except for the ouput of the network where we need
        # sigmoid activations (0-1) for our activation functions ( torch.sigmoid() )

        x.unsqueeze_(1)  # we first need to add a dimension for convolution channels...
        # print the shape of x (if debug output is turned on):
        if self.debug:
            print('shapes for x:')
            print('(batch, channel, frames, features)')
            print(tuple(x.shape))

        # Now implement the individual processing steps for the network's forward function.
        # Add a debug output for the shape of h, after each layer step.
        # As an example, here is the first layer:
        h = self.conv1(x) # apply convolution
        h = self.conv1_bn(h) # apply batch norm
        h = torch_func.relu(h)  # apply activation function (rectified linear)
        if self.debug: # debug output
            print(tuple(h.shape))

        # Implement application of the remaining convolutional layers.
        # Apply max-pooling (torch_func.max_pool) with 3x3 kernels after 2nd and 4th
        # convolutional layers!
        # Remeber to apply the activation function after each layer (torch_func.relu), always
        # as the very last (after batch-norm, max-pooling, and drop out - if applicable) operation.

        # YOUR CODE HERE
        h = self.conv2(h)
        h = self.conv2_bn(h) 
        h = torch_func.relu(h)
        if self.debug: # debug output
            print(tuple(h.shape))
        h = self.conv2_do(h)

        h = self.conv3(h)  # Use the output from the previous layer as input
        h = self.conv3_bn(h)
        h = torch_func.relu(h)
        if self.debug:
            print(tuple(h.shape))

        h = self.conv4(h)  # Use the output from the previous layer as input
        h = self.conv4_bn(h)
        h = torch_func.relu(h)
        if self.debug:
            print(tuple(h.shape))
        h = self.conv4_do(h)


        # Between the convolutional layers and the dense output layers, we have to flatten the
        # tensor to only retain dimensions for (batch, width). this can be achieved using `torch.view`.
        # If you provide -1 as size for a dimension, `view` will leave the number of elements
        # as they where. Use this for the batch dimension. The remaining width is:
        # num_channels*num_frames*num_features
        # which should be:
        # 64*1*7
        # If the sizes are different at this point, something is wrong with your convolutional layers.
        # After flattening the tensor, proceed with feeding the data throught the two dense layers.
        # Do not forget to apply the activation functions: relu for the second-to-last and sigmoid
        # for the last layer!

        # YOUR CODE HERE
        # Flatten the output for the dense layers
        h = h.view(h.size(0), -1)
        # Dense Layers
        h = self.fc1(h)
        h = self.bn1(h)
        h = torch_func.relu(h)
        h = self.dropout1(h)
        h = self.fc2(h)

        # Assuming the last layer is a multi-class classification
        y = torch_func.sigmoid(h)

        return y

In [10]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

net = ConvolutionalNet()
module_list = net.__dict__['_modules'].values()

num_modules = len(module_list)
assert num_modules == 14, "expected 14 modules!"

num_conv2d = 0
num_bn2d = 0
num_do2d = 0
num_lin = 0
num_do1d = 0
# count modules...
print("Found modules:")
for idx, module in enumerate(module_list):
    print("{}:\t{}".format(idx+1, module))
    if isinstance(module, nn.Conv2d):
        num_conv2d += 1
    if isinstance(module, nn.BatchNorm2d):
        num_bn2d += 1
    if isinstance(module, nn.Dropout2d):
        num_do2d += 1
    if isinstance(module, nn.Linear):
        num_lin += 1
    if isinstance(module, nn.Dropout):
        num_do1d += 1
print('-----------')

assert num_conv2d == 4, "expected 4 2D convolutional layers!"
assert num_bn2d == 4, "expected 4 2D batchnorm layers!"
assert num_do2d == 2, "expected 2 2D dropout layers!"
assert num_lin == 2, "expected 2 dense layers!"
assert num_do1d == 1, "expected 1 dropout layer!"


print('All tests successful!')

Found modules:
1:	Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
2:	BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
3:	Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1))
4:	BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
5:	Dropout2d(p=0.5, inplace=False)
6:	Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
7:	BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
8:	Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
9:	BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
10:	Dropout2d(p=0.5, inplace=False)
11:	Linear(in_features=77248, out_features=50, bias=True)
12:	BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
13:	Dropout(p=0.5, inplace=False)
14:	Linear(in_features=50, out_features=3, bias=True)
-----------
All tests successful!


### 1.2 Network Forward Function

Next we want to define the forward function of the neural network.
In the forward function we apply each layer sequentially to the input. 
Additionally to the modules we defined in the constructor, we have to apply activation functions, pooling functions, and any dimensionality transformation required.

A side effect of the convolutional architecture is that it reduces the input spectrogram snippet's shape `(context, spec_freq_bins)` (in our case: (25, 79) ) down to our desired output shape (1, 3) -- the value for the current frame for our three activation functions (one for each drum instrument). This is achieved by using a combination of convolutions without padding and max pooling.

When working with neural networks implemented using matrices, we have to take care of additional dimensions that change the input data matrices into n-dimensional tensors:
We start with an input matrix with shape `(context, spec_freq_bins)`, i.e. `(time, features)`.
For the convolutional layers, we need an additional dimension for channels. 
This dimension has to be added manually, we do this in our `forward` function.
Additionally, the training framework will add one more dimesion for the individual training examples within one batch of training examples. This dimension is the very first dimension and must stay unchanged for the whole forward function. This means for our tensors in the `forward` function we end up with the following dimension: 
`(batch, channel, frames, features)`

To keep track of the data dimensions in the forward function, add debug outputs for the shape of the tensor after the application of each layer.

**Implement the `forward` function** in the `ConvolutionalNet` class above.

In [11]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

net = ConvolutionalNet(debug=True)
net.eval()
out = net.forward(torch.rand(1, 25, spec_freq_bins))
assert out.shape == (1, 3), 'output shape did not match! check layer output shapes...'


print('All tests successful!')

shapes for x:
(batch, channel, frames, features)
(1, 1, 25, 79)
(1, 32, 23, 77)
(1, 32, 21, 75)
(1, 64, 19, 73)
(1, 64, 17, 71)
All tests successful!


## Task 2: Training Dataset (15 Points)

In order to use the audio spectrograms as training data for the neural network, we have to bring it into the right shape and form.
The class that handles data preparation in pytorch is to be derived from `torch.utils.data.Dataset` and must implement the `__len__` and `__getitem__` methods.

### 2.1 Number of Examples in Training Data
In order to implement `__len__`, we must calculate how many training examples (i.e. spectrogram snippets) we can extract from our training data.
For this we must check each spectrogram and calculate, how many snippets with shape `(context, features)` we can get out of the spectrogam when using a hop size `hop_size`.
The number of snippets for each spectrogram are stored in the list `snip_cnt` and returned.

**Implement `calc_num_samples`** below.

In [12]:
def calc_num_samples(feat_list, context, hop_size):
    """
    Calculate overall number of snippets we can get from our data
    
    :param feat_list: list, list with spectrograms (np.ndarray) for individual tracks
    :param context: int, length of context for CNN - time axis, first dimension of spectrograms!
    :param hop_size: int, hop size between two consecutive spectrogram snippets. 
                     This is usually 1, especiall for testing and inference. 
                     For very large datasets it can make sense to increase it, to speed up training.
    """
    snip_cnt = []
    
    # iterate over spectrograms in feat_list and calculate how many snippets we can extract.
    # append number of samples with size context, separated by hop_size, to snip_cnt list
    
    # YOUR CODE HERE
    for spectrogram in feat_list:
        # If the number of time frames in the spectrogram is less than the context size,
        # we cannot extract any snippets, so we add 0 to the snip_cnt list.
        if spectrogram.shape[0] < context:
            snip_cnt.append(0)
        else:
            # Otherwise, calculate the number of snippets as before.
            num_snippets = 1 + (spectrogram.shape[0] - context) // hop_size
            # Ensure we don't add negative numbers
            snip_cnt.append(max(num_snippets, 0))

    return snip_cnt

In [13]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

spec_sizes = [25, 26, 10, 0, 200, 125]
expected_snip_cnt = [1, 1, 0, 0, 18, 11]
result = calc_num_samples([np.ones((frames, 10)) for frames in spec_sizes], 25, 10)
print(result)
assert len(result) == len(spec_sizes), 'return one snippet count per entry in feat_list!'
for idx, exp in enumerate(expected_snip_cnt):
    assert result[idx] == exp, 'incorrect number of snippets at {}, expected: {}, returned {} !'.format(idx, exp, result[idx])
assert np.sum(result) == np.sum(expected_snip_cnt), 'incorrect total number of snippets returned!'

print('All tests successful!')

[1, 1, 0, 0, 18, 11]
All tests successful!


### 2.2 Extracting Training Examples from Dataset
After we know, how many training examples we can extract from each spectrogram in our dataset, we can implement the `__getitem__` method.
The parameter provided with `__getitem__` indicates which element of the dataset should be returned, as an index between zero and `__len__`.
In order to find the correct spectrogram and frames for the requested training example, the following approach should be implemented:
1. Iterate over the values in `self.snip_cnt` and sum them
2. The value for which the sum grows larger than `index` represents the spectrogram in which the training example can be found
3. The frame number in the spectrogram is `index` minus the sum before the current value

Once we know which spectrogram number, and which frame number within the spectrogram represents the start of the training example, we can return both the example and the target for the requested example.

**Implement the `__getitem__` method** in our `Dataset` class below.

In [14]:
from torch.utils.data import Dataset as Dataset

# class which formats the spectrogram data in the way needed for convolutional neural network training
class DrumSet(Dataset):
    def __init__(self, feat_list, targ_list, context, hop_size):
        super(DrumSet, self).__init__()
        """
        Create spectrogram based drum dataset for CNN training
        :param feat_list: list, list with spectrograms (np.ndarray) for individual tracks
        :param targ_list: list, list with targets (np.ndarray) for individual tracks
        :param context: int, length of context for CNN - time axis, first dimension of spectrograms!
        :param hop_size: int, hop size between two consecutive spectrogram snippets. 
                         This is usually 1, especiall for testing and inference. 
                         For very large datasets it can make sense to increase it, to speed up training.
        """
        self.features = feat_list
        self.targets = targ_list
        self.context = context
        self.hop_size = hop_size
        
        self.targ_offset = int(np.ceil(self.context/2)-1)  # target should be in the middle of our training context!
        
        # list with snippet count per track
        self.snip_cnt = calc_num_samples(feat_list, context, hop_size)
        # total number of training examples
        self.length = np.sum(self.snip_cnt)
        
    def __len__(self):
        return self.length

    def __getitem__(self, index):
        # find track which contains example with requested index
        count_sum = 0
        track_idx = 0  # Initialize track_idx
        
        # Iterate over the snip_cnt to find the correct spectrogram
        for i, cnt in enumerate(self.snip_cnt):
            if count_sum + cnt > index:
                track_idx = i
                break
            count_sum += cnt
        
        # Calculate the position within the spectrogram
        position = (index - count_sum) * self.hop_size
        
        # Extract the training example and target
        sample_start = position
        sample_end = position + self.context
        target_pos = sample_start + self.targ_offset

        sample = self.features[track_idx][sample_start:sample_end, :]
        target = self.targets[track_idx][target_pos, :]

        # Adjust the shape of sample to match the expected shape
        # Remove the first dimension which was added for batch normalization
        sample = sample.reshape(self.context, -1)  # (context, features)

        # Target shape adjustment if needed
        target = target.reshape(-1)  # (num_insts, )

        # Convert to PyTorch tensors
        return torch.from_numpy(sample).float(), torch.from_numpy(target).float()

In [15]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

# create some test specs and targets
spec_lens = [35, 0, 24, 30]
spec_lens_sum = np.cumsum([0]+spec_lens)
feats = [spec_lens_sum[idx]+1*
         np.repeat(np.arange(spec_lens[idx])[:, None]+1, spec_freq_bins, axis=1)-1 
         for idx in range(len(spec_lens))]
targs = [feat[:, :num_insts] for feat in feats]

# create dataset for tests
context = 25
testset = DrumSet(feats, targs, context, 1)

# check example counts again...
snip_cnts = testset.snip_cnt
expected_snip_cnts = [11, 0, 0, 6]
assert len(snip_cnts) == len(spec_lens), 'return one snippet count per entry in feat_list!'
for idx, exp in enumerate(expected_snip_cnts):
    assert snip_cnts[idx] == exp, \
    'incorrect number of snippets at {}, expected: {}, returned {} !'.format(idx, exp, snip_cnts[idx])
assert np.sum(result) == np.sum(expected_snip_cnt), 'incorrect total number of snippets returned!'

# check indices of targets and features for three examples...
test_idices = [0, 12, 3]
exp_targs = [12, 72, 15]
exp_feats = [(0, 24), (60, 84), (3, 27)]
for test_idx, exp_targ, exp_feat in zip(test_idices, exp_targs, exp_feats):
    cur_feat, cur_targ = testset[test_idx]
    assert cur_feat.shape == (context, spec_freq_bins), \
    'returned feature snipped has wrong shape: {}'.format(cur_feat.shape)
    assert cur_targ.shape == (num_insts, ),  \
    'returned target has wrong shape: {}'.format(cur_targ.shape)
    assert cur_targ[0] == exp_targ,  \
    'target position was of: expected: {}, was: {}'.format(exp_targ, cur_targ[0])
    assert cur_feat[0, 0] == exp_feat[0],  \
    'feat start pos was of: expected: {}, was: {}'.format(exp_feat[0], cur_feat[0, 0])
    assert cur_feat[-1, 0] == exp_feat[1],  \
    'feat end pos was of: expected: {}, was: {}'.format(exp_feat[1], cur_feat[-1, 0])

print('All tests successful!')

All tests successful!


## Task 3: Functions for Training, Testing, and Inference

In the next code cells, we will implement functions that apply the network on data in the context of training, testing, and inference.

Since the code mainly consists of pytorch boilerplate code and deep learning mechanics, the functions are provided. Read through them and try to understand what is happening.


In [16]:
def train_epoch_cnn(model, train_loader, optimizer, args):
    """
    Training loop for one epoch of NN training.
    Within one epoch, all the data is used once, we use mini-batch gradient descent.
    :param model: The model to be trained
    :param train_loader: Data provider
    :param optimizer: Optimizer (Gradient descent update algorithm)
    :param args: NN parameters for training and inference
    :return:
    """
    model.train()  # set model to training mode (activate dropout layers for example)
    t = get_time() # we measure the needed time
    for batch_idx, (data, target) in enumerate(train_loader):  # iterate over training data
        data, target = data.to(args.device), target.to(args.device)  # move data to device (GPU) if necessary
        optimizer.zero_grad()  # reset optimizer
        output = model(data)   # forward pass: calculate output of network for input
        loss = torch_func.binary_cross_entropy(output, target)  # calculate loss
        loss.backward()  # backward pass: calculate gradients using automatic diff. and backprop.
        optimizer.step()  # udpate parameters of network using our optimizer
        cur_time = get_time()
        # print some outputs if we reached our logging intervall
        if cur_time - t > args.log_interval or batch_idx == len(train_loader)-1:  
            print('[{}/{} ({:.0f}%)]\tloss: {:.6f}, took {:.2f}s'.format(
                       batch_idx * len(data), len(train_loader.dataset),
                       100. * batch_idx / len(train_loader), loss.item(),cur_time - t))
            t = cur_time


def test_cnn(model, test_loader, args):
    """
    Function wich iterates over test data (eval or test set) without performing updates and calculates loss.
    :param model: The model to be tested
    :param test_loader: Data provider
    :param args: NN parameters for training and inference
    :return: cumulative test loss
    """
    model.eval()  # set model to inference mode (deactivate dropout layers for example)
    test_loss = 0  # init overall loss
    with torch.no_grad():  # do not calculate gradients since we do not want to do updates
        for data, target in test_loader:  # iterate over test data
            data, target = data.to(args.device), target.to(args.device)  # move data to device 
            output = model(data) # forward pass
            # claculate loss and add it to our cumulative loss
            test_loss += torch_func.binary_cross_entropy(output, target, reduction='sum').item()
    test_loss /= len(test_loader.dataset)  # calc mean loss
    print('Average eval loss: {:.4f}\n'.format(test_loss, len(test_loader.dataset)))
    return test_loss


def inference_cnn(model, data, args):
    """
    Function calculating the actual output of the network, given some input.
    :param model: The network to be used
    :param data: Data for which the output should be calculated
    :param args: NN parameters for training and inference
    :return: output of network
    """
    model.eval()   # set model to inference mode
    # reserve output memory
    in_shape = data.shape
    side = int((args.context - 1) / 2)
    outlen = in_shape[0] - 2 * side
    output = np.zeros((in_shape[0], args.out_num))
    data = torch.from_numpy(data[None, :, :]) 
    data = data.to(args.device) # move input to device
    with torch.no_grad(): # do not calculate gradients
        for idx in range(outlen): # iterate over input data
            # calculate output for input data (and move back from device)
            output[idx+side, :] = model(data[:, idx:(idx + args.context), :])[0, :].cpu()
    return output

## Task 4: Network Training (15 Points)

Now we have all componentes to actually train the network: we have implemented a network, a dataset class, and the function to make an update of the network parameters for training.

First some predefined hyperparameters. Note that finding these parameters usually takes a lot of time since it involves trying many different parameter combinations and running a full training for each.


In [17]:
# DO NOT COPY OR MODIFY THIS CELL!!

CNN_MODEL_NAME = 'cnn_adt_'

# Helper class for neural network hyper-parameters
# Do not change them until you have a working model
class Args:
    pass

DEFAULT_ARGS = Args()
# general params
DEFAULT_ARGS.use_cuda = True
DEFAULT_ARGS.seed = 1

# architecture setup
DEFAULT_ARGS.batch_size = 64
DEFAULT_ARGS.context = 25
DEFAULT_ARGS.step_size = 1
DEFAULT_ARGS.out_num = num_insts

# optimizer parameters
DEFAULT_ARGS.lr = 0.01
DEFAULT_ARGS.momentum = 0.5

# training protocoll
DEFAULT_ARGS.max_epochs = 1000
DEFAULT_ARGS.patience = 4
DEFAULT_ARGS.log_interval = 10 # seconds

### 4.1 Main Training Function

In the next function, implement the main training loop for the drum transcription model.
For training we will use an early stopping training protocoll with patience, i.e. we will perform updates of the model using the `train_epoch_cnn` function and check the loss on the validation dataset using `test_cnn`. As long as the loss on the validation dataset decreases we continue training. 
As soon as the validation loss did not decrease for at least `patience` epoch in a row, we will stop training.

Additionally we will write the network model to a file, whenever a better validation loss was achieved.

**Implement the missing code parts** in the code cell below.

In [18]:
def train_cnn(smoke_test=False, load_model=False, args=DEFAULT_ARGS):
    """
    Run CNN training using the datasets.
    :param smoke_test: bool, run a quick pseudo training to check if everything works
    :param load_model: bool or string, load the last trained model (if set to True), or the model in file (string)
    :param args: hyperparameters for training
    :return: trained model
    """
    if smoke_test:  # set hyperparameters to run a quick pseudo training...
        step_size = 100
        max_epochs = 2
    else:
        step_size = args.step_size
        max_epochs = args.max_epochs
    

    # setup pytorch
    use_cuda = args.use_cuda and torch.cuda.is_available()
    torch.manual_seed(args.seed)
    args.device = torch.device("cuda" if use_cuda else "cpu")

    # create model and optimizer, we use plain SGD with momentum
    model = ConvolutionalNet().to(args.device)
    from torch.optim import SGD
    optimizer = SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    # setup our datasets for training, evaluation and testing
    kwargs = {'num_workers': 2, 'pin_memory': True} if use_cuda else {'num_workers': 0}
    train_loader = torch.utils.data.DataLoader(DrumSet(train_feat, train_targets, args.context, step_size),
                                               batch_size=args.batch_size, shuffle=True, **kwargs)
    valid_loader = torch.utils.data.DataLoader(DrumSet(valid_feat, valid_targets, args.context, step_size),
                                               batch_size=args.batch_size, shuffle=False, **kwargs)
    test_loader = torch.utils.data.DataLoader(DrumSet(test_feat, test_targets, args.context, step_size),
                                              batch_size=args.batch_size, shuffle=False, **kwargs)

    # find model filename for last run, and next free filename for model
    model_cnt = 0
    new_model_file = os.path.join(CNN_MODEL_NAME + str(model_cnt) + '.model')
    last_model_file = None
    while os.path.exists(new_model_file):
        model_cnt += 1
        last_model_file = new_model_file
        new_model_file = os.path.join(CNN_MODEL_NAME + str(model_cnt) + '.model')
    if load_model is None or not load_model:  # let's train
        best_valid_loss = 9999.  # let's init it with something really large - loss will be << 1 usually
        cur_patience = args.patience  # we keep track of our patience here.
        print('Training CNN...')
        start_t = get_time()
        # Implement neural network trianing using early stopping with patience.
        # Create a loop for epoch to iterate over 1..max_epochs, in this loop perform 
        # the following steps:
        # 1. run one epoch of NN training (call train_epoch_cnn using train_loader)
        # 2. run test on validationset (call test_cnn), and store the aver loss (valid_loss)
        # 3. check early stopping criterion: if valid_loss < best_valid_loss we:
        # 3.1 store the current model (torch.save(<model>, <file>))
        #     ONLY IF we are NOT running a smoke test (smoke_test)
        # 3.2 set best_valid_loss to the current validation loss
        # 3.3 reset cur_patience to the initial value (in case we previousely already decreased it)
        # 4. if valid_loss did not decrease:
        # 4.1 check if we are already at the end of our patience (cur_patience <= 0), 
        #     if that is the case, print a info message and stop trainig (just call break)
        # 4.2 if we still have patience left, print a message and decrease cur_patience by one
        # Also consider adding some additional prints, so you know what is going on...
        
        # YOUR CODE HERE
        for epoch in range(1, max_epochs + 1):
            # 1. Run one epoch of NN training
            train_epoch_cnn(model, train_loader, optimizer, args)

            # 2. Test on validation set
            valid_loss = test_cnn(model, valid_loader, args)

            # 3. Check early stopping criterion
            if valid_loss < best_valid_loss:
                if not smoke_test:
                    torch.save(model, new_model_file)
                best_valid_loss = valid_loss
                cur_patience = args.patience
                print(f'Epoch {epoch}: New best validation loss {valid_loss}')
            else:
                # 4. Early stopping check
                if cur_patience <= 0:
                    print(f'Early stopping triggered at epoch {epoch}')
                    break
                else:
                    cur_patience -= 1
                    print(f'Epoch {epoch}: Validation loss did not improve. Patience: {cur_patience}')

        print(f'Training took: {get_time() - start_t:.2f}s for {epoch} epochs')
        print('Testing...')
        test_loss = test_cnn(model, test_loader, args)
        print(f'Test loss: {test_loss:.4f}')
        
    else:  # we should just load a model...
        if not isinstance(load_model, str) or not os.path.exists(load_model):
            load_model = last_model_file
        if load_model is None or not os.path.exists(load_model):
            print('Model file not found, unable to load...')
        else:
            model.load_state_dict(torch.load(load_model, map_location=args.device))
            print("Model file loaded: {}".format(load_model))
        
    return model

Run the training loop as a *smoke test* first -- we don't want a training that has been running for 2h to crash, just because of a small typo somewhere at the end of the function.

In [19]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

model = train_cnn(smoke_test=True)  # just a quick run to check everything is working

print('All tests successful!')

Training CNN...
Average eval loss: 1.4300

Epoch 1: New best validation loss 1.4300157326023752
Average eval loss: 1.0886

Epoch 2: New best validation loss 1.0886370176222266
Training took: 9.84s for 2 epochs
Testing...
Average eval loss: 1.1409

Test loss: 1.1409
All tests successful!


### 4.2 Train Run
Now that we made sure, everything works fine, let's run the training for real.
Execute the next cells to start a network training. Be aware that on JupyterHub, **a training run will take 2-3h.**
As mentioned in the beginning of the notebook, you are free to run the notebook on CUDA enabled infrastructure e.g. on google colab, on which the training will run \~10x faster (\~15m).

**ATTENTION**: For the submitted version, we only want to load the already trained model, so you have to set the switch below (SUBMISSION) to True! Additionally, make sure no new training is started when executing the final version before submission!!

In [20]:
SUBMISSION = True  # SET THIS TO TRUE FOR SUBMISSION!!

In [21]:
# DO NOT COPY OR MODIFY THIS CELL!!
#ignore this cell...

In [22]:
# DO NOT COPY OR MODIFY THIS CELL!!

# Run training for real
# Note: This will take around 2-3h if everything works correctly.
if not SUBMISSION:
    model = train_cnn()
else:
    model = train_cnn(load_model=True)  
    # you can set a specific model file for submission, if you trained multiple models using e.g.:
    # load_model='best_cnn_adt.model'

Model file loaded: cnn_adt_0.model


## Task 5: Inference on Test Data (10 Points)
Now that we have a trained model, we want to use it to actually perform drum transcription.
In order to do this, we have to apply the model on a spectrogram (`inference_cnn`) and detect the peaks in the predicted activation functions (network outputs).

**Implement the `detect_drums` function** below so it returns the detectedd drum instrument onsets.

In [23]:
def detect_drums(model, spec, args=DEFAULT_ARGS):
    from madmom.features.onsets import OnsetPeakPickingProcessor
    # peak picking for activation functions
    # use these parameters, they should work fine
    peak_picking = OnsetPeakPickingProcessor(threshold=0.05, smooth=0.0, pre_avg=0.01,
                                             post_avg=0.01, pre_max=0.02, post_max=0.02,
                                             combine=0.02, fps=spec_frame_rate)
    
    # use inference_cnn with the trained model to get estimates for the activation functions
    # perform peak picking (peak_picking.process) for each activation function: output[:, inst_nr]
    # create a detections array, in the form [[timestamp, instrument_nr], [timestamp, instrument_nr], ...]
    # make sure to sort the resulting detections by timestamps
    
    # YOUR CODE HERE
    activations = inference_cnn(model, spec, args)
    detections = []
    
    for inst_nr in range(args.out_num):
        onset_timestamps = peak_picking.process(activations[:, inst_nr])
        
        for timestamp in onset_timestamps:
            detections.append([timestamp, inst_nr])
    
    detections.sort(key=lambda x: x[0])
    detections_array = np.array(detections, dtype=np.float32)
    
    return detections_array

In [24]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

spec = np.zeros((100, spec_freq_bins), dtype=np.float32)
spec[50, :] = 1.0
dets = detect_drums(model, spec)
print(len(dets))
print(dets)
assert len(dets.shape) == 2 

print('All tests successful!')

5
[[0.47 0.  ]
 [0.47 2.  ]
 [0.5  0.  ]
 [0.5  1.  ]
 [0.5  2.  ]]
All tests successful!


Let's run inference on the test dataset:

In [25]:
# DO NOT COPY OR MODIFY THIS CELL!!

def detect_drum_for_all(model, spectrograms, args=DEFAULT_ARGS):
    detection_list = []
    # iterate over tracks
    for cur_spec in spectrograms:
        detection_list.append(detect_drums(model, cur_spec))
    return np.asarray(detection_list)

In [26]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

# detect drums on test dataset
t1=get_time()
detected_drums = detect_drum_for_all(model, test_feat)
print("Time used for drum detection: {:.1f} seconds.".format(get_time()-t1))

assert len(detected_drums) == len(test_feat)

print('All tests successful!')

Time used for drum detection: 147.4 seconds.
All tests successful!


## Task 6: Evaluation of Drum Transcription (5 Points)
Evaluate the drum transcription performance (using the madmom.evaluation.onset module) on the dataset. As with onset detection, use fmeasure, precision, and recall.
For each track, sum up true positives, false positives, and false negatives (OnsetSumEvaluation) of the individual instruments. Over different tracks, take the mean of the resulting metrics (OnsetMeanEvaluation).

In [27]:
# DO NOT COPY OR MODIFY THIS CELL!!
DEFAULT_TOLERANCE = 0.02

In [28]:
def evaluate_drums(detections, annotations, tolerance=DEFAULT_TOLERANCE):
    """
    Evaluate detected drums against ground truth annotations.

    Parameters
    ----------
    detections : list
        List with drum detections for all files.
    annotations : list
        List with corresponding ground truth annotations.
    tolerance : float
        Tolerance window in seconds for onset evaluation.

    Returns
    -------
    fmeasure : float
    precision : float
    recall : float
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    total_detected = 0
    total_annotated = 0
    
    annotations_copy = [list(ann) for ann in annotations]
    
    for track_detections, track_annotations in zip(detections, annotations_copy):
        true_positives = 0
        false_positives = 0
        false_negatives = 0
        
        for det_onset in track_detections:
            min_distance = float('inf')
            closest_annotated_index = -1
            for i, ann_onset in enumerate(track_annotations):
                distance = abs(det_onset[0] - ann_onset[0])
                if distance < min_distance:
                    min_distance = distance
                    closest_annotated_index = i
            
            if min_distance <= tolerance:
                true_positives += 1
                track_annotations.pop(closest_annotated_index)
            else:
                false_positives += 1
        
        false_negatives = len(track_annotations)
        
        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
        total_detected += len(track_detections)
        total_annotated += len(track_annotations)
    
    precision = total_true_positives / (total_true_positives + total_false_positives + 1e-8)
    recall = total_true_positives / (total_true_positives + total_false_negatives + 1e-8)
    fmeasure = 2 * precision * recall / (precision + recall + 1e-8)
    
    return fmeasure, precision, recall

In [29]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!
   
np.testing.assert_allclose(evaluate_drums(np.asarray([[(1, 0), (1, 1), (1, 2)]]), 
                                          np.asarray([[(1, 0), (1, 1), (1, 2)]]), 0.02), 
                           (1.0, 1.0, 1.0))

np.testing.assert_allclose(
          evaluate_drums(np.asarray([[(1, 0), (1, 1)]]), 
                         np.asarray([[(1, 0), (2, 1)]]), 0.02), 
          (0.5, 0.5, 0.5))
print('Public tests successful!')
score = 1
print(f"score = {score}")
score

Public tests successful!
score = 1


1

In [30]:
# tests # DO NOT COPY OR MODIFY THIS CELL!!

fmeasure, precision, recall = evaluate_drums(detected_drums, test_annot)
print('Results for CNN:\n'
      '-----------------\n'
      'Precision: {:.3f}\n'
      'Recall:    {:.3f}\n'
      'F-Measure: {:.3f}'.format(precision, recall, fmeasure))

# with correct implementation you should be able to achieve this performance goal:

assert fmeasure > 0.8

print('All tests successful!')

Results for CNN:
-----------------
Precision: 0.990
Recall:    0.969
F-Measure: 0.979
All tests successful!


## Congratulations, you are done!

Reminder:
Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).