# Urban Sound 8k Experiments

## Data Loading and Organization
This section creates a data store of all audio files and adds matching labels.

Load data if already saved (saves time processing further down)

In [3]:
load("resnet")

In [5]:
who


Your variables are:

XTrain             dataFolder         miniBatchSize      specSize           
XTrainIms          dropoutProb        names              sset               
XVal               epsilon            net                sz                 
XValIms            frameDuration      newClassLayer      timePoolSize       
YTrain             hopDuration        newLearnableLayer  trainError         
YTrainPred         idxs               numBands           trainedNet         
YVal               imageSize          numClasses         valError           
YValPred           inputSize          numF               valFrequency       
ads                labels             options            
ans                layers             sTrain             
classLayer         learnableLayer     sVal               
classWeights       lgraph             segmentDuration    



In [64]:
pcolor(XTrain(:,:,:,1))
shading flat

In [6]:
imshow(specs2Ims(XTrain(:,:,:,1), [224 224]), [0 255])

In [None]:
save("resnet")

In [1]:
addpath(".")

In [2]:
dataFolder = "~/Music/urbansound8k";

In [3]:
ads = audioDatastore(strcat(dataFolder,"/Train"));
[~, names, ~] = cellfun(@fileparts, ads.Files, 'UniformOutput', false);
names = cellfun(@str2num, names);
[~, idxs] = sort(names);
ads.Files = ads.Files(idxs);
labels = readtable(strcat(dataFolder,"/train.csv"));
ads.Labels = categorical(labels.Class);

# Sub-Sampling
This section provides the option to take a subset of the dataset for testing.

In [4]:
[sset, ~] = splitEachLabel(ads, 0.15);
[sTrain, sVal] = splitEachLabel(ads, 0.8);

# Neural Networks

Create spectrograms from the data. The dimension of the output data will then be 40x396. Shorter audio clips are padded equally on both sides, see `spectrograms.m` for details.

![spectrogram explained](pres/spec_explained.png)

In [20]:
segmentDuration = 4;
frameDuration = 0.05;
hopDuration = 0.010;
numBands = 40;
reset(sTrain);
reset(sVal);
epsilon = 1e-6; % Added so that log doesn't encounter 0
XTrain = log10(spectrograms(sTrain, segmentDuration, frameDuration, hopDuration, numBands)+epsilon);
XVal = log10(spectrograms(sVal, segmentDuration, frameDuration, hopDuration, numBands)+epsilon);

Computing speech spectrograms...
Processed 100 files out of 4348
Processed 200 files out of 4348
Processed 300 files out of 4348
Processed 400 files out of 4348
Processed 500 files out of 4348
Processed 600 files out of 4348
Processed 700 files out of 4348
Processed 800 files out of 4348
Processed 900 files out of 4348
Processed 1000 files out of 4348
Processed 1100 files out of 4348
Processed 1200 files out of 4348
Processed 1300 files out of 4348
Processed 1400 files out of 4348
Processed 1500 files out of 4348
Processed 1600 files out of 4348
Processed 1700 files out of 4348
Processed 1800 files out of 4348
Processed 1900 files out of 4348
Processed 2000 files out of 4348
Processed 2100 files out of 4348
Processed 2200 files out of 4348
Processed 2300 files out of 4348
Processed 2400 files out of 4348
Processed 2500 files out of 4348
Processed 2600 files out of 4348
Processed 2700 files out of 4348
Processed 2800 files out of 4348
Processed 2900 files out of 4348
Processed 3000 file

Create categorical vectors for labels

In [21]:
YTrain = categorical(sTrain.Labels);
YVal = categorical(sVal.Labels);

## Check out dataset items

In [46]:
n = randi([0,500], 1, 1);
pcolor(XTrain(:,:,:,n))
title(strrep(YTrain(n,:), '_', ' '))
shading flat
[samps, sampfreq] = audioread(sTrain.Files{n});
sound(samps, sampfreq)



In [35]:
n=5;
sound(audioread(ads.Files{n}), 48000)
ads.Files{n}
ads.Labels(n)


ans =

    '/home/zach/Music/urbansound8k/Train/4.wav'


ans = 

  categorical

     dog_bark 



## Conv Net from Scratch
This section uses a simple convolutional neural network to classify spectrograms and achieves a validation set accuracy of 95.86% using holdout validation set of 20% of the data. The network for this section is largely the same as one from a speech recognition tutorial by Mathworks [here](https://www.mathworks.com/help/deeplearning/examples/deep-learning-speech-recognition.html?s_tid=mwa_osa_a). This CNN is trained using the [Adam](https://arxiv.org/abs/1412.6980) optimizer, and begins to overfit starting at about 7 epochs using the learning rate $3*10^{-3}$.

![from scratch](pres/convnet_25_epochs.png)

In [25]:
sz = size(XTrain);
specSize = sz(1:2);
imageSize = [specSize 1];
classWeights = 1./countcats(YTrain);
classWeights = classWeights'/mean(classWeights);
numClasses = numel(categories(YTrain));

timePoolSize = ceil(imageSize(2)/8);
dropoutProb = 0.2;
numF = 12;
layers = [
    imageInputLayer(imageSize, 'Name', 'Input_Layer')

    convolution2dLayer(3,numF,'Padding','same', 'Name', 'Conv_1')
    batchNormalizationLayer('Name', 'BN_1')
    reluLayer('Name', 'Relu_1')

    maxPooling2dLayer(3,'Stride',2,'Padding','same', 'Name', 'MaxPool_1')

    convolution2dLayer(3,2*numF,'Padding','same', 'Name', 'Conv_2')
    batchNormalizationLayer('Name', 'BN_2')
    reluLayer('Name', 'Relu_2')

    maxPooling2dLayer(3,'Stride',2,'Padding','same', 'Name', 'MaxPool_2')

    convolution2dLayer(3,4*numF,'Padding','same', 'Name', 'Conv_3')
    batchNormalizationLayer('Name', 'BN_3')
    reluLayer('Name', 'Relu_3')

    maxPooling2dLayer(3,'Stride',2,'Padding','same', 'Name', 'MaxPool_3')

    convolution2dLayer(3,4*numF,'Padding','same', 'Name', 'Conv_4')
    batchNormalizationLayer('Name', 'BN_4')
    reluLayer('Name', 'Relu_4')
    convolution2dLayer(3,4*numF,'Padding','same', 'Name', 'Conv_5')
    batchNormalizationLayer('Name', 'BN_5')
    reluLayer('Name', 'Relu_5')

    maxPooling2dLayer([1 timePoolSize], 'Name', 'MaxPool_4')

    dropoutLayer(dropoutProb, 'Name', 'Dropout_1')
    fullyConnectedLayer(numClasses, 'Name', 'FC_1')
    softmaxLayer('Name', 'Softmax_1')
    classificationLayer('Name', 'Classification')];

In [5]:
plot(layerGraph(layers))

In [26]:
miniBatchSize = 128;
valFrequency = floor(numel(YTrain)/miniBatchSize);
options = trainingOptions('adam',...
    'InitialLearnRate',3e-3, ...
    'MaxEpochs',25, ...
    'MiniBatchSize',miniBatchSize, ...
    'Shuffle','every-epoch', ...
    'Plots','training-progress', ...
    'Verbose',false, ...
    'ValidationData',{XVal,YVal}, ...
    'ValidationFrequency',valFrequency, ...
    'LearnRateSchedule','piecewise', ...
    'LearnRateDropFactor',0.1, ...
    'LearnRateDropPeriod',20, ...
    'ExecutionEnvironment', 'gpu');

In [28]:
trainedNet = trainNetwork(XTrain, YTrain, layers, options);

In [47]:
YValPred = classify(trainedNet, XVal);
valError = mean(YValPred ~= YVal);
YTrainPred = classify(trainedNet, XTrain);
trainError = mean(YTrainPred ~= YTrain);
disp("Training Error: "+ trainError*100+"%")
disp("Validation Error: "+ valError*100+"%")

Training Error: 0.068997%
Validation Error: 4.3238%


# Visualization of Results

In [48]:
figure('Units','normalized','Position',[0.2 0.2 0.5 0.5]);
cm = confusionchart(YVal,YValPred);
cm.Title = 'Confusion Matrix for Validation Data';
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
sortClasses(cm, [commands,"unknown","background"])

Undefined function or variable 'commands'.


In [33]:
trainedNet.Layers


ans = 

  24x1 Layer array with layers:

     1   'Input_Layer'      Image Input             40x396x1 images with 'zerocenter' normalization
     2   'Conv_1'           Convolution             12 3x3x1 convolutions with stride [1  1] and padding 'same'
     3   'BN_1'             Batch Normalization     Batch normalization with 12 channels
     4   'Relu_1'           ReLU                    ReLU
     5   'MaxPool_1'        Max Pooling             3x3 max pooling with stride [2  2] and padding 'same'
     6   'Conv_2'           Convolution             24 3x3x12 convolutions with stride [1  1] and padding 'same'
     7   'BN_2'             Batch Normalization     Batch normalization with 24 channels
     8   'Relu_2'           ReLU                    ReLU
     9   'MaxPool_2'        Max Pooling             3x3 max pooling with stride [2  2] and padding 'same'
    10   'Conv_3'           Convolution             48 3x3x24 convolutions with stride [1  1] and padding 'same'
    11   'BN_3' 

![Net Architecture](pres/netarch.png)

## Early Conv Layer

![layer 2](pres/layer2.png)

In [49]:
chans = 1:12
I = deepDreamImage(trainedNet, 2, chans, 'PyramidLevel', 1);
figure
I = imtile(I,'ThumbnailSize',[64 64]);
imshow(I)
title(['Layer ',name,' Features'])


chans =

     1     2     3     4     5     6     7     8     9    10    11    12

|  Iteration  |  Activation  |  Pyramid Level  |
|             |   Strength   |                 |
|           1 |         0.07 |               1 |
|           2 |         1.98 |               1 |
|           3 |         4.03 |               1 |
|           4 |         6.08 |               1 |
|           5 |         8.13 |               1 |
|           6 |        10.18 |               1 |
|           7 |        12.23 |               1 |
|           8 |        14.29 |               1 |
|           9 |        16.34 |               1 |
|          10 |        18.39 |               1 |
Undefined function or variable 'name'.


## Late Conv Layer

![layer 17](pres/layer17.png)

In [None]:
chans = 1:48
I = deepDreamImage(trainedNet, 17, chans, 'PyramidLevel', 1);
figure
I = imtile(I,'ThumbnailSize',[64 64]);
imshow(I)
title(['Layer ',name,' Features'])

# Pre-Trained Resnet
Residual networks (Resnets) consist of a series of "residual blocks" of layers, and bypass connections to skip these layers. The idea of these types of networks is that each residual block can correct for the error in the calculation of the previous res block, allowing the network to operate on errors rather than full results. Resnets generally show improved performance over traditional convolutional networks.

This section also utilizes transfer learning, making use of a model that has already been trained to recognize images from the Image Net dataset. While spectrograms are very different from real-world images, the early layers of the network should be similar, capturing low-level features such as edges. In order to re-train part of this network, the final layer is replaced to get a 10 class classifier for our dataset. The network is then trained as if the weights had been randomly initialized. One way this process could be improved is to set a variable learning rate so that early layers in the network adjust slower than the final layers of the network whose weights are largely useless for our purposes.

In order to use the pre-trained resnets that Mathworks provides, input images must be 224x224 pixels. We must then resize our spectrograms to be fed into this network. The network also expects color images. We add color to our spectrogram representations using the `pcolor()` function. Research has shown (CITE THIS) that adding color can actually improve network performance. Reprocess spectrograms as images with same size as pretrained resnet (WHYYYYY doesn't matlab handle varying input sizes for you???)

This method achieves 95.86% accuracy on the same validation set as the previous conv net. This is very similar in terms of performance, however the pre-trained resnet is able to reach this accuracy in only 3 epochs.

![performance](pres/resnet_3_epochs.png)

In [22]:
load("resnet")

In [23]:
%XValIms = specs2Ims(XVal, [224 224]);
XTrainIms = specs2Ims(XTrain, [224 224]);

Processed 4100
Processed 4150
Processed 4200
Processed 4250
Processed 4300


In [24]:
save("resnet", '-v7.3')

In [55]:
net = resnet18;
inputSize = net.Layers(1).InputSize


inputSize =

   224   224     3



Use Matlab [example](https://www.mathworks.com/help/deeplearning/examples/train-deep-learning-network-to-classify-new-images.html) supporting function to get last two layers of network

In [56]:
lgraph = layerGraph(net);
[learnableLayer,classLayer] = findLayersToReplace(lgraph);

In [57]:
numClasses = 10;
newLearnableLayer = fullyConnectedLayer(numClasses, ...
    'Name','new_fc', ...
    'WeightLearnRateFactor',10, ...
    'BiasLearnRateFactor',10);

Replace the last pre-trained layer and classification with untrained layers based on the number of classes we require

In [58]:
lgraph = replaceLayer(lgraph,learnableLayer.Name,newLearnableLayer);
newClassLayer = classificationLayer('Name','new_classoutput');
lgraph = replaceLayer(lgraph,classLayer.Name,newClassLayer);

Optionally, freeze initial layers of resnet here (kind of complicated compared to fastai)

Use same optimizer as before to train

In [59]:
miniBatchSize = 128;
valFrequency = floor(numel(YTrain)/miniBatchSize);
options = trainingOptions('adam',...
    'InitialLearnRate',3e-4, ...
    'MaxEpochs',3, ...
    'MiniBatchSize',miniBatchSize, ...
    'Shuffle','every-epoch', ...
    'Plots','training-progress', ...
    'Verbose',false, ...
    'ValidationData',{XValIms,YVal}, ...
    'ValidationFrequency',valFrequency, ...
    'LearnRateSchedule','piecewise', ...
    'LearnRateDropFactor',0.1, ...
    'LearnRateDropPeriod',2, ...
    'ExecutionEnvironment', 'gpu');

In [60]:
trainNetwork(XTrainIms, YTrain, lgraph, options)


ans = 

  DAGNetwork with properties:

         Layers: [72x1 nnet.cnn.layer.Layer]
    Connections: [79x2 table]



Fin.