# Original implementation of VGG 16 architecture

3x3 kernel with padding of 1 and stride of 1.

Input image resolution is 224x224 and is RGB image.

Image resolution stays the same.

Implementation based on Aladdin Persson VGG torch [tutorial](https://www.youtube.com/watch?v=ACmuBbuXn20)

# VGGish architecture for genre classification [paper](https://arxiv.org/pdf/1609.09430)


>The only changes we made to VGG (configuration E) [2] were to
the final layer (3087 units with a sigmoid) as well as the use of batch
normalization instead of LRN. While the original network had 144M
weights and 20B multiplies, the audio variant uses 62M weights and
2.4B multiplies. We tried another variant that reduced the initial
strides (as we did with AlexNet), but found that not modifying the
strides resulted in faster training and better performance. With our
setup, parallelizing beyond 10 GPUs did not help significantly, so
we trained with 10 GPUs and 5 parameter servers.


The model is originally trained on `YouTube-100M` dataset, which is much bigger than `GTZAN`.


# GTZAN Audio Classification with VGGish Model

I'm using pre-generated Mel spectrograms from the `GTZAN` images_original directory - not `YouTube-100M`

Changes in VGG:
- final layer - 3087 units with a sigmoid
- batch normalization instead of LRN
- 144M weights, 20B multiplies -> 62M weights, 2.4B multiplies
- do not modify strides

Optimized for macOS with ARM processors - Metal Performance Shaders

In [None]:
# Usage of pre-trained VGGish model

import torch

model = torch.hub.load('harritaylor/torchvggish', 'vggish')
model.eval()

# Download an example audio file
import urllib
url, filename = ("http://soundbible.com/grab.php?id=1698&type=wav", "bus_chatter.wav")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

model.forward(filename)

In [2]:
import torch
import torch.nn as nn  # All neural network modules, nn.Linear, nn.Conv2d, BatchNorm, Loss functions


In [3]:
# Integer values - number of channels in the convolutional layers
# M - Maxpooling layer
VGG16_architecture = [ 
    64, 64, "M", 
    128, 128, "M",
    256, 256, 256, "M",
    512, 512, 512, "M",
    512, 512, 512, "M",
    # Then flatten
    # Then 4096x4096x1000 linear layers
]

In [10]:
class VGGish(nn.Module):
    def __init__(self, in_channels=3, num_classes=1000):
        super(VGGish, self).__init__()
        self.in_channels = in_channels
        self.conv_layers = self.create_conv_layers(VGG16_architecture)

        # final layer - 3087 units with a sigmoid
        self.fcs = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),   #7 = input size / 2^num_maxpool = 224 / 2^5
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        print(f"Input shape: {x.shape}")
        x = self.conv_layers(x)
        print(f"Shape after conv layers: {x.shape}")
        x = x.reshape(x.shape[0], -1)
        print(f"Shape after flattening: {x.shape}")
        x = self.fcs(x)
        print(f"Shape after fully connected layers: {x.shape}")
        return x

    def create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == int:
                out_channels = x

                layers += [
                    nn.Conv2d(
                        in_channels=in_channels,
                        out_channels=out_channels,
                        kernel_size=(3, 3),
                        stride=(1, 1),
                        padding=(1, 1),
                    ),
                    nn.BatchNorm2d(x),  # Not included in the original paper
                    nn.ReLU(),
                ]
                in_channels = x
            elif x == "M":
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

        return nn.Sequential(*layers)


In [11]:

device = "mps" if torch.backends.mps.is_available() else "cpu"

num_classes = 10
model = VGGish(in_channels=16, num_classes=num_classes).to(device)
BATCH_SIZE = 3
x = torch.randn(3, 16, 224, 224).to(device)  # 3 images, 3 channels, 224x224
print(x.shape)
assert model(x).shape == torch.Size([BATCH_SIZE, num_classes])
print(model(x).shape)

torch.Size([3, 16, 224, 224])
Input shape: torch.Size([3, 16, 224, 224])
Shape after conv layers: torch.Size([3, 512, 7, 7])
Shape after flattening: torch.Size([3, 25088])
Shape after fully connected layers: torch.Size([3, 10])
Input shape: torch.Size([3, 16, 224, 224])
Shape after conv layers: torch.Size([3, 512, 7, 7])
Shape after flattening: torch.Size([3, 25088])
Shape after fully connected layers: torch.Size([3, 10])
torch.Size([3, 10])
