### VGG

[Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/pdf/1409.1556)

*2024/11/12*

VGG uses fixed 3x3 convolutional kernels instead of the larger kernels (11x11, 7x7, 5x5) used in AlexNet. This approach allows the network to maintain the same receptive field while introducing more non-linear transformations and reducing the number of parameters. For example, three 3x3 convolutional kernels with a stride of 1 can achieve the same receptive field as a single 7x7 kernel, but with fewer parameters: $3*C*(3*3*C)=27C^2$ compared to $C*(7*7*C)=49C^2$. This reduction in parameters allows the network to be deeper, enhancing its ability to learn more complex features. Additionally, VGG compensates for the information lost in pooling by increasing the number of feature maps after each pooling. However, VGG still consumes a lot of computational resources due to the large fully connected layers.

*Code*

In [1]:
import torch
from torchvision.models import vgg16
from torchsummary import summary

model = vgg16().cuda()
x = torch.randn(1, 3, 224, 224).cuda()
summary(model, (x.squeeze(dim=0).shape))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,