### MobileNet

[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/pdf/1704.04861)

*2024/11/20*

In order to construct smaller and less computationally expensive models, two hyperparameter are introduced.

- **Width Multiplier $(\alpha)$**: Reduces the number of channels in each layer.
- **Resolution Multiplier $(\rho)$**: Reduces the input image size and, consequently, the size of the feature maps throughout the network.

Therefore, the final computational cost will be $D_K \cdot D_K \cdot \alpha M \cdot \rho D_F \cdot \rho D_F + \alpha M \cdot \alpha N \cdot \rho D_F \cdot \rho D_F$. 

*2024/11/15*

Depthwise separable convolution is the key innovation in MobileNet that significantly reduces the computational cost and parameter count while maintaining reasonable accuracy. This technique decomposes a standard convolution into two separate operations: depthwise convolution and pointwise convolution.

1. **Depthwise Convolution**
   - **Operation**: In depthwise convolution, each input channel is convolved independently with a single filter. This means that if the input has $C$ channels, there will be $C$ filters.
   - **Output**: The output of the depthwise convolution will have the same number of channels as the input. For example, if the input is a 3-channel image, the output will also have three feature maps, each corresponding to one of the input channels.

2. **Pointwise Convolution**
   - **Operation**: Pointwise convolution is a $1 \times 1$ convolution applied to the output of the depthwise convolution. It combines the feature maps from the depthwise convolution to produce a new set of feature maps.
   - **Output**: The number of output feature maps is determined by the number of $1 \times 1$ filters used. If $M$ filters are used, the output will have $M$ feature maps.

Let's consider a 3-channel input image of size $64 \times 64$.

1. **Depthwise Convolution**:
   - Input: $64 \times 64 \times 3$
   - Kernel: $3 \times 3 \times 1$ (one for each channel)
   - Output: $64 \times 64 \times 3$

2. **Pointwise Convolution**:
   - Input: $64 \times 64 \times 3$
   - Kernel: $1 \times 1 \times 3$ (assuming we want to output three feature maps)
   - Output: $64 \times 64 \times 3$

The computational efficiency of depthwise separable convolution can be quantified by comparing it to a standard convolution.

- **Standard Convolution**:
  - Input: $D_F \times D_F \times M$
  - Kernel: $D_K \times D_K \times M \times N$
  - Output: $D_F \times D_F \times N$
  - Computation: $D_K \cdot D_K \cdot M \cdot N \cdot D_F \cdot D_F$

- **Depthwise Separable Convolution**:
  - Depthwise Convolution: $D_K \cdot D_K \cdot M \cdot 1 \cdot D_F \cdot D_F$
  - Pointwise Convolution: $1 \cdot 1 \cdot M \cdot N \cdot D_F \cdot D_F$
  - Total: $D_K \cdot D_K \cdot M \cdot D_F \cdot D_F + M \cdot N \cdot D_F \cdot D_F$

By using depthwise separable convolution, the computational cost is reduced to $(\frac{1}{N} + \frac{1}{D_K^2})$ of the standard convolution. Commonly, when the output channel $N$ is 64 and the convolution kernel size $D_K$ is 3, the standard convolution takes about 8.5 times more computation than the depthwise separable convolution.

*Code*

In [21]:
import torch
import torch.nn as nn
from torchsummary import summary


class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(DepthwiseSeparableConv, self).__init__()
    
        self.depthwise = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=stride, padding=1, groups=in_channels),
            nn.BatchNorm2d(in_channels),
            nn.ReLU6(inplace=True)
        )

        self.pointwise = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU6(inplace=True)
        )

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x


class MobileNet(nn.Module):
    def __init__(self, num_classes=1000, dropout=0.2):
        super(MobileNet, self).__init__()
        
        self.bottleneck = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU6(inplace=True),
            DepthwiseSeparableConv(32, 64, stride=1),
            DepthwiseSeparableConv(64, 128, stride=2),
            DepthwiseSeparableConv(128, 128, stride=1),
            DepthwiseSeparableConv(128, 256, stride=2),
            DepthwiseSeparableConv(256, 256, stride=1),
            DepthwiseSeparableConv(256, 512, stride=2),
            DepthwiseSeparableConv(512, 512, stride=1),
            DepthwiseSeparableConv(512, 512, stride=1),
            DepthwiseSeparableConv(512, 512, stride=1),
            DepthwiseSeparableConv(512, 512, stride=1),
            DepthwiseSeparableConv(512, 512, stride=1),
            DepthwiseSeparableConv(512, 1024, stride=2),
            DepthwiseSeparableConv(1024, 1024, stride=1),
            nn.AvgPool2d(kernel_size=7)
        )

        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout),
            nn.Linear(1024, num_classes),
        )

        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, x):
        x = self.bottleneck(x)
        x = torch.flatten(x, start_dim=1)
        x = self.classifier(x)
        x = self.softmax(x)
        return x

In [23]:
model = MobileNet().cuda()
x = torch.randn(1, 3, 224, 224).cuda()
summary(model, (x.squeeze(dim=0).shape))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 32, 112, 112]             896
       BatchNorm2d-2         [-1, 32, 112, 112]              64
             ReLU6-3         [-1, 32, 112, 112]               0
            Conv2d-4         [-1, 32, 112, 112]             320
       BatchNorm2d-5         [-1, 32, 112, 112]              64
             ReLU6-6         [-1, 32, 112, 112]               0
            Conv2d-7         [-1, 64, 112, 112]           2,112
       BatchNorm2d-8         [-1, 64, 112, 112]             128
             ReLU6-9         [-1, 64, 112, 112]               0
DepthwiseSeparableConv-10         [-1, 64, 112, 112]               0
           Conv2d-11           [-1, 64, 56, 56]             640
      BatchNorm2d-12           [-1, 64, 56, 56]             128
            ReLU6-13           [-1, 64, 56, 56]               0
           Conv2d-14          [-1,