https://pytorch.org/hub/pytorch_vision_mobilenet_v2/  

MobileNetV2 intended to run on mobile devices  

MobileNetV1 introduced *depth-wise separable convolutions* which reduce computational cost compared to standard convolutions.  

MobileNetV2 introduces *inverted residual blocks* and *linear bottlenecks*.  

1. depthwise separable convolutions to reduce number of parameters by splitting convolutions into depthwise convolutions and pointwise convolutions
2. inverted residuals connect layers with different depths 
3. linear bottlenecks between layers to maintain low dim representations
4. relu6 activations   

Architecture
- initial layer uses standard convolution with stride = 2, downsampling and increasing channels from 3 to 32
- inverted residual blocks: three layers. 1) 1x1 convolution expansion layer followed by relu6 activation 2) depthwise convolution with relu6 activation 3) 1x1 convolution projection layer to go back down to size before 1), no activation function.
- each inverted residual has a shortcut connection that skips the depthwise and directly connectst input to output
- final layers: 1x1 conv layer, a average pooling layer, and a FC layer 


inverted residual with linear bottleneck  
takes as an input a low-dimensional compressed
representation which is first expanded to high dimension and filtered with a lightweight depthwise convolution

[Effecient Image Segmentation Using Pytorch Part 1 - Dhruv Matani,, towardsdatascience.com](https://towardsdatascience.com/efficient-image-segmentation-using-pytorch-part-1-89e8297a0923/)  
[Efficient Image Segmentation Using PyTorch: Part 2 - Dhruv Matani, towardsdatascience.com](https://medium.com/data-science/efficient-image-segmentation-using-pytorch-part-2-bed68cadd7c7)  
[Efficient Image Segmentation Using PyTorch: Part 3 - Dhruv Matani, towardsdatascience.com](https://medium.com/p/3534cf04fb89)  

### cnns
convolution, batch-normalization, ReLU block is the most frequently used block of CNN layers  
Typical CNN progressively reduces the input spatial dimensions as layers are stacked. this is done by pooling neighboring values with max or average pool.  

CNNs generally adhere to the following: 
- A conv layer either keeps the number of output channels intact or doubles them
- A conv layer keeps the spatial dimensions intact with stride=1 or reduces them to half with stride=2
- A conv layer's output is commonly pooled to change the spatial dimensions
- Batch normalization is typically applied after a convo layer 
- ReLU layer next 

Batch normalization normalizes the channel values to have a zero mean and nit variance -> makes sure channel values have same distribution. this prevents gradients from getting too small and achieves faster convergence  

ReLU activation used to clip lowest input values to be >= 0  

Pooling is used to shrink the spatial dimensions of the inputs. Pooling with stride=2 will transform spatial dimensions from (X, Y) to (X/2, Y/2). Max pooling takes the maximum value in a 2x2 or other size grid and uses that as the single value in the pooled output. Reverse of pooling is an unpooling layer, where the opposite is done: for a stride of 2, the output is doubled. THe location of the max value must be recorded during pooling so the value can be placed back there during unpooling, with zeros in the other locations  

### This article uses SegNet
- encoder downsamples input to generate representative features
- decoder upsamples features to create classification
- each section made up of sequence of conv-BN-ReLU blocks with either pooling or unpooling
- SegNet uses pooling indicies from encoder max-pooling to do decoder unpooling

### depthwise seperable convolutions
- number of traiable parameters is a good indication of size of model
- number of tensor ops indicates model complexity/computational cost
- DSC are made of two convolutions:
1. depthwise grouped convolution where num of input channels m = num output channels n so that each output channel is only affected by one input channel
2. pointwise convolution (filtersize=1) which operates like normal conv, where each of the n filters operates on all m input channels to make a single output value
- DSC ends up having 7x less parameters than normal version

In [5]:
import torch 
import torchvision.models as models

model = models.mobilenet_v2()
print(model)

MobileNetV2(
  (features): Sequential(
    (0): ConvBNActivation(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    (1): InvertedResidual(
      (conv): Sequential(
        (0): ConvBNActivation(
          (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=32, bias=False)
          (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU6(inplace=True)
        )
        (1): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (2): InvertedResidual(
      (conv): Sequential(
        (0): ConvBNActivation(
          (0): Conv2d(16, 96, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(96, eps=1e-05, momen