$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$


# CS236781: Deep Learning
# Bonus Tutorial: Efficient and special CNNs

## Introduction

In this tutorial, we will cover:

- Recup over resnets
- Batch Normalization
- SqueezeNet
- Depthwise Separable Convolutions
- MobileNet
- MobileNet v2
- MobileNet v3
- ShuffleNet 
- EfficientNet 

In [1]:
# Setup
%matplotlib inline
import os
import sys
import torch
import torchvision
import matplotlib.pyplot as plt

In [2]:
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')

## Theory Reminders

### Convolution neural networks (CNNs)

<center><img src="img/arch.png" width="500" /></center>

#### Resnet

<center><img src="img/resnet_block2.png" width="900"/></center>

(Left: basic block; right: bottleneck block).

Here the weight layers are `3x3` or `1x1` convolutions followed by batch-normalization.

<center><img src="img/rn.webp" width="900"/></center>


<center><img src="img/resnet_arch_table.png" width="700"/></center>

### Batch Normalization

Batch normalization is a technique for improving the speed, performance, and stability of deep neural networks.
It is used to normalize the input layer by adjusting and scaling the activations.
BN works for all pixels of the feature - channel-wize ( $\in \mathcal{R}^c$).

The original goal was to accelerating training by reducing **Internal Covariate Shift**
**Covariate Shift** is when the distribution of input data shifts between the training environment and live environment.
reducing the shift- i.e, make not only the input be normal distributed but also the intermidiate features, shuld accelerate the learning process.

This claim was disproof to some degree, and [other suggestions](https://arxiv.org/pdf/1805.11604.pdf) claims that it improves the Lipschitzness of both the loss and the gradients. In other words, it creates a smoother loss landscape that is easier to optimize the hypotesis over.
<img src="img/bn_5.png" width="700" alt="scale">


**Batch Norm** has 4 groups of parameters, two are learnable and the other are statistical from the data
<center><img src="img/bn_1.png" width="700" alt="scale"></center>
But at test time, we cannot use the statistics of a batch to classify (why?)

#### During trainig
<center><img src="img/bn_2.png" width="700" alt="scale"></center>

the momentum $\alpha = 0.9$ and epsilon is 1e-5 to avoid devision by 0



#### During test
<center><img src="img/bn_4.png" width="700" alt="scale"></center>

In [3]:
import torch.nn as nn

# A function to count the number of parameters in an nn.Module.
def num_params(layer):
    return sum([p.numel() for p in layer.parameters()])


Let's use torch BN2D

In [4]:
# First conv layer: works on input image volume
bn1 = nn.BatchNorm2d(10)
print(f'bn1: {num_params(bn1)} parameters')

bn1: 20 parameters


What would be the output shape?

In [5]:
input_fm = torch.randn(1,10,32,32)
output_fm = bn1(input_fm)
output_fm.shape

torch.Size([1, 10, 32, 32])

# Efficiency in nueral networks

Besides of creating better models, the deep learning revolution came to edge devices, like phones, drones, cars and more.

The race to create compact yet sufficient models has began.


## Separable Convolutions

We already know that convolution operation is a sparse matrix multiplication.

What if we could decompose this matrix?

### Spatially Separable Convolutions

The spatially separable convolution operates on the 2D spatial dimensions of images, i.e. height and width. Conceptually, spatially separable convolution decomposes a convolution into two separate operations. For an example shown below, a Sobel kernel, which is a 3x3 kernel, is divided into a 3x1 and 1x3 kernel.

<center><img src="img/sobel.png" width="500" alt="scale"></center>

Naturally, it would look like that:

<center><img src="img/SSC.png" width="600" alt="scale"></center>

And while $1\times H \times W$ input image with $f\times f$ convolution would do (no padding) $(H-2)\times(W-2)\times f \times f$ multiplications.

Spatially Separable Convolutions would do $(H-2) \times W \times f  +  (H-2) \times (W-2)\times f $ multiplications.

The drawback is the types of convolution we can represent and can be decomposed like that.

## Depthwise Separable Convolutions:

Another way to reduce computation complexity is to devide the input channels per convolution.

### Depthwise Convolution

Let's look at 128 convolution kernels of 3x3 with input of size 3x7x7:

<center><img src="img/conv1.png" width="700" alt="scale"></center>

we see that each channel of the output contain information from all input channels.

However, we can devide each input channel seperatly and convolve kernels with a dept of 1:


<center><img src="img/conv_dept.png" width="700" alt="scale"></center>


What seems to be the problems?

In order to aggregate the information and create a rich feature map, we add anothe step of 1x1 conv that expand the feature map:

<center><img src="img/conv_dept2.png" width="700" alt="scale"></center>


In [6]:
class depthwise_separable_conv(nn.Module):
    def __init__(self, nin, kernels_per_layer, nout): 
        super(depthwise_separable_conv, self).__init__() 
        self.depthwise = nn.Conv2d(nin, nin * kernels_per_layer, kernel_size=3, padding=1, groups=nin) 
        self.pointwise = nn.Conv2d(nin * kernels_per_layer, nout, kernel_size=1) 
    def forward(self, x): 
        out = self.depthwise(x) 
        out = self.pointwise(out) 
        return out

In [20]:
conv = nn.Conv2d(input_fm.shape[1],3,128)
dsc = depthwise_separable_conv(input_fm.shape[1],3,128)
print(f'input shape: {input_fm.shape}')
print(f'DS Conv output shape: {dsc(input_fm).shape}')


input shape: torch.Size([1, 10, 32, 32])
DS Conv output shape: torch.Size([1, 128, 32, 32])


In [21]:
print(f'simple convolution has: {num_params(conv)} parameters')
print(f'depthwise separable conv has: {num_params(dsc)} parameters')

simple convolution has: 491523 parameters
depthwise separable conv has: 4268 parameters


### SqueezeNet

First attempt to make a mobile device friently model

The main building block in SqueezeNet is the **Fire module**


<center><img src="img/SqueezeNet.png" width="500" /></center>

It first has a **squeeze layer**. This is a 1×1 convolution that reduces the number of channels, for example from 64 to 16 in the above picture.

This is followed by an **expand block**: two parallel convolution layers: one with a 1×1 kernel, the other with a 3×3 kernel. These conv layers also increase the number of channels again, from 16 back to 64. Their outputs are concatenated, so the output of this fire module has 128 channels in total.

The idea exist in Resnet bottleneck blocks as well, yet the model has about 10% from the Resnet18 parameters!

#### Preformence (ImageNet 1K):

* Accuracy: 57.5%

* Parameters 1.25M 

to compare to other networks: (Resnet18 - 11.75M (acc-69.758%) , Resnet50 - 23.5M (acc - 80.8%), VGG16 - 134.7M (acc ~ 64%))

## MobileNet

The main idea, was to use **Depthwise Separable Convolutions** instead of the expensive regular ones..

MobileNet v1 consists of 13 convolution blocks in a row. It does not use max pooling to reduce the spatial dimensions, but some of the depthwise layers have stride 2.

At the end is a global average pooling layer followed by a fully-connected layer.

Often ReLU6 is used as the activation function instead of plain old ReLU.

due to increased robustness when used with low-precision computation.

In [9]:
def relu(x):
    return max(0, x)
def relu6(x):
    return min(max(0, x), 6)

#### Preformence (ImageNet 1K):

* Accuracy: 70.9%

* Parameters 4.2M 


### Mobilenet V2

Recall that in Resnet, we had bottleneck blocks.

Those goes as wide->narrow->wide in order to reduce the computational complexity.

In Mobilenet, we state that DSC breaks down the complexity, and basically do not benefit that greatly from this setup.

Mobilenet V2 introduced **Inverted Residuals**, as the blocks are narrow->wide->narrow:


<center><img src="img/IRB.png" width="800" /></center>

The authors describe this idea as an inverted residual block because skip connections exist between narrow parts of the network which is opposite of how an original residual connection works.

#### Linear bottlenecks:

The reason we use non-linear activation functions in neural networks is that multiple matrix multiplications cannot be reduced to a single numerical operation. It allows us to build neural networks that have multiple layers. At the same time the activation function ReLU, which is commonly used in neural networks, discards values that are smaller than 0. This loss of information can be tackled by increasing the number of channels in order to increase the capacity of the network.

The authors introduced the idea of a linear bottleneck where the last convolution of a residual block has a linear output (no activisions) before it’s added to the initial activations..

<center><img src="img/IRB_linear.png" width="800" /></center>


#### Preformence (ImageNet 1K):

* Accuracy: 71.8%

* Parameters 3.47M 

### MobileNet V3

Improvement over V2 with some tricks:

* uses also [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf)
* Neural Architecture Search for Block-Wise Search
* [NetAdapt](https://arxiv.org/pdf/1804.03230.pdf) for Layer wise search
* Network Improvements — Layer removal and H-swish

$Hswish(x) = x \frac{ReLU6(x+3)}{6}$

<center><img src="img/swish.png" width="600" /></center>


#### Preformence (ImageNet 1K):

#### small:
* Accuracy: 67.5%
* Parameters 2.9M

#### large:
* Accuracy: 75.2%
* Parameters 5.4M

## ShuffleNet

Many of the modern architectures use lots of dense 1×1 convolutions, also known as pointwise convolutions, but they can be relatively expensive. To bring down this cost, we can use group convolutions on those layers. But those have side effects (what?) can be mitigated using a channel shuffle operation.

A group-wise convolution divides the input feature maps into two or more groups in the channel dimension, and performs convolution separately on each group. It is the same as slicing the input into several feature maps of smaller depth, and then running a different convolution on each.

<center><img src="img/ChannelShuffle.png" width="600" /></center>


you can read more in [here](https://arxiv.org/pdf/1707.01083.pdf)


#### Preformence (ImageNet 1K):

#### V1:
* Accuracy: 69.4%
* Parameters 2.3M

#### V2 large:
* Accuracy: 77.1%
* Parameters 6.7M

## EfficientNet: The King of fixed solutions

Build by **AutoML NAS** framework

The network is fine-tuned for obtaining maximum accuracy but is also penalized if the network is very computationally heavy.

It is also penalized for slow inference time when the network takes a lot of time to make predictions. The architecture uses a mobile inverted bottleneck convolution similar to MobileNet V2 but is much larger due to the increase in FLOPS. This baseline model is scaled up to obtain the family of EfficientNets.

  <center><img src="img/eff.png" width="800" /></center>


And the second version:

  <center><img src="img/effv2.png" width="2500" /></center>



The model was build using more advanced NAS methods.

Original code for [V1](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet) and [V2](https://github.com/google/automl/tree/master/efficientnetv2)

recommanded implementations in pytorch (since google don't like us...) can be found in
[Timm](https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientnet.py) or [Luke Melas-Kyriazi](https://github.com/lukemelas/EfficientNet-PyTorch)


##### custom solutions

Today, we have a good understanding that edge devices have diffrent HW architecture and we might want to build a custom solutions.

Many works uses NAS with specific hardware constraint in order to find a fast and good solution.


Other methods exist in order to accelerate the inference or trainig time.


low rank factorization | Quantization  
- | - 
![alt](img/low_rank.png) | ![alt](img/ste.png) 
Pruning | Knowlege distillation 
 ![alt](img/prune.png) | ![alt](img/kd.png)

#### Thanks!

**Credits**

This tutorial was written by [Moshe Kimhi](https://mkimhi.github.io/).<br>
To re-use, please provide attribution and link to the original.

Some images in this tutorial were taken and/or adapted from the following sources:

- Sebastian Raschka, https://sebastianraschka.com/
- Deep Learning, Goodfellow, Bengio and Courville, MIT Press, 2016
- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Deep Learning with Python, Francios Chollet, Manning 2018
- Stanford cs231n course notes by Andrej Karpathy
- Ketan Doshi on medium
- "A Survey of Quantization Methods for Efficient Neural Network Inference", A.Golami 2021
- "Learning both Weights and Connections for Efficient Neural Networks", Song Han, 2015
- "Rethinking the Knowledge Distillation From the Perspective of Model Calibration" Lehan Yang 2021