# Week 3 - Compact Network Design and Knowledge Distillation

In this week, you will learn all about the design of compact neural networks and knowledge distillation. Both methods are of high importance when applying deep neural networks on edge devices because they reduce the memory, computation, and energy requirements of neural networks!

# 1. Notations

## 1.1 Neural Network Compression

## 1.2 Compact Network Design (1/3)

1. SqueezeNet
- Fire module:
    - Strategy: Design CNN Architecture with few parameters while maintening competitive accuracy
    - Replace 3x3 filters with 1x1 filters (9x fewer parameters)
    - Decrease number of inputs channels to 3x3 filters
    - Downsample late in the network so that convolution layers have large activation maps: Our intuition is that large activation maps (due to delayed downsampling - for example, stride > 1) can lead to higher classification accuracy. This strategy is about maximazing accuracy on limited bugdet of parameters.
    
- ReLU is applied to activation from squeeze and expand layers
- Dropout with 50% applied after fire9
- Note the lack of fully-connected layers

- Although the size of the model is smaller, the computation complexity was ignored

2. Xception Network

- Introduces the concept of Depthwise Separable Convolution (Section 14.4.3 of Probability ML (Murphy))
- Number of groups = Number of channels -> results are concatenated after a 1x1 convolution
- Maximizes the decoupling of spatial correlation and channel correlation

## 1.3 Compact Network Design (2/3)

1. MobileNet V1
- Description of two model shrinking hyperparameters (unified network-dimension scaling mechanism)
    - Width multiplier (thinner models)
    - Resolution multiplier (reduced representation)
- Depthwise Separable Convolution
    - Splits convolutions into two layers: filtering and combining
    - Effect: drastically reducing computation and model size
- ReLU6 instead of ReLU (L2 regularizer)
- Concentrates almost all computation into dense 1x1 convolution layers

2. MobileNet V2
- Significant decrease of number of operations and memory need
- Novel layer module: the inverted residual with linear bottleneck
- Inverted bottleneck design:
    - Linear bottleneck:
        - Attempt to preserve the information of the manifold of interest
        - Is crucial to preserve non-linearities from destroying too much information
        - Ratio between size of input bottleneck and inner size referred as expansion ratio.
    - Inverted residuals
        - Improve the ability of a gradient to propagate across multiplier layers
        - More memory efficient

3. MobileNetV3
- Starts the exploration of how automated search algorithms and network design can work together 
- Combination of layers used in MobNet V1 and V2 as building blocks
- Layers are also upgraded with modified swish nonlinearities (sigmoid replaced by hard sigmoid to maintain accuracy)
- Use platform-aware NAS to search global network structure
- Use NetAdapt algorithm to search per layer for number of filters

## 1.4 Compact Network Design (3/3)

1. Efficient Net
- Carefully balancing network depth, width, and resolution can lead to better performance
- New scalling method that uniformlly scales all dimensions of depth/width/resolution using simple compound coefficient (compound scaling method)
- Baseline network developed by leveraging a multi-objective neural architecture search that optimizes accuracy and FLOPS
- Main building block is mobile inverted bottleneck MBConv


2. ShuffleNet V2
- Proposes two principles for effective network architecture design
    - The direct metric (e.g. speed) should be used instead of indirect ones (e.g. FLOPS)
    - Such metric should be evaluated on the target platform
- ShuffleNet V1 Architecture
    - Pointwise group convolution
        - Increase MAC (Memory Access Cost)
    - Bottleneck-like structures
        - Increase MAC
    - Channel shuffle operation: enables information communication between different groups of channels
        - Occupy considerable amount of time, specially on GPU
- Channel Split and Shuffle Net V2:
    - Channels are split into two branches at the beginning of each unit
    - The add operation in ShuffleNet v1 no longer exists
    - Depthwise convolutions exist only in one branch

3. GhostNet

4. AsymmNet
- Based on proposed asymmetrical bottleneck design
- Follow the basic network architecture of MobileNetV3
- The main building blocks of AsymmNet consists of a sequence of stacked assymetrical bottleneck blocks
    - Gradually downsample the feature map resolution and increase the channel number
    

5. RepVGG Net



## 1.5 Knowledge Destillation (1/2)

- Accordingly to the quizz question, energy consumption is reduced due to the fact that KD make it necessary only to deploy the student model

## 1.6 Knowledge Destillation (2/2)

- Loss function (distillation loss): Kullback-Leibler divergence


# 2. Quizzes 

## 2.1 Neural Network Compression

**Question 1**

Why do we have to run models such as ResNet-152 on powerful servers in the cloud?

- [ ] Edge devices are typically not allowed to draw the amount of current necessary to run deep neural networks
    **Current is not a problem since with less processing power, we would only need to wait longer**
- [x] Small edge devices are not powerful enough to compute 11.3 billion floating point operations with multiple frames per second **True**
- [ ] Small edge devices typically use a different instruction set that can not be used to run deep neural networks **Not true**
- [x] The storage needs can not be fulfilled by small edge devices **Yep**

**Question 2**

Which of the following methods can be used to squeeze deep models?

- [x] model quantization/binarization
- [ ] model parallelization
- [x] knowledge distillation
- [x] model pruning

## 2.2 Compact Network Design (1/3)

**Question 1**

What is the name of the block architecture introduced by SqueezeNet?

- [ ] Ice 
- [ ] Water 
- [ ] Soil
- [x] Fire

**Question 2**

Why is SqueezeNet architecture important?

- [x] It was one of the first models that achieved AlexNet-level accuracy with 50x fewer parameters and less than 0.5MB model size
- [ ] It was one of the first network architectures that solely consisted of 1x1 convolutions, which have less learnable parameters and use less storage. Thus, SqueezeNet was able to achieve AlexNet accuracy with a model size of less than 0.25MB
- [ ] It was one of the first functioning image classification models with less than 0.5MB model size 

**Question 3**

How many groups $g$ does a group convolution used as depthwise separable convolution have if the number of channels is defined as $c$?

- [x] g = c
- [ ] it doesn't matter
- [ ] g = 2c
- [ ] g = c/2 

## 2.3 Compact Network Design (2/3)

**Question 1**

Which of these are important building blocks of MobileNet V1?

- [x] ReLu6
- [ ] nearly all computations happens in 3x3 convolutions
- [x] very little or no weight decay (**on depthwise filters**)
- [ ] usage of residual connections

**Question 2**

Which of the following convolution functions and concepts are not required for a 1x1 convolution?

- [ ] memory optimizations
- [x] im2col
- [ ] GEMM 

**Question 3**

MobileNetv1 continues the leading positon in DNN backbone design of which company?

- [ ] Amazon
- [ ] Facebook/Meta
- [ ] Alibaba
- [x] Google 

**Question 4**

Which additional activation function is used in MobileNetV3?

- [ ] GELU
- [ ] Mish
- [ ] PReLU
- [x] h-swish

## 2.4 Compact Network Design (3/3)

**Question 1**

Which of the following parameters can be scaled in EfficientNet?

- [ ] Height
- [x] Resolution
- [x] Depth
- [x] Width

**Question 2**

Given a description of the following characteristcs, which network architecture are we describing? "Feature reuse at inverted bottleneck blocks by copying old features and computing new features. Enhancing the expressiveness by using depth-wise layers to extend the width."

- [ ] EfficientNet
- [ ] ShuffleNetv2
- [x] AsymmNet
- [ ] GhostNet 

**Question 3**

Which techniques are the main characteristics of RepVGG?

- [ ] channel shuffling
- [ ] shortcut connections
- [x] over parameterization
- [x] linear combination 

**Question 4**

Which of these convolution kernel sizes is the most friendly one for accelerators?

- [ ] 9x9
- [ ] 5x5
- [x] 3x3
- [ ] 7x7 


## 2.5 Knowledge Destillation (1/2)

**Question 1**

Which of the following statements about knowledge distillation are true?

- [x] Knowledge distillation can be used to compress the knowledge of an ensemble of networks into a student network
- [ ] Knowledge Distillation always involves the usage of multiple teacher models
- [x] Knowledge distillation is a form of model compression where the knowledge of a large teacher model is compressed into a smaller student model
- [ ] Knowledge distillation is the process of scaling the knowledge of a small teacher network to a larger student network 

**Question 2**

Why do we use soft labels instead of hard labels for knowledge distillation?

- [ ] soft labels provide more sparse information about the label, leading to more confident predictions
- [ ] soft labels are simpler to annotate, leading to larger training datasets
- [x] soft labels have less gradient variance, leading to smoother training
- [ ] soft labels have a lower entropy, thus allowing the model to learn from a stronger supervision. 

**Question 3**

Why can knowledge distillation be used to reduce the energy consumption of AI computing?

- [x] the models train faster
- [x] it is only necessary to deploy the student model
- [ ] the student model is automatically quantized, thus requiring less memory
- [ ] the overall training process for the models requires less energy 

## 2.6 Knowledge Destillation (2/2)

**Question 1**

When do we set a higher temperature for knowledge distillation?

- [ ] If we have a model with fewer parameters
- [ ] If we want to avoid being affected by noise in negative labels
- [x] If we want to learn from negative labels that are informative

**Question 2**

What kind of knowledge can be distilled?

- [ ] Inheritance based knowledge
- [x] Feature based knowledge
- [x] Prediction based knowledge
- [ ] Weights based knowledge

**Question 3**

Which loss function is used to distill feature based knowledge?

- [ ] Soft Cross Entropy
- [x] Mean Squared Error
- [ ] Kullback-Leibler divergence
- [ ] Mean Absolute Error

**Question 4**

Which distillation scheme is especially effective if the training labels are noisy?

- [ ] Offline distillation
- [ ] Online distillation
- [ ] Multi-Teacher distillation
- [x] Self distillation