# Week 4 - Advanced Deep Compression Methods

## 1. Notations

### 1.1 Network Pruning

1. Pruning method:
    - Method for model compression
    
    - The effect of dropout is very similar to the pruning method
    
    - Basic pruning methods:
        - Individual weight pruning (unstructured pruning method)
            - Rank the weights using L1 norm
            - Set the x% weight values to zero
            
        - Neuron (channel) pruning (structured pruning method)
            - Rank the weight columns using L2 norm
            - Set the x% weight columns to zero
            - This is equivalent to delete the corresponding output neurons
            
    - Unstructured pruning
        - Prunes individuals parameters
        
    - Structured pruning
        - Consider parameters in groups removing entire neurons, filters or channels
            
    - Iterative pruning
        - Avoid performance degradation
        
        - Disadvantages
            - Iterative pruning and fine-tuning is time consuming

### 1.2 Dynamic Network

- Dynamic NN can adapt their structures or parameters to different inputs

- Advantages:
    - Efficiency (computational and data)
    - Representation power (?)
    - Adaptiveness (trade-off between accuracy and efficiency)
    - Compatibility (applicable on most advanced techniques of DL)
    - Generality (wide range of applications)
    - Interpretability
    
- Three main categories:
    - Sample-wise dynamic
    - Spatial-wise dynamic
    - Temporal-wise dynamic
    
- AutoSlim
    - Slimmable network: 

### 1.3 Super-subnets Architecture 

1. Single Stage: BigNAS 
    - Train a single model that can be sliced w.r.t the without resource budget any post-processing
    - A series of problems needed attention in order to achieve success in this work: Initialization, Regularization, Convergence Behavior, Batch Norm calibration.

2. Single Path One-Shot

3. Once-for-all

### 1.4 Neural Network Quantization 

- Converting the weights from float to int. Example: FP32 -> int8 (Case study: post-training quantization)
- Types of quantization methods:
    - Post-training (static) quantization
    - Quantization aware training
    
- Disadvantage: Quantization layers are not differentiable!
    
### 1.5 Binary Neural Network (1/2)

- Benefits:
    - Energy saving
    - Smaller model
    - Faster
    
- Activation function: sign(x)
    - Does not help for backpropagation
    - Straight Through Estimation is then used as derivative approximation
    
- BNN Inference:
    Given a defined full precision model
    1. Convert full precision weights
    2. Binarize the input
    3. Apply xnor, popcnt operations (convulution in binary world) **Why xnor?**

### 1.6 Binary Neural Network (2/2)

1. Binary Dense Net
    - BNN
        - Reduced computational costs
            - Up to 32x memory saving
            - 58x speedup
        - Substantial accuracy degradation
    - Improvement over XNORNet and Bi-Real Net
    - Previous works didn't considered the specificity of the binary space in using the same architecture used in the float point (real value) space
    - Divides the efforts for binarization and compression into three categories: 
        - Compact network design
        - Networks with quantized weights
        - Networks with quantized weights and activations
            - BNN: efficient calculation methods for equivalent of matrix multiplication by using xnor and bicount operations
            - XNOR-Net: introduced a channel-wise scalling factor to reduce the approximation error of full-precision parameters
            - ABC-Nets: used multiple weight bases and activation bases to approximate their full-precision counterparts
            - Bi-Real Net: applies an extremelly sophisticated training strategy (full-precision pre-training, multi-step initialization, and custom gradients)
     
     - Golden Rules for Architecture Design
         1. Maintaining rich information flow of the network
         2. Compact Network Design are not well suited for BNNs
         3. Bottleneck design should be eliminated 
         4. Consider using full-precision downsampling
         5. Using shortcut connections to avoid bottlenecks of information flow
         6. To overcome bottlenecks of information flow, increase the network width
    

## 2. Quizzes 

### 2.1 Network Pruning

**Question 1**

Which of the following statements about pruning are true?

- [ ] Dropout has the same effect and purpose as pruning.
- [x] Neural network pruning is inspired by synaptic pruning of biological neuron system.
- [x] It utilizes the fact that large models are often overparameterized.
- [x] Pruning is a method for model regularization. 

**Question 2**

Which of the following are commonly used pruning methods?

- [x] Unstructured pruning
- [x] Weight pruning
- [x] Neuron pruning
- [x] Structured pruning 

### 2.2 Dynamic Network 

**Question 1**

Which of the following statements about dynamic neural networks are **wrong**?

- [x] A dynamic neural network can have dynamic width, depth, and path, but it must rely on a pre-trained network.
- [ ] A dynamic neural network learns a specific network structure for each sample.
- [ ] The dynamic neural network applies a decision-making mechanism for structure prediction.
- [x] A dynamic neural network has a fixed network architecture but adaptive weight intensities. 

**Question 2**

Dynamic neural network has advantages, because...

- [ ] The human brains process information statically.
- [x] Dynamic models are able to achieve a desired trade-off between accuracy and efficiency for dealing with varying computational budgets on the fly.
- [x] It allocates resources on demand at test time.
- [x] It enlarges parameter space and improves representation power. 

**Question 3**

Which of the following statements about early exits are true?

- [x] Adding early exits will slightly increase the training complexity and time. (**Why will it increase complexity and time?**)
- [ ] Early exits is a dynamic width method.
- [ ] We can maximally have three early exits in a dynamic neural network. 


### 2.3 Supernet-Subnets Architecture Search 

**Question 1**

Supernet can be regarded as a teacher for all subnets, using knowledge destillation to train subnets.

- [x] yes
- [ ] no

**Question 2**

What are the benefits when supernet shares weights with all subnets?

- [x] It can be considered as a type of ensemble learning.
- [x] It is significantly reducing the training complexity.
- [ ] The trained supernet is high-precision and highly compact. 

**Question 3**

BigNAS uses the sandwich rule to train N random sample models. 
The initial weights of a small sub-network are inherited from a well-trained larger sub-network. Is this statement correct?

- [ ] yes
- [x] no **The problem with initialization was resolved using initializing the output of the residual block**

### 2.4 Neural Network Quantization

**Question 1**

Generally speaking, the quantized neural network will have better accuracy because the information loss generated by quantization can have a specific regularization effect. Therefore, the lower the bit, the higher the accuracy. Is this correct?

- [x] no
- [ ] yes

**Question 2**

What are the advantages of neural network quantization?

- [ ] It can enrich the distribution range of neural networks and improve the expressiveness of the model.
- [x] The quantized models can support more applications of low-power devices.
- [x] It can save memory and improve inference speed. 

**Question 3**

What are the differences between post-training quantization and quantization aware training?

- [ ] Post-training quantization needs to retrain the model.
- [x] Post-training quantization does not require retraining the model.
- [x] Quantization-aware training will take the effect of information loss into account during training, and it can thus have minor sacrifices to the inference accuracies. 

### 2.5 Binary Neural Network (1/2)

**Question 1**

Which of the following statements about binary neural networks are correct?

- [x] A binary neural network uses 1-bit parameter for both weights and activations.
- [ ] A binary neural network uses 8-bit weights and 1-bit activations.
- [ ] A binary neural network uses 1-bit weights and 8-bit activations. 

**Question 2**

Which of the following statements are correct?

- [x] Sign function binarizes weights and activations to +1 and -1.
- [x] Binary neural networks apply more efficient bitwise operators, e.g., XNOR, bitcount instead of arithmetic operations.
- [ ] Differentiating the sign function does not help backpropagation since we got gradient values "1" almost everywhere.
- [ ] Straight Through Estimation (STE) is the derivative function of Sign function. 

### 2.6 Binary Neural Network (2/2)

**Question 1**

Which of the following are not the challenges for BNN?

- [ ] Lack of tailor-made optimizer for BNN
- [ ] Balancing accuracy and energy consumption for AI accelerators
- [ ] Loss of accuracy compared to 32-bit networks
- [x] Lack of support from the community and industry 

**Question 2**

Which of the following are **not** the main contributions of Bi-Real Net?

- [x] Progressively approximated sign function
- [ ] Two-stage training method
- [ ] Binary-Real valued information flow
- [x] Channel wise scaling factor for activations and weights. 

**Question 3**

What are the main problems of BNN optimization?

- [x] SGD cannot be applied to binary weights.
- [x] Gradient mismatching problem
- [x] Unnecessary computation and memory cost caused by full-precision latent weights.
- [ ] We cannot use full-precision latent weights for inference. 