# Model Compression

Model compression in deep learning aims to deploy complex, high-performing neural networks on resource-constrained edge devices by reducing model size and latency. Size reduction involves minimizing the number and size of model parameters, resulting in decreased memory requirements during execution. Latency reduction focuses on speeding up model predictions, which, in turn, reduces energy consumption. Both aspects are interconnected, with smaller models demanding fewer memory resources, and faster models being more energy-efficient, making them suitable for deployment on devices with strict constraints. The primary goal of model compression is to simplify models while retaining their high accuracy and performance, thus enabling their efficient deployment in resource-limited environments.

We will discuss about two popular methods of model compression here, namely - **Pruning** and **Quantization**.

## 1. Pruning

Pruning is a methodology used to induce sparsity in neural networks by removing less important weights and activations, making the model more efficient.


**Motivation:**

|Age|Number of Connections|Stage|
|---|---------------------|-----|
|at birth|50 Trillion|newly formed|
|1 year old|1000 Trillion|peak|
|10 year old|500 Trillion|pruned and stabilized|

- The synapses pruning mechanism in human brain development is shown in the table above.

- **Pruning** mechanism removes redundant connections in the brain as human brain ages.

Pruning can be broadly classified into two types:

### Structured Pruning
Structured Pruning operates by removing entire channels or filters from a neural network. For example, if one channel is pruned, it also eliminates corresponding channels, which can be determined by evaluating their importance. Methods vary, from calculating the importance of channels based on the sum of their weights to more sophisticated techniques like global and dynamic pruning, which aim to strike a balance between reducing the network's size and maintaining performance. Structured pruning has found applications in image segmentation and object detection, enhancing network efficiency and reducing computational resources.

<!-- ![pic](https://drive.google.com/uc?id=1HjT5mMnF4BVtpb2qAMpG3cLkxAk6txaE) -->
![pic](https://i.postimg.cc/6QLCyKCw/image.png)

### Unstructured Pruning
Unstructured Pruning, on the other hand, targets individual parameters, such as the magnitude of weights, gradients, or Hessian statistics. The "optimal brain damage" approach employs second-derivative information to identify unimportant weights and removes them from the network. Some methods, like the "train–prune–retrain" strategy, focus on learning only the important connections, which can significantly enhance performance without affecting accuracy. Unstructured pruning has the advantage of substantially reducing the number of parameters and computations, but it doesn't eliminate redundant neurons from the network, so it's essential to further investigate how to leverage unstructured pruning for current hardware architectures effectively.

<!-- ![pic](https://drive.google.com/uc?id=1GwSN-Bc5i5ycEm4FvDVCSh0rY5H87c5B) -->
![pic](https://i.postimg.cc/P51wRH3Z/image.png)

Various pruning techniques:
1. **Weight Pruning(model pruning)**: Weight connections that are below some predefined thresholds are pruned (zeroed out).
2. **Neuron Pruning**: Instead of removing the weights one by one, which can be time-consuming, we prune the neurons.
3. **Filter Pruning**: Filters are ranked according to their importance, and the least important filters are removed from the network. The importance of the filters can be calculated by L1/L2 norm.
4. **Layer Pruning**: Layers are pruned.

**One question**: Should you prune large networks or build small dense networks?

- Large-sparse models consistently outperform small-dense models and achieve upto 10x reduction in number of non-zero parameters with minimal loss in accuracy.

### Weights Pruning:
Weights pruning aims to increase the sparsity of a network's weight tensors, reducing the number of parameters in the model.

**Sparsity Definition**: Sparsity is defined as a measure of how many elements in a tensor are exact zeros relative to the tensor's size.

**Pruning Weights, Biases, and Activations**: Pruning can be applied to weights, biases, and activations, with biases being less commonly pruned due to their contribution to a layer's output.

**Iterative Pruning and Fine-Tuning**: Pruning can be performed iteratively, with fine-tuning between iterations to recover from pruning and maintain accuracy.

**Pruning Criteria**: The most common pruning criteria involve comparing the absolute values of elements to a threshold. If an element's absolute value is below the threshold, it is set to zero (pruned).

**Pruning Schedule**: The pruning schedule defines when, how often, and which tensors are pruned during the iterative process.

**Pruning Granularity**: Pruning can be fine-grained (element-wise pruning) or coarse-grained, such as filter pruning, which removes entire groups of elements.

**Sensitivity Analysis**: Sensitivity analysis helps rank tensors by their sensitivity to pruning, determining the impact of pruning on different layers in the network.

### Magnitude-based method: Iterative pruning + Retraining

<!-- ![pic](https://drive.google.com/uc?id=1fwZkjelcJnq4OVdjosoQ5lUxdf116I7L) -->
![pic](https://i.postimg.cc/4dYtw9r4/image.png)

Magnitude-Based Pruning is a popular and straightforward pruning technique where you remove the least important weights in a neural network based on their magnitudes. The idea is to identify the connections (weights) with small absolute values, considering them less critical to the network's performance. These small weights are then pruned, effectively reducing the network's size.

### Lottery Ticket Hypothesis

**"A randomly-initialized, dense neural network contains a subnetwork that is initialized such that - when trained in isolation - it can match the test accuracy of the original network after training for at most the same number of iterations."**

The Lottery Ticket Hypothesis is a concept in deep learning that suggests within large, over-parameterized neural networks, there exist subnetworks (winning tickets) that, when trained in isolation, can match or even exceed the performance of the original network. These winning tickets are sparse, meaning they contain a small fraction of the original parameters.

The Lottery Ticket Hypothesis has a close relationship with pruning for model compression:

1. **Pruning Criteria**: The Lottery Ticket Hypothesis suggests that subnetworks with the potential for high performance are embedded within over-parameterized networks. Magnitude-Based Pruning aligns with this idea by selecting weights with small magnitudes for removal. It assumes that these small weights contribute less to the network's performance and can be pruned without significant accuracy loss.

2. **Discovery of Winning Tickets**: In the context of the Lottery Ticket Hypothesis, the subnetworks that exhibit high performance are akin to "winning tickets." Magnitude-Based Pruning helps identify such winning tickets by removing the unimportant weights, effectively revealing subnetworks that have the potential to achieve good performance when retrained.

3. **Iterative Pruning**: The Lottery Ticket Hypothesis also supports the idea of iterative pruning, where you prune a portion of the network and then retrain it to recover performance. This is in line with Magnitude-Based Pruning, where you can apply pruning iteratively, gradually removing less important weights and retraining the network to maintain or improve its performance.

4. **Size and Efficiency**: Both methods aim to reduce the size of the neural network, making it more compact and efficient for deployment. Magnitude-Based Pruning, by removing small weights, contributes to the creation of smaller subnetworks (winning tickets) that align with the Lottery Ticket Hypothesis.

## 2. Quantization

Quantization is the process of reducing the number of bits that represent a number. Efforts in deep learning research have focused on reducing the resource demands of models, leading to the use of lower-precision numerical formats like 8-bit integers (INT8) instead of the predominant 32-bit floating point(FP32) for more efficient inference without significant accuracy loss. Lower bit-widths, such as 4/2/1-bits, are also actively researched and have shown progress in enhancing inference efficiency.

- **Aggressive Quantization**: INT4 and Lower
- **Conservative Quantization**: INT8

*Comparison of represented value and storage bits of different data types:*

|Data type|Represented Value|Storage bits|
|---|---|---|
|**FP64**|3.141592653589793|64 bits
|**FP32**|3.141592653|32 bits
|**FP16**|3.1415|16 bits
|**INT8**|3|8 bits

**Benefits**:
- significantly reduced bandwidth and storage
- integer computer is faster than floating point compute
- more area and energy efficient

|Algorithm: Parameter quantization|
| -------------------------------- |
|**Step 1:** Count the corresponding min_value and max_value in the input data (weights or activation values)|
|**Step 2:** Choose the appropriate quantization type, symmetric (int-8) or asymmetric (uint-8)|
|**Step 3:** Calculate the quantization parameters Z/Zero point and S/Scale according to the quantization type, min_value and max_value|
|**Step 4:** Quantize the model based on the calibration data, converted from FP32 to INT-8|
|**Step 5:** Verify the performance of the quantized model, and if the result is not good, try to use a different way to calculate S and Z, and re-execute the above operation|

### Quantization-Aware Training (QAT)

Quantization-Aware Training helps neural networks adapt to lower bit-width representations during the training phase. It quantizes the model's parameters during both forward and backward propagation. However, the quantization happens after each gradient update and, importantly, after the weight updates in floating-point precision. Backward transfer is also carried out in a floating-point manner to avoid accumulating gradients with quantization precision, which can lead to high errors, particularly with low-precision quantization. QAT ensures that the model converges to a better loss point, compensating for the perturbations introduced by quantization.

### Post-Training Quantization(PTQ)

In contrast to QAT, Post-Training Quantization is applied after the model has been trained with full precision (32-bit floating-point numbers). PTQ quantizes the trained model and adjusts the weights without the need for extensive fine-tuning. This process offers a cost-effective and low-overhead solution for quantization. A significant advantage of PTQ is that it can be applied with limited or no labeled data, making it highly practical. However, PTQ may lead to a reduction in accuracy, especially with low-precision quantization. Researchers have devised various methods to address this accuracy drop, such as bias correction, weight range balancing, and outlier handling. PTQ is a fast way to quantize neural network models, but it typically achieves lower accuracy compared to QAT.