*Note: Work in Progress*

<font color='#2980B9'><center><h2>All The Ways You Can Compress Transformers</h2></center></font>
<font color='#2980B9'><center>- A foray into strategies for reducing model size and increasing inference speed.</center></font>

Model compression is a technique that shrinks trained neural networks. Compressed models often perform similarly to the original while using a fraction of the computational resources. The bottleneck in many applications, however, turns out to be training the original, large neural network before compression.

<center><img src="https://www.datocms-assets.com/30881/1608731085-pruning-1.png" height="100" width="500"></center>

Model compression reduces redundancy in a trained neural network. This is useful since Transformers barely fits on a GPU (The Large does not). Improved memory and inference speed efficiency can also save costs at scale.

High parameter counts and a large computational footprint mean production deployment of Transformer and friends remains difficult. Thankfully, the past 3 years have seen the development of a diverse variety of techniques to ease the pain and yield faster prediction times.

In particular, this notebook focuses on the following suite of methods applied after base model pre-training to reduce the computational cost of prediction:

 - [**Numeric Precision Reduction**](#section1) : yielding speedups through the use of floating-point reduction and quantization.
 
 - [**Knowledge Distillation**](#section2): efficiently training smaller student models to mimic the behavior of more expressive and expensive teachers.
 
 - **Pruning**: identifying and removing non-essential portions of a network.
 
 - [**Dynamic Inference Acceleration**](#section4) - reducing computational overhead at inference time.
 
 - [**Matrix Decomposition**](#section5) - Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices.
 
 - [**Operation Fusion**](#section6): numerical tricks to merge select nodes in computational graphs.
 
 - **Module Replacement**: reducing model complexity or depth via a replacement curriculum.
 
 - **Others** - several one-of-a-kind methods are effective for reducing the size and inference time.
 
 
 *The competition has a GPU runtime limit of 5 hrs. To ensemble we need to fit more diverse models and for doing that we can use model compression strategies to compress larger models into smaller ones and improve inference speed.*
 
<font color='#2980B9'><a id="section1"><h2>Numeric Precision Reduction</h2></a></font>

Perhaps the most general method for yielding prediction-time speedups is numeric precision reduction. In past years poor support for float16 operations on GPU hardware meant that reducing the precision of weights and activations was often counter-productive, but the introduction of the NVIDIA Volta and Turing architectures with Tensor Cores means modern GPUs are now well equipped for efficient float16 arithmetic.

<font color='#2980B9'><h3>Mixed Precision Training</h3></font>
Floating point types store numeric information of three types – the sign, exponent, and fraction.  Traditional float32 representations have 8 bits and 23 bits respectively to represent the exponent and fraction.  Traditional float16 representations (the format used for NVIDIA hardware) roughly halve both the exponent and fraction components of the representation. TPUs use a variant called bfloat16 that opts to shift some bits from the fraction to the exponent, trading some precision for the ability to represent a broader range of values.

Most of a transformer network can be naively converted to float16 weights and activations with no accuracy penalty.

<center><img src="https://www.pragmatic.ml/content/images/2020/04/image-2.png" height="100" width="600"></center>

Small portions of the network – in particular, portions of the softmax operation – must remain in float32.  This is because the sum of a large number of small values (our logits) can be a source of accumulated error. Because both float16 and float32 values are used, this method is often referred to as "mixed-precision".

Less precise numeric representations enable speedups from two sources.

 - Native half-precision instructions
 - Larger batch sizes thanks to more compact representations
 
<font color='#2980B9'><h4>How to use?</h4></font>
Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere). AMP shows significant (2-3X) speedup on those architectures. On earlier architectures (Kepler, Maxwell, Pascal), one may observe a modest speedup. One can run !nvidia-smi to display the GPU’s architecture.

 - [NVIDIA-apex](https://github.com/NVIDIA/apex) - NVIDIA has published a rather extensive suite of benchmarks relating to floating point precision reduction – in practice this method yields speedups up to 3x.

- [Torch - torch.cuda.amp](https://pytorch.org/docs/stable/amp.html) - In PyTorch 1.6 release, developers at NVIDIA and Facebook moved mixed precision functionality into PyTorch core as the AMP package, torch.cuda.amp. torch.cuda.amp is more flexible and intuitive compared to apex.amp.

<font color='#2980B9'><h4>Special Cases</h4></font>
 
 - Model can be converted to fp16 for predictions using `model.half().cuda()`
 
 - On a dataset with small batches, one should be careful with mixed precision, because it can lead to unexpected slower training if there is not enough computation to perform. For e.g. I tried to train XLM-Roberta Large with batch size of 4 and the training always fails.
 

<font color='#2980B9'><h3>Integer Quantization</h3></font>
Quantization of float32 to int8 values is also possible but requires more nuanced application.  In particular, a post-training calibration step is necessary to ensure that the int8 comptutation is as close as possible to computation performed with float32 values.

Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.

If you know what range of values a network's activations will likely occupy, you can divide up that range into 256 discrete chunks and assign each to an integer.  As long as you store the the scale factor and the range occupied, you can use the integer approximations for your matrix multiplies and recover a floating point value at the output.

<center><img src="https://www.pragmatic.ml/content/images/2020/04/image-5.png" height="100" width="600"></center>

Naively, you could select a scale and offset such that no input floating point activation on a set of calibration inputs is mapped to an integer at either extreme of the range of uint8 values (-128, 127).  However, in doing so we sacrifice some precision in order to accommodate extreme values.  

Instead, frameworks like TensorRT select scale and offset values that minimize the KL divergence between the output activations of the float32 version and int8 version of the model. This allows us to balance the tradeoff between range and precision in a principled way.  As KL divergence can be viewed as a measure of information loss under a different encoding, it's a natural fit.


<font color='#2980B9'><h4>How to use?</h4></font>
Information on how to apply int8 quantization to your own model's using NVIDIA's TensorRT is available below:

 - [Tensorflow TensorRT User Guide](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html)
 - [Tensorflow TensorRT Github](https://github.com/tensorflow/tensorrt)
 
 
<font color='#2980B9'><a id="section2"><h2>Knowledge Distillation</h2></a></font>
A method first conceived by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 work "Distilling the Knowledge in a Neural Network", knowledge distillation involves transferring the knowledge contained within one network (the "teacher") to another (the "student") via a modified loss.  

Observations about knowledge distillation for natural language processing can be summarized as follows.

- Knowledge distillation provides efficient and effective lightweight language deep models. The large-capacity teacher model can transfer the rich knowledge from a large number of different kinds of language data to train a small student model, so that the student can quickly complete many language tasks with effective performance.

- The teacher-student knowledge transfer can easily and effectively solve many multilingual tasks, considering that knowledge from multilingual models can be transferred and shared by each other.

- In deep language models, the sequence knowledge can be effectively transferred from large networks into small networks.


In knowledge distillation, `knowledge types`, `distillation strategies` and the `teacher-student architectures` play the crucial role in the student learning. We will be looking into each of these since each are crucial for getting a better understanding of knowledge distillation landscape.

<font color='#2980B9'><h3>Knowledge Types</h3></font>

<font color='#2980B9'><h4>1. Response-Based Knowledge</h4></font>
Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications.

<center><img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-021-01453-z/MediaObjects/11263_2021_1453_Fig4_HTML.png" height="100" width="400"></center>

The most popular response-based knowledge for image classification is known as soft targets by Geoffrey Hinton. As stated in paper soft targets contain the informative dark knowledge from
the teacher model.

However, the response-based knowledge usually relies on the output of the last layer, e.g., soft targets, and thus fails to address the intermediate-level supervision from the teacher model, which turns out to be very important for representation learning using very deep neural networks.

<font color='#2980B9'><h4>2. Feature-Based Knowledge</h4></font>
Deep neural networks are good at learning multiple levels of feature representation with increasing
abstraction. This is known as representation learning (Bengio et al., 2013). Therefore, both the output of the last layer and the output of intermediate layers, i.e., feature maps, can be used as the knowledge to supervise the training of the student model. 

<center><img src="https://www.researchgate.net/profile/Jianping-Gou/publication/342094012/figure/fig9/AS:950418217652225@1603608769121/The-generic-feature-based-knowledge-distillation.png" height="100" width="500"></center>

Specifically, featurebased knowledge from the intermediate layers is a good extension of response-based knowledge, especially for the training of thinner and deeper networks. The various feature types used include Neuron Selectivity Patterns, Attention maps, Sharing parameters, Feature representation, Feature maps, Feature aggregation etc.

However, the response-based knowledge usually relies on the output of the last layer, e.g., soft targets, and thus fails to address the intermediate-level supervision from the teacher model, which turns out to be very important for representation learning using very deep neural networks.

<font color='#2980B9'><h4>3. Relation-Based Knowledge</h4></font>
Both response-based and feature-based knowledge use the outputs of specific layers in the teacher model. Relation-based knowledge further explores the relationships between different layers or data samples.

<center><img src="https://www.researchgate.net/profile/Jianping-Gou/publication/342094012/figure/fig10/AS:950418217648129@1603608769145/The-generic-instance-relation-based-knowledge-distillation.png" height="100" width="500"></center>

Different relation-based features include FSP Matrix, Instance Relation, Similarity matrix, Representation graph etc.

Although some types of relation-based knowledge are provided recently, how to model the relation information from feature maps or data samples as knowledge still deserves further study.

<font color='#2980B9'><h3>Distillation Schemes</h3></font>
According to whether the teacher model is updated simultaneously with the student model or not, the learning schemes of knowledge distillation can be directly divided into three main categories: offline distillation, online distillation and selfdistillation.

*There have been many previous Kaggle competitions where winning solutions have used self-distillation technique*.


<center><img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-021-01453-z/MediaObjects/11263_2021_1453_Fig8_HTML.png" height="100" width="300"></center>

<font color='#2980B9'><h4>1. Offline Distillation</h4></font>

The knowledge is transferred from a pre-trained teacher model into a student model. Therefore, the whole training process has two stages, namely: 
  1. The large teacher model is first trained on a set of training samples before distillation; and   
  2. The teacher model is used to extract the knowledge in the forms of logits or the intermediate features, which are then used to guide the training of the student model during distillation.

The main advantage of offline methods is that they are simple and easy to be implemented. However, the complex high-capacity teacher model with huge training time can not be avoided, while the training of the student model in offline distillation is usually efficient under the guidance of the teacher model. Moreover, the capacity gap between large teacher and small student always exists, and student often largely relies on teacher.

<font color='#2980B9'><h4>2. Online Distillation</h4></font>
To overcome the limitation of offline distillation, online distillation is proposed to further improve the performance of the student model, especially when a large-capacity high performance teacher model is not available. In online distillation, both the teacher model and the student model are updated simultaneously, and the whole knowledge distillation framework is end-to-end trainable. 

In deep mutual learning multiple neural networks work in a collaborative way. Any one network can be the student model and other models can be the teacher during the training process. Co-distillation in parallel trains multiple models with the same architectures and any one model is trained by transferring the knowledge from the other models.

However, existing online methods (e.g., mutual learning) usually fails to address the high-capacity teacher in online settings, making it an interesting topic to further explore the relationships between the teacher and student model in online settings.

<font color='#2980B9'><h4>3. Self Distillation</h4></font>

In self-distillation, the same networks are used for the teacher and the student models and can be regarded as special case of online distillation. There are variants of self-distillation - 
   1. [Zhang et al. (2019b)](https://arxiv.org/abs/1905.08094) proposed a new self-distillation method, in which knowledge from the deeper sections of the network is distilled into its shallow sections.
   2. Snapshot distillation ([Yang et al., 2019b](https://arxiv.org/abs/1812.00123)) is a special variant of self-distillation, in which knowledge in the earlier epochs of the network (teacher) is transferred into its later epochs (student) to support a supervised training process within the same network.
   3. ([Yuan et al., 2020](https://arxiv.org/abs/1909.11723)) proposed teacher-free knowledge distillation methods based on the analysis of label smoothing regularization.
   4.  Hahn and Choi proposed a novel self-knowledge distillation method, in which the self-knowledge consists of the predicted probabilities instead of traditional soft probabilities ([Hahn and Choi, 2019](https://arxiv.org/abs/1908.01851)). These predicted probabilities are defined by the feature representations of the training model. They reflect the similarities of data in feature embedding space.
   
Self-distillation is an powerful and mysterious phenomenon. Here is an excerpt from one of [Microsoft Research Blog](https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/), 

> *Note that knowledge distillation at least intuitively makes sense: the teacher ensemble model has 84.8% test accuracy, so the student individual model can achieve 83.8%. The following phenomenon, called self-distillation (or “Be Your Own Teacher”), is completely astonishing—by performing knowledge distillation against an individual model of the same architecture, test accuracy can also be improved. (See Figure 2 above.) Consider this: if training an individual model only gives 81.5% test accuracy, then how come “training the same model again using itself as the teacher” suddenly boosts the test accuracy consistently to 83.5%?*


<font color='#2980B9'><h3>Teacher Student Architecture</h3></font>
In knowledge distillation, the teacher-student architecture is a generic carrier to form the knowledge transfer. How to select or design proper structures of teacher and student is very important but difficult problem.

<center><img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-021-01453-z/MediaObjects/11263_2021_1453_Fig9_HTML.png" height="100" width="400"></center>

Knowledge distillation was previously designed to compress an ensemble of deep neural networks in
(Hinton et al., 2015). The complexity of deep neural networks mainly comes from two dimensions: depth and width. It is usually required to transfer knowledge from deeper and wider neural networks to shallower and thinner neural networks.

The student network is usually chosen to be: 
   1) A simplified version of a teacher network with fewer layers and fewer channels in each layer.   
   2) A quantized version of a teacher network in which the structure of the network is preserved.   
   3) A small network with efficient basic operations.   
   4) A small network with optimized global network structure.   
   5) The same network as teacher.   
   
   
<font color='#2980B9'><h3>Knowledge Distillation in NLP</h3></font>
As a multilingual representation model, BERT has attracted attention in natural language understanding, but it is also a cumbersome deep model that is not easy to be deployed. To address this problem, several lightweight variations of BERT (called BERT model compression) using knowledge distillation are proposed.

<font color='#2980B9'><h4>1. Knowledge Distillation to Similar Model Architectures</h4></font>
In "Structured Pruning of a BERT-based Question Answering Model" discussed earlier, knowledge distillation is used to transfer the knowledge contained in an unpruned teacher model to a pruned student.  

On the Natural Questions dataset, teacher performance sits at 70.3 and 58.8 F1 for Long Answer and Short Answer questions respectively. With pruning around 50% of the attention heads and feed forward activations, performance drops to 67.8 and 55.5 F1 respectively – a decrease of around 2.5 F1.  If a distillation loss is used in place of a cross-entropy loss during finetuning, they recover between 1.5 and 2 F1 points and reach scores of 69.3 and 58.4.

<center><img src="https://www.pragmatic.ml/content/images/2020/04/Pruning-Performance-on-NQ-Dataset.png" height="100" width="500"></center>

1. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf of Hugging Face performs knowledge distillation from BERT-base to a 6-layer BERT student during a secondary pre-training step on a masked language modeling objective.  The student model (trained in a task-agnostic manner) retains an impressive 97% of model performance on the GLUE benchmark while reducing prediction time by 60%.

2. In TinyBERT: Distilling BERT for Natural Language Understanding, Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu perform distillation from a BERT teacher to a 4 layer transformer student with hidden size 312.  They distill both at pre-training time and finetuning time to yield a model that achieves 96% of BERT-base's performance on the GLUE benchmark with a 7.5x smaller package and nearly 10x faster inference times.

3. In "Patient Knowledge Distillation for BERT Model Compression", Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu apply a knowledge distillation loss to many of the intermediate representations of a 12-layer BERT teacher and 6-layer BERT student to yield increased accuracy on 5/6 GLUE tasks when compared to a baseline that only applies the knowledge distillation loss to the models logits.

<font color='#2980B9'><h4>2. Knowledge Distillation to Dissimilar Model Architectures</h4></font>

In the papers discussed so far, the teacher model and student are share the same basic architecture and teacher weights are often used for student models' weight initializations. However, knowledge distillation loss can be applied even in scenarios where teacher and student model architectures differ dramatically.

1. In "Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models", Peter Izsak, Shira Guskin, and Moshe Wasserblat of Intel AI Lab distill a BERT teacher (330M params) trained on a named entity recognition task into a dramatically more compact and efficient CNN-LSTM student (3M params).  In doing so they achieve speedups of up to 2 orders of magnitude on CPU hardware at minimal accuracy loss.

2. In "Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data", Subhabrata Mukherjee and Ahmed Hassan Awadallah distill BERT-Base and BERT-Large teachers into a BiLSTM student, matching the performance of the teacher in 4 classification tasks (Ag News, IMDB, Elec, and DBPedia) with a fraction of the parameter count (13M).  They also observe major gains in sample efficiency thanks to distillation, requiring only 500 labeled examples per task to hit parity with the teacher (provided sufficient unlabeled data).

3. In "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks", Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin report similar findings on a variety of sentence pair tasks (QQP, MNLI, etc.) using a single-layer BiLSTM with less than 1M parameters.

4. In "Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models", Linqing Liu, Huan Wang, Jimmy Lin, Richard Socher, and Caiming Xiong combine multi-task learning with knowledge distillation methods to transfer the knowledge from a transformer teacher into a deep LSTM with attention.  They report that the benefits from distillation stack well with generalization benefits from multi-task learning frameworks and record prediction speeds of 30X that of Patient Knowledge Distillation and 7X that of TinyBERT.

<font color='#2980B9'><h4>Conclusion</h4></font>
Distillation is all the rage these days and it's clear why – it's shaping up to be a likely antidote to the ever-increasing parameter counts of transformer-based language models.  If we are to actualize the benefits of these GPU-hungry giants we need methods like distillation to keep prediction throughput high.

<font color='#2980B9'><h4>How to use?</h4></font>
There are several out of the box distiller tools that can be used for implementing knowledge distillation,

   1. [Neural Network Distiller](https://github.com/IntelLabs/distiller): A Python Package For DNN Compression Research.
   2. [TextBrewer](https://github.com/airaria/TextBrewer): An Open-Source Knowledge Distillation Toolkit for Natural Language Processing.
   3. [torchdistill](https://github.com/yoshitomo-matsubara/torchdistill): A Modular, Configuration-Driven Framework for Knowledge Distillation.
   4. [KD-Lib](https://github.com/SforAiDl/KD_Lib): A PyTorch library for Knowledge Distillation, Pruning and Quantization.
   5. [Knowledge-Distillation-Zoo](https://github.com/AberHu/Knowledge-Distillation-Zoo) - Pytorch implementation of various Knowledge Distillation (KD) methods.
   

<!-- <font color='#2980B9'><a id="section3"><h2>Pruning</h2></a></font> -->

<font color='#2980B9'><a id="section4"><h2>Dynamic Inference Acceleration</h2></a></font>
Besides directly compressing the model, there are methods that focus on reducing computational overhead at inference time, catering to individual input examples and dynamically changing the amount of computation required. 

<center><img src="https://drive.google.com/uc?export=view&id=1oyYsAa2Sx_zMa0pCLVkcyNlRwJuTjPyU" height="100" width="400"></center>

<font color='#2980B9'><h3>Early Exit Ramps</h3></font>
One way to speed up inference is to create intermediary exit points in the model. Since the classification layers are the least parameter-extensive part of BERT, separate classifiers can be trained for each encoder unit output. 

This allows the model to get dynamic inference time for various inputs. Training these classifiers can either be done from scratch or through distilling the output from the final classifier. Below are most prominent papers that utilise this technique,

 - [DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference](https://aclanthology.org/2020.acl-main.204/)
 - [BERT Loses Patience: Fast and Robust Inference with Early Exit](https://arxiv.org/abs/2006.04152)
 - [Accelerating BERT Inference for Sequence Labeling via Early-Exit](https://arxiv.org/abs/2105.13878)
 - [FastBERT: a Self-distilling BERT with Adaptive Inference Time](https://arxiv.org/abs/2004.02178)

<font color='#2980B9'><h3>Progressive Word Vector Elimination</h3></font>
Another way to accelerate inference is by reducing the number of words processed at each encoder level. Since we use only the final output corresponding to the [CLS] token as a representation of the complete sentence, the
information of the entire sentence must have fused into that one token. The idea is that such a fusion cannot be sudden, and that it must happen progressively across various encoder levels. We can use this information to lighten the later encoder units by reducing the sentence length through word vector elimination at each step.

 - [PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination](https://arxiv.org/abs/2001.08950)


<font color='#2980B9'><a id="section5"><h2>Matrix Decomposition</h2></a></font>
Matrix Decomposition (or parameter factorization) is a technique that decomposes higher-rank tensors into lower-rank tensors simplifying memory access and compressing model size. It works by breaking large layers into many smaller ones, thereby reducing the number of computations. 

Parameter factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).

<font color='#2980B9'><h3>Weight Matrix Decomposition</h3></font>
A possible way to reduce the computational overhead of the model can be through weight matrix factorization,
which replaces the original A × B weight matrix with the product of two smaller matrices (A × C and C × B). The reduction in model size as well as runtime memory usage is sizable if C << A, B.

   - [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
   - [Compressing Pre-trained Language Models by Matrix Decomposition](https://aclanthology.org/2020.aacl-main.88/)
   - [LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression](https://arxiv.org/abs/2004.04124)
   - [EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference](https://arxiv.org/abs/2011.14203)


<font color='#2980B9'><h3>Attention Decomposition</h3></font>
<center><img src="https://drive.google.com/uc?export=view&id=1megDH-lqPUhpJQXh1rQ4YMEPI9BlMcWj" height="100" width="400"></center>

The importance of attention calculation over the entire sentence has been explored, revealing a large number of redundant computations. One way to resolve this issue is by calculating attention in smaller groups, by either binning them using spatial locality, magnitude-based locality, or an adaptive attention span.

Moreover, since the outputs are calculated independently, local attention methods also enable a higher degree of parallel processing and individual representations can be saved during inference for multiple uses. Since the multi-head self-attention layer does not contain weights, these methods only improve the runtime memory costs and execution speed, but not the model size.

 - [Synthesizer: Rethinking Self-Attention in Transformer Models](https://arxiv.org/abs/2005.00743)
 - [DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering](https://arxiv.org/abs/2005.00697)
 - [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)
 - [Linformer: Selfattention with linear complexity](https://arxiv.org/abs/2006.04768)
 - [Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection](https://arxiv.org/abs/1912.11637)


<!-- <font color='#2980B9'><a id="section6"><h2>Operation Fusion</h2></a></font>

<font color='#2980B9'><a id="section7"><h2>Module Replacement</h2></a></font>

<font color='#2980B9'><a id="section8"><h2>Others</h2></a></font> -->