High-speed 8bit 3x3 Convolutions and Linear Layers for x86-VNNI (256bit vectors) and Arm v8.6-a instructions.
This repo provides fast and scalable kernels for quantized int8 convolutions and linear layers. I noticed that the quantized Pytorch networks run well below the possible throughput on modern Arm and X86 machines. I wanted to try whether more efficient implementations (individual kernels and full networks) are possible. For example, for a VGG network the speed comparison is:
| | Pytorch | Library |
|-------------------------------|--------------------------|----------------|
| X86: Intel Ultra 7 155H | 326 GFLOPS/sec | 818 GFLOPS/sec |
| Arm: Graviton V4 (8 Cores) | 290 GFLOP/sec (QNNPACK) | 1.5 TFlops/sec |
The implementation in this repo is fast because it uses:
- Use efficient, scalable kernels
- Avoid memory allocations at runtime
- Exploiting compilers optimize for the deployment sizes of the model (as is common for inference-only frameworks)
A few comparisons with libtorch might be interesting:
- This library does way fewer heap-memory allocations (and none during inference): 140k vs 151
- There are ~3x fewer L1 cache misses: 3.5% vs 1.2
Note that a build requires at least gcc15
or clang19
. The code can be built
with cmake as:
cmake -S . -B build
cmake --build build
For comparison, a quantized VGG libtorch model is build too.
Because there are no pre-built libtorch Arm binaries available, libtorch has to be built from source.
The benchmarks are located in the bench folder. For example, one can run:
./bench/benchmark_torch PATH_TO_INT8_PT_FILE
./bench/benchmark_network PATH_TO_INT8_FILE PATH_TO_INPUT_DATA
The quantized pytorch weights (in torchscript format) can be downloaded here: https://drive.google.com/file/d/1aeedspAvXb7UJOMJcu_UCnrPXP5b5-ay/view?usp=drive_link The weights in FBS format can be downloaded here: https://drive.google.com/file/d/1tPdKA3pPHBj5c6Oc_bB__eAOrhjvpSOP/view?usp=drive_link