Skip to content

Using the LIBXSMM MKL JIT backend

Roman edited this page Jun 11, 2020 · 1 revision

LIBXSMM/MKL JIT backend

Fastor is a stand-alone library and does not depend on any external BLAS for it's linear algebra routines. However, given that Fastor's primary focus is on speeding up operations on small tensors, it also provides switching backeneds to libxsmm and MKL JIT for small matrix-matrix and matrix-vector multiplications if need be. This is specifically useful in cases where performance portability is important.

To activate libxsmm backend, you need to first download and build libxsmm as a static or dynamic shared library and then compile your Fastor's code by issuing -DFASTOR_USE_LIBXSMM flag to the compiler. You then need to link to libxsmm as -L/path/to/libxsmm/lib/ -lxsmm -lblas -ldl. Fastor will automatically switch to libxsmm for all matmul routines.

To activate MKL JIT backend, you need to first download and install the Intel MKL library and then compile your Fastor's code by issuing -DFASTOR_USE_MKL flag to the compiler. You then need to link to MKL as -L/path/to/mkl/lib/ -lmkl_rt. Fastor will automatically switch to MKL for all matmul routines.

The switch can also be configured based on the matrix size using compiler flags FASTOR_BLAS_SWITCH_MATRIX_SIZE. The default value is 16 for square matrices, that is matrix multiplications with M=N=K>16 will be dispatched to BLAS if one is available. But might be best experimenting with this value on your own architecture. The default value for non-square matrices is cbrt(M*K*N)>16.

Here is a Google benchmark of a complex Kalman filter problem implemented in Fastor (that uses a lot matmul operations on square and non-square matrices) using built-in matmul vs libxsmm dispatched matmul:

BUILT-IN MATMUL:

Run on (8 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.31, 1.23, 1.06
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
CovarianceUpdateFastor<8>                        245 ns          244 ns      2574429
CovarianceUpdateFastor<16>                       948 ns          948 ns       614202
CovarianceUpdateFastor<32>                      5483 ns         5480 ns        99416
CovarianceUpdateFastor<64>                     28338 ns        28306 ns        14543
CovarianceUpdateFastor<128>                   221356 ns       221054 ns         1912
```
LIBXSMM DISPATCHED MATMUL:

Run on (8 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.31, 1.23, 1.06
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
CovarianceUpdateFastor<8>                        254 ns          254 ns      2614574
CovarianceUpdateFastor<16>                       955 ns          955 ns       620507
CovarianceUpdateFastor<32>                      5553 ns         5550 ns       123622
CovarianceUpdateFastor<64>                     30319 ns        30296 ns        23204
CovarianceUpdateFastor<128>                   218788 ns       218571 ns         3583

The default switches were used for dispatching for this analysis. Notice that Fastor is well tuned for small matrix-matrix multiplications and in this case libxsmm only takes over marginally for size 128.