benchmark

op

naive C with openmp

for for for

unroll, first try

h

register allocation

kernels

unroll, second try

simd

neon intrinsics

optional

naive neon assembly with pld

asm

pipeline optimize, first try

more register load mla

pipeline optimize, second try

interleave load mla

pipeline optimize, third try

loop tail

usual practice, load/save

233

usual practice, unroll

233

usual practice, save register

233