benchmark op naive C with openmp for for for unroll, first try h register allocation kernels unroll, second try simd neon intrinsics optional naive neon assembly with pld asm pipeline optimize, first try more register load mla pipeline optimize, second try interleave load mla pipeline optimize, third try loop tail usual practice, load/save 233 usual practice, unroll 233 usual practice, save register 233