I see NI is enabled by a flag (and CPU test). I've previously made a trampoline for this type of case. If you are sure turning on NI and running on a non-NI CPU doesn't harm performance then we can delete this issue, if not then I will find time to make a trampoline patch.
It seems a bit noisy, but my previous AES work, based on an older version of your NI and gladmans AES with a trampoline, does seem to show an improvement so perhaps I'll port a patch.
AES128_tmd: 120.4667 ns 126.5365 ns 174.2186 ns 349.0955 ns 588.7610 ns 2.029187 us 7.677496 us
126.7 M/s 241.2 M/s 700.7 M/s 1398.7 M/s 1658.7 M/s 1925.0 M/s 2035.2 M/s
AES128: 169.6261 ns 179.7410 ns 207.4656 ns 358.6883 ns 541.5221 ns 1.690435 us 6.132310 us
90.0 M/s 169.8 M/s 588.4 M/s 1361.3 M/s 1803.4 M/s 2310.8 M/s 2548.0 M/s
i'm not sure what your benchmark is benchmarking exactly, but 2 things on my list are:
is showing the performance impact of my main difference from your
cipher-aes pacakage is the trampolines. By 'trampolines' I am referring
to the C routines and function pointers that first detect if NI is
available then set the function pointer to always use NI (or not) without
any further checks. This is in contrast to dynamically checking for NI on every single AES call.
void *(*aes_encrypt_ecb_ptr(key *k, uint8_t* out, uint8_t* in, uint32_t nrBlocks)
if(has_ni()) aes_encrypt_ecb_ptr = &aes_ni_encrypt_ecb;
After updating to use your latest NI code, the trampoline method compares
favorably to dynamically checking at every single call. In the below the
most sensitive benchmark is the 16byte case (because the time to check for
NI is the only difference in the computation, the actual cipher routine is
cipher| 16 bytes 32 bytes 64 bytes 512 bytes 1024 bytes 4096 bytes 16384 bytes
AES128_tmd: 123.2306 ns 125.3117 ns 161.5419 ns 312.6124 ns 510.0473 ns 1.688639 us 6.348553 us
123.8 M/s 243.5 M/s 755.7 M/s 1561.9 M/s 1914.7 M/s 2313.3 M/s 2461.2 M/s
AES128: 179.7151 ns 180.8886 ns 215.2212 ns 367.9218 ns 569.0574 ns 1.758038 us 6.448365 us
84.9 M/s 168.7 M/s 567.2 M/s 1327.1 M/s 1716.1 M/s 2221.9 M/s 2423.1 M/s
If we wish to update cipher-aes, we need to change the way it calls the NI
routines. Right now it performs a dynamic check on the size of the AES
key then dispatch based on that check. This is needed because we have one
type for all routines (Key could mean 128,192, or 256 bits). If we had
separate keys in Haskell, calling separate (but mostly identical) routines
in C, then this dynamic check wouldn't be needed and we could reap our
An alternate, short-term, option is that I flesh-out and release the
current code as "cipher-aes128". I'd have to add useful modes of
operation, such as GCM, but I already have (and the benchmark used) the
I've implemented the code I would use and made a repo: https://github.com/TomMD/cipher-aes128
This is still painfully slow for GCM, but at least the basic framework is there, which saves almost 50ns (33%) when dealing with small payloads.
I've been looking at your trampoline code, and will definitely add something similar.
Althought I can't reasonably export more C functions (for support reason) that do almost the same thing just to differentiate the key size, and I can't remove the support for 192/256 bits support; so i need something slightly different. One thing that's preventing a simpler trampoline is the lack of support for 192 and 256 bits with aesni; maybe this is where i'll carry on.
For GCM, cipher-aes itself is slow too, and i think this is mainly due to my crappy GF implementation (and no pcmulqdq implementation yet).
i think it's mostly fixed in 0.2.x now. Using more table, with a much shorter initialization phase.
Yes, it is done and is a nice solution. Good work. I ran some benchmarks a little while ago and everything appeared to be competitive.