Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Need a trampoline to pick between NI/gladman #8

TomMD opened this Issue Dec 22, 2012 · 7 comments


None yet
2 participants

TomMD commented Dec 22, 2012

I see NI is enabled by a flag (and CPU test). I've previously made a trampoline for this type of case. If you are sure turning on NI and running on a non-NI CPU doesn't harm performance then we can delete this issue, if not then I will find time to make a trampoline patch.


TomMD commented Dec 22, 2012

It seems a bit noisy, but my previous AES work, based on an older version of your NI and gladmans AES with a trampoline, does seem to show an improvement so perhaps I'll port a patch.

AES128_tmd:  120.4667 ns  126.5365 ns  174.2186 ns  349.0955 ns  588.7610 ns  2.029187 us  7.677496 us
               126.7 M/s    241.2 M/s    700.7 M/s   1398.7 M/s   1658.7 M/s   1925.0 M/s   2035.2 M/s
AES128:  169.6261 ns  179.7410 ns  207.4656 ns  358.6883 ns  541.5221 ns  1.690435 us  6.132310 us
                90.0 M/s    169.8 M/s    588.4 M/s   1361.3 M/s   1803.4 M/s   2310.8 M/s   2548.0 M/s

vincenthz commented Dec 22, 2012

i'm not sure what your benchmark is benchmarking exactly, but 2 things on my list are:

  • gladman version is faster than my handrolled version. There's some low hanging fruits that i haven't time to pick up, but probably some more complicated one too.
  • I've unoptimised some of the aesni/non aesni path differences since it was causing some problem in windows and macosx.

TomMD commented Dec 31, 2012

The benchmark

is showing the performance impact of my main difference from your
cipher-aes pacakage is the trampolines. By 'trampolines' I am referring
to the C routines and function pointers that first detect if NI is
available then set the function pointer to always use NI (or not) without
any further checks. This is in contrast to dynamically checking for NI on every single AES call.

In pseduo-C:

void *(*aes_encrypt_ecb_ptr(key *k, uint8_t* out, uint8_t* in, uint32_t nrBlocks)
                    = &detect_ni_and_encrypt;

    if(has_ni()) aes_encrypt_ecb_ptr = &aes_ni_encrypt_ecb;


After updating to use your latest NI code, the trampoline method compares
favorably to dynamically checking at every single call. In the below the
most sensitive benchmark is the 16byte case (because the time to check for
NI is the only difference in the computation, the actual cipher routine is

  cipher|     16 bytes     32 bytes     64 bytes    512 bytes   1024 bytes   4096 bytes  16384 bytes
AES128_tmd:  123.2306 ns  125.3117 ns  161.5419 ns  312.6124 ns  510.0473 ns  1.688639 us  6.348553 us
             123.8 M/s    243.5 M/s    755.7 M/s   1561.9 M/s   1914.7 M/s   2313.3 M/s   2461.2 M/s
AES128:  179.7151 ns  180.8886 ns  215.2212 ns  367.9218 ns  569.0574 ns  1.758038 us  6.448365 us
              84.9 M/s    168.7 M/s    567.2 M/s   1327.1 M/s   1716.1 M/s   2221.9 M/s   2423.1 M/s

Moving Forward

If we wish to update cipher-aes, we need to change the way it calls the NI
routines. Right now it performs a dynamic check on the size of the AES
key then dispatch based on that check. This is needed because we have one
type for all routines (Key could mean 128,192, or 256 bits). If we had
separate keys in Haskell, calling separate (but mostly identical) routines
in C, then this dynamic check wouldn't be needed and we could reap our

An alternate, short-term, option is that I flesh-out and release the
current code as "cipher-aes128". I'd have to add useful modes of
operation, such as GCM, but I already have (and the benchmark used) the
crypto-api interface.


TomMD commented Dec 31, 2012

I've implemented the code I would use and made a repo: https://github.com/TomMD/cipher-aes128

This is still painfully slow for GCM, but at least the basic framework is there, which saves almost 50ns (33%) when dealing with small payloads.


vincenthz commented Jan 3, 2013

I've been looking at your trampoline code, and will definitely add something similar.

Althought I can't reasonably export more C functions (for support reason) that do almost the same thing just to differentiate the key size, and I can't remove the support for 192/256 bits support; so i need something slightly different. One thing that's preventing a simpler trampoline is the lack of support for 192 and 256 bits with aesni; maybe this is where i'll carry on.

For GCM, cipher-aes itself is slow too, and i think this is mainly due to my crappy GF implementation (and no pcmulqdq implementation yet).


vincenthz commented Aug 13, 2013

i think it's mostly fixed in 0.2.x now. Using more table, with a much shorter initialization phase.

@vincenthz vincenthz closed this Aug 13, 2013


TomMD commented Aug 14, 2013

Yes, it is done and is a nice solution. Good work. I ran some benchmarks a little while ago and everything appeared to be competitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment