Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[wip] IP checksum in AVX2 assembler (prototype rewrite) #899
This branch rewrites the AVX2 checksum routine with DynASM assembler instead of GCC intrinsics.
My motivations and perceived benefits are a little subjective:
I will be satisfied if the assembler version is at least as short as the C version and also at least as fast.
Looks promising so far. The version here is basically working but seems to be missing a carry bit somewhere (off-by-one on some tests). The code is a little over half the size of the C version. The performance seems at least as good.
Just have simple microbenchmarks for now. I would like to get a more thorough performance test like #755 upstream to verify this change. Meanwhile, here is how it looks in comparison to the C code based on the microbenchmark included on the PR (1 million iterations on the same input). On case with 150 byte input:
Here is a little table with some other values:
EDIT: Marked with
So: fun and encouraging so far but more work to be done. Could still be some big mistakes that invalidate this code and/or results for now.
Fixed the bug with 4993c37. This code works fine up to 128KB inputs in casual testing. That limitation seems okay to me i.e. not worth writing more code to increase it because packets are not that big.
The next step is to integrate and test/benchmark more extensively both with synthetic benchmarks and end-to-end tests (offloading checksums from QEMU VMs).
This is intended to be fairly harsh and realistic. The input sizes, contents, and alignments are randomized. I draw the input sizes from a log-uniform distribution which is my current favorite for packet sizes (mostly small but also including large and jumbo sizes). I also update the assembler routine to have the same interface as the others (added the
The assembler routine shows the best results by far. The older AVX routine is likely suffering from the logic that selectively falls back on the generic routine on small inputs (both for the unpredictable branch and because the current implementation of generic checksum is beyond awful -- should fix that as a matter of principle even if we are using the SIMD one in practice.)
This is starting to look like effort well spent! IP checksum is the main hotspot for Virtio-net with client/server workloads e.g. running iperf in VM. Cycles saved here should translate directly into extra capacity for the NFV application.
This branch is taking a little bit of a different turn:
I found a fairly straightforward formulation of checksum in C that GCC is able to automatically vectorize when compiled with
I have retained the AVX2 assembler variant as this is still the fastest by a significant margin.
The next step is to eliminate either the C/AVX2 implementation of the assembler one. The open problem for the assembler one right now is the wart that it temporarily overwrites the memory trailing the input which is a different and more complex interface that may not suit all usages. The open problem for the C/AVX2 implementation is that it is slower than the assembler.
My usual worry with auto-vectorization for critical hotspots is that at some point it'll fail to trigger for whatever reason, and you'll end up silently running the naive version of the code. It's ok in a closed environment where the compiler versions are guaranteed to be fixed and only get upgraded at very specific points in time, but more problematic when you as the author have little control over how this gets compiled.
Though the simplicity of that new C code is pretty sweet.
@jsnell Yes, I know what you mean. In Snabb we pin exact versions of LuaJIT and DynASM so that we can "geek out" on them but have tried to be conservative with gcc, glibc, etc. I am not especially comfortable with either the auto-vectorized nor the vectorized-with-C-intrinsics versions.
I imagine it is interesting in other projects where they are dedicated C hackers and pick one compiler (ICC, CLANG, or GCC) and geek out on its features for vectorization etc. at least until the distros come along and decide to use a different compiler, compile for a generic target architecture, skip the performance tests, etc :).
Questions I am struggling with now are: