Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
README.md
makefile
testsum128.cc
testsum256.cc

README.md

Testing several ways to sum up floats stored in 128-bit SSE and 256-bit AVX registers

Tested using gcc 4.8.4 & Linux Ubuntu 14. CPUS is Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz Type make and run two binaries. See also my blog post for a short explanation. Note that some other compilers will not accept C-style casting of registers, you may need to use the following cast pseudo-intrinsics.

Currently, performance numbers are (see code for details) for the 128-bit SSE registers:

./testsum128 
Function: scalar                	Total time: 302.783 (ms)  Ignore: 7.03693e+13
Function: vector via _mm_hadd_ps	Total time: 469.195 (ms)  Ignore: 7.03693e+13
Function: vector via shuffles1  	Total time: 440.41 (ms)  Ignore: 7.03693e+13
Function: vector via shuffles2  	Total time: 407.912 (ms)  Ignore: 7.03693e+13

For 256-bit AVX registers:

./testsum256 
Function: scalar          	Total time: 617.095 (ms)  Ignore: 1.40741e+14
Function: part. vectorized	Total time: 939.864 (ms)  Ignore: 1.40741e+14

In short, the most naive implementation that relies on the conversion to the array of scalars sum works best.