Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Use libdivide for divison #489

Closed
ryao opened this issue Oct 27, 2015 · 1 comment
Closed

Use libdivide for divison #489

ryao opened this issue Oct 27, 2015 · 1 comment
Labels

Comments

@ryao
Copy link
Contributor

ryao commented Oct 27, 2015

The hardware division units are non-pipelined and have horrific latencies:

https://gmplib.org/~tege/x86-timing.pdf

A few quick tests using libdivide's benchmark shows that it is about twice as fast at 64-bit division as the native implement:

richard@desktop ~/devel/libdivide $ ./benchmark s64
     #  system  scalar  scl_us  vector  vec_us   gener  algo
...
 -8009  16.323  10.931   7.856   0.000   0.000  51.834     4
  8010  15.606  10.902   7.837   0.000   0.000  51.300     4
 -8010  16.319  10.895   7.828   0.000   0.000  51.331     4
  8011  15.612  10.899   7.801   0.000   0.000  51.315     4
 -8011  16.306  10.889   7.801   0.000   0.000  51.300     4
  8012  15.635  12.232   8.345   0.000   0.000  54.016     2
 -8012  16.296  12.295   8.915   0.000   0.000  54.062     3
  8013  15.646  10.891   7.812   0.000   0.000  51.285     4
 -8013  16.293  10.897   7.811   0.000   0.000  51.437     4

That is after modifying CFLAGS to use -O2 -m32 on my Haswell E3-1276 v3 and killing the SSE2 code (hence why the vector column is zero). The system and scl_us columns are the ones that are relevant to us, where performance is twice as fast.

http://libdivide.com/

We should look into switching to libdivide for 64-bit division on 32-bit platforms and if there is a way to get the compiler to use our own function for 64-bit division on 64-bit platforms, we would likely be better off switching to it. libdivide is under the zlib license, which allows us to use it.

This is not a high priority idea from a performance standpoint (how often do we do division?), but it is something I want to put out there for anyone new to contributing that is interested in working on it.

A couple related thoughts of possible interest are "does division being non-pipelined has an effect on another thread in Intel's SMT implementation?" and "is hardware division preemptible?". These questions might worth asking the realtime Linux developers.

@behlendorf
Copy link
Contributor

Closing. This still might be a fun project though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants