You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was wondering whether it would make sense to provide special support for 63-bit division. The motivation is that such divisions can, I think, be done a bit faster than 64-bit divisions, and in all cases in which I personally needed fast division, 63 bits were enough.
For example, the current branch-free 64-bit division code is
Yes and even better: if your numerator is N-1 bits, then we only need an N bit magic number, and we can do without the add path entirely:
uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
return t >> denom->more;
I'm not sure how to evaluate whether N-1 bit division is broadly useful. The API surface area is already large and it's becoming unwieldy to maintain. N-1 bit division would nearly double the size.
That's fair. I'll close this thread. If N-1 bit division is indeed broadly useful, someone will eventually re-open this thread and describe their use case.
Just for the record, I was just playing around with the idea of keeping the API the same and making a dynamic decision whether to use N-1 bit division. In order to minimize branch mispredictions, this decision would have to be sticky: once we've seen a numerator with bit N-1 set, the next 100 or so invocations of the division function would use the full N bit division, even if the numerators were small. The necessary state would be opportunistically kept on the stack (hoping that it would be preserved across function invocations, with should be true in hot loops, and shouldn't matter otherwise). Unfortunately, this approach turned out to generate way too much overhead (at least in my implementation).
Hi,
I was wondering whether it would make sense to provide special support for 63-bit division. The motivation is that such divisions can, I think, be done a bit faster than 64-bit divisions, and in all cases in which I personally needed fast division, 63 bits were enough.
For example, the current branch-free 64-bit division code is
On x86-64 the last two lines of that function compile to something like this:
These are four 1-cycle instructions that have sequential data dependencies (via rax), so they have a combined latency of 4 cycles.
If both the numerator and denominator were only 63-bits, I think this can be improved to
Here the last line should compile to something like this, which should take 2 cycles rather than 4.
As a bonus, division by 1 would work, by setting
magic
to 0 or 1, andmore_plus_one
to 0.On the other hand, there are code maintenance and API clarity drawbacks, which should be weighed against this proposal.
The text was updated successfully, but these errors were encountered: