Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special support for 63-bit division (unsigned)? #95

Closed
zielaj opened this issue Mar 30, 2022 · 2 comments
Closed

Special support for 63-bit division (unsigned)? #95

zielaj opened this issue Mar 30, 2022 · 2 comments

Comments

@zielaj
Copy link

zielaj commented Mar 30, 2022

Hi,

I was wondering whether it would make sense to provide special support for 63-bit division. The motivation is that such divisions can, I think, be done a bit faster than 64-bit divisions, and in all cases in which I personally needed fast division, 63 bits were enough.

For example, the current branch-free 64-bit division code is

uint64_t libdivide_u64_branchfree_do(
    uint64_t numer, const struct libdivide_u64_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    uint64_t t = ((numer - q) >> 1) + q;
    return t >> denom->more;
}

On x86-64 the last two lines of that function compile to something like this:

sub    rax,rdi           ; numer - q
shr    rax,1             ; >> 1
add    rax,rdi           ; + q
shrx   rax,rax,r10       ; >> denom->more

These are four 1-cycle instructions that have sequential data dependencies (via rax), so they have a combined latency of 4 cycles.

If both the numerator and denominator were only 63-bits, I think this can be improved to

uint64_t libdivide_u63_branchfree_do(
    uint64_t numer, const struct libdivide_u63_branchfree_t *denom) {
    uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
    return (numer + q) >> denom->more_plus_1;   // more_plus_1 == more + 1
}

Here the last line should compile to something like this, which should take 2 cycles rather than 4.

add    rax,rdi           ; numer + q
shrx   rax,rax,r10       ; >> denom->more_plus_one

As a bonus, division by 1 would work, by setting magic to 0 or 1, and more_plus_one to 0.

On the other hand, there are code maintenance and API clarity drawbacks, which should be weighed against this proposal.

@ridiculousfish
Copy link
Owner

Yes and even better: if your numerator is N-1 bits, then we only need an N bit magic number, and we can do without the add path entirely:

uint64_t q = libdivide_mullhi_u64(denom->magic, numer);
return t >> denom->more;

I'm not sure how to evaluate whether N-1 bit division is broadly useful. The API surface area is already large and it's becoming unwieldy to maintain. N-1 bit division would nearly double the size.

@zielaj
Copy link
Author

zielaj commented Mar 31, 2022

That's fair. I'll close this thread. If N-1 bit division is indeed broadly useful, someone will eventually re-open this thread and describe their use case.

Just for the record, I was just playing around with the idea of keeping the API the same and making a dynamic decision whether to use N-1 bit division. In order to minimize branch mispredictions, this decision would have to be sticky: once we've seen a numerator with bit N-1 set, the next 100 or so invocations of the division function would use the full N bit division, even if the numerators were small. The necessary state would be opportunistically kept on the stack (hoping that it would be preserved across function invocations, with should be true in hot loops, and shouldn't matter otherwise). Unfortunately, this approach turned out to generate way too much overhead (at least in my implementation).

@zielaj zielaj closed this as completed Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants