-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math/bits: faster way to compute OnesCount64 #46188
Comments
This implementation is based on the C function from "Faster Population Counts Using AVX2 Instructions" paper, Figure 3, available at https://arxiv.org/pdf/1612.07612.pdf More details and benchmark results are available in the golang#46188. Closes golang#46188
Change https://golang.org/cl/320112 mentions this issue: |
This could be useful in general, but I think it would be better to double-check it's a win on the architectures that actually use the pure Go version. Suppose the new code is actually 10% slower on some GOARCH without a POPCNT... then this change would be a net loss, because amd64 won't actually benefit from the 10% speedup you measured (since all processors made after 2006 have POPCNT), but the GOARCH that does use the pure Go version will be slowed down. |
Thank you for the feedback, @ALTree! Do you happen to have any particular architecture in mind? New algorithm assembly is shorter on both ARM:
and ARM64:
The only reason to suspect potential performance regression on these platforms is the usage of multiplication but it's hard to tell for sure without having access to the target platform. |
On arm64 we always use the intrinsic, so the speed of this backup code shouldn't matter there (unless it is faster than the intrinsic!). |
On the Raspberry Pi 4B running in 32-bit mode (ARM Cortex-A72), the WWG version is significantly faster:
OnesCount64 of uint64(i) inside the loop shows similar performance.
|
Thank you for checking the 32bit ARM performance, @greatroar! @randall77, does this address your concern about arm perf impact? |
Seems fine. A 4% performance improvement doesn't seem great though. That's well within the "code alignment could have caused it" performance range. |
@andig This change shouldn't affect arm64 at all - it already uses an intrinsic, not the fallback code being discussed here. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I've ported Wilkes-Wheeler-Gill algorithm from "Faster Population Counts Using AVX2 Instructions" paper:
and benchmarked it against the version used by
math.bits.OnesCount64
in casePOPCNT
is not supported:The benchmarks:
For
var x uint64 = 0xffffffffffffffff
:and for
var x uint64 = 0
What did you expect to see?
math.bits.OnesCount64
's version to be fasterWhat did you see instead?
Wilkes-Wheeler-Gill algorithm was faster by ~10% and, given its implementation is also more concise, it makes sense to replace with it existing implementation.
The text was updated successfully, but these errors were encountered: