Open
Description
What version of Go are you using (go version
)?
$ go version go version go1.16.3 darwin/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/ttsugrii/Library/Caches/go-build" GOENV="/Users/ttsugrii/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/ttsugrii/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/ttsugrii/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/Cellar/go/1.16.3/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.16.3" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/Users/ttsugrii/projects/go/testing/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kx/wp6rcc054zv_xyf4lpmlztr00000gn/T/go-build1168578044=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
I've ported Wilkes-Wheeler-Gill algorithm from "Faster Population Counts Using AVX2 Instructions" paper:
const c1 = 0x5555555555555555
const c2 = 0x3333333333333333
const c4 = 0x0F0F0F0F0F0F0F0F
//go:noinline
func wilkes_wheeler_gill(x uint64) int {
x -= (x >> 1) & c1
x = ((x >> 2) & c2) + (x & c2)
x = (x + (x >> 4)) & c4
x *= 0x0101010101010101
return int(x >> 56)
}
and benchmarked it against the version used by math.bits.OnesCount64
in case POPCNT
is not supported:
const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...
//go:noinline
func parallel_summing(x uint64) int {
const m = 1<<64 - 1
x = x>>1&(m0&m) + x&(m0&m)
x = x>>2&(m1&m) + x&(m1&m)
x = (x>>4 + x) & (m2 & m)
x += x >> 8
x += x >> 16
x += x >> 32
return int(x) & (1<<7 - 1)
}
The benchmarks:
var x uint64 = <the input value>
var result int
func BenchmarkPopCountParallelSumming(b *testing.B) {
var cnt int
for i := 0; i < b.N; i++ {
cnt += parallel_summing(x)
}
result += cnt
}
func BenchmarkPopCountWilkesWheelerGill(b *testing.B) {
var cnt int
for i := 0; i < b.N; i++ {
cnt += wilkes_wheeler_gill(x)
}
result += cnt
}
For var x uint64 = 0xffffffffffffffff
:
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz BenchmarkPopCountParallelSumming-16 360195423 3.094 ns/op BenchmarkPopCountWilkesWheelerGill-16 418102376 2.798 ns/op
and for var x uint64 = 0
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz BenchmarkPopCountParallelSumming-16 379682172 3.126 ns/op BenchmarkPopCountWilkesWheelerGill-16 431421498 2.870 ns/op
What did you expect to see?
math.bits.OnesCount64
's version to be faster
What did you see instead?
Wilkes-Wheeler-Gill algorithm was faster by ~10% and, given its implementation is also more concise, it makes sense to replace with it existing implementation.