Skip to content

math/bits: faster way to compute OnesCount64 #46188

Open
@ttsugriy

Description

@ttsugriy

What version of Go are you using (go version)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/ttsugrii/Library/Caches/go-build"
GOENV="/Users/ttsugrii/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/ttsugrii/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/ttsugrii/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.16.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/ttsugrii/projects/go/testing/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kx/wp6rcc054zv_xyf4lpmlztr00000gn/T/go-build1168578044=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I've ported Wilkes-Wheeler-Gill algorithm from "Faster Population Counts Using AVX2 Instructions" paper:

const c1 = 0x5555555555555555
const c2 = 0x3333333333333333
const c4 = 0x0F0F0F0F0F0F0F0F

//go:noinline
func wilkes_wheeler_gill(x uint64) int {
	x -= (x >> 1) & c1
	x = ((x >> 2) & c2) + (x & c2)
	x = (x + (x >> 4)) & c4
	x *= 0x0101010101010101
	return int(x >> 56)
}

and benchmarked it against the version used by math.bits.OnesCount64 in case POPCNT is not supported:

const m0 = 0x5555555555555555 // 01010101 ...
const m1 = 0x3333333333333333 // 00110011 ...
const m2 = 0x0f0f0f0f0f0f0f0f // 00001111 ...

//go:noinline
func parallel_summing(x uint64) int {
	const m = 1<<64 - 1
	x = x>>1&(m0&m) + x&(m0&m)
	x = x>>2&(m1&m) + x&(m1&m)
	x = (x>>4 + x) & (m2 & m)
	x += x >> 8
	x += x >> 16
	x += x >> 32
	return int(x) & (1<<7 - 1)
}

The benchmarks:

var x uint64 = <the input value>
var result int

func BenchmarkPopCountParallelSumming(b *testing.B) {
	var cnt int
	for i := 0; i < b.N; i++ {
		cnt += parallel_summing(x)
	}
	result += cnt
}

func BenchmarkPopCountWilkesWheelerGill(b *testing.B) {
	var cnt int
	for i := 0; i < b.N; i++ {
		cnt += wilkes_wheeler_gill(x)
	}
	result += cnt
}

For var x uint64 = 0xffffffffffffffff:

cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPopCountParallelSumming-16      	360195423	         3.094 ns/op
BenchmarkPopCountWilkesWheelerGill-16    	418102376	         2.798 ns/op

and for var x uint64 = 0

cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkPopCountParallelSumming-16      	379682172	         3.126 ns/op
BenchmarkPopCountWilkesWheelerGill-16    	431421498	         2.870 ns/op

What did you expect to see?

math.bits.OnesCount64's version to be faster

What did you see instead?

Wilkes-Wheeler-Gill algorithm was faster by ~10% and, given its implementation is also more concise, it makes sense to replace with it existing implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions