Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add benchmarks for std.AutoHashMap #2

Open
andrewrk opened this issue May 19, 2020 · 11 comments
Open

add benchmarks for std.AutoHashMap #2

andrewrk opened this issue May 19, 2020 · 11 comments

Comments

@andrewrk
Copy link
Member

No description provided.

@squeek502
Copy link
Contributor

Relevant, but not very fleshed out (and not updated to latest Zig) so maybe not that useful: https://github.com/squeek502/zig-hash-map-bench

The links in the readme should be useful, though.

@squeek502
Copy link
Contributor

One thing I'm unsure about with this is whether or not it's possible to have a single benchmark be representative of the speed of a hash map (if you look at the links in the repo linked above, all benchmarks are presented as graphs). That is, because hash maps can have very different performance characteristics depending on the size of the map, a benchmark like 'add 1,000,000 elements, get 1,000,000 times, then remove all the elements one by one' would only represent one single point along the continuum of possible hash map sizes, and therefore it would fail to catch regressions in performance for other map sizes. And doing something like running that same benchmark for all sizes 1-1,000,000 and then measuring the total time would still underrepresent the small sizes and therefore would still allow for regressions in small hash maps without the benchmark noticing.

Unless I'm thinking about it wrong, I'm not sure this is a solvable problem without breaking the benchmarking into multiple different benchmarks differentiated by the size of the hash map being benchmarked. And, if that's true, then it's not clear how many different sizes should be benchmarked and what those sizes should be (would only 'small' and 'huge' be necessary? 'small', 'medium', 'large', and 'huge'? way more?).

@andrewrk
Copy link
Member Author

andrewrk commented Jun 21, 2020

The idea here would be to create multiple different benchmarks, representing each "use case" that you want to keep a performance history on. Hopefully it should be reasonable to maintain this, since the manifest JSON file lets you re-use directories for multiple different benchmarks, and so you should only have to fulfill one tiny function for each different benchmark.

My intuition tells me that 3-5 benchmarks should be enough to get a reasonable performance picture of this API.

Another way to approach this would be to create benchmarks that represent real world usage of the API. So if you have a project that utilizes the API, make a benchmark that simulates how your project would use it. This ensures that the perf of your use case is explicitly being monitored for regressions. As you can imagine, this will become relevant for making sure self-hosted is and stays fast.

@andrewrk andrewrk changed the title add benchmark for std.AutoHashMap add benchmarks for std.AutoHashMap Jun 21, 2020
@lemmi
Copy link

lemmi commented Jul 2, 2020

So I was toying around with zig and implemented Project Euler 14 for fun. Since it was easy I ported it directly to go and run some benchmarks. The this makes heavy use of the HashMap, maybe it's a decent benchmark for some use cases.

TL;DR:

  • for both implementations the map is preallocated, so only one allocation takes place
  • looks like zig uses twice the memory of go
  • zig is almost as fast as the go implementation in -Drelease-fast
  • zig takes about twice as long with -Drelease-safe
  • Tests were running on AMD FX(tm)-8350

Code

const std = @import("std");
const AutoHashMap = std.AutoHashMap;

fn step(x: u64) u64 {
    if (x & 1 > 0) {
        return 3 * x + 1;
    } else {
        return x / 2;
    }
}

fn length(cache: *AutoHashMap(u64, u64), x: u64) anyerror!u64 {
    if (x <= 1) return 0;
    if (cache.getValue(x)) |e| {
        return e;
    } else {
        const next = step(x);
        const len = 1 + try length(cache, next);
        try cache.putNoClobber(x, len);
        return len;
    }
}

pub fn main() anyerror!void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();
    const alloc = &arena.allocator;
    var cache = AutoHashMap(u64, u64).init(alloc);
    defer cache.deinit();

    try cache.ensureCapacity(2000000);

    var x: u64 = 0;
    var maxx: u64 = 0;
    var maxl: u64 = 0;
    while (x < 1000000) : (x += 1) {
        const l = try length(&cache, x);
        if (l > maxl) {
            maxl = l;
            maxx = x;
        }
    }
    std.debug.warn("{} {}\n", .{ maxx, maxl });
}
package main

import "fmt"

func step(x uint64) uint64 {
	if x&1 > 0 {
		return 3*x + 1
	} else {
		return x / 2
	}
}

func length(cache map[uint64]uint64, x uint64) uint64 {
	if x <= 1 {
		return 0
	}
	if e, ok := cache[x]; ok {
		return e
	} else {
		next := step(x)
		l := 1 + length(cache, next)
		cache[x] = l
		return l
	}
}

func main() {
	cache := make(map[uint64]uint64, 2000000)

	var maxx uint64 = 0
	var maxl uint64 = 0
	for x := uint64(0); x < 1000000; x++ {
		l := length(cache, x)
		if l > maxl {
			maxl = l
			maxx = x
		}
	}

	fmt.Println(maxx, maxl)
}

Timings

go build -o collaz
/usr/bin/time -v ./collaz
837799 524
	Command being timed: "./collaz"
	User time (seconds): 0.35
	System time (seconds): 0.02
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.37
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 82208
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1689
	Voluntary context switches: 147
	Involuntary context switches: 105
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
zig build -Drelease-safe
/usr/bin/time -v ./zig-cache/bin/collaz
837799 524
	Command being timed: "./zig-cache/bin/collaz"
	User time (seconds): 0.57
	System time (seconds): 0.07
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.65
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 163748
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 40987
	Voluntary context switches: 1
	Involuntary context switches: 179
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
zig build -Drelease-fast
/usr/bin/time -v ./zig-cache/bin/collaz
837799 524
	Command being timed: "./zig-cache/bin/collaz"
	User time (seconds): 0.38
	System time (seconds): 0.08
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 163692
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 40984
	Voluntary context switches: 1
	Involuntary context switches: 140
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@andrewrk
Copy link
Member Author

andrewrk commented Jul 6, 2020

Thanks @lemmi for this benchmark, it's now added to this repo. I think probably before closing this issue we should come up with 2-3 more benchmarks that try to cover other use cases of hash maps. Maybe one that deals with strings, and one that creates and destroys many small hash maps in memory, and any other suggestions are welcome too.

@Sahnvour
Copy link
Contributor

Sahnvour commented Jul 6, 2020

Just saw Andrew's tweets, good work everyone.
In case that's of any interest, I had some hashmap benchmarks at https://github.com/Sahnvour/zig-containers/blob/master/benchmarks/martinus_map.zig that tried to reproduce the ones mentionned by @squeek502 at https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ .

The code is a bit old however.

@lemmi
Copy link

lemmi commented Jul 6, 2020

Short update. While @andrewrk integrated the test, I was in the middle of benchmarking my implementation of the swiss hash map to make use of simd for (hopefully) faster accesstimes.

Code: https://gist.github.com/lemmi/6576efbfd9fcc7fbf74371e42e3146a8

Once I ported my code over to the new API, I'll run the tests again with the new HashMap from current master.


Brief explaination: The last number indicates the simd width. So 32 for example means 32 byte wide simd is used for metadata storage. With 16 the compiler emits SSE code, with 32 I can actually see AVX code. It's still rather puzzeling to me why the scalar version is the fastest among all architectures. Maybe the cost of loading a whole vector and iterating through the bitmask is too costly compared to the cheap comparison of a 64 bit key, which will almost never be a miss with this kind of table.

AMD FX(tm)-8350

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           505ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           562ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           523ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           528ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          529ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          546ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          624ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         720ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 416ms

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           383ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           408ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           390ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           393ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          401ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          421ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          469ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         545ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 431ms

AMD Ryzen 5 3600X 6-Core Processor

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                           400ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,2)                           458ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,4)                           432ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,8)                           435ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                          435ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,32)                          448ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,64)                          461ms
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,128)                         497ms
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql) 373ms

@Sahnvour
Copy link
Contributor

Sahnvour commented Jul 6, 2020

@lemmi In my experience, going for a similar design to google's hashtable (8bit meta data per element) with scalar code was faster. I think the SIMD is beneficial to them because they try to achieve extremely high load factor (97.5% iirc) which implies quite a lot of probing, especially when they use C++'s standard hash which is of very poor quality.
With a hash function of good quality, the bottleneck is likely on the initial cache-miss more than on the probing.

In your implementation, I see you're using modulus % operations in the groupIter code. This is very slow, and your code would probably be faster if you used a bitmask since you already work with power-of-two sizes.

@lemmi
Copy link

lemmi commented Jul 6, 2020

@Sahnvour thanks a lot for your input. Your comment made me want to measure memory usage behaviour when not preallocating memory. Also getting rid of the modulus operations was indeed getting me a couple tens of milliseconds in comparison to the numbers above.
The google style hashmap can indeed be pushed to much higher loadfactors before loosing performance. And now the numbers look very different. Since the old std map needed to keep a lot of free space, more allocations and therefore copying/rehashing is necessary. It also uses quite a bit more memory, which is coherent with my findings while comparing to the go implementation.

Benchmarks without preallocation

Third column is the maximum allocated memory. I left out the worse performing vector sizes.

Code: https://gist.github.com/lemmi/c4e9dbedc5424a82e300c0000cf46728

AMD FX(tm)-8350

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             737ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            809ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   838ms 196608K

Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             499ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            527ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   696ms 196608K

AMD Ryzen 5 3600X 6-Core Processor

smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,1)                             418ms 104448K
smap.Map(u64,u64,smap.getAutoHashFn(u64).hash,smap.getAutoEqlFn(u64).eql,16)                            447ms 104448K
std.hash_map.HashMap(u64,u64,std.hash_map.getAutoHashFn(u64).hash,std.hash_map.getAutoEqlFn(u64).eql)   507ms 196608K

I hope I don't carry this too much off topic.

@andrewrk
Copy link
Member Author

andrewrk commented Jul 6, 2020

I hope I don't carry this too much off topic.

This is very much on topic!

@Sahnvour
Copy link
Contributor

I'm adding some benchmarks that can take up to a few tens of seconds. Should we make them as atomic as possible, or is it ok to have them all in std-hash-map ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants