New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up tilebuilding by avoiding some STL bottlenecks #2349
Conversation
@danpat there are a couple of interesting but lesser used ones in osmdata.h as well that you might want to hit. they are used by what are pretty infrequently (relatively speaking) occuring tagging combinations but still may be useful to future proof. |
@danpat i took a pass at getting this build working in CI (resolving conflicts and adding a hash function to the Turn::Type) |
@danpat @kevinkreiser sounded like there were concerns about the difficulty of integrating absl, the project from which the header-only parallel hash map dependency in the PR is actually derived. I put up a branch here which shows that it's actually a pretty straightforward update to our CMake configuration to only include the dependency for I didn't include it in the branch, but I verified locally that the project builds correctly and tests pass when you include @kevinkreiser also raised a concern about binary size, here's what I'm getting on release build:
Update: the master release binaries are about the same size. The Debug builds seem about 5-10% bigger. Revisiting the question from this morning: do we want to replace our STL hash map with what, on the surface, seems like someone's personal project on Github? That doesn't seem like an obvious choice to me. Why not use the project from which it's derived? |
…ed notes. For small sizes, a flat memory layout in L1 cache will usually outperform a fragmented container like unordered_set, in this case, by about 5x.
…ually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout.
e78cd0e
to
dae1280
Compare
expandset.erase(expandset.begin()); | ||
// Expand edges from this node. Post-increment the index to the next | ||
// item. | ||
const GraphId expandnode = expandset[expand_pos++]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if you can use GraphId.value
as a key to your bloom filter? (will they be unique in this context?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mandeepsandhu I had that initially - but I think because many edges will be within the same tile, and not many bits change between values, we would end up with bloom filter collisions really quickly. I added hashing to shake things around a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean still hash, but only use the .value()
part of the GraphId?
That might collide if we're exploring a location that crosses tile boundaries. We definitely don't need to be perfectly collision avoiding here, just better-than-nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I meant using GraphId.value
directly without running through the hash function. But yeah, if there are a lot of collisions it will cause extra hash lookups which might not be as effective.
valhalla/midgard/small_flat_map.h
Outdated
// Works when all your data will fit in L1 cache and it's faster to just loop | ||
// over everything than it is to fetch fragmented records from main memory. | ||
template <typename KeyT, typename ValueT> | ||
struct SmallFlatMap : public std::vector<std::pair<KeyT, ValueT>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For small number of entries, won't a std::unordered_map
too fit in the L1/L2 cache? (I'd be surprised if it was otherwise).
Out of curiosity, till how many entries is the linear search here faster than a hash lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unordered_map
is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.
I don't have direct data to answer your second question, but the napkin math is as follows:
- Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the
std::string
object itself, not in a buffer pointed to by the object. std::pair
is basically just a fancy wrapper around a struct with two data members.- Combine (1) and (2) in a
std::vector
means you'll get key/values stored contiguously in memory, using just a few bytes each. - If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.
Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.
OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics
Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.
It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.
See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a nice histogram of OSM key lengths here:
https://taginfo.openstreetmap.org/reports/key_lengths#histogram
a good solid chunk of all keys are <20 chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danpat: I had the same question @mandeepsandhu had: I feel like we're trying to outsmart the compiler here. Did we confirm (via profiling or measurement) that any performance impact in the change is due to this custom hash map?
I just wrote a benchmark in dd3c3ed constructing maps of different sizes for very small string keys and then accessing the middle element. As @mandeepsandhu has suggested might be the case, the STL map and SmallFlatMap
seem pretty close to each other up to about 16 elements, but then quickly diverge. Under 16 elements, there's definitely some benefit, but <10 nsec (strong possibility this is measurement noise / timing resolution error). Maybe there would be some additional perf wins if we ordered the tag map by frequency of use within Valhalla.
Here's the benchmark:
$ ./bench/midgard/benchmark-flat_map
2020-05-04 18:30:21
Running ./bench/midgard/benchmark-flat_map
Run on (4 X 3500 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 256 KiB (x2)
L3 Unified 4096 KiB (x1)
Load Average: 2.57, 2.33, 2.63
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
BM_VFlatMapAccess/1 87.2 ns 86.0 ns 7669216
BM_VFlatMapAccess/2 83.0 ns 82.5 ns 8142091
BM_VFlatMapAccess/4 86.3 ns 85.2 ns 8546382
BM_VFlatMapAccess/8 89.7 ns 88.4 ns 7869236
BM_VFlatMapAccess/16 91.1 ns 90.0 ns 7923213
BM_VFlatMapAccess/32 108 ns 106 ns 6397485
BM_VFlatMapAccess/64 125 ns 124 ns 5687219
BM_VFlatMapAccess/128 147 ns 146 ns 4806801
BM_VFlatMapAccess/256 276 ns 273 ns 2619584
BM_VFlatMapAccess/512 409 ns 403 ns 1715337
BM_VFlatMapAccess/1024 755 ns 726 ns 1044605
...
BM_STLUnorderedMapAccess/1 90.1 ns 89.5 ns 7853168
BM_STLUnorderedMapAccess/2 90.9 ns 90.7 ns 7914881
BM_STLUnorderedMapAccess/4 94.9 ns 94.8 ns 6541322
BM_STLUnorderedMapAccess/8 96.5 ns 96.3 ns 7318195
BM_STLUnorderedMapAccess/16 97.2 ns 96.9 ns 7243752
BM_STLUnorderedMapAccess/32 101 ns 101 ns 6901311
BM_STLUnorderedMapAccess/64 99.3 ns 99.1 ns 6996642
BM_STLUnorderedMapAccess/128 99.5 ns 99.1 ns 6832935
BM_STLUnorderedMapAccess/256 111 ns 111 ns 6177197
BM_STLUnorderedMapAccess/512 102 ns 102 ns 6613944
BM_STLUnorderedMapAccess/1024 113 ns 112 ns 6157960
Happy to update this if I did something wrong (see the linked commit)! Entirely possible I made a benchmarking mistake, or that my computer isn't reflective of reality.
/cc @kevinkreiser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mookerji The main gains in this PR come from the changes to IsNotThruEdge
- probably 90% of the improvement is there. The SmallFlatMap
shaves a few additional seconds off my test runs.
However, at least running locally, the improvement from SmallFlatMap
is pretty consistent it cuts 5-10 seconds off a 150-160 second test run (I'm running valhalla_build_tiles
with a 100MB OSM PBF file of Bulgaria). I wonder if it behaves better under cache pressure than unordered_map
- do you know if there's a pattern to apply cache pressure in Google benchmark? Clobbering the L1/L2 cache each iteration might better match all the other work that the real code is doing.
A couple of other things that are worth considering:
- What do the insertion benchmarks look like? I don't think calls to
[]
dominate the usage of the class, lots of these objects are built and destroyed as well. - What does range iteration look like -
for (const auto &element : tags) {}
in a benchmark? We loop over the complete tagset in quite a few places, so it could be that the main gains are coming from that, rather than only usage ofoperator[]
.
Overall, SmallFlatMap
is a minor piece of this PR - the main benefit comes from the other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one other thing - you probably want to move the std::to_string()
calls out of the hot loop in those benchmarks - they'll be dominating timings particularly for the smaller values. I quickly did that and now get this for lookups:
BM_VFlatMapAccess/1 4.80 ns 4.77 ns 145741334
BM_VFlatMapAccess/2 4.79 ns 4.76 ns 144731864
BM_VFlatMapAccess/4 6.20 ns 6.17 ns 101122459
BM_VFlatMapAccess/8 9.59 ns 9.44 ns 75012323
BM_VFlatMapAccess/16 12.7 ns 12.7 ns 54992537
BM_VFlatMapAccess/32 27.7 ns 27.6 ns 25056825
BM_VFlatMapAccess/64 60.5 ns 60.3 ns 11405295
BM_VFlatMapAccess/128 86.3 ns 86.0 ns 8018695
BM_VFlatMapAccess/256 243 ns 241 ns 2785393
BM_VFlatMapAccess/512 407 ns 402 ns 1770417
BM_VFlatMapAccess/1024 755 ns 748 ns 916110
BM_STLUnorderedMapAccess/1 9.93 ns 9.89 ns 69545865
BM_STLUnorderedMapAccess/2 10.0 ns 9.98 ns 70417576
BM_STLUnorderedMapAccess/4 16.5 ns 16.5 ns 42466452
BM_STLUnorderedMapAccess/8 16.5 ns 16.4 ns 42089567
BM_STLUnorderedMapAccess/16 16.6 ns 16.5 ns 42153946
BM_STLUnorderedMapAccess/32 17.5 ns 17.5 ns 39942939
BM_STLUnorderedMapAccess/64 17.6 ns 17.5 ns 39955934
BM_STLUnorderedMapAccess/128 18.1 ns 17.9 ns 39609787
BM_STLUnorderedMapAccess/256 23.8 ns 23.7 ns 29252352
BM_STLUnorderedMapAccess/512 18.9 ns 18.7 ns 38226508
BM_STLUnorderedMapAccess/1024 24.0 ns 23.9 ns 29206217
In percentage terms, a clear win IMO. In absolute terms, I do struggle to see how we'd be making enough of these calls for a 7ns (my ballpark for the gain at 3 entries per map) improvement to manifest in a macro improvement to the process runtime. Maybe this map is being used even more than I thought. We have to make something like 150 million calls to operator[]
for 1 second to be gained....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another interesting observation - we have a lot of calls to .find()
on this object type. I added a benchmark to your script that compares that. I wasn't expecting much different to operator[]
but:
BM_VFlatMapFind/1 2.97 ns 2.95 ns 236505663
BM_VFlatMapFind/2 2.95 ns 2.94 ns 238443170
BM_VFlatMapFind/4 4.79 ns 4.77 ns 146934115
BM_VFlatMapFind/8 9.07 ns 9.04 ns 76705604
BM_VFlatMapFind/16 11.5 ns 11.4 ns 60524313
BM_VFlatMapFind/32 35.4 ns 35.3 ns 19926103
BM_VFlatMapFind/64 80.6 ns 79.7 ns 6927812
BM_VFlatMapFind/128 115 ns 115 ns 5988280
BM_VFlatMapFind/256 316 ns 314 ns 2244036
BM_VFlatMapFind/512 572 ns 570 ns 1182033
BM_VFlatMapFind/1024 1116 ns 1108 ns 634173
BM_STLUnorderedMapFind/1 10.1 ns 9.80 ns 65048508
BM_STLUnorderedMapFind/2 9.69 ns 9.62 ns 71446068
BM_STLUnorderedMapFind/4 16.1 ns 16.0 ns 43037726
BM_STLUnorderedMapFind/8 16.1 ns 16.0 ns 43643074
BM_STLUnorderedMapFind/16 16.1 ns 16.0 ns 43600396
BM_STLUnorderedMapFind/32 17.1 ns 16.9 ns 41560292
BM_STLUnorderedMapFind/64 16.8 ns 16.8 ns 41083670
BM_STLUnorderedMapFind/128 16.8 ns 16.8 ns 41179862
BM_STLUnorderedMapFind/256 24.1 ns 23.5 ns 30205093
BM_STLUnorderedMapFind/512 17.7 ns 17.5 ns 39801674
BM_STLUnorderedMapFind/1024 23.4 ns 23.3 ns 30081392
Again, these numbers are so small I struggle to see how they'd have noticeable impact, but they do seem to....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: std::to_string()
. That's fair, but that that string conversion should be will be common factor between comparisons of a certain size. You could also pre-prepare the benchmark data across all benchmark cases, but my point was to show relative performance between maps, not an absolute benchmark (and also not overthink it).
What was the typical key space size on one of these maps again? https://taginfo.openstreetmap.org/reports/key_lengths#histogram makes it seem like there could be a lot of them, but I'm not familiar enough (yet) with how this particular map is used.
Addressing your other points:
- Yes, there's a thing to flush to memory:
benchmark::ClobberMemory()
. I haven't used it. - Benchmarking a key space scan makes sense, but I expect we'd see the same performance story as we have with these other benchmarks.
- I also added insertion benchmarking, but didn't include those because I thought (perhaps incorrectly) that our workloads were read-intensive and not write intensive. I also left out the absl benchmarks, which seem to be better than STL for 'large' maps. Anyway, here they are:
BM_VFlatMapInsert/1 207 ns 201 ns 3050016
BM_VFlatMapInsert/2 416 ns 398 ns 1565334
BM_VFlatMapInsert/4 822 ns 804 ns 741541
BM_VFlatMapInsert/8 1624 ns 1594 ns 438866
BM_VFlatMapInsert/16 3510 ns 3414 ns 208204
BM_VFlatMapInsert/32 8127 ns 7509 ns 90333
BM_VFlatMapInsert/64 18418 ns 16659 ns 42500
BM_VFlatMapInsert/128 38339 ns 35216 ns 18561
BM_VFlatMapInsert/256 94164 ns 92460 ns 7767
BM_VFlatMapInsert/512 283171 ns 264208 ns 2768
BM_VFlatMapInsert/1024 925983 ns 900776 ns 782
BM_STLUnorderedMapInsert/1 245 ns 197 ns 3766722
BM_STLUnorderedMapInsert/2 373 ns 368 ns 1891462
BM_STLUnorderedMapInsert/4 821 ns 797 ns 874497
BM_STLUnorderedMapInsert/8 1579 ns 1556 ns 446588
BM_STLUnorderedMapInsert/16 3873 ns 3576 ns 191874
BM_STLUnorderedMapInsert/32 7139 ns 6801 ns 104181
BM_STLUnorderedMapInsert/64 13956 ns 13749 ns 48900
BM_STLUnorderedMapInsert/128 26959 ns 26627 ns 26020
BM_STLUnorderedMapInsert/256 55176 ns 54341 ns 13137
BM_STLUnorderedMapInsert/512 117416 ns 115000 ns 6232
BM_STLUnorderedMapInsert/1024 241274 ns 236532 ns 3040
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unordered_map
is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.
My comment was poorly worded, I meant the hashed keys (buckets) will fit in cache. But you're right that any attempt to access the value or resolve a collision would go to main memory due to pointer indirection.
I don't have direct data to answer your second question, but the napkin math is as follows:
- Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the
std::string
object itself, not in a buffer pointed to by the object.
A string comparison would require all characters to be compared as opposed to looking up an array index, although that has the added overhead of calculating the hash on the string (I've heard Murmur or FNV-1 are good for hashing strings and produce an even distribution). At this point I don't know which one would be faster, TBH. Its best to profile this and find out.
std::pair
is basically just a fancy wrapper around a struct with two data members.- Combine (1) and (2) in a
std::vector
means you'll get key/values stored contiguously in memory, using just a few bytes each.- If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.
Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.
OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics
Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.
I agree open-addressing is much more cache friendly (and consequently better performing), but they suffer at high load factors due to need for rehashing (which is probably a non-issue here since there are few keys we're storing).
It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.
See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok - so it looks like the evidence shows that the improvement from the SmallFlatMap is marginal at best. I've pulled it from this PR.
… case, and a bigger one isn't faster than unordered_set.find() for this case.
…e are usually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout." This reverts commit 134f145.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind i saw |
@mandeepsandhu I ended up pulling the bloom filter out. I realized I hadn't sized it right to ensure a low hit rate, 64bits isn't much space to ensure a low collision rate across a couple of thousand nodes that we'd usually insert. The initial version was probably not actually preventing very many lookups. I re-wrote it locally to use a much larger bloom filter - more like 2k bits using |
…halla/valhalla into danpat_parallel_hashmap_tilebuilding
Lol, ok! When I was writing the comment, the diff still showed the bloom filter in use (and you had removed the flat map).
Okay thats good to know. The simpler the better. And either way, this was a good learning exercise! |
Issue
Speeds up
valhalla_build_tiles
by about 25% (on my test maps) by replacing some instances ofstd::unordered_map
andstd::unordered_set
with more efficient approaches for the context.The primary changes are:
unordered_set.erase(set.begin())
in a tight loop duringIsNotThruEdge
- this triggers constant reallocation. It's faster to use a flat vector buffer as a queue instead.IsNotThruEdge
, avoid doing constantunordered_set.find()
operations by adding a pre-check bloom filter. As what we're interested in in this case is "is the node not in the visited set", a bloom filter works well. Because the set is relatively small (a few hundred entries), a simple filter using std::uint64_t is quite effective.Tags
type inosmpbfparser
andluatagtransform
with a flat list of key/value pairs. Performance is better than*map
types (even good ones) because the list of key/value pairs is very short, and the keys and values themselves are also very short, allowing a linear search of the list to occur completely in L1 cache. This outperforms approaches which store the data fragmented in RAM and incur memory lookup delays.Tasklist
Requirements / Relations
Link any requirements here. Other pull requests this PR is based on?