Speed up tilebuilding by avoiding some STL bottlenecks #2349

danpat · 2020-05-04T05:15:16Z

Issue

Speeds up valhalla_build_tiles by about 25% (on my test maps) by replacing some instances of std::unordered_map and std::unordered_set with more efficient approaches for the context.

# Before
$ /usr/bin/time ./valhalla_build_tiles -c config.json bulgaria-latest.osm.pbf
      149.53 real       213.47 user        13.32 sys

# After
$ /usr/bin/time ./valhalla_build_tiles -c config.json bulgaria-latest.osm.pbf
      118.93 real       156.03 user         9.16 sys

The primary changes are:

Avoid doing unordered_set.erase(set.begin()) in a tight loop during IsNotThruEdge - this triggers constant reallocation. It's faster to use a flat vector buffer as a queue instead.
Also in IsNotThruEdge, avoid doing constant unordered_set.find() operations by adding a pre-check bloom filter. As what we're interested in in this case is "is the node not in the visited set", a bloom filter works well. Because the set is relatively small (a few hundred entries), a simple filter using std::uint64_t is quite effective.
In a similar vein, replace the Tags type in osmpbfparser and luatagtransform with a flat list of key/value pairs. Performance is better than *map types (even good ones) because the list of key/value pairs is very short, and the keys and values themselves are also very short, allowing a linear search of the list to occur completely in L1 cache. This outperforms approaches which store the data fragmented in RAM and incur memory lookup delays.

Tasklist

Add tests
Add #fixes with the issue number that this PR addresses
Generally use squash merge to rebase and clean comments before merging
Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

src/mjolnir/luatagtransform.cc

valhalla/mjolnir/osmpbfparser.h

kevinkreiser · 2020-05-04T13:53:33Z

@danpat there are a couple of interesting but lesser used ones in osmdata.h as well that you might want to hit. they are used by what are pretty infrequently (relatively speaking) occuring tagging combinations but still may be useful to future proof.

kevinkreiser · 2020-05-04T13:56:37Z

@danpat i took a pass at getting this build working in CI (resolving conflicts and adding a hash function to the Turn::Type)

.gitmodules

mookerji · 2020-05-04T18:23:45Z

@danpat @kevinkreiser sounded like there were concerns about the difficulty of integrating absl, the project from which the header-only parallel hash map dependency in the PR is actually derived. I put up a branch here which shows that it's actually a pretty straightforward update to our CMake configuration to only include the dependency for absl::flat_hash_map: master...mookerji:mookerji-hashmaps. These specific targets are actually exported by Absl's CMake file and pretty clearly documented here: https://github.com/abseil/abseil-cpp/tree/master/CMake#available-abseil-cmake-public-targets.

I didn't include it in the branch, but I verified locally that the project builds correctly and tests pass when you include #include <absl/container/flat_hash_map.h> and update a locally-scoped hash maps to use it. Making more extensive changes around interfaces are already covered in this PR.

@kevinkreiser also raised a concern about binary size, here's what I'm getting on release build:

$ cmake -DENABLE_PYTHON_BINDINGS=OFF -DENABLE_NODE_BINDINGS=OFF -DENABLE_BENCHMARKS=OFF -DCMAKE_BUILD_TYPE=Release ../
$ make -j4
$ ls
-rwxr-xr-x    1 buromookerji  staff   2.1M May  4 10:50 valhalla_add_predicted_traffic*
-rwxr-xr-x    1 buromookerji  staff   106K May  4 10:50 valhalla_benchmark_adjacency_list*
-rwxr-xr-x    1 buromookerji  staff   4.0M May  4 10:50 valhalla_benchmark_admins*
-rwxr-xr-x    1 buromookerji  staff   5.7M May  4 10:50 valhalla_benchmark_loki*
-rwxr-xr-x    1 buromookerji  staff   234K May  4 10:50 valhalla_benchmark_skadi*
-rwxr-xr-x    1 buromookerji  staff   4.1M May  4 10:49 valhalla_build_admins*
-rwxr-xr-x    1 buromookerji  staff   1.9M May  4 10:50 valhalla_build_connectivity*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:50 valhalla_build_statistics*
-rwxr-xr-x    1 buromookerji  staff   4.0M May  4 10:49 valhalla_build_tiles*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:50 valhalla_convert_transit*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:50 valhalla_expand_bounding_box*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:50 valhalla_export_edges*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:49 valhalla_fetch_transit*
-rwxr-xr-x    1 buromookerji  staff   6.3M May  4 10:49 valhalla_loki_worker*
-rwxr-xr-x    1 buromookerji  staff   6.7M May  4 10:49 valhalla_odin_worker*
-rwxr-xr-x    1 buromookerji  staff   6.0M May  4 10:50 valhalla_path_comparison*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:49 valhalla_query_transit*
-rwxr-xr-x    1 buromookerji  staff   5.8M May  4 10:50 valhalla_run_isochrone*
-rwxr-xr-x    1 buromookerji  staff   5.8M May  4 10:50 valhalla_run_map_match*
-rwxr-xr-x    1 buromookerji  staff   6.5M May  4 10:50 valhalla_run_matrix*
-rwxr-xr-x    1 buromookerji  staff   7.8M May  4 10:50 valhalla_run_route*
-rwxr-xr-x    1 buromookerji  staff   9.2M May  4 10:49 valhalla_service*
-rwxr-xr-x    1 buromookerji  staff   7.1M May  4 10:49 valhalla_thor_worker*
-rwxr-xr-x    1 buromookerji  staff   4.1M May  4 10:50 valhalla_validate_transit*
-rwxr-xr-x    1 buromookerji  staff   1.9M May  4 10:49 valhalla_ways_to_edges*

Update: the master release binaries are about the same size. The Debug builds seem about 5-10% bigger.

Revisiting the question from this morning: do we want to replace our STL hash map with what, on the surface, seems like someone's personal project on Github? That doesn't seem like an obvious choice to me. Why not use the project from which it's derived?

…ed notes. For small sizes, a flat memory layout in L1 cache will usually outperform a fragmented container like unordered_set, in this case, by about 5x.

…ually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout.

mandeepsandhu · 2020-05-04T22:59:00Z

src/mjolnir/graphenhancer.cc

-    expandset.erase(expandset.begin());
+    // Expand edges from this node.  Post-increment the index to the next
+    // item.
+    const GraphId expandnode = expandset[expand_pos++];


Wonder if you can use GraphId.value as a key to your bloom filter? (will they be unique in this context?)

@mandeepsandhu I had that initially - but I think because many edges will be within the same tile, and not many bits change between values, we would end up with bloom filter collisions really quickly. I added hashing to shake things around a bit.

Or do you mean still hash, but only use the .value() part of the GraphId?

That might collide if we're exploring a location that crosses tile boundaries. We definitely don't need to be perfectly collision avoiding here, just better-than-nothing.

No I meant using GraphId.value directly without running through the hash function. But yeah, if there are a lot of collisions it will cause extra hash lookups which might not be as effective.

mandeepsandhu · 2020-05-04T23:19:21Z

valhalla/midgard/small_flat_map.h

+// Works when all your data will fit in L1 cache and it's faster to just loop
+// over everything than it is to fetch fragmented records from main memory.
+template <typename KeyT, typename ValueT>
+struct SmallFlatMap : public std::vector<std::pair<KeyT, ValueT>> {


For small number of entries, won't a std::unordered_map too fit in the L1/L2 cache? (I'd be surprised if it was otherwise).

Out of curiosity, till how many entries is the linear search here faster than a hash lookup?

std::unordered_map is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.

I don't have direct data to answer your second question, but the napkin math is as follows:

Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the std::string object itself, not in a buffer pointed to by the object.

std::pair is basically just a fancy wrapper around a struct with two data members.

Combine (1) and (2) in a std::vector means you'll get key/values stored contiguously in memory, using just a few bytes each.

If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.

Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.

OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics

Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.

It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.

See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.

There's also a nice histogram of OSM key lengths here:

https://taginfo.openstreetmap.org/reports/key_lengths#histogram

a good solid chunk of all keys are <20 chars.

@danpat: I had the same question @mandeepsandhu had: I feel like we're trying to outsmart the compiler here. Did we confirm (via profiling or measurement) that any performance impact in the change is due to this custom hash map?

I just wrote a benchmark in dd3c3ed constructing maps of different sizes for very small string keys and then accessing the middle element. As @mandeepsandhu has suggested might be the case, the STL map and SmallFlatMap seem pretty close to each other up to about 16 elements, but then quickly diverge. Under 16 elements, there's definitely some benefit, but <10 nsec (strong possibility this is measurement noise / timing resolution error). Maybe there would be some additional perf wins if we ordered the tag map by frequency of use within Valhalla.

Here's the benchmark:

$ ./bench/midgard/benchmark-flat_map 2020-05-04 18:30:21 Running ./bench/midgard/benchmark-flat_map Run on (4 X 3500 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 2.57, 2.33, 2.63 ------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------ BM_VFlatMapAccess/1 87.2 ns 86.0 ns 7669216 BM_VFlatMapAccess/2 83.0 ns 82.5 ns 8142091 BM_VFlatMapAccess/4 86.3 ns 85.2 ns 8546382 BM_VFlatMapAccess/8 89.7 ns 88.4 ns 7869236 BM_VFlatMapAccess/16 91.1 ns 90.0 ns 7923213 BM_VFlatMapAccess/32 108 ns 106 ns 6397485 BM_VFlatMapAccess/64 125 ns 124 ns 5687219 BM_VFlatMapAccess/128 147 ns 146 ns 4806801 BM_VFlatMapAccess/256 276 ns 273 ns 2619584 BM_VFlatMapAccess/512 409 ns 403 ns 1715337 BM_VFlatMapAccess/1024 755 ns 726 ns 1044605 ... BM_STLUnorderedMapAccess/1 90.1 ns 89.5 ns 7853168 BM_STLUnorderedMapAccess/2 90.9 ns 90.7 ns 7914881 BM_STLUnorderedMapAccess/4 94.9 ns 94.8 ns 6541322 BM_STLUnorderedMapAccess/8 96.5 ns 96.3 ns 7318195 BM_STLUnorderedMapAccess/16 97.2 ns 96.9 ns 7243752 BM_STLUnorderedMapAccess/32 101 ns 101 ns 6901311 BM_STLUnorderedMapAccess/64 99.3 ns 99.1 ns 6996642 BM_STLUnorderedMapAccess/128 99.5 ns 99.1 ns 6832935 BM_STLUnorderedMapAccess/256 111 ns 111 ns 6177197 BM_STLUnorderedMapAccess/512 102 ns 102 ns 6613944 BM_STLUnorderedMapAccess/1024 113 ns 112 ns 6157960

Happy to update this if I did something wrong (see the linked commit)! Entirely possible I made a benchmarking mistake, or that my computer isn't reflective of reality.

/cc @kevinkreiser

@mookerji The main gains in this PR come from the changes to IsNotThruEdge - probably 90% of the improvement is there. The SmallFlatMap shaves a few additional seconds off my test runs.

However, at least running locally, the improvement from SmallFlatMap is pretty consistent it cuts 5-10 seconds off a 150-160 second test run (I'm running valhalla_build_tiles with a 100MB OSM PBF file of Bulgaria). I wonder if it behaves better under cache pressure than unordered_map - do you know if there's a pattern to apply cache pressure in Google benchmark? Clobbering the L1/L2 cache each iteration might better match all the other work that the real code is doing.

A couple of other things that are worth considering:

What do the insertion benchmarks look like? I don't think calls to [] dominate the usage of the class, lots of these objects are built and destroyed as well.

What does range iteration look like - for (const auto &element : tags) {} in a benchmark? We loop over the complete tagset in quite a few places, so it could be that the main gains are coming from that, rather than only usage of operator[].

Overall, SmallFlatMap is a minor piece of this PR - the main benefit comes from the other changes.

Oh, one other thing - you probably want to move the std::to_string() calls out of the hot loop in those benchmarks - they'll be dominating timings particularly for the smaller values. I quickly did that and now get this for lookups:

BM_VFlatMapAccess/1 4.80 ns 4.77 ns 145741334 BM_VFlatMapAccess/2 4.79 ns 4.76 ns 144731864 BM_VFlatMapAccess/4 6.20 ns 6.17 ns 101122459 BM_VFlatMapAccess/8 9.59 ns 9.44 ns 75012323 BM_VFlatMapAccess/16 12.7 ns 12.7 ns 54992537 BM_VFlatMapAccess/32 27.7 ns 27.6 ns 25056825 BM_VFlatMapAccess/64 60.5 ns 60.3 ns 11405295 BM_VFlatMapAccess/128 86.3 ns 86.0 ns 8018695 BM_VFlatMapAccess/256 243 ns 241 ns 2785393 BM_VFlatMapAccess/512 407 ns 402 ns 1770417 BM_VFlatMapAccess/1024 755 ns 748 ns 916110 BM_STLUnorderedMapAccess/1 9.93 ns 9.89 ns 69545865 BM_STLUnorderedMapAccess/2 10.0 ns 9.98 ns 70417576 BM_STLUnorderedMapAccess/4 16.5 ns 16.5 ns 42466452 BM_STLUnorderedMapAccess/8 16.5 ns 16.4 ns 42089567 BM_STLUnorderedMapAccess/16 16.6 ns 16.5 ns 42153946 BM_STLUnorderedMapAccess/32 17.5 ns 17.5 ns 39942939 BM_STLUnorderedMapAccess/64 17.6 ns 17.5 ns 39955934 BM_STLUnorderedMapAccess/128 18.1 ns 17.9 ns 39609787 BM_STLUnorderedMapAccess/256 23.8 ns 23.7 ns 29252352 BM_STLUnorderedMapAccess/512 18.9 ns 18.7 ns 38226508 BM_STLUnorderedMapAccess/1024 24.0 ns 23.9 ns 29206217

In percentage terms, a clear win IMO. In absolute terms, I do struggle to see how we'd be making enough of these calls for a 7ns (my ballpark for the gain at 3 entries per map) improvement to manifest in a macro improvement to the process runtime. Maybe this map is being used even more than I thought. We have to make something like 150 million calls to operator[] for 1 second to be gained....

Another interesting observation - we have a lot of calls to .find() on this object type. I added a benchmark to your script that compares that. I wasn't expecting much different to operator[] but:

BM_VFlatMapFind/1 2.97 ns 2.95 ns 236505663 BM_VFlatMapFind/2 2.95 ns 2.94 ns 238443170 BM_VFlatMapFind/4 4.79 ns 4.77 ns 146934115 BM_VFlatMapFind/8 9.07 ns 9.04 ns 76705604 BM_VFlatMapFind/16 11.5 ns 11.4 ns 60524313 BM_VFlatMapFind/32 35.4 ns 35.3 ns 19926103 BM_VFlatMapFind/64 80.6 ns 79.7 ns 6927812 BM_VFlatMapFind/128 115 ns 115 ns 5988280 BM_VFlatMapFind/256 316 ns 314 ns 2244036 BM_VFlatMapFind/512 572 ns 570 ns 1182033 BM_VFlatMapFind/1024 1116 ns 1108 ns 634173 BM_STLUnorderedMapFind/1 10.1 ns 9.80 ns 65048508 BM_STLUnorderedMapFind/2 9.69 ns 9.62 ns 71446068 BM_STLUnorderedMapFind/4 16.1 ns 16.0 ns 43037726 BM_STLUnorderedMapFind/8 16.1 ns 16.0 ns 43643074 BM_STLUnorderedMapFind/16 16.1 ns 16.0 ns 43600396 BM_STLUnorderedMapFind/32 17.1 ns 16.9 ns 41560292 BM_STLUnorderedMapFind/64 16.8 ns 16.8 ns 41083670 BM_STLUnorderedMapFind/128 16.8 ns 16.8 ns 41179862 BM_STLUnorderedMapFind/256 24.1 ns 23.5 ns 30205093 BM_STLUnorderedMapFind/512 17.7 ns 17.5 ns 39801674 BM_STLUnorderedMapFind/1024 23.4 ns 23.3 ns 30081392

Again, these numbers are so small I struggle to see how they'd have noticeable impact, but they do seem to....

re: std::to_string(). That's fair, but that that string conversion should be will be common factor between comparisons of a certain size. You could also pre-prepare the benchmark data across all benchmark cases, but my point was to show relative performance between maps, not an absolute benchmark (and also not overthink it).

What was the typical key space size on one of these maps again? https://taginfo.openstreetmap.org/reports/key_lengths#histogram makes it seem like there could be a lot of them, but I'm not familiar enough (yet) with how this particular map is used.

Addressing your other points:

Yes, there's a thing to flush to memory: benchmark::ClobberMemory(). I haven't used it.

Benchmarking a key space scan makes sense, but I expect we'd see the same performance story as we have with these other benchmarks.

I also added insertion benchmarking, but didn't include those because I thought (perhaps incorrectly) that our workloads were read-intensive and not write intensive. I also left out the absl benchmarks, which seem to be better than STL for 'large' maps. Anyway, here they are:

BM_VFlatMapInsert/1 207 ns 201 ns 3050016 BM_VFlatMapInsert/2 416 ns 398 ns 1565334 BM_VFlatMapInsert/4 822 ns 804 ns 741541 BM_VFlatMapInsert/8 1624 ns 1594 ns 438866 BM_VFlatMapInsert/16 3510 ns 3414 ns 208204 BM_VFlatMapInsert/32 8127 ns 7509 ns 90333 BM_VFlatMapInsert/64 18418 ns 16659 ns 42500 BM_VFlatMapInsert/128 38339 ns 35216 ns 18561 BM_VFlatMapInsert/256 94164 ns 92460 ns 7767 BM_VFlatMapInsert/512 283171 ns 264208 ns 2768 BM_VFlatMapInsert/1024 925983 ns 900776 ns 782 BM_STLUnorderedMapInsert/1 245 ns 197 ns 3766722 BM_STLUnorderedMapInsert/2 373 ns 368 ns 1891462 BM_STLUnorderedMapInsert/4 821 ns 797 ns 874497 BM_STLUnorderedMapInsert/8 1579 ns 1556 ns 446588 BM_STLUnorderedMapInsert/16 3873 ns 3576 ns 191874 BM_STLUnorderedMapInsert/32 7139 ns 6801 ns 104181 BM_STLUnorderedMapInsert/64 13956 ns 13749 ns 48900 BM_STLUnorderedMapInsert/128 26959 ns 26627 ns 26020 BM_STLUnorderedMapInsert/256 55176 ns 54341 ns 13137 BM_STLUnorderedMapInsert/512 117416 ns 115000 ns 6232 BM_STLUnorderedMapInsert/1024 241274 ns 236532 ns 3040

std::unordered_map is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.

My comment was poorly worded, I meant the hashed keys (buckets) will fit in cache. But you're right that any attempt to access the value or resolve a collision would go to main memory due to pointer indirection.

I don't have direct data to answer your second question, but the napkin math is as follows:

Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the std::string object itself, not in a buffer pointed to by the object.

A string comparison would require all characters to be compared as opposed to looking up an array index, although that has the added overhead of calculating the hash on the string (I've heard Murmur or FNV-1 are good for hashing strings and produce an even distribution). At this point I don't know which one would be faster, TBH. Its best to profile this and find out.

std::pair is basically just a fancy wrapper around a struct with two data members.

Combine (1) and (2) in a std::vector means you'll get key/values stored contiguously in memory, using just a few bytes each.

If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.

Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.

OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics

Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.

I agree open-addressing is much more cache friendly (and consequently better performing), but they suffer at high load factors due to need for rehashing (which is probably a non-issue here since there are few keys we're storing).

It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.

See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.

Ok - so it looks like the evidence shows that the improvement from the SmallFlatMap is marginal at best. I've pulled it from this PR.

… case, and a bigger one isn't faster than unordered_set.find() for this case.

…e are usually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout." This reverts commit 134f145.

mandeepsandhu

LGTM. The use of bloom filter to short-circuit a cache miss is quite clever @danpat !

And thanks to @mookerji & you for the various benchmarks to visualize the improvements! There are some good learnings for me in this PR! 👍

kevinkreiser · 2020-05-05T17:47:13Z

at first its a bit unintuitive because doing vector::find a lot can be slow but its important to note that the function starts on a low class road and does an expansion only on adjacent low class roads. and importantly, the function bails early if it encounters a high class road without going back through the node it started on. the only time this will be slow is if there is a large area of low class roads but also no intersecting higher class roads. it will be interesting to see what this does on the planet.

nevermind i saw visitedset and thought it was expandset. now i see the change is just forget about removing the already visited from the expandset. which is quite straight forward

danpat · 2020-05-05T17:54:05Z

@mandeepsandhu I ended up pulling the bloom filter out. I realized I hadn't sized it right to ensure a low hit rate, 64bits isn't much space to ensure a low collision rate across a couple of thousand nodes that we'd usually insert. The initial version was probably not actually preventing very many lookups.

I re-wrote it locally to use a much larger bloom filter - more like 2k bits using std::vector<bool>, but there was no appreciable impact on runtime for our use. I just stripped it out to keep the code simpler - the big gain is really comes down to using a std::vector as a simple queue for the expandset, rather than an std::unordered_set. The unordered_set was suffering from lots of .erase()/.insert() operations.

src/mjolnir/graphenhancer.cc

…halla/valhalla into danpat_parallel_hashmap_tilebuilding

mandeepsandhu · 2020-05-05T18:15:20Z

@mandeepsandhu I ended up pulling the bloom filter out. I realized I hadn't sized it right to ensure a low hit rate, 64bits isn't much space to ensure a low collision rate across a couple of thousand nodes that we'd usually insert. The initial version was probably not actually preventing very many lookups.

Lol, ok! When I was writing the comment, the diff still showed the bloom filter in use (and you had removed the flat map).

I re-wrote it locally to use a much larger bloom filter - more like 2k bits using std::vector<bool>, but there was no appreciable impact on runtime for our use. I just stripped it out to keep the code simpler - the big gain is really comes down to using a std::vector as a simple queue for the expandset, rather than an std::unordered_set. The unordered_set was suffering from lots of .erase()/.insert() operations.

Okay thats good to know. The simpler the better. And either way, this was a good learning exercise!

kevinkreiser reviewed May 4, 2020

View reviewed changes

src/mjolnir/luatagtransform.cc Outdated Show resolved Hide resolved

kevinkreiser reviewed May 4, 2020

View reviewed changes

valhalla/mjolnir/osmpbfparser.h Outdated Show resolved Hide resolved

danpat added this to the Sprint 9 - Left Turn milestone May 4, 2020

kevinkreiser reviewed May 4, 2020

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

danpat added 3 commits May 4, 2020 14:34

Use more CPU-cache-friendly memory layout for tracking visited/expand…

dd6a822

…ed notes. For small sizes, a flat memory layout in L1 cache will usually outperform a fragmented container like unordered_set, in this case, by about 5x.

Use more compact storage for OSM key/value tags. We know there are us…

134f145

…ually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout.

Update changelog.

dae1280

danpat force-pushed the danpat_parallel_hashmap_tilebuilding branch from e78cd0e to dae1280 Compare May 4, 2020 21:35

danpat changed the title ~~Faster tilebuilding by replacing std::unordered_map~~ Speed up tilebuilding by avoiding some STL library bottlenecks May 4, 2020

danpat changed the title ~~Speed up tilebuilding by avoiding some STL library bottlenecks~~ Speed up tilebuilding by avoiding some STL bottlenecks May 4, 2020

Fix linting.

691166c

mandeepsandhu reviewed May 4, 2020

View reviewed changes

danpat added 5 commits May 5, 2020 08:00

Remove the bloom filter - it's not big enough to be effective in this…

e285ba7

… case, and a bigger one isn't faster than unordered_set.find() for this case.

Remove unused flatmap implementation.

d529b3a

Update changelog

2744142

Merge branch 'master' into danpat_parallel_hashmap_tilebuilding

8c176dc

mandeepsandhu previously approved these changes May 5, 2020

View reviewed changes

kevinkreiser previously approved these changes May 5, 2020

View reviewed changes

kevinkreiser reviewed May 5, 2020

View reviewed changes

src/mjolnir/graphenhancer.cc Outdated Show resolved Hide resolved

danpat added 2 commits May 5, 2020 11:00

Unwind an unnecessary change from master.

01f8e1a

Merge branch 'danpat_parallel_hashmap_tilebuilding' of github.com:val…

723ca48

…halla/valhalla into danpat_parallel_hashmap_tilebuilding

danpat dismissed stale reviews from kevinkreiser and mandeepsandhu via 723ca48 May 5, 2020 18:00

kevinkreiser approved these changes May 5, 2020

View reviewed changes

danpat merged commit 5dbd7c0 into master May 5, 2020

danpat deleted the danpat_parallel_hashmap_tilebuilding branch May 5, 2020 18:34

kevinkreiser mentioned this pull request May 20, 2020

How much resource do I need to build planet tiles? #2384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up tilebuilding by avoiding some STL bottlenecks #2349

Speed up tilebuilding by avoiding some STL bottlenecks #2349

danpat commented May 4, 2020 •

edited

kevinkreiser commented May 4, 2020

kevinkreiser commented May 4, 2020

mookerji commented May 4, 2020 •

edited

mandeepsandhu May 4, 2020

danpat May 4, 2020

danpat May 4, 2020

mandeepsandhu May 4, 2020

mandeepsandhu May 4, 2020

danpat May 4, 2020 •

edited

danpat May 4, 2020

mookerji May 5, 2020

danpat May 5, 2020 •

edited

danpat May 5, 2020 •

edited

danpat May 5, 2020

mookerji May 5, 2020 •

edited

mandeepsandhu May 5, 2020

danpat May 5, 2020

mandeepsandhu left a comment

kevinkreiser commented May 5, 2020 •

edited

danpat commented May 5, 2020 •

edited

mandeepsandhu commented May 5, 2020

Navigation Menu

Speed up tilebuilding by avoiding some STL bottlenecks #2349

Speed up tilebuilding by avoiding some STL bottlenecks #2349

Conversation

danpat commented May 4, 2020 • edited

Issue

Tasklist

Requirements / Relations

kevinkreiser commented May 4, 2020

kevinkreiser commented May 4, 2020

mookerji commented May 4, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpat May 4, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpat May 5, 2020 • edited

Choose a reason for hiding this comment

danpat May 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mookerji May 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mandeepsandhu left a comment

Choose a reason for hiding this comment

kevinkreiser commented May 5, 2020 • edited

danpat commented May 5, 2020 • edited

mandeepsandhu commented May 5, 2020

danpat commented May 4, 2020 •

edited

mookerji commented May 4, 2020 •

edited

danpat May 4, 2020 •

edited

danpat May 5, 2020 •

edited

danpat May 5, 2020 •

edited

mookerji May 5, 2020 •

edited

kevinkreiser commented May 5, 2020 •

edited

danpat commented May 5, 2020 •

edited