Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up tilebuilding by avoiding some STL bottlenecks #2349

Merged
merged 11 commits into from May 5, 2020

Conversation

danpat
Copy link
Member

@danpat danpat commented May 4, 2020

Issue

Speeds up valhalla_build_tiles by about 25% (on my test maps) by replacing some instances of std::unordered_map and std::unordered_set with more efficient approaches for the context.

# Before
$ /usr/bin/time ./valhalla_build_tiles -c config.json bulgaria-latest.osm.pbf
      149.53 real       213.47 user        13.32 sys

# After
$ /usr/bin/time ./valhalla_build_tiles -c config.json bulgaria-latest.osm.pbf
      118.93 real       156.03 user         9.16 sys

The primary changes are:

  1. Avoid doing unordered_set.erase(set.begin()) in a tight loop during IsNotThruEdge - this triggers constant reallocation. It's faster to use a flat vector buffer as a queue instead.
  2. Also in IsNotThruEdge, avoid doing constant unordered_set.find() operations by adding a pre-check bloom filter. As what we're interested in in this case is "is the node not in the visited set", a bloom filter works well. Because the set is relatively small (a few hundred entries), a simple filter using std::uint64_t is quite effective.
  3. In a similar vein, replace the Tags type in osmpbfparser and luatagtransform with a flat list of key/value pairs. Performance is better than *map types (even good ones) because the list of key/value pairs is very short, and the keys and values themselves are also very short, allowing a linear search of the list to occur completely in L1 cache. This outperforms approaches which store the data fragmented in RAM and incur memory lookup delays.

Tasklist

  • Add tests
  • Add #fixes with the issue number that this PR addresses
  • Generally use squash merge to rebase and clean comments before merging
  • Update the changelog

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

@kevinkreiser
Copy link
Member

@danpat there are a couple of interesting but lesser used ones in osmdata.h as well that you might want to hit. they are used by what are pretty infrequently (relatively speaking) occuring tagging combinations but still may be useful to future proof.

@kevinkreiser
Copy link
Member

@danpat i took a pass at getting this build working in CI (resolving conflicts and adding a hash function to the Turn::Type)

@danpat danpat added this to the Sprint 9 - Left Turn milestone May 4, 2020
.gitmodules Outdated Show resolved Hide resolved
@mookerji
Copy link
Contributor

mookerji commented May 4, 2020

@danpat @kevinkreiser sounded like there were concerns about the difficulty of integrating absl, the project from which the header-only parallel hash map dependency in the PR is actually derived. I put up a branch here which shows that it's actually a pretty straightforward update to our CMake configuration to only include the dependency for absl::flat_hash_map: master...mookerji:mookerji-hashmaps. These specific targets are actually exported by Absl's CMake file and pretty clearly documented here: https://github.com/abseil/abseil-cpp/tree/master/CMake#available-abseil-cmake-public-targets.

I didn't include it in the branch, but I verified locally that the project builds correctly and tests pass when you include #include <absl/container/flat_hash_map.h> and update a locally-scoped hash maps to use it. Making more extensive changes around interfaces are already covered in this PR.

@kevinkreiser also raised a concern about binary size, here's what I'm getting on release build:

$ cmake -DENABLE_PYTHON_BINDINGS=OFF -DENABLE_NODE_BINDINGS=OFF -DENABLE_BENCHMARKS=OFF -DCMAKE_BUILD_TYPE=Release ../
$ make -j4
$ ls
-rwxr-xr-x    1 buromookerji  staff   2.1M May  4 10:50 valhalla_add_predicted_traffic*
-rwxr-xr-x    1 buromookerji  staff   106K May  4 10:50 valhalla_benchmark_adjacency_list*
-rwxr-xr-x    1 buromookerji  staff   4.0M May  4 10:50 valhalla_benchmark_admins*
-rwxr-xr-x    1 buromookerji  staff   5.7M May  4 10:50 valhalla_benchmark_loki*
-rwxr-xr-x    1 buromookerji  staff   234K May  4 10:50 valhalla_benchmark_skadi*
-rwxr-xr-x    1 buromookerji  staff   4.1M May  4 10:49 valhalla_build_admins*
-rwxr-xr-x    1 buromookerji  staff   1.9M May  4 10:50 valhalla_build_connectivity*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:50 valhalla_build_statistics*
-rwxr-xr-x    1 buromookerji  staff   4.0M May  4 10:49 valhalla_build_tiles*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:50 valhalla_convert_transit*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:50 valhalla_expand_bounding_box*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:50 valhalla_export_edges*
-rwxr-xr-x    1 buromookerji  staff   4.2M May  4 10:49 valhalla_fetch_transit*
-rwxr-xr-x    1 buromookerji  staff   6.3M May  4 10:49 valhalla_loki_worker*
-rwxr-xr-x    1 buromookerji  staff   6.7M May  4 10:49 valhalla_odin_worker*
-rwxr-xr-x    1 buromookerji  staff   6.0M May  4 10:50 valhalla_path_comparison*
-rwxr-xr-x    1 buromookerji  staff   1.8M May  4 10:49 valhalla_query_transit*
-rwxr-xr-x    1 buromookerji  staff   5.8M May  4 10:50 valhalla_run_isochrone*
-rwxr-xr-x    1 buromookerji  staff   5.8M May  4 10:50 valhalla_run_map_match*
-rwxr-xr-x    1 buromookerji  staff   6.5M May  4 10:50 valhalla_run_matrix*
-rwxr-xr-x    1 buromookerji  staff   7.8M May  4 10:50 valhalla_run_route*
-rwxr-xr-x    1 buromookerji  staff   9.2M May  4 10:49 valhalla_service*
-rwxr-xr-x    1 buromookerji  staff   7.1M May  4 10:49 valhalla_thor_worker*
-rwxr-xr-x    1 buromookerji  staff   4.1M May  4 10:50 valhalla_validate_transit*
-rwxr-xr-x    1 buromookerji  staff   1.9M May  4 10:49 valhalla_ways_to_edges*

Update: the master release binaries are about the same size. The Debug builds seem about 5-10% bigger.

Revisiting the question from this morning: do we want to replace our STL hash map with what, on the surface, seems like someone's personal project on Github? That doesn't seem like an obvious choice to me. Why not use the project from which it's derived?

danpat added 3 commits May 4, 2020 14:34
…ed notes. For small sizes, a flat memory layout in L1 cache will usually outperform a fragmented container like unordered_set, in this case, by about 5x.
…ually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout.
@danpat danpat force-pushed the danpat_parallel_hashmap_tilebuilding branch from e78cd0e to dae1280 Compare May 4, 2020 21:35
@danpat danpat changed the title Faster tilebuilding by replacing std::unordered_map Speed up tilebuilding by avoiding some STL library bottlenecks May 4, 2020
@danpat danpat changed the title Speed up tilebuilding by avoiding some STL library bottlenecks Speed up tilebuilding by avoiding some STL bottlenecks May 4, 2020
expandset.erase(expandset.begin());
// Expand edges from this node. Post-increment the index to the next
// item.
const GraphId expandnode = expandset[expand_pos++];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if you can use GraphId.value as a key to your bloom filter? (will they be unique in this context?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mandeepsandhu I had that initially - but I think because many edges will be within the same tile, and not many bits change between values, we would end up with bloom filter collisions really quickly. I added hashing to shake things around a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean still hash, but only use the .value() part of the GraphId?

That might collide if we're exploring a location that crosses tile boundaries. We definitely don't need to be perfectly collision avoiding here, just better-than-nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I meant using GraphId.value directly without running through the hash function. But yeah, if there are a lot of collisions it will cause extra hash lookups which might not be as effective.

// Works when all your data will fit in L1 cache and it's faster to just loop
// over everything than it is to fetch fragmented records from main memory.
template <typename KeyT, typename ValueT>
struct SmallFlatMap : public std::vector<std::pair<KeyT, ValueT>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For small number of entries, won't a std::unordered_map too fit in the L1/L2 cache? (I'd be surprised if it was otherwise).

Out of curiosity, till how many entries is the linear search here faster than a hash lookup?

Copy link
Member Author

@danpat danpat May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unordered_map is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.

I don't have direct data to answer your second question, but the napkin math is as follows:

  1. Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the std::string object itself, not in a buffer pointed to by the object.
  2. std::pair is basically just a fancy wrapper around a struct with two data members.
  3. Combine (1) and (2) in a std::vector means you'll get key/values stored contiguously in memory, using just a few bytes each.
  4. If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.

Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.

OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics

Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.

It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.

See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a nice histogram of OSM key lengths here:

https://taginfo.openstreetmap.org/reports/key_lengths#histogram

a good solid chunk of all keys are <20 chars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danpat: I had the same question @mandeepsandhu had: I feel like we're trying to outsmart the compiler here. Did we confirm (via profiling or measurement) that any performance impact in the change is due to this custom hash map?

I just wrote a benchmark in dd3c3ed constructing maps of different sizes for very small string keys and then accessing the middle element. As @mandeepsandhu has suggested might be the case, the STL map and SmallFlatMap seem pretty close to each other up to about 16 elements, but then quickly diverge. Under 16 elements, there's definitely some benefit, but <10 nsec (strong possibility this is measurement noise / timing resolution error). Maybe there would be some additional perf wins if we ordered the tag map by frequency of use within Valhalla.

Here's the benchmark:

$ ./bench/midgard/benchmark-flat_map
2020-05-04 18:30:21
Running ./bench/midgard/benchmark-flat_map
Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 2.57, 2.33, 2.63
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_VFlatMapAccess/1                 87.2 ns         86.0 ns      7669216
BM_VFlatMapAccess/2                 83.0 ns         82.5 ns      8142091
BM_VFlatMapAccess/4                 86.3 ns         85.2 ns      8546382
BM_VFlatMapAccess/8                 89.7 ns         88.4 ns      7869236
BM_VFlatMapAccess/16                91.1 ns         90.0 ns      7923213
BM_VFlatMapAccess/32                 108 ns          106 ns      6397485
BM_VFlatMapAccess/64                 125 ns          124 ns      5687219
BM_VFlatMapAccess/128                147 ns          146 ns      4806801
BM_VFlatMapAccess/256                276 ns          273 ns      2619584
BM_VFlatMapAccess/512                409 ns          403 ns      1715337
BM_VFlatMapAccess/1024               755 ns          726 ns      1044605
...
BM_STLUnorderedMapAccess/1          90.1 ns         89.5 ns      7853168
BM_STLUnorderedMapAccess/2          90.9 ns         90.7 ns      7914881
BM_STLUnorderedMapAccess/4          94.9 ns         94.8 ns      6541322
BM_STLUnorderedMapAccess/8          96.5 ns         96.3 ns      7318195
BM_STLUnorderedMapAccess/16         97.2 ns         96.9 ns      7243752
BM_STLUnorderedMapAccess/32          101 ns          101 ns      6901311
BM_STLUnorderedMapAccess/64         99.3 ns         99.1 ns      6996642
BM_STLUnorderedMapAccess/128        99.5 ns         99.1 ns      6832935
BM_STLUnorderedMapAccess/256         111 ns          111 ns      6177197
BM_STLUnorderedMapAccess/512         102 ns          102 ns      6613944
BM_STLUnorderedMapAccess/1024        113 ns          112 ns      6157960

Happy to update this if I did something wrong (see the linked commit)! Entirely possible I made a benchmarking mistake, or that my computer isn't reflective of reality.

/cc @kevinkreiser

Copy link
Member Author

@danpat danpat May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mookerji The main gains in this PR come from the changes to IsNotThruEdge - probably 90% of the improvement is there. The SmallFlatMap shaves a few additional seconds off my test runs.

However, at least running locally, the improvement from SmallFlatMap is pretty consistent it cuts 5-10 seconds off a 150-160 second test run (I'm running valhalla_build_tiles with a 100MB OSM PBF file of Bulgaria). I wonder if it behaves better under cache pressure than unordered_map - do you know if there's a pattern to apply cache pressure in Google benchmark? Clobbering the L1/L2 cache each iteration might better match all the other work that the real code is doing.

A couple of other things that are worth considering:

  1. What do the insertion benchmarks look like? I don't think calls to [] dominate the usage of the class, lots of these objects are built and destroyed as well.
  2. What does range iteration look like - for (const auto &element : tags) {} in a benchmark? We loop over the complete tagset in quite a few places, so it could be that the main gains are coming from that, rather than only usage of operator[].

Overall, SmallFlatMap is a minor piece of this PR - the main benefit comes from the other changes.

Copy link
Member Author

@danpat danpat May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one other thing - you probably want to move the std::to_string() calls out of the hot loop in those benchmarks - they'll be dominating timings particularly for the smaller values. I quickly did that and now get this for lookups:

BM_VFlatMapAccess/1                 4.80 ns         4.77 ns    145741334
BM_VFlatMapAccess/2                 4.79 ns         4.76 ns    144731864
BM_VFlatMapAccess/4                 6.20 ns         6.17 ns    101122459
BM_VFlatMapAccess/8                 9.59 ns         9.44 ns     75012323
BM_VFlatMapAccess/16                12.7 ns         12.7 ns     54992537
BM_VFlatMapAccess/32                27.7 ns         27.6 ns     25056825
BM_VFlatMapAccess/64                60.5 ns         60.3 ns     11405295
BM_VFlatMapAccess/128               86.3 ns         86.0 ns      8018695
BM_VFlatMapAccess/256                243 ns          241 ns      2785393
BM_VFlatMapAccess/512                407 ns          402 ns      1770417
BM_VFlatMapAccess/1024               755 ns          748 ns       916110
BM_STLUnorderedMapAccess/1          9.93 ns         9.89 ns     69545865
BM_STLUnorderedMapAccess/2          10.0 ns         9.98 ns     70417576
BM_STLUnorderedMapAccess/4          16.5 ns         16.5 ns     42466452
BM_STLUnorderedMapAccess/8          16.5 ns         16.4 ns     42089567
BM_STLUnorderedMapAccess/16         16.6 ns         16.5 ns     42153946
BM_STLUnorderedMapAccess/32         17.5 ns         17.5 ns     39942939
BM_STLUnorderedMapAccess/64         17.6 ns         17.5 ns     39955934
BM_STLUnorderedMapAccess/128        18.1 ns         17.9 ns     39609787
BM_STLUnorderedMapAccess/256        23.8 ns         23.7 ns     29252352
BM_STLUnorderedMapAccess/512        18.9 ns         18.7 ns     38226508
BM_STLUnorderedMapAccess/1024       24.0 ns         23.9 ns     29206217

In percentage terms, a clear win IMO. In absolute terms, I do struggle to see how we'd be making enough of these calls for a 7ns (my ballpark for the gain at 3 entries per map) improvement to manifest in a macro improvement to the process runtime. Maybe this map is being used even more than I thought. We have to make something like 150 million calls to operator[] for 1 second to be gained....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another interesting observation - we have a lot of calls to .find() on this object type. I added a benchmark to your script that compares that. I wasn't expecting much different to operator[] but:

BM_VFlatMapFind/1                   2.97 ns         2.95 ns    236505663
BM_VFlatMapFind/2                   2.95 ns         2.94 ns    238443170
BM_VFlatMapFind/4                   4.79 ns         4.77 ns    146934115
BM_VFlatMapFind/8                   9.07 ns         9.04 ns     76705604
BM_VFlatMapFind/16                  11.5 ns         11.4 ns     60524313
BM_VFlatMapFind/32                  35.4 ns         35.3 ns     19926103
BM_VFlatMapFind/64                  80.6 ns         79.7 ns      6927812
BM_VFlatMapFind/128                  115 ns          115 ns      5988280
BM_VFlatMapFind/256                  316 ns          314 ns      2244036
BM_VFlatMapFind/512                  572 ns          570 ns      1182033
BM_VFlatMapFind/1024                1116 ns         1108 ns       634173
BM_STLUnorderedMapFind/1            10.1 ns         9.80 ns     65048508
BM_STLUnorderedMapFind/2            9.69 ns         9.62 ns     71446068
BM_STLUnorderedMapFind/4            16.1 ns         16.0 ns     43037726
BM_STLUnorderedMapFind/8            16.1 ns         16.0 ns     43643074
BM_STLUnorderedMapFind/16           16.1 ns         16.0 ns     43600396
BM_STLUnorderedMapFind/32           17.1 ns         16.9 ns     41560292
BM_STLUnorderedMapFind/64           16.8 ns         16.8 ns     41083670
BM_STLUnorderedMapFind/128          16.8 ns         16.8 ns     41179862
BM_STLUnorderedMapFind/256          24.1 ns         23.5 ns     30205093
BM_STLUnorderedMapFind/512          17.7 ns         17.5 ns     39801674
BM_STLUnorderedMapFind/1024         23.4 ns         23.3 ns     30081392

Again, these numbers are so small I struggle to see how they'd have noticeable impact, but they do seem to....

Copy link
Contributor

@mookerji mookerji May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: std::to_string(). That's fair, but that that string conversion should be will be common factor between comparisons of a certain size. You could also pre-prepare the benchmark data across all benchmark cases, but my point was to show relative performance between maps, not an absolute benchmark (and also not overthink it).

What was the typical key space size on one of these maps again? https://taginfo.openstreetmap.org/reports/key_lengths#histogram makes it seem like there could be a lot of them, but I'm not familiar enough (yet) with how this particular map is used.

Addressing your other points:

  • Yes, there's a thing to flush to memory: benchmark::ClobberMemory(). I haven't used it.
  • Benchmarking a key space scan makes sense, but I expect we'd see the same performance story as we have with these other benchmarks.
  • I also added insertion benchmarking, but didn't include those because I thought (perhaps incorrectly) that our workloads were read-intensive and not write intensive. I also left out the absl benchmarks, which seem to be better than STL for 'large' maps. Anyway, here they are:
BM_VFlatMapInsert/1                  207 ns          201 ns      3050016
BM_VFlatMapInsert/2                  416 ns          398 ns      1565334
BM_VFlatMapInsert/4                  822 ns          804 ns       741541
BM_VFlatMapInsert/8                 1624 ns         1594 ns       438866
BM_VFlatMapInsert/16                3510 ns         3414 ns       208204
BM_VFlatMapInsert/32                8127 ns         7509 ns        90333
BM_VFlatMapInsert/64               18418 ns        16659 ns        42500
BM_VFlatMapInsert/128              38339 ns        35216 ns        18561
BM_VFlatMapInsert/256              94164 ns        92460 ns         7767
BM_VFlatMapInsert/512             283171 ns       264208 ns         2768
BM_VFlatMapInsert/1024            925983 ns       900776 ns          782
BM_STLUnorderedMapInsert/1           245 ns          197 ns      3766722
BM_STLUnorderedMapInsert/2           373 ns          368 ns      1891462
BM_STLUnorderedMapInsert/4           821 ns          797 ns       874497
BM_STLUnorderedMapInsert/8          1579 ns         1556 ns       446588
BM_STLUnorderedMapInsert/16         3873 ns         3576 ns       191874
BM_STLUnorderedMapInsert/32         7139 ns         6801 ns       104181
BM_STLUnorderedMapInsert/64        13956 ns        13749 ns        48900
BM_STLUnorderedMapInsert/128       26959 ns        26627 ns        26020
BM_STLUnorderedMapInsert/256       55176 ns        54341 ns        13137
BM_STLUnorderedMapInsert/512      117416 ns       115000 ns         6232
BM_STLUnorderedMapInsert/1024     241274 ns       236532 ns         3040

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unordered_map is implemented as linked-list buckets, which usually means lookups are almost always cache misses, as the values don't reside near the map itself in memory. This is why there are so many "better" implementations floating around, the first thing most of them do is switch to open addressing to avoid the cache miss problem.

My comment was poorly worded, I meant the hashed keys (buckets) will fit in cache. But you're right that any attempt to access the value or resolve a collision would go to main memory due to pointer indirection.

I don't have direct data to answer your second question, but the napkin math is as follows:

  1. Most OSM keys/values are short - just a few chars. This means they'll usually take advantage of small string optimization - i.e. the string chars will be stored inside the std::string object itself, not in a buffer pointed to by the object.

A string comparison would require all characters to be compared as opposed to looking up an array index, although that has the added overhead of calculating the hash on the string (I've heard Murmur or FNV-1 are good for hashing strings and produce an even distribution). At this point I don't know which one would be faster, TBH. Its best to profile this and find out.

  1. std::pair is basically just a fancy wrapper around a struct with two data members.
  2. Combine (1) and (2) in a std::vector means you'll get key/values stored contiguously in memory, using just a few bytes each.
  3. If you read down http://norvig.com/21-days.html#answers you'll see that an L1 cache hit is about 0.5ns, and a main memory lookup (cache miss) is about 100ns. This means that you can do about 200 L1 operations in the time a single main memory access is done.

Based on the above, the back-of-the-napkin math tells me that it's probably somewhere in the order of 10s to maybe low 100s of key comparisons before it's better to use a hash to jump directly to a memory page that's not in L1 cache already.

OSM averages 2.3 tags per way: https://taginfo.openstreetmap.org/reports/database_statistics

Many of the replacement hashmaps use open-addressing for this reason - jump to a hash spot, then linear scan in a cache-friendly way. This beats out things like cuckoo-hash which avoids hash collisions, but ends up jumping all over the place causing cache misses.

I agree open-addressing is much more cache friendly (and consequently better performing), but they suffer at high load factors due to need for rehashing (which is probably a non-issue here since there are few keys we're storing).

It's very specific to our known use-case. If someone lumps thousands of very long key/values into OSM, then this approach won't perform.

See also https://github.com/osmcode/libosmium/blob/2709257c3a61c121533c7b465bfc2b2f9661d7f0/include/osmium/osm/tag.hpp#L121-L141 - I've spoken to @joto about this in the past and he's confident in the approach for fast handling of OSM tag/value pairs with typical OSM-like data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - so it looks like the evidence shows that the improvement from the SmallFlatMap is marginal at best. I've pulled it from this PR.

danpat added 5 commits May 5, 2020 08:00
… case, and a bigger one isn't faster than unordered_set.find() for this case.
…e are usually few pairs, and the values are small, so linear search in L1 cache combined with short-string-optimization will typically outperform std::unordered_map with its fragmented memory layout."

This reverts commit 134f145.
mandeepsandhu
mandeepsandhu previously approved these changes May 5, 2020
Copy link
Contributor

@mandeepsandhu mandeepsandhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The use of bloom filter to short-circuit a cache miss is quite clever @danpat !

And thanks to @mookerji & you for the various benchmarks to visualize the improvements! There are some good learnings for me in this PR! 👍

kevinkreiser
kevinkreiser previously approved these changes May 5, 2020
@kevinkreiser
Copy link
Member

kevinkreiser commented May 5, 2020

at first its a bit unintuitive because doing vector::find a lot can be slow but its important to note that the function starts on a low class road and does an expansion only on adjacent low class roads. and importantly, the function bails early if it encounters a high class road without going back through the node it started on. the only time this will be slow is if there is a large area of low class roads but also no intersecting higher class roads. it will be interesting to see what this does on the planet.

nevermind i saw visitedset and thought it was expandset. now i see the change is just forget about removing the already visited from the expandset. which is quite straight forward

@danpat
Copy link
Member Author

danpat commented May 5, 2020

@mandeepsandhu I ended up pulling the bloom filter out. I realized I hadn't sized it right to ensure a low hit rate, 64bits isn't much space to ensure a low collision rate across a couple of thousand nodes that we'd usually insert. The initial version was probably not actually preventing very many lookups.

I re-wrote it locally to use a much larger bloom filter - more like 2k bits using std::vector<bool>, but there was no appreciable impact on runtime for our use. I just stripped it out to keep the code simpler - the big gain is really comes down to using a std::vector as a simple queue for the expandset, rather than an std::unordered_set. The unordered_set was suffering from lots of .erase()/.insert() operations.

@danpat danpat dismissed stale reviews from kevinkreiser and mandeepsandhu via 723ca48 May 5, 2020 18:00
@mandeepsandhu
Copy link
Contributor

@mandeepsandhu I ended up pulling the bloom filter out. I realized I hadn't sized it right to ensure a low hit rate, 64bits isn't much space to ensure a low collision rate across a couple of thousand nodes that we'd usually insert. The initial version was probably not actually preventing very many lookups.

Lol, ok! When I was writing the comment, the diff still showed the bloom filter in use (and you had removed the flat map).

I re-wrote it locally to use a much larger bloom filter - more like 2k bits using std::vector<bool>, but there was no appreciable impact on runtime for our use. I just stripped it out to keep the code simpler - the big gain is really comes down to using a std::vector as a simple queue for the expandset, rather than an std::unordered_set. The unordered_set was suffering from lots of .erase()/.insert() operations.

Okay thats good to know. The simpler the better. And either way, this was a good learning exercise!

@danpat danpat merged commit 5dbd7c0 into master May 5, 2020
@danpat danpat deleted the danpat_parallel_hashmap_tilebuilding branch May 5, 2020 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants