use protozero #623

cldellow · 2023-12-26T04:57:37Z

This is meant to be merged after #618. Until then, you can see the protozero-specific diffs here: https://github.com/cldellow/tilemaker/compare/planet-on-32gb...cldellow:protozero?expand=1

As mentioned in that PR, I'm just queuing up the PRs - no urgency to review/merge them.

Use protozero instead of libprotobuf for reading OSM PBFs. We still use libprotobuf for reading/write vector tiles.

PbfReader exposes an API very similar to the one produced by libprotobuf for osmformat.proto. There are some differences:

Better names than libprotobuf's mechanically produced ones, and exposed via range-based for loops, e.g:

for (auto& group : block.groups()) {
  for (auto& node : group.nodes()) { ... }
  for (auto& way : group.ways()) { ... }
  for (auto& rel : group.relations()) { ... }
}

We avoid memory copies, which means lifetime of objects returned from PbfReader is limited. You cannot nest iteration, so code such as this is invalid:

for (auto& node1: group.nodes()) {
  for (auto& node2: group.nodes()) {
    // compare node1 and node2
  }
}

PbfReader handles delta-decoding, so read_pbf's methods no longer need to maintain the running sums.

Also some other, unrelated things I noticed:

tweak decompress_string so caller provides the output buffer. Re-use the output buffer to minimize allocations and copying
fix calculateCentroid for multipolygons - it wasn't using the cached multipolygon

Memory use: I had hoped the memory savings would scale with the input PBF, but it appears to be more like a fixed savings of 250MB. Ah, well.

Timings to read the PBF, planet-on-32gb vs this branch:

france, planet-on-32gb: 2m46s vs 2m28s (-11%)
great-britain, planet-on-32gb: 58s vs 52s (-10%)

Questions:

I used PbfReader as the class name for the low-level PBF reader that sits on top of protozero. After implementing it and getting its tests passing, I tried to build the whole project...and realized that read_pbf.cpp uses this name already.

I just renamed read_pbf's class to PbfProcessor, in an attempt to capture that it's a bit higher-level. I'm happy to rename pbf_reader's class, though.

Any preferences? My default would be to keep the names as they are, and rename read_pbf.{h,cpp} to pbf_processor.{h,cpp}

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

For GB, ~0.3% of objects are visible at low zooms. I noticed in previous planet runs that fetching the objects for tiles in the low zooms was quite slow - I think it's because we're scanning 1.3B objects each time, only to discard most of them. Now we'll only be scanning ~4M objects per tile, which is still an absurd number, but should mitigate most of the speed issue without having to properly index things. This will also help us maintain performance for memory-constrained users, as we won't be scanning all 15GB of data on disk, just a smaller ~45MB chunk.

For Points stored via Layer(...) calls, store the node ID in the OSM store, unless `--materialize-geometries` is present. This saves ~200MB of RAM for North America, so perhaps 1 GB for the planet if NA has similar characteristics as the planet. Also fix the OSM_ID(...) macro - it was lopping off many more bits than needed, due to some previous experiments. Now that we want to track nodes, we need at least 34 bits. This may pose a problem down the road when we try to address thrashing. The mechanism I hoped to use was to divide the OSM stores into multiple stores covering different low zoom tiles. Ideally, we'd be able to recall which store to look in -- but we only have 36 bits, we need 34 to store the Node ID, so that leaves us with 1.5 bits => can divide into 3 stores. Since the node store for the planet is 44GB, dividing into 3 stores doesn't give us very much headroom on a 32 GB box. Ah well, we can sort this out later.

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

Not used by anything yet. Given Tilemaker's limited needs, we can get away with a stripped-down string class that is less flexible than std::string, in exchange for memory savings. The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang). When it does allocate (for strings longer than 15 bytes), it allocates from a pool so there's less per-allocation overhead.

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

Break dependency on AttributePair, just work on std::string

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

AttributePair has now been trimmed from 48 bytes to 18 bytes. There are 40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs, so hopefully this is a savings of ~900MB at the planet level. Runtime doesn't seem affected. There's a further opportunity for savings if we can make more strings qualify for the short string optimization. Only about 40% of strings fit in the 15 byte short string optimization. Of the remaining 60%, many are Latin-alphabet title cased strings like `Wellington Avenue` -- this could be encoded using 5 bits per letter, saving us an allocation. Even in the most optimistic case where: - there are 30M AttributePairs - of these, 90% are strings (= 27M) - of these, 60% don't fit in SSO (=16m) - of these, we can make 100% fit in SSO ...we only save about 256MB at the planet level, but at some significant complexity cost. So probably not worth pursuing at the moment.

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

When using --store, deque is nice because growing doesn't require invalidating the old storage and copying it to a new location. However, it's also bad, because deque allocates in 512-byte chunks, which causes each 4KB OS page to have data from different z6 tiles. Instead, use our own container that tries to get the best of both worlds. Writing a random access iterator is new for me, so I don't trust this code that much. The saving grace is that the container is very limited, so errors in the iterator impelementation may not get exercised in practice.

This adds three methods to the stores: - `shard()` returns which shard you are - `shards()` returns how many shards total - `contains(shard, id)` returns whether or not shard N has an item with id X SortedNodeStore/SortedWayStore are not implemented yet, that'll come in a future commit. This will allow us to create a `ShardedNodeStore` and `ShardedWayStore` that contain N stores. We will try to ensure that each store has data that is geographically close to each other. Then, when reading, we'll do multiple passes of the PBF to populate each store. This should let us reduce the working set used to populate the stores, at the cost of additional linear scans of the PBF. Linear scans of disk are much less painful than random scans, so that should be a good trade.

I'm going to rejig the innards of this class, so let's have some tests.

In order to shard the stores, we need to have multiple instances of the class. Two things block this currently: atomics at file-level, and thread-locals. Moving the atomics to the class is easy. Making the thread-locals per-class will require an approach similar to that adopted in https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48, where we have a container that tracks the per-class data.

Still only supports 1 class, but this is a step along the path.

D'oh, this "worked" due to two bugs cancelling each other: (a) the code to find things in the low zoom list never found anything, because it assumed a base z6 tile of 0/0 (b) we weren't returning early, so the normal code still ran Rejigged to actually do what I was intending

Do a single pass, rather than one pass per zoom.

This distributes nodes into one of 8 shards, trying to roughly group parts of the globe by complexity. This should help with locality when writing tiles. A future commit will add a ShardedWayStore and teach read_pbf to read in a locality-aware manner, which should help when reading ways.

We'd like to have different defaults based on whether `--store` is present. Now that option parsing will have some more complex logic, let's pull it into its own class so it can be more easily tested.

This has no performance impact as we never put anything in the 7th shard, and so we skip doing the 7th pass in the ReadPhase::Ways and ReadPhase::Relations phase. The benefit is only to avoid emitting a noisy log about how the 7th store has 0 entries in it. Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb The new peak memory is ~12.2GB. I am a little perplexed -- the runtime on a 16-core server was previously: ``` $ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores real 195m7.819s user 2473m52.322s sys 73m13.116s ``` But with the most recent commits on this branch, it was: ``` real 118m50.098s user 1531m13.026s sys 34m7.252s ``` This is incredibly suspicious. I also tried re-running commit bbf0957, and got: ``` real 123m15.534s user 1546m25.196s sys 38m17.093s ``` ...so I can't explain why the earlier runs took 195 min. Ideas: - the planet changed between runs, and a horribly broken geometry was fixed - Vultr gives quite different machines for the same class of server - perhaps most likely: I failed to click "CPU-optimized" when picking the earlier server, and got a slow machine the first time, and a fast machine the second time. I'm pretty sure I paid the same $, so I'm not sure I believe this. I don't think I really believe that a 33% reduction in runtime is explained by any of those, though. Anyway, just another thing to be befuddled by.

I did some experiments on a Hetzner 48-core box with 192GB of RAM: --store, materialize geometries: real 65m34.327s user 2297m50.204s sys 65m0.901s The process often failed to use 100% of CPU--if you naively divide user+sys/real you get ~36, whereas the ideal would be ~48. Looking at stack traces, it seemed to coincide with calls to Boost's rbtree_best_fit allocator. Maybe: - we're doing disk I/O, and it's just slower than recomputing the geometries - we're using the Boost mmap library suboptimally -- maybe there's some other allocator we could be using. I think we use the mmap allocator like a simple bump allocator, so I don't know why we'd need a red-black tree --store, lazy geometries: real 55m33.979s user 2386m27.294s sys 23m58.973s Faster, but still some overhead (user+sys/real => ~43) no --store, materialize geometries: OOM no --store, lazy geometries (used 175GB): real 51m27.779s user 2306m25.309s sys 16m34.289s This was almost 100% CPU - user+sys/real => ~45) From this, I infer: - `--store` should always default to lazy geometries in order to minimize the I/O burden - `--materialize-geometries` is a good default for non-store usage, but it's still useful to be able to override and use lazy geometries, if it then means you can fit the data entirely in memory

systemed · 2023-12-26T13:37:55Z

Looks excellent (and the code is neater than before)!

I just renamed read_pbf's class to PbfProcessor, in an attempt to capture that it's a bit higher-level. I'm happy to rename pbf_reader's class, though.
Any preferences? My default would be to keep the names as they are, and rename read_pbf.{h,cpp} to pbf_processor.{h,cpp}

100% happy with PbfProcessor.

The NA case is heavily affected by the Hudson Bay relation being a straggler.

I'm wondering about this. Is this relation 9441240? None of the tags match (we no longer look for natural=bay since #611) so we shouldn't be assembling the geometry for Hudson Bay. But I've not traced it through.

cldellow · 2023-12-26T16:13:52Z

100% happy with PbfProcessor.

Done! (...now I see that this loses the symmetry between read_pbf and read_shp. Hmmm.)

The NA case is heavily affected by the Hudson Bay relation being a straggler.

I'm wondering about this. Is this relation 9441240? None of the tags match (we no longer look for natural=bay since #611) so we shouldn't be assembling the geometry for Hudson Bay. But I've not traced it through.

Oops, this was an assumption & error on my part. You're absolutely right--it's not Hudson Bay, it's Lake Huron (1205151). In another branch, I dug a bit further.

Lake Huron is the long pole for relatons - 120sec. The whole read phase takes only 137s, so it really hurts us.

The thrust of my experiments in that branch were to find out if we could save time by starting Lake Huron (and other slow relations) earlier--we control the order in which we iterate the blocks, after all. But at 120 seconds, it dominates the next-slowest relation (Lake Superior, 4039486) by a factor of like 10x. Thus, even if you immediately start processing Lake Huron in ReadPhase::Relations, you'll still be spending the majority of the phase waiting on a single core to finish Lake Huron.

I'm sure there's something that could be done there (optimize correction code? be able to parallelize correction at the per-relation level? ship a corrected Lake Huron?), but I was hoping for something relatively easy with no messy implications, so I gave up. :)

I suspect the most likely next source of perf wins will be vtzero and figuring out a way to reuse simplification results.

systemed · 2023-12-27T13:14:03Z

I'm sure there's something that could be done there (optimize correction code? be able to parallelize correction at the per-relation level? ship a corrected Lake Huron?), but I was hoping for something relatively easy with no messy implications, so I gave up. :)

Geometry algos like that are strictly "here be dragons" territory. ;) It would be interesting to profile where it's spending most of its time - but I'd not expect any quick wins.

(One possibility for the scenario where you're running tilemaker repeatedly over the same area, and there are complex but rarely-changing geometries like lakes, could be to cache corrected polygons between runs. But that's a lot of over-complication for a pretty niche use case!)

cldellow · 2023-12-27T18:24:41Z

could be to cache corrected polygons between runs

That's a little reminiscent of #588. I agree, it's a big hammer for a relatively minor nit, and likely isn't a good trade off in developer and end-user complexity. I probably overweight the impact of these things due to living in Ontario, which has several bonkers geometries that take > 10 seconds to process. (Contrast the UK, where I think the worst relation is the UK boundary itself, which takes 1 second.)

Geometry algos like that are strictly "here be dragons" territory. ;) It would be interesting to profile where it's spending most of its time - but I'd not expect any quick wins.

I've poked around a little bit. It is indeed very daunting!

I loosely group the opportunities into three areas:

Optimize the simplification process itself, maintaining the current result

This is very daunting to me. It resides at the nexus of knowledge of boost::geometry, spatial indexing strategies, insights about geometry generally, and historic issues with simplification (e.g. issues like #428 and #602). It is a field of rakes waiting to be stepped on.

Apply C++ tricks

The stacks often show that time is spent in allocation/deallocation. Simplification seems to create many temporary objects that get discarded almost immediately. If we made boost use a bump allocator that re-used a fixed, long-lived buffer, I think we'd see some runtime improvement.

Given that the boost geometries work happily with the mmap allocator, I believe this ought to be possible. I think the impact on the code could be quite small--a few lines of code, mostly outside of the simplify code. It "just" needs someone who knows what they're doing when it comes to custom allocators, which isn't yet me.

Cheat

When the problem in front of me is hard, I often try to ignore it and solve a different problem. Here, I think an approach like this could work:

A. For high zooms, write unsimplified geometries as usual.
B. For zooms where simplification is required, operate on a minimally pre-simplified geometry (say, a geometry that had been simplified to 0.0001).

The spirit of the idea is that we currently repeatedly simplify the full-fidelity base geometry to generate the low-zoom geometry. Most of the cost is in throwing out the first 80% of the points.

But once you're at, say, z10, maybe it's fine to use a lower-fidelity geometry as your input for simplification.

I did a little prototyping of this idea in cldellow@146b692.

The prototyping was just to test the idea, not yet to integrate it into tilemaker. It seemed to show a dramatic improvement in runtime. It does result in different geometries. They seem similar to the geometries that would have been generated using the current approach. Still, they're different, so it's not yet clear to me if this would be acceptable.

It also means that the geometry gets simplified before being converted to tile coordinates, so the issue fixed in #428 might crop up again.

cldellow · 2023-12-27T18:49:01Z

Oops, I somehow veered from geometry correction to geometry simplification in that comment.

That comment still stands, but is not relevant to the current discussion. :) (Unless, perhaps, the initial simplification pass could happen in the same pass as the correction pass. This would give the other cores something to do when they would otherwise be sitting idle.)

systemed · 2023-12-28T20:20:49Z

Another great one - thank you. Duly merged!

On the (side) issue of performance with large geometries:

Optimize the simplification process itself, maintaining the current result
This is very daunting to me. It resides at the nexus of knowledge of boost::geometry, spatial indexing strategies, insights about geometry generally, and historic issues with simplification (e.g. issues like Scale before simplifying #428 and Avoid creating small artefacts when scaling #602). It is a field of rakes waiting to be stepped on.

Yep. I absolutely wouldn't suggest getting deep into the algorithms. There are a few places in correct.hpp where I suspect we may be spending a lot of time: the loop in dissolve_generate_rings, and I have historic suspicions about the most efficient way to call boost::geometry::union_ in result_combine. But even if we do look at this - and it's not an immediate priority - then I'd suggest concentrating on any aspects of the implementation that can be fine-tuned, rather than the algorithms themselves.

Apply C++ tricks
The stacks often show that time is spent in allocation/deallocation. Simplification seems to create many temporary objects that get discarded almost immediately. If we made boost use a bump allocator that re-used a fixed, long-lived buffer, I think we'd see some runtime improvement.

👍

Cheat
[...] I did a little prototyping of this idea in cldellow@146b692.
Oops, I somehow veered from geometry correction to geometry simplification in that comment.

Interesting nonetheless! Broadly I think the principle of simplifying already-simplified geometries is sound: yes, they'll differ slightly in output, but the overall fidelity to the original should be similar.

cldellow added 30 commits December 15, 2023 21:36

move OutputObjects to mmap store

5f30a30

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

make more explicit that this is unexpected

d7caf10

rejig AttributePair layout

b86fddc

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

fix initialization order warning

a54938e

add tests for attribute store

3eb07c2

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

rejig isHot

b3eac99

Break dependency on AttributePair, just work on std::string

teach PooledString to work with std::string

2784903

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

log timings

9394bc7

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

fix progress when --store present

330b0a7

mutex on RelationScan progress output

9d97d30

add minimal SortedNodeStore test

b49b1e7

I'm going to rejig the innards of this class, so let's have some tests.

SortedNodeStore: abstract TLS behind storage()

1c4174d

Still only supports 1 class, but this is a step along the path.

SortedWayStore: abstract TLS behind storage()

99b5912

SortedNodeStore: support multiple instances

f225ebd

SortedWayStorage: support multiple instances

6c7917b

AppendVector tweaks

24b73f1

more low zoom fixes

2a05365

implement SortedNodeStore::contains

00bb73b

implement SortedWayStore::contains

e8be59c

use TileCoordinatesSet

792d1b3

faster covered tile enumeration

2df3081

Do a single pass, rather than one pass per zoom.

cldellow added 9 commits December 24, 2023 12:12

extract option parsing to own file

09abd3a

We'd like to have different defaults based on whether `--store` is present. Now that option parsing will have some more complex logic, let's pull it into its own class so it can be more easily tested.

use sensible defaults based on presence of --store

48305a4

improve test coverage

411b71e

fixes

3d89a78

Merge branch 'planet-on-32gb' into protozero

2ae5d69

Merge branch 'planet-on-32gb' into protozero

181c3c7

Merge branch 'planet-on-32gb' into protozero

576fd37

cldellow mentioned this pull request Dec 26, 2023

Lua interop #604

Closed

fix mac/windows build?

916ff31

cldellow marked this pull request as draft December 26, 2023 05:11

cldellow added 3 commits December 26, 2023 00:19

fixes for Windows build

4901415

fix mac build?

6d491fc

more macos fixes

decd18b

read_pbf -> pbf_processor

867fd1f

cldellow mentioned this pull request Dec 27, 2023

use vtzero instead of libprotobuf #624

Closed

cldellow added 2 commits December 28, 2023 10:35

Merge remote-tracking branch 'origin/master' into protozero

bebfb5d

tidy up test_pbf_reader

ea3148b

cldellow marked this pull request as ready for review December 28, 2023 15:44

systemed mentioned this pull request Dec 28, 2023

3.0 release planning #622

Closed

18 tasks

systemed merged commit 12ed241 into systemed:master Dec 28, 2023
5 checks passed

This was referenced Jan 6, 2024

extend NextRelation/FindInRelation to nodes #632

Merged

Update Mac install instructions #565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use protozero #623

use protozero #623

cldellow commented Dec 26, 2023

systemed commented Dec 26, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 27, 2023

cldellow commented Dec 27, 2023

cldellow commented Dec 27, 2023

systemed commented Dec 28, 2023

use protozero #623

use protozero #623

Conversation

cldellow commented Dec 26, 2023

systemed commented Dec 26, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 27, 2023

cldellow commented Dec 27, 2023

cldellow commented Dec 27, 2023

systemed commented Dec 28, 2023