3.0 release planning #622

systemed · 2023-12-21T17:39:18Z

cldellow · 2023-12-21T19:18:34Z

Exciting!

Depending on your timeline, there are two other things that seem promising

use protozero for PBF reading - the motivation for this is touched on in Alternative protobuf library? #621
write a custom map class for AttributePair and AttributeSet

The motivation for AttributePair/AttributeSet - I think we can save ~600MB of RAM in a planet build. Between that and (1), I think we might get away with fewer, bigger splits when doing the planet on a constrained-memory build, so it's worth doing.

The idea is: we currently refer to AttributePairs and AttributeSets compactly, e.g. by a uint32_t that references their index in a deque<AttributePair>, rather than by a full pointer.

When reading a new attribute, we need to know if we've seen it already, and if so, what its index is. To do this, we have a boost::container::flat_map<AttributePair*, uint32_t>.

It has a custom less-than comparator so that the pointer is compared by its logical contents, not its memory address. This lets us ask "is this new AttributePair, who may be stored in a different memory location than a previously-seen version of it, already logically present in the map?"

But there is some waste: we effectively store the pointer in the map twice. The first copy is the key, which is literally a pointer. The second copy is the index value -- the uint32_t identifies the offset in the deque, which, if you squint, is a pointer to an AttributePair. Deques don't invalidate memory locations when they grow, so the pointer obtained by indexing to that location will be valid throughout the life of the program.

Thus, where today the flat map stores a vector<pair<AttributePair*, uint32_t>>, I think if we write a small wrapper, we ought to be able to get away with a custom container that stores a vector<uint32_t> and supports the same operations. I believe there are ~40M AttributeSets and ~30M AttributePairs, so I expect a savings of 70M * 8 => 560M. Our task is simpler than a general-purpose map, since we're only interested in insert + find -- iteration, deletion, etc aren't needed.

systemed · 2023-12-22T10:47:33Z

Both sound good. I'm not in an inordinate hurry to do the release so happy to wait for those!

systemed · 2023-12-22T11:14:01Z

Incidentally, there might be some possibilities for memory saving in the .pmtiles branch I've just merged in - it creates a big map/vector (depending on size) of the location of each tile within the file, for writing out into the pmtiles directories at the end of the process. I wondered whether this could potentially be mmaped after #618 - writing by z6 tile means that we'll be filling up nearby entries most of the time.

This implements the idea in systemed#622 (comment) Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`, store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per AttributePair and AttributeSet.

cldellow · 2023-12-26T17:11:24Z

Incidentally, there might be some possibilities for memory saving in the .pmtiles branch I've just merged in - it creates a big map/vector (depending on size) of the location of each tile within the file, for writing out into the pmtiles directories at the end of the process. I wondered whether this could potentially be mmaped after #618 - writing by z6 tile means that we'll be filling up nearby entries most of the time.

Oh, good thought, denseIndex does look like a candidate for being mmap-able. I'm guessing the savings would be ~640MB? When I build the planet without shapefiles, I get ~80M tiles, so I'm treating that as the upper bound on the number of tiles with "interesting" things in them, and TileOffset is an 8-byte struct.

It might also be possible to approach it from a different angle -- maybe they could be flushed from memory to the pmtiles archive earlier? If I understand the pmtiles spec, I think things can be scattered throughout the archive willy-nilly when non-clustered mode is used. It might complicate the bookkeeping, and be a little tricky to do without imposing a lot of locking overhead.

systemed · 2023-12-28T14:32:00Z

When I build the planet without shapefiles, I get ~80M tiles, so I'm treating that as the upper bound on the number of tiles with "interesting" things in them, and TileOffset is an 8-byte struct.

I think it'll be more than that - there are many thousands of sea tiles when building with shapefiles, and we need an index for each of them.

It might also be possible to approach it from a different angle -- maybe they could be flushed from memory to the pmtiles archive earlier? If I understand the pmtiles spec, I think things can be scattered throughout the archive willy-nilly when non-clustered mode is used. It might complicate the bookkeeping, and be a little tricky to do without imposing a lot of locking overhead.

Each pmtiles leaf directory is a series of file offsets for contiguously numbered tiles (using pmtiles' Hilbert tile numbering). The leaf directories are all together in the .pmtiles archive.

What this means in practice:

Either we write all the leaf directories together at the end of file (as we do now), or we reserve space for them at the start of the file, and write them as we go along. But if we do the latter we have to estimate how big the leaf directories are going to be before we start writing tiles.
If we have space reserved at the start of the file, we can flush a leaf directory to disk once it's complete. Directory completion is obviously not linear though - some tiles take longer than others, and we don't build the tiles in pmtiles Hilbert order. So there'd still be several (maybe many?) incomplete directories in memory before we flush.

With that in mind I suspect mmaping denseIndex is possibly easiest - but I haven't tried it!

* move OutputObjects to mmap store For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM. * treat objects at low zoom specially For GB, ~0.3% of objects are visible at low zooms. I noticed in previous planet runs that fetching the objects for tiles in the low zooms was quite slow - I think it's because we're scanning 1.3B objects each time, only to discard most of them. Now we'll only be scanning ~4M objects per tile, which is still an absurd number, but should mitigate most of the speed issue without having to properly index things. This will also help us maintain performance for memory-constrained users, as we won't be scanning all 15GB of data on disk, just a smaller ~45MB chunk. * make more explicit that this is unexpected * extend --materialize-geometries to nodes For Points stored via Layer(...) calls, store the node ID in the OSM store, unless `--materialize-geometries` is present. This saves ~200MB of RAM for North America, so perhaps 1 GB for the planet if NA has similar characteristics as the planet. Also fix the OSM_ID(...) macro - it was lopping off many more bits than needed, due to some previous experiments. Now that we want to track nodes, we need at least 34 bits. This may pose a problem down the road when we try to address thrashing. The mechanism I hoped to use was to divide the OSM stores into multiple stores covering different low zoom tiles. Ideally, we'd be able to recall which store to look in -- but we only have 36 bits, we need 34 to store the Node ID, so that leaves us with 1.5 bits => can divide into 3 stores. Since the node store for the planet is 44GB, dividing into 3 stores doesn't give us very much headroom on a 32 GB box. Ah well, we can sort this out later. * rejig AttributePair layout On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box * fix initialization order warning * add PooledString Not used by anything yet. Given Tilemaker's limited needs, we can get away with a stripped-down string class that is less flexible than std::string, in exchange for memory savings. The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang). When it does allocate (for strings longer than 15 bytes), it allocates from a pool so there's less per-allocation overhead. * add tests for attribute store ...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things * rejig isHot Break dependency on AttributePair, just work on std::string * teach PooledString to work with std::string ...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value. * use PooledString in AttributePair AttributePair has now been trimmed from 48 bytes to 18 bytes. There are 40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs, so hopefully this is a savings of ~900MB at the planet level. Runtime doesn't seem affected. There's a further opportunity for savings if we can make more strings qualify for the short string optimization. Only about 40% of strings fit in the 15 byte short string optimization. Of the remaining 60%, many are Latin-alphabet title cased strings like `Wellington Avenue` -- this could be encoded using 5 bits per letter, saving us an allocation. Even in the most optimistic case where: - there are 30M AttributePairs - of these, 90% are strings (= 27M) - of these, 60% don't fit in SSO (=16m) - of these, we can make 100% fit in SSO ...we only save about 256MB at the planet level, but at some significant complexity cost. So probably not worth pursuing at the moment. * log timings When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly. * AppendVector: an append-only chunked vector When using --store, deque is nice because growing doesn't require invalidating the old storage and copying it to a new location. However, it's also bad, because deque allocates in 512-byte chunks, which causes each 4KB OS page to have data from different z6 tiles. Instead, use our own container that tries to get the best of both worlds. Writing a random access iterator is new for me, so I don't trust this code that much. The saving grace is that the container is very limited, so errors in the iterator impelementation may not get exercised in practice. * fix progress when --store present * mutex on RelationScan progress output * make NodeStore/WayStore shardable This adds three methods to the stores: - `shard()` returns which shard you are - `shards()` returns how many shards total - `contains(shard, id)` returns whether or not shard N has an item with id X SortedNodeStore/SortedWayStore are not implemented yet, that'll come in a future commit. This will allow us to create a `ShardedNodeStore` and `ShardedWayStore` that contain N stores. We will try to ensure that each store has data that is geographically close to each other. Then, when reading, we'll do multiple passes of the PBF to populate each store. This should let us reduce the working set used to populate the stores, at the cost of additional linear scans of the PBF. Linear scans of disk are much less painful than random scans, so that should be a good trade. * add minimal SortedNodeStore test I'm going to rejig the innards of this class, so let's have some tests. * stop using internal linkage for atomics In order to shard the stores, we need to have multiple instances of the class. Two things block this currently: atomics at file-level, and thread-locals. Moving the atomics to the class is easy. Making the thread-locals per-class will require an approach similar to that adopted in https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48, where we have a container that tracks the per-class data. * SortedNodeStore: abstract TLS behind storage() Still only supports 1 class, but this is a step along the path. * SortedWayStore: abstract TLS behind storage() * SortedNodeStore: support multiple instances * SortedWayStorage: support multiple instances * actually fix the low zoom object collection D'oh, this "worked" due to two bugs cancelling each other: (a) the code to find things in the low zoom list never found anything, because it assumed a base z6 tile of 0/0 (b) we weren't returning early, so the normal code still ran Rejigged to actually do what I was intending * AppendVector tweaks * more low zoom fixes * implement SortedNodeStore::contains * implement SortedWayStore::contains * use TileCoordinatesSet * faster covered tile enumeration Do a single pass, rather than one pass per zoom. * add ShardedNodeStore This distributes nodes into one of 8 shards, trying to roughly group parts of the globe by complexity. This should help with locality when writing tiles. A future commit will add a ShardedWayStore and teach read_pbf to read in a locality-aware manner, which should help when reading ways. * add ShardedWayStore Add `--shard-stores` flag. It's not clear yet this'll be a win, will need to benchmark. The cost of reading the PBF blocks repeatedly is a bit higher than I was expecting. It might be worth seeing if we can index the blocks to skip fruitless reads. * fewer, more balanced shards * skip ReadPhase::Ways passes if node store is empty * support multiple passes for ReadPhase::Relations * fix check for first way * adjust shards With this distribution, no node shard is more than ~8.5GB. * Relations: fix effectiveShards > 1 check Oops, bug that very moderately affected performance in the non `--shard-stores` case * extend --materialize-geometries to LayerAsCentroid It turns out that about 20% of LayerAsCentroid calls are for nodes, which this branch could already do. The remaining calls are predominantly ways, e.g. housenumbers. We always materialize relation centroids, as they're expensive to compute. In GB, this saves about 6.4M points, ~102M. Scaled to the planet, it's perhaps a 4.5GB savings, which should let us use a more aggressive shard strategy. It seems to add 3-4 seconds to the time to process GB. * add `DequeMap`, change AttributeStore to use it This implements the idea in #622 (comment) Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`, store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per AttributePair and AttributeSet. * capture s(this) Seems to save ~1.5 seconds on GB * fix warning * fix warning, really * fewer shards Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is 3.7GB. Even ignoring the memory savings in the recent commits, these could be merged. * extract option parsing to own file We'd like to have different defaults based on whether `--store` is present. Now that option parsing will have some more complex logic, let's pull it into its own class so it can be more easily tested. * use sensible defaults based on presence of --store * improve test coverage * fixes * update number of shards to 6 This has no performance impact as we never put anything in the 7th shard, and so we skip doing the 7th pass in the ReadPhase::Ways and ReadPhase::Relations phase. The benefit is only to avoid emitting a noisy log about how the 7th store has 0 entries in it. Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb The new peak memory is ~12.2GB. I am a little perplexed -- the runtime on a 16-core server was previously: ``` $ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores real 195m7.819s user 2473m52.322s sys 73m13.116s ``` But with the most recent commits on this branch, it was: ``` real 118m50.098s user 1531m13.026s sys 34m7.252s ``` This is incredibly suspicious. I also tried re-running commit bbf0957, and got: ``` real 123m15.534s user 1546m25.196s sys 38m17.093s ``` ...so I can't explain why the earlier runs took 195 min. Ideas: - the planet changed between runs, and a horribly broken geometry was fixed - Vultr gives quite different machines for the same class of server - perhaps most likely: I failed to click "CPU-optimized" when picking the earlier server, and got a slow machine the first time, and a fast machine the second time. I'm pretty sure I paid the same $, so I'm not sure I believe this. I don't think I really believe that a 33% reduction in runtime is explained by any of those, though. Anyway, just another thing to be befuddled by. * --store uses lazy geometries; permit overriding I did some experiments on a Hetzner 48-core box with 192GB of RAM: --store, materialize geometries: real 65m34.327s user 2297m50.204s sys 65m0.901s The process often failed to use 100% of CPU--if you naively divide user+sys/real you get ~36, whereas the ideal would be ~48. Looking at stack traces, it seemed to coincide with calls to Boost's rbtree_best_fit allocator. Maybe: - we're doing disk I/O, and it's just slower than recomputing the geometries - we're using the Boost mmap library suboptimally -- maybe there's some other allocator we could be using. I think we use the mmap allocator like a simple bump allocator, so I don't know why we'd need a red-black tree --store, lazy geometries: real 55m33.979s user 2386m27.294s sys 23m58.973s Faster, but still some overhead (user+sys/real => ~43) no --store, materialize geometries: OOM no --store, lazy geometries (used 175GB): real 51m27.779s user 2306m25.309s sys 16m34.289s This was almost 100% CPU - user+sys/real => ~45) From this, I infer: - `--store` should always default to lazy geometries in order to minimize the I/O burden - `--materialize-geometries` is a good default for non-store usage, but it's still useful to be able to override and use lazy geometries, if it then means you can fit the data entirely in memory

cldellow · 2023-12-28T15:53:42Z

I think it'll be more than that - there are many thousands of sea tiles when building with shapefiles, and we need an index for each of them.

Ah, right. I misunderstood how isSparse worked.

And re-reading the PMTiles spec, I see that I was hallucinating, and it's not valid to intersperse tile data with leaf directories. Let's pretend I didn't comment. :)

systemed · 2024-01-10T17:35:19Z

I've created a v3 branch: so far it has master + #626, #629, #636 and a bit of tidying. I think I've merged #629 correctly but you might want to check.

cldellow · 2024-01-12T01:19:29Z

I think I've merged #629 correctly but you might want to check.

I did a smoke test, seems good to me.

systemed · 2024-01-14T00:02:05Z

A bit of benchmarking for v3 on my current machine:

Great Britain (no --store)

default:

Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.12
Maximum resident set size (kbytes): 12210088

--fast:

Elapsed (wall clock) time (h:mm:ss or m:ss): 4:17.14
Maximum resident set size (kbytes): 12119416

--lazy-geometries:

Elapsed (wall clock) time (h:mm:ss or m:ss): 4:25.12
Maximum resident set size (kbytes): 9272368

So when running without --store, I'm inclined to default to --lazy-geometries (for the significant memory saving), but turn it off with --fast.

planet (with --store on SSD)

default:

peak memory ~18.3GB (note this is quite an old planet)
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:24:27
Maximum resident set size (kbytes): 131827336

--fast:

peak memory similar
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:09:23
Maximum resident set size (kbytes): 139364168

I haven't yet tried with --materialize-geometries on the planet.

leshak · 2024-01-14T11:08:11Z

Today:
build tilemaker from v3 branch on ubuntu 22.04

Last planet (74327MB)

default options + --fast
scripts from v3 +
-- preferred_language = "ru"
-- preferred_language_attribute = "name:ru"
VM: 48 cores (2GHz) + 768GB RAM + SSD
peak memory ~345GB + 158GB (cache/buffers) by top
time:

real	84m9.543s
user	3596m55.728s
sys     109m37.462s

systemed · 2024-01-15T15:29:22Z

Release done!

real 84m9.543s

That's amazing - thank you for running that as a benchmark.

cldellow · 2024-01-19T00:07:48Z

👍 No more excuses for me to avoid working on my hobby map project now, I guess. :)

Thanks, @systemed, for your patience in answering my questions and reviewing PRs over the past few months. I really appreciate it.

The in memoriam you added for Wouter van Kleunen is very thoughtful. I stumbled across many of his contributions and discussions here and in the boost geometry repos, learning something from him each time.

systemed · 2024-01-19T13:58:53Z

Thanks, @systemed, for your patience in answering my questions and reviewing PRs over the past few months. I really appreciate it.

Not at all - you've made massive improvements, so thank you!

The in memoriam you added for Wouter van Kleunen is very thoughtful. I stumbled across many of his contributions and discussions here and in the boost geometry repos, learning something from him each time.

A lot of his code was really inspired - particularly the really intense geometry stuff such as intersection-aware simplification and the dissolve/correct algorithm. I hope he's in a better place.

StephenAtty · 2024-01-21T10:47:28Z

Just wanted to say thanks for all the hard work. Generated the North America tile set in 50 minutes on my 32GB machine and it never used more than 1GB of swap when V2 was using several 10s of GB. Just throwing Europe through it now

cldellow mentioned this issue Dec 26, 2023

be able to render the planet with 32gb of RAM #618

Merged

systemed pinned this issue Jan 1, 2024

systemed mentioned this issue Jan 14, 2024

v3.0 #643

Merged

systemed linked a pull request Jan 14, 2024 that will close this issue

v3.0 #643

Merged

systemed closed this as completed in #643 Jan 15, 2024

systemed unpinned this issue Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.0 release planning #622

3.0 release planning #622

systemed commented Dec 21, 2023 •

edited

Loading

cldellow commented Dec 21, 2023

systemed commented Dec 22, 2023

systemed commented Dec 22, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 28, 2023

cldellow commented Dec 28, 2023

systemed commented Jan 10, 2024 •

edited

Loading

cldellow commented Jan 12, 2024

systemed commented Jan 14, 2024

leshak commented Jan 14, 2024

systemed commented Jan 15, 2024

cldellow commented Jan 19, 2024

systemed commented Jan 19, 2024

StephenAtty commented Jan 21, 2024

3.0 release planning #622

3.0 release planning #622

Comments

systemed commented Dec 21, 2023 • edited Loading

cldellow commented Dec 21, 2023

systemed commented Dec 22, 2023

systemed commented Dec 22, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 28, 2023

cldellow commented Dec 28, 2023

systemed commented Jan 10, 2024 • edited Loading

cldellow commented Jan 12, 2024

systemed commented Jan 14, 2024

Great Britain (no --store)

planet (with --store on SSD)

leshak commented Jan 14, 2024

systemed commented Jan 15, 2024

cldellow commented Jan 19, 2024

systemed commented Jan 19, 2024

StephenAtty commented Jan 21, 2024

systemed commented Dec 21, 2023 •

edited

Loading

systemed commented Jan 10, 2024 •

edited

Loading