some memory and concurrency improvements #612

cldellow · 2023-12-14T03:36:04Z

This PR has a few things:

housekeeping -- tidy up where some files live, extract ClipCache to its own file
reduce lock contention
use ~33% less memory, as benchmarked on GB (or, use the same amount of memory, and be ~7% faster)

The memory reduction comes from leaving most ways in the WayStore and generating the linestring or polygon on-demand at writing time. There are a few tricks to mitigate the cost of rebuilding the geometry -- we only generate on-demand those ways that didn't need to be corrected, and each thread keeps a thread-local cache of linestrings that they recently constructed.

I explored extending this to relations as well, but it seemed like there was much less opportunity with relations: each individual relation is bigger, so reconstructing it is more costly; they're more often in need of correction; and in absolute terms, they're not as big of a memory hog as all the ways. My bag of tricks wasn't able to mitigate the runtime impact. There might be something there, but I'm going to admit defeat for now.

GB, master:

9,682MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real	2m19.527s
user	31m44.574s
sys	0m25.943s

GB, this PR:

6,484MB RAM

SortedWayStore: 18614 groups, 1636695 chunks, 17554284 ways, 160743637 nodes, 650662740 bytes
Generated points: 6477007, lines: 1846, polygons: 5344375

real 2m20.764s
user 33m1.989s
sys 0m19.444s

GB, this PR with --materialize-geometries:

9,600MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real	2m11.685s
user	31m1.558s
sys	0m18.525s

planet-latest, this PR on a 48-core Hetzner CCX63:

SortedNodeStore: 173473 groups, 41167239 chunks, 8773657450 nodes, 44932630766 bytes
SortedWayStore: 18746 groups, 4571385 chunks, 714744586 ways, 7912084618 nodes, 21576308504 bytes
Generated points: 204334361, lines: 56549, polygons: 229203193

real	67m51.201s
user	2717m54.920s
sys	36m53.642s

tilemaker used 42GB of RAM and 249GB of virtual memory to generate the planet -- so still a bit out of reach for people on smaller machines.

Some housekeeping: extract clip_cache.cpp

This provides a very small benefit. I think the reason is two-fold: there aren't many multilinestrings (relative to multipolygons), and clipping them is less expensive. Still, it did seem to provide a small boost, so leaving it in.

This isn't super useful to end users, but is useful for developers. If it's not OK to leave it in, let me know & I'll revert it. You can then process the log: ```bash $ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt | grep ' took ' | sort -nk3 | grep z$x/ | awk 'BEGIN { min = 999999; max = 0; }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done z0 1 7.04769 7.04769 7.047685 (z0/0/0) z1 1 9.76067 9.76067 9.760671 (z1/0/0) z2 1 9.98514 9.98514 9.985141 (z2/1/1) z3 1 9.98514 9.98514 9.985141 (z3/2/2) z4 2 14.4699 7.23493 8.610035 (z4/5/5) z5 2 20.828 10.414 13.956526 (z5/10/11) z6 5 6464.05 1292.81 3206.252711 (z6/20/23) z7 13 11306.4 869.727 3275.475707 (z7/40/46) z8 35 15787.1 451.061 2857.506681 (z8/81/92) z9 86 20723.8 240.974 1605.788985 (z9/162/186) z10 277 25456.8 91.9018 778.311785 (z10/331/369) z11 960 28851.3 30.0534 627.351078 (z11/657/735) z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471) z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943) z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916) ``` This shows each zoom's # of tiles, total time, average time, worst case time (and the tile that caused it). In general, lower zooms are slower than higher zooms. This seems intuitively reasonable, as the lower zoom often contains all of the objects in the higher zoom. I would have guessed that a lower zoom would cost 4x the next higher zoom on a per-tile basis. That's sort of the case for `z12->z13`, `z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms, where it's more like a 2x cost. Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile. This is probably because `water` has a minzoom of 6. This all makes me think that the next big gain will be from re-using simplifications. This is sort of the mirror image of the clip cache: - the clip cache avoids redundant clipping, and needs to be computed from lower zooms to higher zooms - a simplification cache could make simplifying cheaper, but needs to be computed from higher zooms to lower zooms The simplification cache also has two other wrinkles: 1. Is it even valid? e.g. is `simplify(object, 4)` the same as `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the same, because users are already accepting that we're losing accuracy when we simplify. 2. Rendering an object at `z - 1`, needds to (potentially) stitch together that object from 4 tiles at `z`. If those have each been simplified, we may introduce odd seams where the terminal points don't line up.

This saves a very little bit of time, but more importantly, tees up lazily evaluating the nodes in a way.

Rather than locking on each store call, threads lease a range of the ID space for points/lines/multilines/polygons. When the thread ends, it return the lease. This has some implications: - the IDs are no longer guaranteed to be contiguous - shapefiles are a bit weird, as they're loaded on the main thread -- so their lease won't be returned until program end. This is fine, just pointing it out. This didn't actually seem to affect runtime that much on my 16 core machine, but I think it'd help on machines with more cores.

When scaling to 32+ cores, this shows up as an issue. Try a really blunt hammer fix.

`std::cout` has some internal locks -- instead, let's synchronize explicitly outside of it so we control the contention. If a worker fails to get the lock, just skip that worker's update.

If a worker can't get the lock, just skip their update.

This reverts commit 3e7b9b6. This commit came about from some experiments that I had done pre-SortedNodeStore. In that world, lazily evaluating the nodes of a way provided a meaningful savings if the way was ultimately discarded by the Lua code. Post-SortedNodeStore, it doesn't seem to matter as much. Which is great, as it means the store is much faster, but also means this commit is just noise. You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes

Tilemaker previously stored the 2D geometries it produced from ways. This commit makes Tilemaker use the OSM way store to generate linestrings and polygons that originated with an OSM way. You can get the old behaviour with `--materialize-geometries`, which is a sensible choice if you are not memory constrained. For GB: before (available via `--materialize-geometries`): 2m11s, 9600MB this commit: 2m20s, 6400MB So ~8% slower, but 33% less memory. I think it's probably reasonable for this to be the default, which has nice symmetry with compressed nodes and compressed ways being the default. Building NA with --store still seems OK - 36min. I was concerned that the increased node store lookups could be vulnerable to thrashing. I do see some stuttering during tile writing, but I think the decreased read iops from a smaller geometry store balance out the increased read iops from looking up nodes. A future investigation might be to have SortedWayStore store latplons rather than node IDs -- a bit more memory, but should be less CPU and less vulnerable to thrashing.

Before writing, we compute the set of tiles to be written. There were two opportunities for improvement here: - determining which tiles were covered by our objects: we previously used a `std::set`, which has poor memory and runtime behaviour. Instead, use a fixed size `std::vector<bool>` -- this takes 64MB at z14, but gives much faster insert/query times - determining if tiles were covered by clipping box: we used boost's intersect algorithm before, which required constructing a TileBbox and was a bit slow. In the case where the tile is contained in a z6 tile that is wholly covered by the clipping box, we can short-circuit This has the most impact when the set of objects or tiles is very large--e.g. Antarctica, North America or bigger.

On a 48-core server, I noticed lock contention on the mmap allocator. So let's just always use pools of memory, and pick a bigger pool size. This means we'll sometimes allocate memory that we don't use. In the extreme case of Monaco, we only need like 200KB, but we'll allocate several megs. As you scale to larger PBFs, the waste trends to 0%, so this should be fine in practice.

D'oh, clock_gettime is Linux-ish. `std::chrono` may have a cross-platform option, but it's not clear. For now, just omit this functionality on Windows. If we want to expose it, we can explore something in std::chrono or make a wrapper that calls QueryPerformanceCounter on Windows.

systemed · 2023-12-14T10:58:00Z

That's superb!

With North America I get an even bigger memory saving - down from 20.1GB to 13.2GB (using --store and with shapefiles). I think about 4GB of that is ClipCache so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

systemed · 2023-12-14T19:58:08Z

On my usual machine, 40.2GB (with --store) for the full planet including the poles, 5hr12m22. Really happy with that.

It was sitting on one thread for about 10/15 minutes at the end, which I'm guessing was crunching through a particularly complex set of geometries (very possibly at the poles).

cldellow · 2023-12-14T23:20:26Z

I think about 4GB of that is ClipCache

Oh, wow. I hadn't thought too critically about this, but I didn't expect it to use quite that much memory.

The cache size does increase with more cores, so perhaps this is expected - for 24 cores, 4 GB is "only" 160MB/core.

I don't often test with the shapefiles, I should probably start doing that, especially as I think the runtime overhead is much smaller now (and especially after #614 lands).

so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

I imagined the LRU cache would help with runtime, but not with memory usage. The benefit I was hoping for is that we'd more often have relevant things in the cache, but I wasn't planning on shrinking the cache.

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

It was sitting on one thread for about 10/15 minutes at the end

I also noticed this, and I noticed that GB had a similar straggler on the Hetzner box (Ubuntu 22.04).

Interestingly, the GB straggler doesn't appear when I run locally (Ubuntu 20.04) or when I ran on the Hetzner box prior to the #611 change.

So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

According to https://packages.ubuntu.com/search?keywords=libboost-, 20.04 has 1.71 and 22.04 has 1.74.

Nothing in Boost's changelog jumps out at me, though.

systemed · 2023-12-15T09:42:41Z

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

I did have a very brief attempt at downsizing the cache size and being a bit pickier about what we put in it (e.g. no simple polygons with <20 vertices) but it removed the performance gains. But there may still be possibilities along those lines.

So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

In order of likelihood:

adding geom::correct before simplify_combine (geom.cpp) will have some overhead. It's possible that this is pathological in the case of particularly large polygons and we should call our own make_valid (aka geometry::correct) in these cases instead, but we'd have to be careful not to reintroduce the invalid geometries this was intended to fix.
we moved to the newer Boost.Geometry BOOST_GEOMETRY_NO_ROBUSTNESS behaviour
possible that something in RemovePartsBelowSize is taking a while (this should be easy to test - just remove filter_below and filter_area from the JSON config)

systemed · 2023-12-15T17:00:55Z

This looks good to merge to me - ok with that?

cldellow · 2023-12-15T17:01:42Z

Yup!

systemed · 2023-12-15T17:04:57Z

Excellent - thank you again!

systemed · 2023-12-30T01:27:49Z

Quick issue you might be able to help me with...

I've been playing around with a branch that can load GeoJSON files in the same way that we use shapefiles. (GeoJSON is easy to generate, and I quite often find myself writing little Ruby scripts that turn some data or other into GeoJSON.) At the same time, I've moved the shapefile code into its own class. It's at master...geojson

On shutdown, after the mbtiles/pmtiles file has been written, it terminates (on Mac) with:

Filled the tileset with good things at /Users/richard/Maps/planet/oxfordshire.pmtiles
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

I've run valgrind over it and that shows:

==57979== Warning: set address range perms: large range [0x11fc54028, 0x12fc54058) (noaccess)
==57979== Invalid read of size 8
==57979==    at 0x100F41137: pthread_mutex_lock (in /usr/lib/system/libsystem_pthread.dylib)
==57979==    by 0x100237AAF: LeasedStore<std::__1::vector<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator>, std::__1::allocator<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator> > > >::~LeasedStore() (in /usr/local/bin/tilemaker)
==57979==    by 0x100C580D1: tlv_finalize (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x100D3B6AB: exit (in /usr/lib/system/libsystem_c.dylib)
==57979==    by 0x100C583DB: start (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x2: ???
==57979==    by 0x104F018FE: ???
==57979==    by 0x104F01908: ???
==57979==    by 0x104F0193E: ???
==57979==  Address 0x104ef0e18 is on thread 1's stack
==57979==  67808 bytes below stack pointer
[same error two more times]

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
==57979== 
==57979== Process terminating with default action of signal 6 (SIGABRT)
==57979==    at 0x100EF0D46: __pthread_sigmask (in /usr/lib/system/libsystem_kernel.dylib)
[etc.]

So I think it's something to do with what you've described here:

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

...and that I've missed something obvious. I can't see what that is (but then it is 1.30am). Any thoughts?

cldellow · 2023-12-30T01:40:37Z

I gave that a quick whirl and didn't repro a crash (granted, on Linux). If it's easy to replicate the command you're running, I can try that.

Speculation: the LeasedStore destructor is running after the ShpMemTiles destructor.

In the OSM case, the LeasedStore destructor runs when the threads for the various ReadPhases are terminated.

In this case, if the reading of GeoJSON and SHP is happening on the main tilemaker thread, it'll run when the thread ends -- which might be after ShpMemTiles has been destroyed.

The overall design is a bit fragile/footgun-y at the moment.

If that's the case, possible fixes could include:

ensure reading happens on threads that terminate before the main thread
add a destructor to TileDataSource that ensures that the LeasedStore on that thread, if any, has released its leases

cldellow · 2023-12-30T02:02:57Z

Ah, I repro a different error if I try that branch with some random geojson file:

Reading GeoJSON geojson
Reading shapefile urban_areas
terminate called after throwing an instance of 'std::runtime_error'
  what():  fatal: no available stores to lease

If I then wrap systemed:ae1981b...systemed:a1976d2 in the world's most pointless thread pool:

void GeoJSONProcessor::read(class LayerDef &layer, uint layerNum) {
	boost::asio::thread_pool pool(1);
	boost::asio::post(pool, [&]() {
		// Parse the JSON file into a RapidJSON document
		...
	});
	pool.join();
}

it runs as expected.

I think the precise behaviour you'd get would depend on the order your shapefiles or geojson files were processed (which I guess depends on your layer orders).

So I think that confirms that the leases either need to be manually relinquished (if running on the same thread), or the reading needs to happen in a separate thread to ensure the leases get returned.

systemed · 2023-12-30T13:37:59Z

Thanks! That fixes it - I've used the same thread pool approach as with shapefiles and it's fine now. Will do a bit more testing.

cldellow added 23 commits December 10, 2023 13:17

extract ClipCache to own file

b830bc0

Some housekeeping: extract clip_cache.cpp

templatize ClipCache, apply to MultiLineStrings

8576cfd

This provides a very small benefit. I think the reason is two-fold: there aren't many multilinestrings (relative to multipolygons), and clipping them is less expensive. Still, it did seem to provide a small boost, so leaving it in.

housekeeping: move test, minunit

72759d5

more, smaller caches; run destructors outside lock

b66da75

use explicit types

036c7cf

don't populate unnecessary vectors

0be9ef5

reserve vectors appropriately

c87d567

don't eagerly call way:IsClosed()

3e7b9b6

This saves a very little bit of time, but more importantly, tees up lazily evaluating the nodes in a way.

increase attributestore shards

6b30836

When scaling to 32+ cores, this shows up as an issue. Try a really blunt hammer fix.

read_pbf: less lock contention on status

59625cd

`std::cout` has some internal locks -- instead, let's synchronize explicitly outside of it so we control the contention. If a worker fails to get the lock, just skip that worker's update.

tile_worker: do syscall 1x/thread, not 1x/tile

38a38aa

tilemaker: avoid lock contention on status update

2f94b1f

If a worker can't get the lock, just skip their update.

Merge remote-tracking branch 'origin/master' into perf-tweaks-2

f74c963

update ifdef guard, add comments

f8c558a

remove TODO

55e50ea

sigh

523d942

fix bounds check

0fca3eb

systemed merged commit 52b62df into systemed:master Dec 15, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some memory and concurrency improvements #612

some memory and concurrency improvements #612

cldellow commented Dec 14, 2023 •

edited

Loading

systemed commented Dec 14, 2023

systemed commented Dec 14, 2023

cldellow commented Dec 14, 2023

systemed commented Dec 15, 2023

systemed commented Dec 15, 2023

cldellow commented Dec 15, 2023

systemed commented Dec 15, 2023

systemed commented Dec 30, 2023

cldellow commented Dec 30, 2023

cldellow commented Dec 30, 2023

systemed commented Dec 30, 2023

some memory and concurrency improvements #612

some memory and concurrency improvements #612

Conversation

cldellow commented Dec 14, 2023 • edited Loading

systemed commented Dec 14, 2023

systemed commented Dec 14, 2023

cldellow commented Dec 14, 2023

systemed commented Dec 15, 2023

systemed commented Dec 15, 2023

cldellow commented Dec 15, 2023

systemed commented Dec 15, 2023

systemed commented Dec 30, 2023

cldellow commented Dec 30, 2023

cldellow commented Dec 30, 2023

systemed commented Dec 30, 2023

cldellow commented Dec 14, 2023 •

edited

Loading