-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some memory and concurrency improvements #612
Conversation
Some housekeeping: extract clip_cache.cpp
This provides a very small benefit. I think the reason is two-fold: there aren't many multilinestrings (relative to multipolygons), and clipping them is less expensive. Still, it did seem to provide a small boost, so leaving it in.
This isn't super useful to end users, but is useful for developers. If it's not OK to leave it in, let me know & I'll revert it. You can then process the log: ```bash $ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt | grep ' took ' | sort -nk3 | grep z$x/ | awk 'BEGIN { min = 999999; max = 0; }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done z0 1 7.04769 7.04769 7.047685 (z0/0/0) z1 1 9.76067 9.76067 9.760671 (z1/0/0) z2 1 9.98514 9.98514 9.985141 (z2/1/1) z3 1 9.98514 9.98514 9.985141 (z3/2/2) z4 2 14.4699 7.23493 8.610035 (z4/5/5) z5 2 20.828 10.414 13.956526 (z5/10/11) z6 5 6464.05 1292.81 3206.252711 (z6/20/23) z7 13 11306.4 869.727 3275.475707 (z7/40/46) z8 35 15787.1 451.061 2857.506681 (z8/81/92) z9 86 20723.8 240.974 1605.788985 (z9/162/186) z10 277 25456.8 91.9018 778.311785 (z10/331/369) z11 960 28851.3 30.0534 627.351078 (z11/657/735) z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471) z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943) z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916) ``` This shows each zoom's # of tiles, total time, average time, worst case time (and the tile that caused it). In general, lower zooms are slower than higher zooms. This seems intuitively reasonable, as the lower zoom often contains all of the objects in the higher zoom. I would have guessed that a lower zoom would cost 4x the next higher zoom on a per-tile basis. That's sort of the case for `z12->z13`, `z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms, where it's more like a 2x cost. Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile. This is probably because `water` has a minzoom of 6. This all makes me think that the next big gain will be from re-using simplifications. This is sort of the mirror image of the clip cache: - the clip cache avoids redundant clipping, and needs to be computed from lower zooms to higher zooms - a simplification cache could make simplifying cheaper, but needs to be computed from higher zooms to lower zooms The simplification cache also has two other wrinkles: 1. Is it even valid? e.g. is `simplify(object, 4)` the same as `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the same, because users are already accepting that we're losing accuracy when we simplify. 2. Rendering an object at `z - 1`, needds to (potentially) stitch together that object from 4 tiles at `z`. If those have each been simplified, we may introduce odd seams where the terminal points don't line up.
This saves a very little bit of time, but more importantly, tees up lazily evaluating the nodes in a way.
Rather than locking on each store call, threads lease a range of the ID space for points/lines/multilines/polygons. When the thread ends, it return the lease. This has some implications: - the IDs are no longer guaranteed to be contiguous - shapefiles are a bit weird, as they're loaded on the main thread -- so their lease won't be returned until program end. This is fine, just pointing it out. This didn't actually seem to affect runtime that much on my 16 core machine, but I think it'd help on machines with more cores.
When scaling to 32+ cores, this shows up as an issue. Try a really blunt hammer fix.
`std::cout` has some internal locks -- instead, let's synchronize explicitly outside of it so we control the contention. If a worker fails to get the lock, just skip that worker's update.
If a worker can't get the lock, just skip their update.
This reverts commit 3e7b9b6. This commit came about from some experiments that I had done pre-SortedNodeStore. In that world, lazily evaluating the nodes of a way provided a meaningful savings if the way was ultimately discarded by the Lua code. Post-SortedNodeStore, it doesn't seem to matter as much. Which is great, as it means the store is much faster, but also means this commit is just noise. You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes
Tilemaker previously stored the 2D geometries it produced from ways. This commit makes Tilemaker use the OSM way store to generate linestrings and polygons that originated with an OSM way. You can get the old behaviour with `--materialize-geometries`, which is a sensible choice if you are not memory constrained. For GB: before (available via `--materialize-geometries`): 2m11s, 9600MB this commit: 2m20s, 6400MB So ~8% slower, but 33% less memory. I think it's probably reasonable for this to be the default, which has nice symmetry with compressed nodes and compressed ways being the default. Building NA with --store still seems OK - 36min. I was concerned that the increased node store lookups could be vulnerable to thrashing. I do see some stuttering during tile writing, but I think the decreased read iops from a smaller geometry store balance out the increased read iops from looking up nodes. A future investigation might be to have SortedWayStore store latplons rather than node IDs -- a bit more memory, but should be less CPU and less vulnerable to thrashing.
Before writing, we compute the set of tiles to be written. There were two opportunities for improvement here: - determining which tiles were covered by our objects: we previously used a `std::set`, which has poor memory and runtime behaviour. Instead, use a fixed size `std::vector<bool>` -- this takes 64MB at z14, but gives much faster insert/query times - determining if tiles were covered by clipping box: we used boost's intersect algorithm before, which required constructing a TileBbox and was a bit slow. In the case where the tile is contained in a z6 tile that is wholly covered by the clipping box, we can short-circuit This has the most impact when the set of objects or tiles is very large--e.g. Antarctica, North America or bigger.
On a 48-core server, I noticed lock contention on the mmap allocator. So let's just always use pools of memory, and pick a bigger pool size. This means we'll sometimes allocate memory that we don't use. In the extreme case of Monaco, we only need like 200KB, but we'll allocate several megs. As you scale to larger PBFs, the waste trends to 0%, so this should be fine in practice.
D'oh, clock_gettime is Linux-ish. `std::chrono` may have a cross-platform option, but it's not clear. For now, just omit this functionality on Windows. If we want to expose it, we can explore something in std::chrono or make a wrapper that calls QueryPerformanceCounter on Windows.
That's superb! With North America I get an even bigger memory saving - down from 20.1GB to 13.2GB (using |
On my usual machine, 40.2GB (with It was sitting on one thread for about 10/15 minutes at the end, which I'm guessing was crunching through a particularly complex set of geometries (very possibly at the poles). |
Oh, wow. I hadn't thought too critically about this, but I didn't expect it to use quite that much memory. The cache size does increase with more cores, so perhaps this is expected - for 24 cores, 4 GB is "only" 160MB/core. I don't often test with the shapefiles, I should probably start doing that, especially as I think the runtime overhead is much smaller now (and especially after #614 lands).
I imagined the LRU cache would help with runtime, but not with memory usage. The benefit I was hoping for is that we'd more often have relevant things in the cache, but I wasn't planning on shrinking the cache. The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.
I also noticed this, and I noticed that GB had a similar straggler on the Hetzner box (Ubuntu 22.04). Interestingly, the GB straggler doesn't appear when I run locally (Ubuntu 20.04) or when I ran on the Hetzner box prior to the #611 change. So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04. According to https://packages.ubuntu.com/search?keywords=libboost-, 20.04 has 1.71 and 22.04 has 1.74. Nothing in Boost's changelog jumps out at me, though. |
I did have a very brief attempt at downsizing the cache size and being a bit pickier about what we put in it (e.g. no simple polygons with <20 vertices) but it removed the performance gains. But there may still be possibilities along those lines.
In order of likelihood:
|
This looks good to merge to me - ok with that? |
Yup! |
Excellent - thank you again! |
Quick issue you might be able to help me with... I've been playing around with a branch that can load GeoJSON files in the same way that we use shapefiles. (GeoJSON is easy to generate, and I quite often find myself writing little Ruby scripts that turn some data or other into GeoJSON.) At the same time, I've moved the shapefile code into its own class. It's at master...geojson On shutdown, after the mbtiles/pmtiles file has been written, it terminates (on Mac) with:
I've run valgrind over it and that shows:
So I think it's something to do with what you've described here:
...and that I've missed something obvious. I can't see what that is (but then it is 1.30am). Any thoughts? |
I gave that a quick whirl and didn't repro a crash (granted, on Linux). If it's easy to replicate the command you're running, I can try that. Speculation: the LeasedStore destructor is running after the ShpMemTiles destructor. In the OSM case, the LeasedStore destructor runs when the threads for the various ReadPhases are terminated. In this case, if the reading of GeoJSON and SHP is happening on the main tilemaker thread, it'll run when the thread ends -- which might be after ShpMemTiles has been destroyed. The overall design is a bit fragile/footgun-y at the moment. If that's the case, possible fixes could include:
|
Ah, I repro a different error if I try that branch with some random geojson file:
If I then wrap systemed:ae1981b...systemed:a1976d2 in the world's most pointless thread pool: void GeoJSONProcessor::read(class LayerDef &layer, uint layerNum) {
boost::asio::thread_pool pool(1);
boost::asio::post(pool, [&]() {
// Parse the JSON file into a RapidJSON document
...
});
pool.join();
} it runs as expected. I think the precise behaviour you'd get would depend on the order your shapefiles or geojson files were processed (which I guess depends on your layer orders). So I think that confirms that the leases either need to be manually relinquished (if running on the same thread), or the reading needs to happen in a separate thread to ensure the leases get returned. |
Thanks! That fixes it - I've used the same thread pool approach as with shapefiles and it's fine now. Will do a bit more testing. |
This PR has a few things:
The memory reduction comes from leaving most ways in the WayStore and generating the linestring or polygon on-demand at writing time. There are a few tricks to mitigate the cost of rebuilding the geometry -- we only generate on-demand those ways that didn't need to be corrected, and each thread keeps a thread-local cache of linestrings that they recently constructed.
I explored extending this to relations as well, but it seemed like there was much less opportunity with relations: each individual relation is bigger, so reconstructing it is more costly; they're more often in need of correction; and in absolute terms, they're not as big of a memory hog as all the ways. My bag of tricks wasn't able to mitigate the runtime impact. There might be something there, but I'm going to admit defeat for now.
GB, master:
GB, this PR:
GB, this PR with
--materialize-geometries
:planet-latest, this PR on a 48-core Hetzner CCX63:
tilemaker used 42GB of RAM and 249GB of virtual memory to generate the planet -- so still a bit out of reach for people on smaller machines.