Filter out ways when importing #190

ttsirkia · 2020-12-06T19:19:38Z

If I have understood correctly, while reading the pbf file tilemaker will store all the ways to the memory. Based on the number of ways to be found, the memory consumption can be high.

What if I needed only, let's say motorways or land usage information, would it be possible to have some kind of filter Lua function to skip those ways which are not relevant? Does it already work like that for nodes with the node_keys list?

I think this would reduce the memory consumption quite a lot if the whole information is not needed but are there some drawbacks or something that I didn't understood correctly about the import process?

The text was updated successfully, but these errors were encountered:

ttsirkia · 2020-12-06T22:16:17Z

OK. I made some experiments with Estonian map data as it is not a tiny but not too big.

With the shipped OSM process files:
Stored 10045947 nodes, 1165834 ways, 3396 relations

With the shipped almost blank process files:
Stored 10045947 nodes, 1092984 ways, 303 relations

With an empty process file which does not include anything:
Stored 10045947 nodes, 84422 ways, 0 relations

So the node number is always the same but ways seem to reduce. However, it is not zero in the last test. The .osm.pbf file is 83 MB, the memory consumption in the first two tests around 1 GB and in the last around 500 MB. Is there actually anything that could be reduced in the importing phase to drop out unnecessary data prior writing the mbtiles file?

systemed · 2020-12-07T09:19:26Z

Yes, tilemaker reads all nodes in the .pbf first, then (multipolygon) relations, then ways. It needs to read nodes before ways so that, when processing a way, it has the location of the way available for functions like way:Intersects (and I'd like to add more functions like this, as per #167).

node_keys is just a speed optimisation that avoids calling the Lua code for uninteresting nodes - it doesn't affect memory usage.

I suspect the reason that the OMT and example process files have similar results is because both of them include buildings, which are the biggest contributor to OSM data bloat.

You can eliminate unwanted nodes from the .pbf in advance by using tools such as osmfilter/osmconvert or osmium. You can also potentially reduce memory requirements by preprocessing the .pbf using mapsplit.

Conceivably we could implement a way_keys function that did an extra first pass to read ways, and if they match the tags, note their waynodes. Only then would we then read the node lat/lon on the next pass. That's not something I have any plan to implement (osmfilter/osmconvert does the same and works for me) but I wouldn't reject a PR implementing it.

Another possibility I've considered is adding a runtime switch that states all nodes in the .pbf are consecutively numbered (i.e. 1,2,3,4... without any gaps). You can produce .pbfs like this with osmium renumber. This would allow us to use a vector (array) rather than a sparse_map (hash) for node storage, which would lead to a useful memory saving.

ttsirkia · 2020-12-07T09:25:12Z

Thanks for the very detailed answer! I'll take a look at these options.

ttsirkia · 2020-12-07T09:36:40Z

Osmium seems to be the easiest option as it can read and write directly .osm.pbf files, https://osmcode.org/osmium-tool/manual.html#filtering-by-tags

This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402. I'm not too sure if this is generally useful - it's useful for one of my use cases, and I see someone asking about it in systemed#190 and, elsewhere, in onthegomap/planetiler#99 If you feel it complicates the maintainer story too much, please reject. The goal is to reduce memory usage for users doing thematic extracts by not indexing nodes that are only used by uninteresting ways. For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node store. By contrast, if your interest is only to build a railway map, you require only ~8M nodes, needing 70MB of RAM. Currently, a user can achieve this by pre-filtering their PBF using osmium-tool. If you know exactly what you want, this is a good long-term solution. But for one-offs and experimenting, it's a bit cumbersome to iterate. Sample use cases: ```lua -- Building a map without building polygons - exclude them way_keys = {"~building"} ``` ```lua -- Building a railway map way_keys = {"railway"} ``` ```lua -- Building a map of major roads way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}` ``` Nodes used in ways which are used in relations (as identified by `relation_scan_function`) will always be indexed, regardless of `node_keys` and `way_keys` settings that might exclude them. Notes: 1. This is based on `lua-interop-3`, as it interacts with files that are changed by that. I can rebase against master after lua-interop-3 is merged. 2. The names `node_keys` and `way_keys` are perhaps out of date, as they can now express conditions on the values of tags in addition to their keys. Leaving them as-is is nice, as it's not a breaking change. But if breaking changes are OK, maybe these should be `node_filters` and `way_filters` ?

This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402. I'm not too sure if this is generally useful - it's useful for one of my use cases, and I see someone asking about it in systemed#190 and, elsewhere, in onthegomap/planetiler#99 If you feel it complicates the maintainer story too much, please reject. The goal is to reduce memory usage for users doing thematic extracts by not indexing nodes that are only used by uninteresting ways. For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node store. By contrast, if your interest is only to build a railway map, you require only ~8M nodes, needing 70MB of RAM. Or, to build a map of national/provincial parks, 12M nodes and ~120MB of RAM. Currently, a user can achieve this by pre-filtering their PBF using osmium-tool. If you know exactly what you want, this is a good long-term solution. But if you're me, flailing about in the OSM data model, it's convenient to be able to tweak something in the Lua script and observe the results without having to re-filter the PBF and update your tilemaker command to use the new PBF. Sample use cases: ```lua -- Building a map without building polygons, ~ excludes ways whose -- only tags are matched by the filter. way_keys = {"~building"} ``` ```lua -- Building a railway map way_keys = {"railway"} ``` ```lua -- Building a map of major roads way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}` ``` Nodes used in ways which are used in relations (as identified by `relation_scan_function`) will always be indexed, regardless of `node_keys` and `way_keys` settings that might exclude them. A concrete example, given a Lua script like: ```lua function way_function() if Find("railway") ~= "" then Layer("lines", false) end end ``` it takes 13GB of RAM and 100 seconds to process North America. If you add: ```lua way_keys = {"railway"} ``` It takes 2GB of RAM and 47 seconds. Notes: 1. This is based on `lua-interop-3`, as it interacts with files that are changed by that. I can rebase against master after lua-interop-3 is merged. 2. The names `node_keys` and `way_keys` are perhaps out of date, as they can now express conditions on the values of tags in addition to their keys. Leaving them as-is is nice, as it's not a breaking change. But if breaking changes are OK, maybe these should be `node_filters` and `way_filters` ? 3. Maybe the value for `node_keys` in the OMT profile should be expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`? This would avoid issues like systemed#337

This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402. I'm not too sure if this is generally useful - it's useful for one of my use cases, and I see someone asking about it in systemed#190 and, elsewhere, in onthegomap/planetiler#99 If you feel it complicates the maintainer story too much, please reject. The goal is to reduce memory usage for users doing thematic extracts by not indexing nodes that are only used by uninteresting ways. For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node store. By contrast, if your interest is only to build a railway map, you require only ~8M nodes, needing 70MB of RAM. Or, to build a map of national/provincial parks, 12M nodes and ~120MB of RAM. Currently, a user can achieve this by pre-filtering their PBF using osmium-tool. If you know exactly what you want, this is a good long-term solution. But if you're me, flailing about in the OSM data model, it's convenient to be able to tweak something in the Lua script and observe the results without having to re-filter the PBF and update your tilemaker command to use the new PBF. Sample use cases: ```lua -- Building a map without building polygons, ~ excludes ways whose -- only tags are matched by the filter. way_keys = {"~building"} ``` ```lua -- Building a railway map way_keys = {"railway"} ``` ```lua -- Building a map of major roads way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}` ``` Nodes used in ways which are used in relations (as identified by `relation_scan_function`) will always be indexed, regardless of `node_keys` and `way_keys` settings that might exclude them. A concrete example, given a Lua script like: ```lua function way_function() if Find("railway") ~= "" then Layer("lines", false) end end ``` it takes 13GB of RAM and 100 seconds to process North America. If you add: ```lua way_keys = {"railway"} ``` It takes 2GB of RAM and 47 seconds. Notes: 1. This is based on `lua-interop-3`, as it interacts with files that are changed by that. I can rebase against master after lua-interop-3 is merged. 2. The names `node_keys` and `way_keys` are perhaps out of date, as they can now express conditions on the values of tags in addition to their keys. Leaving them as-is is nice, as it's not a breaking change. But if breaking changes are OK, maybe these should be `node_filters` and `way_filters` ? 3. Maybe the value for `node_keys` in the OMT profile should be expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`? This would avoid issues like systemed#337 4. This also adds a SIGUSR1 handler during OSM processing, which prints the ID of the object currently being processed. This is helpful for tracking down slow geometries.

ttsirkia closed this as completed Dec 7, 2020

cldellow mentioned this issue Dec 29, 2023

generalize node_keys; add way_keys #629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out ways when importing #190

Filter out ways when importing #190

ttsirkia commented Dec 6, 2020

ttsirkia commented Dec 6, 2020

systemed commented Dec 7, 2020

ttsirkia commented Dec 7, 2020

ttsirkia commented Dec 7, 2020

Filter out ways when importing #190

Filter out ways when importing #190

Comments

ttsirkia commented Dec 6, 2020

ttsirkia commented Dec 6, 2020

systemed commented Dec 7, 2020

ttsirkia commented Dec 7, 2020

ttsirkia commented Dec 7, 2020