Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out ways when importing #190

Closed
ttsirkia opened this issue Dec 6, 2020 · 4 comments
Closed

Filter out ways when importing #190

ttsirkia opened this issue Dec 6, 2020 · 4 comments

Comments

@ttsirkia
Copy link

ttsirkia commented Dec 6, 2020

If I have understood correctly, while reading the pbf file tilemaker will store all the ways to the memory. Based on the number of ways to be found, the memory consumption can be high.

What if I needed only, let's say motorways or land usage information, would it be possible to have some kind of filter Lua function to skip those ways which are not relevant? Does it already work like that for nodes with the node_keys list?

I think this would reduce the memory consumption quite a lot if the whole information is not needed but are there some drawbacks or something that I didn't understood correctly about the import process?

@ttsirkia
Copy link
Author

ttsirkia commented Dec 6, 2020

OK. I made some experiments with Estonian map data as it is not a tiny but not too big.

With the shipped OSM process files:
Stored 10045947 nodes, 1165834 ways, 3396 relations

With the shipped almost blank process files:
Stored 10045947 nodes, 1092984 ways, 303 relations

With an empty process file which does not include anything:
Stored 10045947 nodes, 84422 ways, 0 relations

So the node number is always the same but ways seem to reduce. However, it is not zero in the last test. The .osm.pbf file is 83 MB, the memory consumption in the first two tests around 1 GB and in the last around 500 MB. Is there actually anything that could be reduced in the importing phase to drop out unnecessary data prior writing the mbtiles file?

@systemed
Copy link
Owner

systemed commented Dec 7, 2020

Yes, tilemaker reads all nodes in the .pbf first, then (multipolygon) relations, then ways. It needs to read nodes before ways so that, when processing a way, it has the location of the way available for functions like way:Intersects (and I'd like to add more functions like this, as per #167).

node_keys is just a speed optimisation that avoids calling the Lua code for uninteresting nodes - it doesn't affect memory usage.

I suspect the reason that the OMT and example process files have similar results is because both of them include buildings, which are the biggest contributor to OSM data bloat.

You can eliminate unwanted nodes from the .pbf in advance by using tools such as osmfilter/osmconvert or osmium. You can also potentially reduce memory requirements by preprocessing the .pbf using mapsplit.

Conceivably we could implement a way_keys function that did an extra first pass to read ways, and if they match the tags, note their waynodes. Only then would we then read the node lat/lon on the next pass. That's not something I have any plan to implement (osmfilter/osmconvert does the same and works for me) but I wouldn't reject a PR implementing it.

Another possibility I've considered is adding a runtime switch that states all nodes in the .pbf are consecutively numbered (i.e. 1,2,3,4... without any gaps). You can produce .pbfs like this with osmium renumber. This would allow us to use a vector (array) rather than a sparse_map (hash) for node storage, which would lead to a useful memory saving.

@ttsirkia
Copy link
Author

ttsirkia commented Dec 7, 2020

Thanks for the very detailed answer! I'll take a look at these options.

@ttsirkia ttsirkia closed this as completed Dec 7, 2020
@ttsirkia
Copy link
Author

ttsirkia commented Dec 7, 2020

Osmium seems to be the easiest option as it can read and write directly .osm.pbf files, https://osmcode.org/osmium-tool/manual.html#filtering-by-tags

cldellow added a commit to cldellow/tilemaker that referenced this issue Dec 29, 2023
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402.

I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in systemed#190
and, elsewhere, in onthegomap/planetiler#99

If you feel it complicates the maintainer story too much, please reject.

The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.

For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM.

Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But for one-offs and experimenting, it's a bit
cumbersome to iterate.

Sample use cases:

```lua
-- Building a map without building polygons - exclude them
way_keys = {"~building"}
```

```lua
-- Building a railway map
way_keys = {"railway"}
```

```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```

Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.

Notes:

1. This is based on `lua-interop-3`, as it interacts with files that are
   changed by that. I can rebase against master after lua-interop-3 is
   merged.

2. The names `node_keys` and `way_keys` are perhaps out of date, as they
   can now express conditions on the values of tags in addition to their
   keys. Leaving them as-is is nice, as it's not a breaking change.
   But if breaking changes are OK, maybe these should be
   `node_filters` and `way_filters` ?
cldellow added a commit to cldellow/tilemaker that referenced this issue Dec 29, 2023
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402.

I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in systemed#190
and, elsewhere, in onthegomap/planetiler#99

If you feel it complicates the maintainer story too much, please reject.

The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.

For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM. Or, to build a map of
national/provincial parks, 12M nodes and ~120MB of RAM.

Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But if you're me, flailing about in the OSM data
model, it's convenient to be able to tweak something in the Lua script
and observe the results without having to re-filter the PBF and update
your tilemaker command to use the new PBF.

Sample use cases:

```lua
-- Building a map without building polygons, ~ excludes ways whose
-- only tags are matched by the filter.
way_keys = {"~building"}
```

```lua
-- Building a railway map
way_keys = {"railway"}
```

```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```

Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.

A concrete example, given a Lua script like:

```lua
function way_function()
  if Find("railway") ~= "" then
    Layer("lines", false)
  end
end
```

it takes 13GB of RAM and 100 seconds to process North America.

If you add:

```lua
way_keys = {"railway"}
```

It takes 2GB of RAM and 47 seconds.

Notes:

1. This is based on `lua-interop-3`, as it interacts with files that are
   changed by that. I can rebase against master after lua-interop-3 is
   merged.

2. The names `node_keys` and `way_keys` are perhaps out of date, as they
   can now express conditions on the values of tags in addition to their
   keys. Leaving them as-is is nice, as it's not a breaking change.
   But if breaking changes are OK, maybe these should be
   `node_filters` and `way_filters` ?

3. Maybe the value for `node_keys` in the OMT profile should be
   expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`?
   This would avoid issues like systemed#337
cldellow added a commit to cldellow/tilemaker that referenced this issue Dec 29, 2023
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402.

I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in systemed#190
and, elsewhere, in onthegomap/planetiler#99

If you feel it complicates the maintainer story too much, please reject.

The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.

For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM. Or, to build a map of
national/provincial parks, 12M nodes and ~120MB of RAM.

Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But if you're me, flailing about in the OSM data
model, it's convenient to be able to tweak something in the Lua script
and observe the results without having to re-filter the PBF and update
your tilemaker command to use the new PBF.

Sample use cases:

```lua
-- Building a map without building polygons, ~ excludes ways whose
-- only tags are matched by the filter.
way_keys = {"~building"}
```

```lua
-- Building a railway map
way_keys = {"railway"}
```

```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```

Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.

A concrete example, given a Lua script like:

```lua
function way_function()
  if Find("railway") ~= "" then
    Layer("lines", false)
  end
end
```

it takes 13GB of RAM and 100 seconds to process North America.

If you add:

```lua
way_keys = {"railway"}
```

It takes 2GB of RAM and 47 seconds.

Notes:

1. This is based on `lua-interop-3`, as it interacts with files that are
   changed by that. I can rebase against master after lua-interop-3 is
   merged.

2. The names `node_keys` and `way_keys` are perhaps out of date, as they
   can now express conditions on the values of tags in addition to their
   keys. Leaving them as-is is nice, as it's not a breaking change.
   But if breaking changes are OK, maybe these should be
   `node_filters` and `way_filters` ?

3. Maybe the value for `node_keys` in the OMT profile should be
   expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`?
   This would avoid issues like systemed#337
cldellow added a commit to cldellow/tilemaker that referenced this issue Dec 29, 2023
This PR generalizes the idea of `node_keys`, adds `way_keys`, and fixes systemed#402.

I'm not too sure if this is generally useful - it's useful for one of my
use cases, and I see someone asking about it in systemed#190
and, elsewhere, in onthegomap/planetiler#99

If you feel it complicates the maintainer story too much, please reject.

The goal is to reduce memory usage for users doing thematic extracts by
not indexing nodes that are only used by uninteresting ways.

For example, North America has ~1.8B nodes, needing 9.7GB of RAM for its node
store. By contrast, if your interest is only to build a railway map, you
require only ~8M nodes, needing 70MB of RAM. Or, to build a map of
national/provincial parks, 12M nodes and ~120MB of RAM.

Currently, a user can achieve this by pre-filtering their PBF using
osmium-tool. If you know exactly what you want, this is a good
long-term solution. But if you're me, flailing about in the OSM data
model, it's convenient to be able to tweak something in the Lua script
and observe the results without having to re-filter the PBF and update
your tilemaker command to use the new PBF.

Sample use cases:

```lua
-- Building a map without building polygons, ~ excludes ways whose
-- only tags are matched by the filter.
way_keys = {"~building"}
```

```lua
-- Building a railway map
way_keys = {"railway"}
```

```lua
-- Building a map of major roads
way_keys = {"highway=motorway", "highway=trunk", "highway=primary", "highway=secondary"}`
```

Nodes used in ways which are used in relations (as identified by
`relation_scan_function`) will always be indexed, regardless of
`node_keys` and `way_keys` settings that might exclude them.

A concrete example, given a Lua script like:

```lua
function way_function()
  if Find("railway") ~= "" then
    Layer("lines", false)
  end
end
```

it takes 13GB of RAM and 100 seconds to process North America.

If you add:

```lua
way_keys = {"railway"}
```

It takes 2GB of RAM and 47 seconds.

Notes:

1. This is based on `lua-interop-3`, as it interacts with files that are
   changed by that. I can rebase against master after lua-interop-3 is
   merged.

2. The names `node_keys` and `way_keys` are perhaps out of date, as they
   can now express conditions on the values of tags in addition to their
   keys. Leaving them as-is is nice, as it's not a breaking change.
   But if breaking changes are OK, maybe these should be
   `node_filters` and `way_filters` ?

3. Maybe the value for `node_keys` in the OMT profile should be
   expressed in terms of a negation, e.g. `node_keys = {"~created_by"}`?
   This would avoid issues like systemed#337

4. This also adds a SIGUSR1 handler during OSM processing, which prints
   the ID of the object currently being processed. This is helpful for
   tracking down slow geometries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants