Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raster data footprint determination support #307

Merged
merged 12 commits into from
Aug 1, 2022

Conversation

philvarner
Copy link
Collaborator

@philvarner philvarner commented Jun 9, 2022

Related Issue(s):

Description:

These changes add a new module raster_footprint that provides parameterized methods for generating data footprints from raster data. This includes a method to update the geometry of an Item based on the footprint determined from the data assets in the Item.

PR checklist:

  • Code is formatted (run scripts/format).
  • Code lints properly (run scripts/lint).
  • Tests pass (run scripts/test).
  • Documentation has been updated to reflect changes, if applicable.
  • Changes are added to the CHANGELOG.

@pjhartzell
Copy link
Collaborator

Unsolicited opinions on the data_extent function:

  • I'm not a pyproj expert, but after looking at the docs, I don't think the accuracy option involves densification. Looks more like coordinate computation accuracy guarantee.
  • I think the check should be arr != src.nodata rather than against 0
  • Checking against nodata will likely produce a lot of small polygons in certain rasters, so looking at rasterio.features.sieve might be something to look at.
  • Interested whether the scale is beneficial in any way
  • I think reprojecting the raster to wgs84 before rasterio.features.shapes and shapely.geometry.simplify might be worth looking at. If you reproject to wgs84 after pulling out the shapes, you start having to think about densification. Might be easier to just avoid that.

@matthewhanson
Copy link
Member

Note that the issue for this is #85 where the initial draft code came from, and there's some examples of the code in action.

Some responses to @pjhartzell

  • [rasterio.features.sieve](https://github.com/rasterio/rasterio/blob/master/examples/sieve.py) does look like it would be worth using instead and having a parameter the user can pass through to it.
  • Scale is likely not beneficial. The original idea was that you could generate a reasonable footprint by just reading an overview of the image rather than the whole array, but in some initial testing the results were too coarse. Recommend additional testing though before ditching.
  • Regarding densification, my experience is that the opposite problem tends to be the case, where the geometries are way too detailed. rasterio.features.shapes can create highly detailed geometries and simplification is then required. The 2nd animated gif in issue Add method for generating data footprints #85 was simplified, the original polygon was massive.

@matthewhanson
Copy link
Member

I'll also add in here that while it's use case driven whether or not overestimating or underestimating the data geometry is preferable, I think the default method should slightly underestimate the data geometry.

As a user I don't want to get matches against scenes that end up having only nodata under my polygon, whereas if I miss a scene because it happened to have a handful of pixels on the edge that intersected with my polygon....I just don't see that as much of a big deal.

@pjhartzell
Copy link
Collaborator

@matthewhanson Agreed that shapely's simplify should be applied. My thought on densification (actually on how to avoid it) relates to the problem that will exist for, e.g., a clean raster where all data is valid (no nodata pixels). If you extract a polygon with rasterio's features.shapes, it's going to be just the four corners. If you then reproject that shape to wgs84, you can end up with a parallelogram that does not closely follow the edge of valid raster data when the raster is also projected to wgs84 (only agrees at the corners). This is the case for MODIS and VIIRS data. But if you start by projecting the raster data to wgs84, and then follow with shape extraction and simplification, that issue should be able to be avoided.

Imagine the raster is all valid data. Blue line is bad, Green line is better. We can get green without heuristic densification by projecting the raster to wgs84 before extracting the valid data shape.
MicrosoftTeams-image

@matthewhanson
Copy link
Member

Ah, gotcha @pjhartzell that makes perfect sense.

Knowing if the scaling factor is usable is even more important then, since if we can get away with using even a 1/2 overview then we greatly reduce computation. Not a big deal if doing a handful of images, but over a million files this could amount to some sizable $.

@pjhartzell
Copy link
Collaborator

Yeah, the computational aspect of this is a definite drawback.

@TomAugspurger
Copy link
Collaborator

One meta-comment: this is assuming that you're cataloging raster data that can be read with rasterio, and wouldn't work for tabular or Zarr / NetCDF assets (#218). I think the names data_extent and data_extents_for_data_assets, or the namespace these functions are in, should reflect that it's for raster data.

@gadomski gadomski linked an issue Jun 23, 2022 that may be closed by this pull request
@philvarner philvarner changed the title initial go at data footprint Raster data footprint determination support Jun 24, 2022
@philvarner
Copy link
Collaborator Author

To address some of these comments:

@pjhartzell

I'm not a pyproj expert, but after looking at the docs, I don't think the accuracy option involves densification. Looks more like coordinate computation accuracy guarantee.

That code isn't used, so not relevant.

I think the check should be arr != src.nodata rather than against 0

Yes -- this has been fixed. The code now either pulls nodata from the src metadata or requires the user to explicitly define what the no data value is. I chose not to just default it to 0, as while I think that's more common, I think it's a more apparent behavior to always get the entire image extent if no data is not defined than if it defaulted to 0, but your no data value isn't actually 0.

Checking against nodata will likely produce a lot of small polygons in certain rasters, so looking at rasterio.features.sieve might be something to look at.

sieve wasn't the right option here -- sieve will connect the small no data polygons, but then you just get larger complex polygons within footprint. Using the convex hull gives what I would consider a footprint. Future work might be to attempt to calculate the alpha shape (a sort of "concave hull") but I don't see a clear benefit in using the more-complex alpha shape vs. the simpler convex hull. I think this is only beneficial when there are large, highly concave no data areas that only intersect one side of an image, which aren't common. I could see this being useful for another custom property of an Item, but not for the Item geometry, where you generally want something simpler and less accurate than larger and more accurate.

Interested whether the scale is beneficial in any way

I left it in, but I'm still not sure what this is supposed to be used for.

I think reprojecting the raster to wgs84 before rasterio.features.shapes and shapely.geometry.simplify might be worth looking at. If you reproject to wgs84 after pulling out the shapes, you start having to think about densification. Might be easier to just avoid that.

already had some out-of-band conversation about this.

@philvarner
Copy link
Collaborator Author

@matthewhanson

Regarding densification, my experience is that the opposite problem tends to be the case, where the geometries are way too detailed. rasterio.features.shapes can create highly detailed geometries and simplification is then required. The 2nd animated gif in issue #85 was simplified, the original polygon was massive.

I think I've balanced the densification / simplification concerns with the method parameters and the documentation around them. I explicitly put in the docs that users should emperically validate what the footprint looks like, and that they need to adjust the knobs to work for their particular data, and what those knobs mean. I think this is actually useful compared with a lot of the docs for the libraries i've used for this, which state what they do, but don't really explain how to configure things for different types of data.

@philvarner
Copy link
Collaborator Author

@matthewhanson

I'll also add in here that while it's use case driven whether or not overestimating or underestimating the data geometry is preferable, I think the default method should slightly underestimate the data geometry.

I think we get a bit of each here -- using the convex hull for the data area will overestimate the data area if there are lots of no-data concavities. I'm not really sure what the general effect of densification/simplification is along the edges -- I think in many cases the over/under balances out in terms of total area, and there will be some edges that are a little over and some a little under.

@philvarner
Copy link
Collaborator Author

@TomAugspurger

One meta-comment: this is assuming that you're cataloging raster data that can be read with rasterio, and wouldn't work for tabular or Zarr / NetCDF assets (#218). I think the names data_extent and data_extents_for_data_assets, or the namespace these functions are in, should reflect that it's for raster data.

renamed the module to "raster_footprint"

@philvarner
Copy link
Collaborator Author

@matthewhanson @pjhartzell

Thinking about this a little more, since I don't even understand how scale is supposed to be used, it should probably be documented better. Does someone want to explain it to me?

@philvarner philvarner marked this pull request as ready for review June 24, 2022 14:54
@pjhartzell
Copy link
Collaborator

sieve wasn't the right option here -- sieve will connect the small no data polygons, but then you just get larger complex polygons within footprint. Using the convex hull gives what I would consider a footprint. Future work might be to attempt to calculate the alpha shape (a sort of "concave hull") but I don't see a clear benefit in using the more-complex alpha shape vs. the simpler convex hull. I think this is only beneficial when there are large, highly concave no data areas that only intersect one side of an image, which aren't common. I could see this being useful for another custom property of an Item, but not for the Item geometry, where you generally want something simpler and less accurate than larger and more accurate.

I'm +1 on keeping things simple. I would push back a little in that if a nodata value is used to generate a polygon footprint, I think most users would expect the concave regions of nodata values to be excluded. So the use of a convex hull should be clear in the docs of the primary function(s), or maybe even in the function name.

Agreed that sieve will connect the small nodata polygons into larger complex polygons within the footprint. But those larger complex polygons can then be simplified, no? The issue of holes, or inner polygons, will be an issue, but there might be some ways to remove inner polygons. But I'm open to leaving this to future work.

I think I've balanced the densification / simplification concerns with the method parameters and the documentation around them. I explicitly put in the docs that users should emperically validate what the footprint looks like, and that they need to adjust the knobs to work for their particular data, and what those knobs mean. I think this is actually useful compared with a lot of the docs for the libraries i've used for this, which state what they do, but don't really explain how to configure things for different types of data.

I guess my concern on this is that projection distortion is not the same everywhere. So a savvy user will pick an area with the worst distortion and tune things there. Or just use a very high densification factor (easier). But if just using a very high densification factor will do the trick, is there any reason we can't do this (with a bit more intelligence that just a very high value) for the user? And avoid users having to empirically tune the densification? Again, perhaps a future improvement.

Speaking of densification, does densification by distance rather than by factor make sense? When densifying by factor, you can end up with a lot of very closely spaced points where the original polygon has very close vertices, and vice versa where the original vertices are far apart.

@philvarner
Copy link
Collaborator Author

The convexity and inclusion of nodata should be sufficiently documented. Let me know if you think it needs more.

I guess my concern on this is that projection distortion is not the same everywhere. So a savvy user will pick an area with the worst distortion and tune things there. Or just use a very high densification factor (easier). But if just using a very high densification factor will do the trick, is there any reason we can't do this (with a bit more intelligence that just a very high value) for the user? And avoid users having to empirically tune the densification? Again, perhaps a future improvement.

This is a good point -- we may want to just put the densification factor to default to 10. This makes reprojection 10x slower. The additional points get cleaned up by the simplification, so the resulting geometry isn't necessarily 10x larger.

Speaking of densification, does densification by distance rather than by factor make sense? When densifying by factor, you can end up with a lot of very closely spaced points where the original polygon has very close vertices, and vice versa where the original vertices are far apart.

I think it would - we'd have to write our own code for this, as shapely only does equidistant densification.

@pjhartzell
Copy link
Collaborator

we may want to just put the densification factor to default to 10. This makes reprojection 10x slower.

Point reprojection should be pretty cheap, though. At least compared to reprojecting an image. So I think it'd be OK to add more points.

as shapely only does equidistant densification

I think a custom function (_densify) is currently being used for the densification. Here's a shameless link to a by-distance densification function.

Copy link
Collaborator

@pjhartzell pjhartzell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a suggestion to remove the explicit preserve_topology setting in densify_reproject_simplify, I think this is good to go. We can look at handling concavity and perhaps a refined or automatic "densification" method going forward.

src/stactools/core/utils/raster_footprint.py Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
tests/core/utils/test_raster_footprint.py Outdated Show resolved Hide resolved
@philvarner
Copy link
Collaborator Author

@pjhartzell @gadomski @matthewhanson that should resolve all the outstanding issues.

@gadomski gadomski self-requested a review July 19, 2022 19:55
@gadomski gadomski mentioned this pull request Jul 25, 2022
4 tasks
@gadomski
Copy link
Member

gadomski commented Jul 28, 2022

@philvarner @pjhartzell I've been playing with this branch a bit this morning, and when I checked it against the landsat test data, I was surprised to find that the raster footprint did not include all pixels:

landsat-raster-footprint

Is this expected behavior? I think this has to do with the densification parameters (e.g. the MODIS example Preston posted in this thread) but just wanted to make sure

@gadomski gadomski requested a review from pjhartzell July 28, 2022 19:35
@gadomski
Copy link
Member

gadomski commented Jul 28, 2022

@philvarner @pjhartzell I've made a couple of additions/changes to this PR:

  1. Added a command line interface, so you can now do stac update-geometry item.json
  2. Reduced the size of the landsat test file -- it was ~95MB, so I cropped it down to ~3MB

I re-requested your review @pjhartzell, if you wouldn't mind looking at the command line interface in particular since no one else has put eyes on that.

NB I've done a rebase of this branch (which is how I found the large file) so you'll have to force reset any local branches. I plan on rebase-with-partial-squash after approval to remove the large landsat file from the final history.

@pjhartzell
Copy link
Collaborator

@gadomski Still working through the review, but in answer to your question, the reason all the pixels are not included is because the DEFAULT_PRECISION = 3 is very low for geodetic coordinates. Probably a leftover from thinking about things in meters. As a srrveyor, my default precision for geodetic coordinates is 8, which is roughly a millimeter at the equator. Rule of thumb for geodetic precision is:

  • 0 = 100,000m
  • 1 = 10,000m
  • 2 = 1,000m
  • 3 = 100m
  • 4 = 10m
  • 5 = 1m
  • 6 = 0.1m

and so on. Here's a snapshot of the footprint with a precision of 3 (red) and 8 (green). I did not apply any simplification, so a precision of 8 wraps around each pixel:
image

Of course this leads to a very large very large geometry coordinate list. If we keep a precision equal to 8 and set a simplification tolerance equal to roughly twice the pixel size (30mx2 = 60m --> 0.0006 degrees for this landsat band), we get something closer to what we expect (blue line):
image

@gadomski gadomski requested a review from pjhartzell July 29, 2022 15:37
Copy link
Collaborator

@pjhartzell pjhartzell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the CLI on some files. Works well. Looks like the updated precision broke the tests, though. 🙁

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@pjhartzell pjhartzell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@gadomski gadomski merged commit 9e81ffe into main Aug 1, 2022
@gadomski gadomski deleted the pv/generate-data-footprint branch August 1, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add method for generating data footprints
5 participants