Raster data footprint determination support #307

philvarner · 2022-06-09T20:10:48Z

Related Issue(s):

Add method for generating data footprints #85

Description:

These changes add a new module raster_footprint that provides parameterized methods for generating data footprints from raster data. This includes a method to update the geometry of an Item based on the footprint determined from the data assets in the Item.

PR checklist:

Code is formatted (run scripts/format).
Code lints properly (run scripts/lint).
Tests pass (run scripts/test).
Documentation has been updated to reflect changes, if applicable.
Changes are added to the CHANGELOG.

pjhartzell · 2022-06-10T15:41:22Z

Unsolicited opinions on the data_extent function:

I'm not a pyproj expert, but after looking at the docs, I don't think the accuracy option involves densification. Looks more like coordinate computation accuracy guarantee.
I think the check should be arr != src.nodata rather than against 0
Checking against nodata will likely produce a lot of small polygons in certain rasters, so looking at rasterio.features.sieve might be something to look at.
Interested whether the scale is beneficial in any way
I think reprojecting the raster to wgs84 before rasterio.features.shapes and shapely.geometry.simplify might be worth looking at. If you reproject to wgs84 after pulling out the shapes, you start having to think about densification. Might be easier to just avoid that.

matthewhanson · 2022-06-10T15:57:54Z

Note that the issue for this is #85 where the initial draft code came from, and there's some examples of the code in action.

Some responses to @pjhartzell

[rasterio.features.sieve](https://github.com/rasterio/rasterio/blob/master/examples/sieve.py) does look like it would be worth using instead and having a parameter the user can pass through to it.
Scale is likely not beneficial. The original idea was that you could generate a reasonable footprint by just reading an overview of the image rather than the whole array, but in some initial testing the results were too coarse. Recommend additional testing though before ditching.
Regarding densification, my experience is that the opposite problem tends to be the case, where the geometries are way too detailed. rasterio.features.shapes can create highly detailed geometries and simplification is then required. The 2nd animated gif in issue Add method for generating data footprints #85 was simplified, the original polygon was massive.

matthewhanson · 2022-06-10T16:00:33Z

I'll also add in here that while it's use case driven whether or not overestimating or underestimating the data geometry is preferable, I think the default method should slightly underestimate the data geometry.

As a user I don't want to get matches against scenes that end up having only nodata under my polygon, whereas if I miss a scene because it happened to have a handful of pixels on the edge that intersected with my polygon....I just don't see that as much of a big deal.

pjhartzell · 2022-06-10T16:55:24Z

@matthewhanson Agreed that shapely's simplify should be applied. My thought on densification (actually on how to avoid it) relates to the problem that will exist for, e.g., a clean raster where all data is valid (no nodata pixels). If you extract a polygon with rasterio's features.shapes, it's going to be just the four corners. If you then reproject that shape to wgs84, you can end up with a parallelogram that does not closely follow the edge of valid raster data when the raster is also projected to wgs84 (only agrees at the corners). This is the case for MODIS and VIIRS data. But if you start by projecting the raster data to wgs84, and then follow with shape extraction and simplification, that issue should be able to be avoided.

Imagine the raster is all valid data. Blue line is bad, Green line is better. We can get green without heuristic densification by projecting the raster to wgs84 before extracting the valid data shape.

matthewhanson · 2022-06-10T20:45:11Z

Ah, gotcha @pjhartzell that makes perfect sense.

Knowing if the scaling factor is usable is even more important then, since if we can get away with using even a 1/2 overview then we greatly reduce computation. Not a big deal if doing a handful of images, but over a million files this could amount to some sizable $.

pjhartzell · 2022-06-11T01:13:31Z

Yeah, the computational aspect of this is a definite drawback.

TomAugspurger · 2022-06-14T13:29:34Z

One meta-comment: this is assuming that you're cataloging raster data that can be read with rasterio, and wouldn't work for tabular or Zarr / NetCDF assets (#218). I think the names data_extent and data_extents_for_data_assets, or the namespace these functions are in, should reflect that it's for raster data.

philvarner · 2022-06-24T14:04:39Z

To address some of these comments:

@pjhartzell

I'm not a pyproj expert, but after looking at the docs, I don't think the accuracy option involves densification. Looks more like coordinate computation accuracy guarantee.

That code isn't used, so not relevant.

I think the check should be arr != src.nodata rather than against 0

Yes -- this has been fixed. The code now either pulls nodata from the src metadata or requires the user to explicitly define what the no data value is. I chose not to just default it to 0, as while I think that's more common, I think it's a more apparent behavior to always get the entire image extent if no data is not defined than if it defaulted to 0, but your no data value isn't actually 0.

Checking against nodata will likely produce a lot of small polygons in certain rasters, so looking at rasterio.features.sieve might be something to look at.

sieve wasn't the right option here -- sieve will connect the small no data polygons, but then you just get larger complex polygons within footprint. Using the convex hull gives what I would consider a footprint. Future work might be to attempt to calculate the alpha shape (a sort of "concave hull") but I don't see a clear benefit in using the more-complex alpha shape vs. the simpler convex hull. I think this is only beneficial when there are large, highly concave no data areas that only intersect one side of an image, which aren't common. I could see this being useful for another custom property of an Item, but not for the Item geometry, where you generally want something simpler and less accurate than larger and more accurate.

Interested whether the scale is beneficial in any way

I left it in, but I'm still not sure what this is supposed to be used for.

I think reprojecting the raster to wgs84 before rasterio.features.shapes and shapely.geometry.simplify might be worth looking at. If you reproject to wgs84 after pulling out the shapes, you start having to think about densification. Might be easier to just avoid that.

already had some out-of-band conversation about this.

philvarner · 2022-06-24T14:25:17Z

@matthewhanson

Regarding densification, my experience is that the opposite problem tends to be the case, where the geometries are way too detailed. rasterio.features.shapes can create highly detailed geometries and simplification is then required. The 2nd animated gif in issue #85 was simplified, the original polygon was massive.

I think I've balanced the densification / simplification concerns with the method parameters and the documentation around them. I explicitly put in the docs that users should emperically validate what the footprint looks like, and that they need to adjust the knobs to work for their particular data, and what those knobs mean. I think this is actually useful compared with a lot of the docs for the libraries i've used for this, which state what they do, but don't really explain how to configure things for different types of data.

philvarner · 2022-06-24T14:28:13Z

@matthewhanson

I'll also add in here that while it's use case driven whether or not overestimating or underestimating the data geometry is preferable, I think the default method should slightly underestimate the data geometry.

I think we get a bit of each here -- using the convex hull for the data area will overestimate the data area if there are lots of no-data concavities. I'm not really sure what the general effect of densification/simplification is along the edges -- I think in many cases the over/under balances out in terms of total area, and there will be some edges that are a little over and some a little under.

philvarner · 2022-06-24T14:29:45Z

@TomAugspurger

One meta-comment: this is assuming that you're cataloging raster data that can be read with rasterio, and wouldn't work for tabular or Zarr / NetCDF assets (#218). I think the names data_extent and data_extents_for_data_assets, or the namespace these functions are in, should reflect that it's for raster data.

renamed the module to "raster_footprint"

philvarner · 2022-06-24T14:31:24Z

@matthewhanson @pjhartzell

Thinking about this a little more, since I don't even understand how scale is supposed to be used, it should probably be documented better. Does someone want to explain it to me?

pjhartzell · 2022-06-26T20:47:38Z

sieve wasn't the right option here -- sieve will connect the small no data polygons, but then you just get larger complex polygons within footprint. Using the convex hull gives what I would consider a footprint. Future work might be to attempt to calculate the alpha shape (a sort of "concave hull") but I don't see a clear benefit in using the more-complex alpha shape vs. the simpler convex hull. I think this is only beneficial when there are large, highly concave no data areas that only intersect one side of an image, which aren't common. I could see this being useful for another custom property of an Item, but not for the Item geometry, where you generally want something simpler and less accurate than larger and more accurate.

I'm +1 on keeping things simple. I would push back a little in that if a nodata value is used to generate a polygon footprint, I think most users would expect the concave regions of nodata values to be excluded. So the use of a convex hull should be clear in the docs of the primary function(s), or maybe even in the function name.

Agreed that sieve will connect the small nodata polygons into larger complex polygons within the footprint. But those larger complex polygons can then be simplified, no? The issue of holes, or inner polygons, will be an issue, but there might be some ways to remove inner polygons. But I'm open to leaving this to future work.

I think I've balanced the densification / simplification concerns with the method parameters and the documentation around them. I explicitly put in the docs that users should emperically validate what the footprint looks like, and that they need to adjust the knobs to work for their particular data, and what those knobs mean. I think this is actually useful compared with a lot of the docs for the libraries i've used for this, which state what they do, but don't really explain how to configure things for different types of data.

I guess my concern on this is that projection distortion is not the same everywhere. So a savvy user will pick an area with the worst distortion and tune things there. Or just use a very high densification factor (easier). But if just using a very high densification factor will do the trick, is there any reason we can't do this (with a bit more intelligence that just a very high value) for the user? And avoid users having to empirically tune the densification? Again, perhaps a future improvement.

Speaking of densification, does densification by distance rather than by factor make sense? When densifying by factor, you can end up with a lot of very closely spaced points where the original polygon has very close vertices, and vice versa where the original vertices are far apart.

src/stactools/core/utils/raster_footprint.py

philvarner · 2022-06-27T13:43:31Z

The convexity and inclusion of nodata should be sufficiently documented. Let me know if you think it needs more.

I guess my concern on this is that projection distortion is not the same everywhere. So a savvy user will pick an area with the worst distortion and tune things there. Or just use a very high densification factor (easier). But if just using a very high densification factor will do the trick, is there any reason we can't do this (with a bit more intelligence that just a very high value) for the user? And avoid users having to empirically tune the densification? Again, perhaps a future improvement.

This is a good point -- we may want to just put the densification factor to default to 10. This makes reprojection 10x slower. The additional points get cleaned up by the simplification, so the resulting geometry isn't necessarily 10x larger.

Speaking of densification, does densification by distance rather than by factor make sense? When densifying by factor, you can end up with a lot of very closely spaced points where the original polygon has very close vertices, and vice versa where the original vertices are far apart.

I think it would - we'd have to write our own code for this, as shapely only does equidistant densification.

pjhartzell · 2022-06-27T15:38:42Z

we may want to just put the densification factor to default to 10. This makes reprojection 10x slower.

Point reprojection should be pretty cheap, though. At least compared to reprojecting an image. So I think it'd be OK to add more points.

as shapely only does equidistant densification

I think a custom function (_densify) is currently being used for the densification. Here's a shameless link to a by-distance densification function.

pjhartzell

Other than a suggestion to remove the explicit preserve_topology setting in densify_reproject_simplify, I think this is good to go. We can look at handling concavity and perhaps a refined or automatic "densification" method going forward.

src/stactools/core/utils/raster_footprint.py

tests/core/utils/test_raster_footprint.py

philvarner · 2022-06-30T17:39:08Z

@pjhartzell @gadomski @matthewhanson that should resolve all the outstanding issues.

src/stactools/core/utils/raster_footprint.py

gadomski · 2022-07-28T18:53:05Z

@philvarner @pjhartzell I've been playing with this branch a bit this morning, and when I checked it against the landsat test data, I was surprised to find that the raster footprint did not include all pixels:

Is this expected behavior? I think this has to do with the densification parameters (e.g. the MODIS example Preston posted in this thread) but just wanted to make sure

gadomski · 2022-07-28T19:37:27Z

@philvarner @pjhartzell I've made a couple of additions/changes to this PR:

Added a command line interface, so you can now do stac update-geometry item.json
Reduced the size of the landsat test file -- it was ~95MB, so I cropped it down to ~3MB

I re-requested your review @pjhartzell, if you wouldn't mind looking at the command line interface in particular since no one else has put eyes on that.

NB I've done a rebase of this branch (which is how I found the large file) so you'll have to force reset any local branches. I plan on rebase-with-partial-squash after approval to remove the large landsat file from the final history.

pjhartzell · 2022-07-29T14:17:48Z

@gadomski Still working through the review, but in answer to your question, the reason all the pixels are not included is because the DEFAULT_PRECISION = 3 is very low for geodetic coordinates. Probably a leftover from thinking about things in meters. As a srrveyor, my default precision for geodetic coordinates is 8, which is roughly a millimeter at the equator. Rule of thumb for geodetic precision is:

0 = 100,000m
1 = 10,000m
2 = 1,000m
3 = 100m
4 = 10m
5 = 1m
6 = 0.1m

and so on. Here's a snapshot of the footprint with a precision of 3 (red) and 8 (green). I did not apply any simplification, so a precision of 8 wraps around each pixel:

Of course this leads to a very large very large geometry coordinate list. If we keep a precision equal to 8 and set a simplification tolerance equal to roughly twice the pixel size (30mx2 = 60m --> 0.0006 degrees for this landsat band), we get something closer to what we expect (blue line):

tests/data-files/raster_footprint/LC08_L1TP_198029_20220331_20220406_02_T1_B2.json

src/stactools/cli/commands/update_geometry.py

src/stactools/core/utils/raster_footprint.py

pjhartzell

Ran the CLI on some files. Works well. Looks like the updated precision broke the tests, though. 🙁

src/stactools/core/utils/raster_footprint.py

pjhartzell

Looks good.

This includes a cropped version of the landsat image; the original was > 90MB.

It was confusing.

gadomski linked an issue Jun 23, 2022 that may be closed by this pull request

Add method for generating data footprints #85

Closed

philvarner changed the title ~~initial go at data footprint~~ Raster data footprint determination support Jun 24, 2022

philvarner marked this pull request as ready for review June 24, 2022 14:54

philvarner requested review from gadomski, matthewhanson and pjhartzell June 24, 2022 14:54

pjhartzell reviewed Jun 26, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

pjhartzell approved these changes Jun 28, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Show resolved Hide resolved

gadomski requested changes Jun 29, 2022

View reviewed changes

pjhartzell reviewed Jul 1, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

pjhartzell reviewed Jul 1, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

pjhartzell reviewed Jul 1, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

pjhartzell reviewed Jul 1, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

gadomski self-requested a review July 19, 2022 19:55

gadomski mentioned this pull request Jul 25, 2022

Remove scale argument #334

Merged

4 tasks

gadomski approved these changes Jul 25, 2022

View reviewed changes

gadomski force-pushed the pv/generate-data-footprint branch from 39ddc1c to 2f21370 Compare July 27, 2022 16:04

gadomski requested a review from pjhartzell July 28, 2022 19:35

pjhartzell requested changes Jul 29, 2022

View reviewed changes

tests/data-files/raster_footprint/LC08_L1TP_198029_20220331_20220406_02_T1_B2.json Show resolved Hide resolved

src/stactools/cli/commands/update_geometry.py Outdated Show resolved Hide resolved

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

gadomski requested a review from pjhartzell July 29, 2022 15:37

pjhartzell reviewed Aug 1, 2022

View reviewed changes

src/stactools/core/utils/raster_footprint.py Outdated Show resolved Hide resolved

gadomski requested a review from pjhartzell August 1, 2022 14:55

pjhartzell approved these changes Aug 1, 2022

View reviewed changes

philvarner and others added 12 commits August 1, 2022 10:22

feat: create raster data footprints

d8e8400

This includes a cropped version of the landsat image; the original was > 90MB.

fix: remove scale argument

09e6f16

It was confusing.

fix, docs: remove reference to roles in docs

0983fd5

docs: minor wording fix for raster footprint

3f06c50

feat: add cli interface to update-geometry

2f53eb5

docs, fix: clean up raster footprint docs

00b20e0

fix: add missing spaces to command help strings

551f3d4

refactor: make asset_names not optional

ca156f8

fix: raster_footprint precision to ~1cm

cd84c01

fix, tests: raster_footprint typing

e4b02db

fix, tests: precision in raster footprints

645473a

docs: raster footprint wording for asset names

42cff3e

gadomski force-pushed the pv/generate-data-footprint branch from defdb2a to 42cff3e Compare August 1, 2022 16:24

gadomski merged commit 9e81ffe into main Aug 1, 2022

gadomski deleted the pv/generate-data-footprint branch August 1, 2022 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raster data footprint determination support #307

Raster data footprint determination support #307

philvarner commented Jun 9, 2022 •

edited

Loading

pjhartzell commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

pjhartzell commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

pjhartzell commented Jun 11, 2022

TomAugspurger commented Jun 14, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

pjhartzell commented Jun 26, 2022

philvarner commented Jun 27, 2022

pjhartzell commented Jun 27, 2022

pjhartzell left a comment

philvarner commented Jun 30, 2022

gadomski commented Jul 28, 2022 •

edited

Loading

gadomski commented Jul 28, 2022 •

edited

Loading

pjhartzell commented Jul 29, 2022

pjhartzell left a comment

pjhartzell left a comment

Raster data footprint determination support #307

Raster data footprint determination support #307

Conversation

philvarner commented Jun 9, 2022 • edited Loading

pjhartzell commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

pjhartzell commented Jun 10, 2022

matthewhanson commented Jun 10, 2022

pjhartzell commented Jun 11, 2022

TomAugspurger commented Jun 14, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

philvarner commented Jun 24, 2022

pjhartzell commented Jun 26, 2022

philvarner commented Jun 27, 2022

pjhartzell commented Jun 27, 2022

pjhartzell left a comment

Choose a reason for hiding this comment

philvarner commented Jun 30, 2022

gadomski commented Jul 28, 2022 • edited Loading

gadomski commented Jul 28, 2022 • edited Loading

pjhartzell commented Jul 29, 2022

pjhartzell left a comment

Choose a reason for hiding this comment

pjhartzell left a comment

Choose a reason for hiding this comment

philvarner commented Jun 9, 2022 •

edited

Loading

gadomski commented Jul 28, 2022 •

edited

Loading

gadomski commented Jul 28, 2022 •

edited

Loading