Initial Ideas for a spec #28

cholmes · 2023-09-26T20:47:08Z

This probably will make sense to live somewhere else other than this repo, but starting it here for discussion. It needs more work, but wanted to put up progress.

TomAugspurger

Thanks @cholmes!

I wonder if we can provide any guidance around partitioning. Maybe we run into the same issue as with geoparquet generally, that it's hard to do in general.

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

spec/stac-geoparquet-spec.md

TomAugspurger · 2023-09-27T01:51:03Z

spec/stac-geoparquet-spec.md

+
+| Field           | GeoParquet Type    | Required | Details                                            |
+| --------------- | ------------------ | ---------|--------------------------------------------------- |
+| type            | String             | Optional | This is just needed for GeoJSON, so it is optional and not recommended to include in GeoParquet |


I'm trying to decide on whether or not this should be recommended to be included. I agree it's not really useful, since it's constant. But in terms of storage, this can hopefully be very small with a dictionary encoding.

I also wasn't sure on whether it should be included or not. I agree it should be trivial on disk and in memory (if dictionary encoded in Arrow as well).

I lean towards no, mostly to communicate that the core idea behind this is not being able to naively roundtrip from a GeoJSON file. Like I view this 'spec' as GeoParquet instantiation of the core fields in like an abstract STAC idea. The only reason it's in STAC is to make it valid GeoJSON. GeoParquet uses a different mechanism (Parquets metadata) to make a parquet file 'geo'. So philosophically I don't think we should just include it. If anything I lean towards something even stronger, like not even listing it as an 'optional field', but mentioning that it's ok to include if you must.

spec/stac-geoparquet-spec.md

kylebarron · 2023-09-27T15:19:42Z

spec/stac-geoparquet-spec.md

+
+Generally most all the fields in a STAC Item should be mapped to a row in GeoParquet. We embrace Parquet structures where possible, mapping
+from JSON into nested structures. We do pull the properties to the top level, so that it is easier to query and use them. The names of the
+most of the fields should be the same in STAC and in GeoParquet.


Are there any fields whose names are not the same in STAC and GeoParquet?

I started the table as stac name and geoparquet, but it seemed useful to just always have them be the same. But I think there's a better way to do the table.

spec/stac-geoparquet-spec.md

kylebarron · 2023-09-27T15:29:56Z

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

I think time-based partitioning generally makes sense for STAC, because you know the data will be static after the current time period has ended

Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>

cholmes · 2023-09-27T18:32:15Z

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

I think time-based partitioning generally makes sense for STAC, because you know the data will be static after the current time period has ended

I think it makes sense for most data that is in STAC, but could see data in STAC that doesn't make sense to partition by time. But yeah, I think some general recommendations on partitioning parquet data likely makes sense here. Ideally I think we'd point at some GeoParquet partitioning best practices, and recommend stuff from there - like there should be general recommendations in GeoParquet if you have large amounts of data and it's common to query over time then a time partition makes sense, with advice on how to do that. But seems fine to start some of that here.

* wording * links to STAC spec * Added Asset Structoo * formatting

TomAugspurger · 2023-10-07T14:12:38Z

I pushed a few changes in 9f3cfff.

I'm pretty happy with where this is at. I think it'd be good to take some of the examples from the stac-spec repo and show what a stac-geoparquet representation would look like (either code to make the actual parquet files, or maybe just showing the parquet / Arrow schema you'd get).

* Move Use Cases to the Top * Added section on Collection JSON * Added note on accessing fields

TomAugspurger · 2023-10-15T15:55:23Z

I think this is in a good spot. I'll merge it in a couple of days, unless there's any more conversation.

I agree with Chris' original comment:

This probably will make sense to live somewhere else other than this repo, but starting it here for discussion.

So we should treat this as a temporary home, and nothing should be considered set in stone.

m-mohr · 2023-10-16T22:55:22Z

spec/stac-geoparquet-spec.md

+| bbox               | Struct of Floats     | Required | Can be a 4 or 6 value struct, depending on dimension of the data                                                               |
+| links              | List of Link structs | Required | See [Link Struct](#link-struct) for more info                                                                                  |
+| assets             | An Assets struct     | Required | See [Asset Struct](#asset-struct) for more info                                                                                |
+| collection         | String               | Required | The ID of the collection this Item is a part of                                                                                |


Does this field make sense to be required? In the usual case where you have all items from one collection, it's pretty much the same as with type. A column with all the same values.

Good point. I don't know how to treat this one.

It can technically vary if we're just strongly recommending a single collection per datasets, which differentiates it a bit from type.

For both this and the type, I'd recommend we remove them from the table. My thoughts:

Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).

Consumers of stac-geoparquet who wish to generate STAC items will already need custom code (to move columns to properties). Adding type and extracting a collection ID out of the parquet metadata doesn't seem like too much extra work.

By removing collection as a required field and putting collection metadata in the geoparquet metadata, we really are focusing on a single parquet dataset per collection. I think people are aligned on that, but wanted to confirm.

I think collection is fine to be optional, in case there are situations where it's hard for people to write parquet file metadata (e.g. from data warehouses). But in most situations we can say: as long as you have the Collection in the file metadata, and you only have one collection, you don't need it as a column.

Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).

Documentation on how to extend this spec would be much appreciated! I've been looking into using stac-geoparquet to convert label extension items to ML ready label datasets, using EuroSat-STAC as an example.

I think we can update this document with additional sections for each extension that gives the name of the field and the parquet type.

spec/stac-geoparquet-spec.md

TomAugspurger · 2023-11-10T03:30:50Z

Merging, thanks everyone!

jiayuasu · 2023-11-11T04:24:46Z

@kylebarron @TomAugspurger @cholmes

Thanks for the great work on integrating stac into GeoParquet. I just want to bring this to your attention: we @wherobots recently developed this Havasu spatial table format. It builds on top of Apache Iceberg and adds table metadata / transactions to vector and raster parquet files (https://docs.wherobots.services/1.2.0/references/havasu/spec/#parquet-data-type-mappings). Its vector data parquet storage model is geoparquet and we developed our own raster data parquet storage model raster-on-geoparquet.

The raster data parquet storage is very similar to the stac-geoparquet while we focus more on storing raster geo-reference and try to stay close to PostGIS out-db storage layout (https://postgis.net/docs/RT_reference.html). Currently, we are working on supporting import/export stac-items into/from raster-on-geoparquet

I believe some synergy can be added here. We can work together to build the ecosystem and avoid duplicate work. We can even schedule a Zoom meeting to further discuss this.

Havaus table format intro: https://docs.wherobots.services/1.2.0/references/havasu/introduction/
Havasu table spec: https://docs.wherobots.services/1.2.0/references/havasu/spec/

CC @Kontinuation @rbavery

cholmes · 2023-11-15T22:09:06Z

@jiayuasu - working together on the ecosystem sounds great. I'm not sure if we'll have critical mass for a zoom meeting (I'm slammed the next few weeks and this particular topic is low on my priorities), but you want to open a discussion explaining potential synergies and how to coordinate? In general not duplicating effort makes sense, but I'm not sure of the practical differences, if our goal would be to try to merge the efforts or just be sure they're compatible, etc.

cholmes added 3 commits September 26, 2023 13:45

initial spec skeleton

a75cc14

spacing fixes

1f9efaa

more spacing fixes

a834fe5

TomAugspurger reviewed Sep 27, 2023

View reviewed changes

kylebarron reviewed Sep 27, 2023

View reviewed changes

spec/stac-geoparquet-spec.md Show resolved Hide resolved

kylebarron reviewed Sep 27, 2023

View reviewed changes

spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved

cholmes and others added 3 commits September 27, 2023 10:28

Update spec/stac-geoparquet-spec.md

ae9d966

Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>

Update spec/stac-geoparquet-spec.md

0af604b

Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>

Update spec/stac-geoparquet-spec.md

1121145

Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>

cholmes and others added 2 commits September 28, 2023 05:33

updates based on PR conversations

db63c03

Fixups

9f3cfff

* wording * links to STAC spec * Added Asset Structoo * formatting

Updates

b0d9a6a

* Move Use Cases to the Top * Added section on Collection JSON * Added note on accessing fields

TomAugspurger marked this pull request as ready for review October 15, 2023 15:45

m-mohr reviewed Oct 16, 2023

View reviewed changes

TomAugspurger reviewed Oct 20, 2023

View reviewed changes

spec/stac-geoparquet-spec.md Show resolved Hide resolved

Update note on collection, removed type

e3471ec

TomAugspurger merged commit d791973 into main Nov 10, 2023
1 check passed

TomAugspurger deleted the spec branch November 10, 2023 03:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Ideas for a spec #28

Initial Ideas for a spec #28

cholmes commented Sep 26, 2023

TomAugspurger left a comment

TomAugspurger Sep 27, 2023

kylebarron Sep 27, 2023

cholmes Sep 27, 2023

kylebarron Sep 27, 2023

cholmes Sep 27, 2023

kylebarron commented Sep 27, 2023

cholmes commented Sep 27, 2023

TomAugspurger commented Oct 7, 2023

TomAugspurger commented Oct 15, 2023

m-mohr Oct 16, 2023

TomAugspurger Oct 19, 2023

TomAugspurger Oct 20, 2023

kylebarron Oct 20, 2023

rbavery Nov 10, 2023

TomAugspurger Nov 10, 2023

TomAugspurger commented Nov 10, 2023

jiayuasu commented Nov 11, 2023

cholmes commented Nov 15, 2023

Initial Ideas for a spec #28

Initial Ideas for a spec #28

Conversation

cholmes commented Sep 26, 2023

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Sep 27, 2023

cholmes commented Sep 27, 2023

TomAugspurger commented Oct 7, 2023

TomAugspurger commented Oct 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Nov 10, 2023

jiayuasu commented Nov 11, 2023

cholmes commented Nov 15, 2023