Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Ideas for a spec #28

Merged
merged 10 commits into from Nov 10, 2023
Merged

Initial Ideas for a spec #28

merged 10 commits into from Nov 10, 2023

Conversation

cholmes
Copy link
Contributor

@cholmes cholmes commented Sep 26, 2023

This probably will make sense to live somewhere else other than this repo, but starting it here for discussion. It needs more work, but wanted to put up progress.

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cholmes!

I wonder if we can provide any guidance around partitioning. Maybe we run into the same issue as with geoparquet generally, that it's hard to do in general.

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved

| Field | GeoParquet Type | Required | Details |
| --------------- | ------------------ | ---------|--------------------------------------------------- |
| type | String | Optional | This is just needed for GeoJSON, so it is optional and not recommended to include in GeoParquet |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to decide on whether or not this should be recommended to be included. I agree it's not really useful, since it's constant. But in terms of storage, this can hopefully be very small with a dictionary encoding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wasn't sure on whether it should be included or not. I agree it should be trivial on disk and in memory (if dictionary encoded in Arrow as well).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards no, mostly to communicate that the core idea behind this is not being able to naively roundtrip from a GeoJSON file. Like I view this 'spec' as GeoParquet instantiation of the core fields in like an abstract STAC idea. The only reason it's in STAC is to make it valid GeoJSON. GeoParquet uses a different mechanism (Parquets metadata) to make a parquet file 'geo'. So philosophically I don't think we should just include it. If anything I lean towards something even stronger, like not even listing it as an 'optional field', but mentioning that it's ok to include if you must.

spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved
spec/stac-geoparquet-spec.md Outdated Show resolved Hide resolved

Generally most all the fields in a STAC Item should be mapped to a row in GeoParquet. We embrace Parquet structures where possible, mapping
from JSON into nested structures. We do pull the properties to the top level, so that it is easier to query and use them. The names of the
most of the fields should be the same in STAC and in GeoParquet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any fields whose names are not the same in STAC and GeoParquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started the table as stac name and geoparquet, but it seemed useful to just always have them be the same. But I think there's a better way to do the table.

@kylebarron
Copy link
Collaborator

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

I think time-based partitioning generally makes sense for STAC, because you know the data will be static after the current time period has ended

cholmes and others added 3 commits September 27, 2023 10:28
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
@cholmes
Copy link
Contributor Author

cholmes commented Sep 27, 2023

But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.

I think time-based partitioning generally makes sense for STAC, because you know the data will be static after the current time period has ended

I think it makes sense for most data that is in STAC, but could see data in STAC that doesn't make sense to partition by time. But yeah, I think some general recommendations on partitioning parquet data likely makes sense here. Ideally I think we'd point at some GeoParquet partitioning best practices, and recommend stuff from there - like there should be general recommendations in GeoParquet if you have large amounts of data and it's common to query over time then a time partition makes sense, with advice on how to do that. But seems fine to start some of that here.

cholmes and others added 2 commits September 28, 2023 05:33
* wording
* links to STAC spec
* Added Asset Structoo
* formatting
@TomAugspurger
Copy link
Collaborator

I pushed a few changes in 9f3cfff.

I'm pretty happy with where this is at. I think it'd be good to take some of the examples from the stac-spec repo and show what a stac-geoparquet representation would look like (either code to make the actual parquet files, or maybe just showing the parquet / Arrow schema you'd get).

* Move Use Cases to the Top
* Added section on Collection JSON
* Added note on accessing fields
@TomAugspurger TomAugspurger marked this pull request as ready for review October 15, 2023 15:45
@TomAugspurger
Copy link
Collaborator

I think this is in a good spot. I'll merge it in a couple of days, unless there's any more conversation.

I agree with Chris' original comment:

This probably will make sense to live somewhere else other than this repo, but starting it here for discussion.

So we should treat this as a temporary home, and nothing should be considered set in stone.

| bbox | Struct of Floats | Required | Can be a 4 or 6 value struct, depending on dimension of the data |
| links | List of Link structs | Required | See [Link Struct](#link-struct) for more info |
| assets | An Assets struct | Required | See [Asset Struct](#asset-struct) for more info |
| collection | String | Required | The ID of the collection this Item is a part of |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this field make sense to be required? In the usual case where you have all items from one collection, it's pretty much the same as with type. A column with all the same values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't know how to treat this one.

It can technically vary if we're just strongly recommending a single collection per datasets, which differentiates it a bit from type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both this and the type, I'd recommend we remove them from the table. My thoughts:

  1. Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).
  2. Consumers of stac-geoparquet who wish to generate STAC items will already need custom code (to move columns to properties). Adding type and extracting a collection ID out of the parquet metadata doesn't seem like too much extra work.

By removing collection as a required field and putting collection metadata in the geoparquet metadata, we really are focusing on a single parquet dataset per collection. I think people are aligned on that, but wanted to confirm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think collection is fine to be optional, in case there are situations where it's hard for people to write parquet file metadata (e.g. from data warehouses). But in most situations we can say: as long as you have the Collection in the file metadata, and you only have one collection, you don't need it as a column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).

Documentation on how to extend this spec would be much appreciated! I've been looking into using stac-geoparquet to convert label extension items to ML ready label datasets, using EuroSat-STAC as an example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can update this document with additional sections for each extension that gives the name of the field and the parquet type.

@TomAugspurger
Copy link
Collaborator

Merging, thanks everyone!

@TomAugspurger TomAugspurger merged commit d791973 into main Nov 10, 2023
1 check passed
@jiayuasu
Copy link

@kylebarron @TomAugspurger @cholmes

Thanks for the great work on integrating stac into GeoParquet. I just want to bring this to your attention: we @wherobots recently developed this Havasu spatial table format. It builds on top of Apache Iceberg and adds table metadata / transactions to vector and raster parquet files (https://docs.wherobots.services/1.2.0/references/havasu/spec/#parquet-data-type-mappings). Its vector data parquet storage model is geoparquet and we developed our own raster data parquet storage model raster-on-geoparquet.

The raster data parquet storage is very similar to the stac-geoparquet while we focus more on storing raster geo-reference and try to stay close to PostGIS out-db storage layout (https://postgis.net/docs/RT_reference.html). Currently, we are working on supporting import/export stac-items into/from raster-on-geoparquet

I believe some synergy can be added here. We can work together to build the ecosystem and avoid duplicate work. We can even schedule a Zoom meeting to further discuss this.

Havaus table format intro: https://docs.wherobots.services/1.2.0/references/havasu/introduction/
Havasu table spec: https://docs.wherobots.services/1.2.0/references/havasu/spec/

CC @Kontinuation @rbavery

@cholmes
Copy link
Contributor Author

cholmes commented Nov 15, 2023

@jiayuasu - working together on the ecosystem sounds great. I'm not sure if we'll have critical mass for a zoom meeting (I'm slammed the next few weeks and this particular topic is low on my priorities), but you want to open a discussion explaining potential synergies and how to coordinate? In general not duplicating effort makes sense, but I'm not sure of the practical differences, if our goal would be to try to merge the efforts or just be sure they're compatible, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants