New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Ideas for a spec #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cholmes!
I wonder if we can provide any guidance around partitioning. Maybe we run into the same issue as with geoparquet generally, that it's hard to do in general.
But for the Planetary Computer, all of our STAC-geoparquet datasets are partitioned by time. We aim for a time-based partitioning (so weeks, months, quarters, etc.) that gives parquet files of ~50-500MB.
spec/stac-geoparquet-spec.md
Outdated
|
||
| Field | GeoParquet Type | Required | Details | | ||
| --------------- | ------------------ | ---------|--------------------------------------------------- | | ||
| type | String | Optional | This is just needed for GeoJSON, so it is optional and not recommended to include in GeoParquet | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to decide on whether or not this should be recommended to be included. I agree it's not really useful, since it's constant. But in terms of storage, this can hopefully be very small with a dictionary encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wasn't sure on whether it should be included or not. I agree it should be trivial on disk and in memory (if dictionary encoded in Arrow as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards no, mostly to communicate that the core idea behind this is not being able to naively roundtrip from a GeoJSON file. Like I view this 'spec' as GeoParquet instantiation of the core fields in like an abstract STAC idea. The only reason it's in STAC is to make it valid GeoJSON. GeoParquet uses a different mechanism (Parquets metadata) to make a parquet file 'geo'. So philosophically I don't think we should just include it. If anything I lean towards something even stronger, like not even listing it as an 'optional field', but mentioning that it's ok to include if you must.
|
||
Generally most all the fields in a STAC Item should be mapped to a row in GeoParquet. We embrace Parquet structures where possible, mapping | ||
from JSON into nested structures. We do pull the properties to the top level, so that it is easier to query and use them. The names of the | ||
most of the fields should be the same in STAC and in GeoParquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any fields whose names are not the same in STAC and GeoParquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started the table as stac name and geoparquet, but it seemed useful to just always have them be the same. But I think there's a better way to do the table.
I think time-based partitioning generally makes sense for STAC, because you know the data will be static after the current time period has ended |
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
Co-authored-by: Tom Augspurger <taugspurger@microsoft.com>
I think it makes sense for most data that is in STAC, but could see data in STAC that doesn't make sense to partition by time. But yeah, I think some general recommendations on partitioning parquet data likely makes sense here. Ideally I think we'd point at some GeoParquet partitioning best practices, and recommend stuff from there - like there should be general recommendations in GeoParquet if you have large amounts of data and it's common to query over time then a time partition makes sense, with advice on how to do that. But seems fine to start some of that here. |
I pushed a few changes in 9f3cfff. I'm pretty happy with where this is at. I think it'd be good to take some of the examples from the stac-spec repo and show what a stac-geoparquet representation would look like (either code to make the actual parquet files, or maybe just showing the parquet / Arrow schema you'd get). |
I think this is in a good spot. I'll merge it in a couple of days, unless there's any more conversation. I agree with Chris' original comment:
So we should treat this as a temporary home, and nothing should be considered set in stone. |
spec/stac-geoparquet-spec.md
Outdated
| bbox | Struct of Floats | Required | Can be a 4 or 6 value struct, depending on dimension of the data | | ||
| links | List of Link structs | Required | See [Link Struct](#link-struct) for more info | | ||
| assets | An Assets struct | Required | See [Asset Struct](#asset-struct) for more info | | ||
| collection | String | Required | The ID of the collection this Item is a part of | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this field make sense to be required? In the usual case where you have all items from one collection, it's pretty much the same as with type. A column with all the same values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't know how to treat this one.
It can technically vary if we're just strongly recommending a single collection per datasets, which differentiates it a bit from type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both this and the type
, I'd recommend we remove them from the table. My thoughts:
- Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).
- Consumers of stac-geoparquet who wish to generate STAC items will already need custom code (to move columns to properties). Adding
type
and extracting a collection ID out of the parquet metadata doesn't seem like too much extra work.
By removing collection
as a required field and putting collection metadata in the geoparquet metadata, we really are focusing on a single parquet dataset per collection. I think people are aligned on that, but wanted to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think collection is fine to be optional, in case there are situations where it's hard for people to write parquet file metadata (e.g. from data warehouses). But in most situations we can say: as long as you have the Collection
in the file metadata, and you only have one collection, you don't need it as a column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can add additional fields if they want (and will, for columns coming from STAC extensions; we should document this).
Documentation on how to extend this spec would be much appreciated! I've been looking into using stac-geoparquet to convert label extension items to ML ready label datasets, using EuroSat-STAC as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can update this document with additional sections for each extension that gives the name of the field and the parquet type.
Merging, thanks everyone! |
@kylebarron @TomAugspurger @cholmes Thanks for the great work on integrating stac into GeoParquet. I just want to bring this to your attention: we @wherobots recently developed this Havasu spatial table format. It builds on top of Apache Iceberg and adds table metadata / transactions to vector and raster parquet files (https://docs.wherobots.services/1.2.0/references/havasu/spec/#parquet-data-type-mappings). Its vector data parquet storage model is The raster data parquet storage is very similar to the I believe some synergy can be added here. We can work together to build the ecosystem and avoid duplicate work. We can even schedule a Zoom meeting to further discuss this. Havaus table format intro: https://docs.wherobots.services/1.2.0/references/havasu/introduction/ |
@jiayuasu - working together on the ecosystem sounds great. I'm not sure if we'll have critical mass for a zoom meeting (I'm slammed the next few weeks and this particular topic is low on my priorities), but you want to open a discussion explaining potential synergies and how to coordinate? In general not duplicating effort makes sense, but I'm not sure of the practical differences, if our goal would be to try to merge the efforts or just be sure they're compatible, etc. |
This probably will make sense to live somewhere else other than this repo, but starting it here for discussion. It needs more work, but wanted to put up progress.