-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Cube Extension: Variables and more #713
Comments
Re 1: GEE uses the following fields for variables:
|
Thanks for opening the discussion @m-mohr. I'll link to esm-collection-spec here: https://github.com/NCAR/esm-collection-spec/ This is something we hacked together to provide a STAC-inspired catalog of our cloud-based Zarr climate model data. This is how we are currently cataloging the Google Cloud CMIP6 data. (More technical blog post here.) In the long term, we would love to this to actually become a valid STAC catalog. I'd welcome anyone's thoughts on the best roadmap to achieve this. |
@rabernat I think this will take us multiple steps and probably two extensions or so. First we would probably work together to align the data cube extension to be flexible enough for your use case and then add another extension for additional domain-specific and/or format-specific things. One primary question to answer is probably whether your data would better be a STAC Item or Collection. For the data cube extension I'd need to understand your requirements and why you did what you did in the ESM spec. I guess the attributes are what would translate to what I call "variables" here, but I don't really understand that vocabulary thing in the ESM spec. Also, why is it "external"? The other fields would probably be part of the another extension, which would probably be mostly copy & paste with some restructuring. What is this CSV file about? Is there a reason for it being CSV instead of JSON? |
I'd really love to get ESM and STAC aligned, and I think the time is now as we're going to go 1.0-beta soon. I agree with the path @m-mohr lays out - first get the data cube extension to be flexible enough, and then add in an another extension for the specifics. @rabernat - could you help Matthias understand your requirements? Perhaps we could jump on call sometime soon and try to sort it? Take a crack at a STAC+datacube+extension version of ESM? From our discussions earlier, it seems hard to really map to an Item. But we don't have assets at the collection level, so I could see something like a collection that has one item which has the asset links, since conceptually it feels more at the level of a collection, but one with a very expansive Item. |
In the next weeks I'll have a look at
and probably some more data cube things and will try to align them. Any help is highly appreciated. I'm not so much into many of these formats. |
Thanks for keeping this discussion alive, and sorry for my slow responses. I'll tag a few more Pangeo folks to help move the conversation along: @jhamman, @andersy005 (who maintains ESM collection spec and Intake ESM) . Let's use the CMIP6 Google Cloud data as a representative use case. This dataset consists of about 100,000 distinct Zarr Groups, each formatted following the NetCDF data model. Opening a single group in Xarray returns something like this. Here the main data variable is This object, with dimensions A principle contrast between CMIP6 and most other datasets I've seen in STAC is that all of the CMIP6 data is completely global in extent. It's not at all interesting or important for us to know the bounding box of the data. Furthermore, the time range may include non-standard calendars used in climate modeling, such as 360, no-leap, etc, which are impossible to encode using STAC. The challenge for users with CMIP6 data is not finding a particular spatial or temporal extent--rather, it is filtering the >100,000 datasets according to variable, scenario, modeling center, etc. There is no inherent hierarchy to these attributes, so it can't be nested. The ESGF uses a custom search API to solve this problem. For the cloud data, the simplest solution we could come up with is a flat table, stored as a .csv file, with all of the relevant attributes for each dataset, e.g. (Screenshot from https://catalog.pangeo.io/browse/master/climate/cmip6_gcs/.)
Because it is not hierarchical and therefore cannot be nested, size is a major concern. These CSV files are already ~50 MB. JSON is much larger to store and slower to parse for this sort of flat, tidy data. But supporting JSON would certainly be possible. We would be happy to set up a call to discuss the details. I have some availability Wednesday morning (EST) and Friday afternoon. |
I'd also be happy to join a discussion to see how we can move these efforts along/together. Wednesday afternoon (expect 1-2p PT) and Friday after 1p PT are free for me.
This is possible now following NCAR/esm-collection-spec#15. I find this single-file collection to be the easiest example to comprehend: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/examples/sample-collection-with-catalog-dict.json |
Thanks for all the information. I'll use them in the next days and try to figure out how to combine boths specs without breaking too much on both sides. I don't think we need to put the catalog into JSON though, I guess this could simply be a new link type. I'm happy to join a call, but 1pm PDT is 10pm here in Europe, so that's too late unfortunately. Would be better to find something in the morning (PDT) next week or so. |
I could do monday 8am-10am pacific, or tuesday 7:30-8:30am pacific. |
Both of these time slots work for me. |
I tried to come up with an example to base discussions upon, based on:
For more details see also the comments in the code. {
// STAC collection fields
"stac_version": "0.9.0",
"stac_extensions": [
"asset",
"datacube",
"esm" // A new extension based on the ESM collection spec
],
"id": "pangeo-cmip6",
"title": "Google CMIP6",
"description": "This is an ESM collection for CMIP6 Zarr data residing in Pangeo's Google Storage.",
"extent": {
"spatial": {
"bbox": [[-180, -90, 180, 90]]
},
"temporal": {
"interval": [["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"]]
}
},
"providers": [
{
"name": " World Climate Research Programme",
"roles": ["producer","licensor"],
"url": "https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6"
},
{
"name": "The Pangeo Project",
"roles": ["processor"],
"url": "https://console.cloud.google.com/pangeo.io"
},
{
"name": "Google",
"roles": ["host"],
"url": "https://console.cloud.google.com/marketplace/details/noaa-public/cmip6"
}
],
"license": "proprietary",
"links": [
{
"href": "https://pcmdi.llnl.gov/CMIP6/TermsOfUse/TermsOfUse6-1.html",
"type": "text/html",
"rel": "license",
"title": "CMIP6: Terms of Use"
}
],
"summaries": {
// Could hold additional metadata as defined for STAC Items, not sure what could be relevant.
},
// Data Cube extension, see https://github.com/radiantearth/stac-spec/tree/master/extensions/datacube
"cube:dimensions": {
"lon": {
"type": "spatial",
"axis": "x",
"extent": [0,360],
"reference_system": 0 // Placeholder, Which is it here?
},
"lat": {
"type": "spatial",
"axis": "y",
"extent": [-90,90],
"reference_system": 0 // Placeholder, Which is it here?
},
"time": {
"type": "temporal",
"extent": ["1850-01-15T12:00:00Z", "2014-12-15T12:00:00Z"],
"step": "P30D" // Random placeholder
},
// Could probably be moved to cube:variables
"tas": {
"type": "variable",
"description": "Surface air temperature",
"extent": [-70, 70], // Placeholder
"unit": "°C"
},
},
// This is not part of STAC yet and needs to be defined, probably similar to dimensions with objects like https://github.com/radiantearth/stac-spec/tree/master/extensions/datacube#additional-dimension-object
"cube:variables": {
"tas": {
// to be defined...
"type": "variable",
"description": "Surface air temperature",
"extent": [-70, 70], // Placeholder
"unit": "°C"
},
"time_bnds": {
// to be defined
},
"lat_bnds": {
// to be defined
},
"lon_bnds": {
// to be defined
}
},
// Asset extension, extended by ESM extension to support asset-level metadata (adds the `href` property), ESM also defines "column_name" and specific roles ("catalog", "attribute").
"assets": {
"catalog": {
// Optional, otherwise specify esm:catalog below
"roles": ["catalog"],
"type": "application/vnd.zarr", // Previously assets.format - is there a ZARR media type?
"column_name": "path",
"title": "Catalog",
"description": "Path to a the CSV file with the catalog contents.",
"href": "sample-pangeo-cmip6-zarr-stores.csv"
},
// All attributes / vocabulary files, we may also move these out of the assets, depending on whether there's usually a "href" set or not. If not, it could simply be moved to a field "esm:attributes" with the same structure as in the ESM spec.
"activity_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "activity_id",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
"source_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "source_id",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
},
"institution_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "institution_id",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_institution_id.json"
},
"experiment_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "experiment_id",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_experiment_id.json"
},
"member_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "member_id"
},
"table_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "table_id",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_table_id.json"
},
"variable_id": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "variable_id"
},
"grid_label": {
"roles": ["attribute"],
"type": "application/json",
"column_name": "grid_label",
"href": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_grid_label.json"
}
},
// ESM extension fields
"esm:catalog": {}, // Optional, previously the "catalog dict" if no "catalog" asset is available
"esm:aggregation_control": {
// As defined by the ESM spec
}
} |
Thanks for all the information, it helped to work on the example above.
That seems pretty complex to model. We need to check closely whether we can abstract it into the data cube extension or add this to the ESM extension if desired.
Understood, although STAC collections require you to specify the extent. So you could always set it to worldwide.
It still has a temporal extent it covers, right? So that can be specified in the extents. And then for the non-standard calendars we can probably use the data cube extension (it has the option to do so) or add something in the ESM extension. Although I couldn't find any information on dates etc in the ESM spec so not sure whether you want to include it at all...
You don't need to search for spatial or temporal in STAC API. We "just" need to figure out how you search for variable, scenario, modeling center etc. and how we can map this to STAC. But the example above only looks at a JSON encoding for now. API is a completely different beast.
As said above, I guess you still need a custom API or an extension for the STAC API. That's nothing we can support out of the box yet. Let's start with aligning the JSON encodings first.
We don't need to store it as JSON, I think. We can just reference the CSV file(s) as assets. |
Agreed, and I think even though it's not interesting or important to you, it is important for a general person looking for geospatial information to know that this particular set of data is global. But setting it worldwide in all cases I think is totally valid. And it does seem like there are multidimensional cubes / netcdf data that isn't global sometimes?
+1 - I see the win here as getting the 'overview' in the same 'world' as STAC, so we have interoperability at the 'collection' level. The filtering of datasets can be a totally different 'thing'. And I think the links to the csv files seems like a nice cloud native way to let diverse tools filter datasets.
+1 - I too was curious to know why it is CSV, but the reasoning makes sense, and having STAC refer to assets that aren't json is totally normal. |
This is more-or-less what the ESM collection spec does now. At the top level, we still have a json file, which points to this csv file.
This is all covered by CF conventions: http://cfconventions.org/cf-conventions/cf-conventions.html#calendar An important thing to note here is that our field has a very comprehensive and widely adopted set of metadata conventions called CF conventions. But all the CF metadata live inside the netCDF / Zarr file (and in Zarr it is stored in json). A question we would need to resolve is how much metadata to pull out and store in a STAC collection. @m-mohr, in your example above, you are essentially duplicating much of that metadata, so it's starting to look very similar to the zarr metadata file itself. An example: https://storage.googleapis.com/cmip6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/clt/gn/.zmetadata:
This is essentially CF conventions encoded in the Zarr v2 Spec. Perhaps one path forward is to make STAC more aware of Zarr objects. There is already considerably compatibility. Zarr uses json files to provide metadata about a collection of binary data objects which together comprise a full multidimensional array.
I can do Monday 9 AM PDT (12 PM EDT). Shall we settle on this? |
Sure, let's do it then. Here are the Zoom details:
|
On my calendar - thanks! |
Yes, I had looked at the spec when coming up with the example. It tries to borrow as much as possible from there.
Interesting. We need to figure out what needs to be exposed and how. Would someone search on this data? Maybe it's as simple as adding a field "calendar" with the options defined in the link?
Yes, STAC doesn't replace the original metadata files, but exposes the search / discovery related data in JSON. So there is indeed some intentional duplication. We need to figure out to which extent this is needed for Zarr, for example all the data cube related things are optional and then the example is shorter and you have a bare minimum, I think. I don't think we can or should go beyond that point of "simplicifcation".
Not 100% sure how we could do that, but we can link to it for sure. Is there a media type for zarr (metadata) except for application/json? |
A Zarr array or group is not a single file / object. It is a collection of json files and binary objects, stored with a standard layout. So it is not meaningful to talk about a media type for Zarr. In other words, a Zarr array or group is analogous to a STAC collection, and the individual chunks analogous to STAC items. This overlap in scope is one reason I have a hard time mapping Zarr to STAC. |
Aha, I wasn't aware of that. That makes some of my considerations above obsolete. Let's see how we can go forward... |
Over the past year, I've spent quite a bit of time reading and trying to understand the STAC spec and its extension. In order to make Monday's meeting as productive as possible, I encourage you to do the same with the Zarr spec. For context, Zarr recently received a CZI EOSS Grant. It is growing rapidly, not only in geoscience but also in bioinformatics / bioimaging. Thinking carefully about how to best integrate STAC and Zarr is an important task. |
Just catching up on this - I'll join the call on Monday as well. To @rabernat's point about Zarr being multiple files, while that's true you don't reference the individual pieces....that is they aren't individually addressable as files. You reference the entire Zarr dataset, so in that sense I think that maybe a media type might make sense for Zarr. One idea that had been brought up before was a collection-level assets, which there's an issue for now ( #779 ), but it does go against the central concept of STAC....Items are the things that are searched. Some questions to consider and to talk about on Monday:
If the answer to the above questions is 'yes' or 'possibly' then that implies that it is advantageous to being able to specifically address a portion of a Zarr dataset, and to determine that ahead of time by some separate query. This goes against my point above that pieces of a Zarr dataset aren't individually addressable....they aren't in the traditional sense but Zarr chunks are read in pieces...so maybe the "URLs" to get to a piece of a Zarr dataset is really just arguments to some function. In this case maybe multiple STAC Items representing a complete Zarr dataset make sense, with the Collection representing the entire dataset. If the answer to these questions is no, then Items do not really make sense. We could still align STAC and ESM - either by adding assets to a Collection and the Collection represents the entire dataset, or maybe just a single STAC Item is a Zarr dataset. |
The meeting details above were not correct. It said it's Sun, 19th, but of course we meet on Mon, 20th. |
As we are not going to use it in the ESM spec, I'm delaying the work on this until beta 2 and will focus on other more pressuring issues first. |
Related to the ESM collection spec alignment with STAC (NCAR/esm-collection-spec#21)
@cholmes Isn't it important that each item is searchable through the STAC API's search endpoint itself, rather than relying on a client side CSV lookup? I mean, it's clear that storing as JSON would be a heavier data representation than what we find in the CSV file, but the server side search capability would benefit from it.
Based on the nature of the climate data, which is accessed by variables instead of spatial or temporal extent, it makes sense to use a CSV to easily find datasets. However, I wonder if the actual goal of STAC search is reached if it only provides the CSV and all operations are client side. What about having the actual data attributes and values self-contained in the JSON using the data cube extension or custom metadata fields using custom schemas, instead of relying on CSV? I personally can't describe CSV files as a cloud native way to store this data. I imagine consuming the STAC API through a light client (ie: web browser) would be painful if 10mb CSV needs to be beforehand downloaded. |
I agree that's the ideal. But my main thought with esm / zarr in STAC now is to not let great be the enemy good. Originally STAC was very focused on 'Items', but OpenEO and Google Earth Engine both find a lot of value in just using it for 'collections' (as they don't have items - each layer is a full composite, abstracting out the 'scenes'/ It's just a more modern metadata for those who don't want to use one of the older XML standards, and it works in STAC tools. I still don't have my head fully around zarr, so it feels like STAC can at least 'help' with collection level search. We get that 'win', and then we can further investigate the 'item' level. What you say makes sense to me, but I still don't have a real feel of what exactly should be an 'item'. But I do think once we get that first win we should try to dive deep. I think it'll also be easier once the ecosystem for stac is a bit more mature, as we'll be able to see what putting things in items or collections actually results in. Right now it just all feels too abstract.
This does make sense to me, and I fully agree the full goal of STAC search will be better reached if we map the CSV into the structures that result in fuller 'search' in STAC. But I think we can start with 'stac collection search' (which to be fair we don't really do yet in API's, as we are waiting to align with OGC API - Records, but we know the 'static' structure that will power the apis). But I'm also up to spend some time to figure out the 'full' solution, though I'd also want to understand what that would mean for existing clients of the CSV. |
The ESM Collection spec work seems to have stopped at some point, it was never finished as far as I know, but I think it was replaced with a more generic zarr mapping for STAC. Maybe @rabernat can sound in here? |
Thanks for re-opening the discussion. Looking back at these issues in retrospect, I see that we muddied the water a lot by conflating two separate issues. These separate issues are
Going forward, I think the most important issue to resolve is how best to point to and describe non-imagery file formats in STAC. Zarr is a particularly hard case, because it is not even a single file but rather has its own internal nested hierarchy of objects. Rather than trying to do this ESM collection / csv idea from inside STAC, our current view is that we should simply generate a static, deeply nested STAC catalog for our data and then index it with various search tools. In this context, the csv file is simply one of those indexes. Our community is quite into open search and elastic search as well, so those could be other options. This is discussed quite a bit in pangeo-forge/cmip6-pipeline#7. Coincidentally, a group of folks from Pangeo and ESGF has just been getting ready to reach out to the STAC community again to discuss how we can start to align better with STAC. We are still stuck on some high level conceptual questions that would benefit from real-time discussion. We would love to set up a meeting with @m-mohr, @cholmes, @HamedAlemo, and anyone else interested. This meeting would include some folks from the ESGF leadership and so would be a chance to really advance STAC adoption in the climate world. Would you be interested in such a meeting, and, if so, what's the best way to schedule? |
@rabernat Also the points you mention in #366 (comment) are pertinent and should be part of the discussion. The data pipeline around STAC which is responsible to bridge with existing TDS data is one important piece of the STAC ecosystem. When is the next meeting? |
@rabernat thanks for the clarification on the two issues. I think a meeting would be helpful. If you can also share a sample small Zarr file (something much smaller than a CMIP6 output e.g.) before the meeting it would be great. That can help better understand the Zarr hierarchy and how it can work with STAC. |
I just discovered a bunch of weather data (grib) cataloged in STAC! https://api.weather.gc.ca/ I think @tomkralidis is responsible for this! |
Thanks @rabernat. Yes, powered by pygeoapi, this is an experimental capability atop some of our MSC Datamart real-time data. Note that we are working on providing our climate data via STAC in a future release, so it would be great to move forward climate data in STAC. |
The data cube extension has been moved to another repository: https://github.com/stac-extensions/datacube This issue has been moved to: stac-extensions/datacube#1 Closing here. |
Two things came up recently that could be integrated into the data cube extension:
Add variables in addition to dimensions. Some data cubes expose variables, some don't. We don't need this for openEO (yet?), but Google Earth Engine @simonff would probably use them. I'm also looking at netCDF and other formats, which afaik support variables as addition to dimensions. Maybe there's space for alignment also with the ESM collection spec "fork" from @rabernat.
For dimensions it might be useful to specify the number of cells (see openEO UDF discussions).
The text was updated successfully, but these errors were encountered: