-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datacube extension #361
Datacube extension #361
Conversation
36838d7
to
4abb599
Compare
As discussed in #366, I think STAC can have a big impact on the broader climate science community, via things like this datacube extension. But I would strongly encourage you to engage with the netCDF / CF conventions community on this topic. (CF Conventions repo: https://github.com/cf-convention/cf-conventions). There are decades of expertise on metadata to describe multidimensional gridded geographic datasets. We have also been discussing json summaries of datasets in xarray (see pydata/xarray#2656). Specifically, it might be good to ask someone from CF Conventions to review this PR at some point. |
@rabernat Could you forward it to the corresponding people you are speaking about, please? I personally don't need much more than what is currently in the data cube extension, so it is up to the communities (e.g. netCDF, openDataCube, WCS, ...) to communicate their needs so we can add them here. |
I can try to help bring it to the attention of those communities. However, this is a bit of a chicken-vs-egg problem. They mostly don't yet realize that they need STAC and are currently living in The quicker option is of course to just move forward with what current STAC contributors personally need. |
I'm happy to help with some outreach to those communities. I do think it's worth trying to figure out a 'bridge' of some JSON in STAC that helps expose the world they live in to a wider audience of users. Perhaps a good point of collaboration would be one of the Earth on AWS datasets - https://aws.amazon.com/earth/ Maybe the UK Met Office Atmospheric Deterministic and Probabilistic Forecasts one? Or if there's another good open data set that we could get AWS to host, and try out STAC around it? The ideal to me is to get the data rendered and interactive in some way in STAC Browser. Like https://planet-stac.netlify.com/ but when you navigate down you get a cool interactive 5d thing instead of our normal web tiles. |
The ERA-5 data set from the Copernicus Climate Change Service C3S is on AWS Cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear Mathias
Good to start, but thinking of weather and climate models, I could see something more being needed.
An example where there are 2 different horisontal axes: layers as in something between 2 levels and precise levels. For exmaple ECMWF ERA-5 and earlier ERAs have 4 layers of soil with 0-7cm for the layer 1, 7-28cm for 2, 28-100cm for 3 and 100-255 for 4. There are then temperature, moisture etc variables for these layers. Into the atmosphere there are model levels over 100 with the same principle, close to the surface levels are close and with increasing height the distance between levels grows. For most users the atmospheric information is useful in equal pressure levels 1000, 925, 850, 700, 500, 300. I believe that Mathias proposal works with these then we have either a more complex extent possible or another field defining these levels or layers. Extent allowing a list of single or two-dimensional arrays, would solve it in my view beautifully: the soil layers would mean "extent": [0,7],[7,28],[28,100],[100,255] unit cm; while for example pressure levels would be "extent":1000,925,850,...
And as a very general comment on data cubes; for me they should be a place to be able to mix numerical models including atmosphere and EO data ;) would make most sense like that!
Dear @mstrahl, thanks for commenting. Unfortunetaly, I currently don't have the time to explore other data sets on my own and find out how they do things, so I'd need some guidance. Nevertheless, your second comment is very valuable. Not sure whether I understood everything correctly, but based on your example we could try to figure out how to describe your data with the data cube extension and change the extension if required. The current proposal heavily focuses on dimensions so having multiple horizontal axes is not a problem, but it seems that the fields need some adjustments. I'm not sure yet whether we should go with multi-dimensional arrays... in a data cube world you'd basically just have a dimension with four values for each soil layer, right? The actual groups (0-7cm) would be an attribute IMHO. Could you make an example based on this extension and fill in your date as you need it? Feel free to change whatever is required and then we can discuss it in-depth.
That's also what we want to achieve in the long run (with openEO). |
Would it be something like this? Just to get started.... "cube:dimensions": {
"x": {
"axis": "spatial",
"extent": [-180,180],
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"y": {
"type": "spatial",
"axis": "y",
"extent": [-90,90],
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"soil_layer": {
"type": "spatial",
"axis": "z",
"values": [1,2,3,4], // ordinal scale? 1 = 0-7cm, 2 = 7-28cm, 3 = 28-100cm, 4 = 100-255cm
"labels": ["0-7cm", "7-28cm", "28-100cm", "100-255cm"]
},
"pressure": {
"type": "pressure",
"extent": [0,1000], // are these "grouped" as the soil layers or not? I expect here that they are not...
"unit": "Pa"
},
"temperature": {
"type": "temperature",
"extent": [0,273.15],
"unit": "K"
}
} |
Dear Mathias
Main point really was that in weather and climate data sets the dimensions
up or down from surface are not uniform. They might represent as data not
single points, but the average for a column.
This means that dimensions with just start and end extent are poor
representations. I believe that datacube extension needs to describe this
and I would like the extent as a list of lists type, but maybe having in
addition to the minimum maximum another parameter like steps. These could
then have lists or lists of extents. For uniform dimensions a simple single
value for how many would suffice.
In the end extent, unit, steps would be enough for both small and simple as
well as precise description.
Cheers
Mikko
ke 6. helmik. 2019 klo 22.21 Matthias Mohr <notifications@github.com>
kirjoitti:
… Dear @mstrahl <https://github.com/mstrahl>,
thanks for commenting. Unfortunetaly, I currently don't have the time to
explore other data sets on my own and find out how they do things, so I'd
need some guidance. Nevertheless, your second comment is very valuable. Not
sure whether I understood everything correctly, but based on your example
we could try to figure out how to describe your data with the data cube
extension and change the extension if required.
The current proposal heavily focuses on dimensions so having multiple
horizontal axes is not a problem, but it seems that the fields need some
adjustments.
I'm not sure yet whether we should go with multi-dimensional arrays or put
each into a separate dimension?
Could you make an example based on this extension and fill in your date as
you need it? Feel free to change whatever is required and then we can
discuss it in-depth.
And as a very general comment on data cubes; for me they should be a place
to be able to mix numerical models including atmosphere and EO data ;)
would make most sense like that!
That's also what we want to achieve in the long run (with openEO).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#361 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARceP4U3PvHY0MJo4lHI7NEAekzqfpClks5vKzlIgaJpZM4YzVjP>
.
|
Dear @mstrahl , good points for sure, thanks again for participating. I think changing the extent to include the internal structure of the dimension would not be correct and needs separate fields. An extent should simply be an extent, which means minimum and maximum bounds and that's it. So in addition, it could be a set of "groups" and/or steps that need to get a separate field. For the steps I'm not sure yet how to describe them properly. Irregularly spaced steps are somewhat difficult to describe with ISO8601, for example. Are there any existing standards that have a solution for this? It would be very valuable if we could take one or two datasets and describe them in STAC so that we get a better understanding what is currently possible and what is lacking. Do you or anybody else has time to do so? Maybe also together in a call or so... Maybe ECMWF ERA-5 and one of the datasets Chris mentioned? Best, |
Just to complicate things further, I just learned about cf-json: http://cf-json.org/specification Here is what a cf-json object looks like: {
"attributes": {
"source": "cf-json.org",
"description": "Example wind data on a grid",
"timestamp": "2000-01-01T00:00:00Z"
},
"dimensions": {
"latitude": 8,
"longitude": 10
},
"variables": {
"longitude": {
"shape": ["longitude"],
"type": "float",
"attributes": {
"units": "degrees_east"
},
"data": [ 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
},
"latitude": {
"shape": ["latitude"],
"type": "float",
"attributes": {
"units": "degrees_north"
},
"data":[ 30.2, 30.4, 30.6, 30.8, 31.0, 31.2, 31.4, 31.6]
},
"wind_east": {
"shape": ["latitude", "longitude"],
"type": "float",
"attributes": {
"units": "ms^{-1}",
"long_name": "Easterly component of wind",
"standard_name": "eastward_wind"
},
"data":[
[ 5.3, 2.2, 2.2, 5.2, 1.6, 5.2, 6.7, 9.9, 8.4, 1.5],
[ 7.1, 1.9, 7.8, 6.8, 1.7, 2.3, 6.8, 2.6, 3.5, 4.5],
[ 1.5, 4.4, 5.9, 0.3, 7.6, 1.0, 6.6, 0.8, 2.8, 3.0],
[ 4.3, 4.0, 5.3, 1.1, 0.6, 7.9, 8.3, 9.0, 6.9, 3.5],
[ 5.1, 6.6, 4.5, 4.8, 2.7, 7.3, 9.3, 1.2, 4.2, 1.9],
[ 9.3, 9.1, 5.5, 5.2, 2.5, 0.1, 6.4, 9.5, 5.6, 5.9],
[ 0.1, 5.2, 0.8, 8.4, 3.8, 3.1, 8.7, 0.7, 1.0, 2.8],
[ 7.5, 2.7, 5.9, 6.8, 4.2, 9.1, 9.8, 4.7, 1.8, 6.9]
]
},
"wind_north": {
"shape": ["latitude", "longitude"],
"type": "float",
"attributes": {
"units": "ms^{-1}",
"long_name": "Northerly component of wind",
"standard_name": "northward_wind"
},
"data": [
[ 8.9, 2.1, 9.0, 6.0, 9.2, 8.6, 7.3, 7.2, 5.6, 6.0],
[ 8.4, 9.1, 0.2, 9.2, 6.4, 9.4, 6.3, 4.1, 1.3, 3.7],
[ 4.3, 5.3, 2.3, 8.5, 9.1, 9.8, 7.4, 2.5, 9.0, 0.9],
[ 8.1, 5.8, 2.7, 2.9, 7.6, 5.5, 4.8, 5.0, 3.9, 9.6],
[ 5.9, 7.2, 3.5, 4.7, 8.4, 9.3, 0.9, 9.6, 5.5, 5.8],
[ 7.9, 6.1, 2.2, 9.6, 9.8, 7.5, 0.1, 6.6, 0.0, 2.7],
[ 7.0, 6.0, 6.5, 1.1, 8.0, 9.0, 9.7, 1.6, 5.0, 6.6],
[ 6.0, 4.4, 5.0, 8.6, 5.6, 3.5, 1.9, 2.3, 7.2, 3.1]
]
}
}
} |
Could make sense to be able to refer to CF-JSON as an 'asset', and then to also use them as direct inspiration for the field names / structure to describe things like the irregularly spaced steps? Or does it not quite handle it either? |
I think cf-json is designed to hold the whole dataset, not just the metadata. So in STAC terms, it would definitely be an asset. I personally would be interested in a cf-json-like representation minus the actual data (which could live in a more scalable container like zarr). At that point, you're pretty close to the zarr metadata itself though, so not sure cf-json has any extra added value. I think netcdf-ld is probably closer to what we want: https://binary-array-ld.github.io/netcdf-ld/ |
That's a good hint, I basically forgot about cf-json and CovJSON, @rabernat. netcdf-ld is also interesting to look at. We also have a related discussion in openEO with a proprietary proposal. So, I took some time to dig into them. First of all, all specs except netCDF-LD include the data itself, which we don't want. So we could only use a subset of any of the specifications and link to those "full" files as an asset. This is basically what @cholmes already wrote, too. What I also noticed is that netcdf-ld and cf-json have a different understanding of dimensions as I have it. In my approach, I'd basically merge dimensions (CovJSON: axes) with the variables (CovJSON: parameters). None of the specs seem to handle irregularly spaced steps and actually I generally don't know any spec yet that really handles them, @cholmes . ;-) If anybody could lead me to one, please do so! The cj-json spec is very compact and simple, which I like, but is only in version 0.2 (is it stable and used?) and lacks some details (e.g. no specification on units). Also, if we remove things that are out of scope for STAC (data, missing_values, type, dimension length, (shape?) are not really interesting for search) there not much left that's not already in the current proposal. We could actually add a human-readable name/title and/or description to the dimensions. CovJSON on the other hand is a quite long specification (also in some pre-version "0.2-draft", but implemented in Hyrax), tackling a lot of special cases etc. Their domain property is pretty much describing what we already have in this PR, of course I ignore that actual data again and also don't consider i18n. In their domain property, they describe axes (what I call dimensions in this PR) incl. extents and reference systems. Their parameters are a bit more complex, but include units, textual fields, categories and groups. They have many domain and axis types that encode different types of geospatial rasters, which we don't need I think. It's not important for searching and makes things overly complex. I'm struggling a bit with netcdf-ld as I couldn't find an actual specification? The homepage only lists examples. Can someone guide me to the spec? @rabernat ? Is it this (incomplete?) repo? Looking at the examples it seems they do exactly what I do in this PR. They have Dimensions with a length of the data and variables (again I'd merge these, but not sure whether that's meaningful for all domains). The variables have names, units, axis, ranges/grids/exstents and some more things that are not so relevant: fill value, offset, scale factor (not so useful for search, see above). So what do I conclude from this?
|
covjson calls steps axis values, and uses netcdf encodes the ordinate positions (axis values) in separate 1-D variables called coordinate variables that have the same name as the corresponding dimension, and are enumerated along that dimension e.g.: "dimensions": {
"time": 3,
// ...
},
"variables": {
"time": {
"shape": ["time"],
"type": "float",
"attributes": {
"units": "days since 2000-01-01"
},
"data": [0.0, 1.0, 2.0] The recommendation is for the variable to be strictly monotonic, but it doesn't have to be equally spaced. |
@mkadunc I hoped there would have been something more "compact" to be included in the metadata instead of listing all the actual values there. I'm not quite sure what I thought it could be, so this might actually be the only valuable solution except for just stating "irregular" in the metadata. |
Or you could use covjson's |
Dear Mathias
Tomorrow we will make an example at FMI describing a HIRLAM model result
with STAC. I'll add it to github then ready.
Cheers
Mikko
ma 11. helmik. 2019 klo 17.04 Matthias Mohr <notifications@github.com>
kirjoitti:
… Dear @mstrahl <https://github.com/mstrahl> ,
good points for sure, thanks again for participating.
I think changing the extent to include the internal structure of the
dimension would not be correct and needs separate fields. An extent is
simply an extent, which means minimum and maximum bounds and that's it. So
in addition, it could be a set of "groups" and/or steps that need to get a
separate field. For the steps I'm not sure yet how to describe them
properly. Irregularly spaced steps are somewhat difficult to describe with
ISO8601, for example. Are there any existing standards that have a solution
for this?
It would be very valuable if we could take one or two datasets and
describe them in STAC so that we get a better understanding what is
currently possible and what is lacking. Do you or anybody else has time to
do so? Maybe also together in a call or so... Maybe ECMWF ERA-5 and one of
the datasets Chris mentioned?
Best,
Matthias
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#361 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARcePxDNz8YPNgX4wLqb719n9hhX3l_Lks5vMYZpgaJpZM4YzVjP>
.
|
I updated the draft to me more flexible. It now allows to specify a set of values, a step etc. It should (mostly) allow to describe your dataset, @mstrahl, does it? By the way, did you have time to come up with an example? Otherwise it may be easier now with the updated draft. What do you all think? Here's the new example enriched with some comments:
|
It is hard for me to wrap my head around "temperature" being called a dimension, but I am new to the stac spec. I imagine your example re-written as follows: {
"stac_version": "0.6.0",
"id": "datacube",
"description": "Multi-dimensional data cube.",
"links": [
{
"rel": "self",
"href": "catalog.json"
}
],
"properties": {
"cube:dimensions": {
"x": {
"type": "spatial",
// Change "number" to "axis" and explicitly state which axis
"axis": "x",
"extent": [-180, 180],
"step": 2,
// Change "reference_system" to "crs"
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"y": {
"type": "spatial",
// Change "number" to "axis" and explicitly state which axis
"axis": "y",
"extent": [-90, 90],
"step": 1,
// Change "reference_system" to "crs"
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"pressure_levels": {
"type": "spatial",
// Change "number" to "axis" and explicitly state which axis
"axis": "z",
// Extent/step values
"extent": [0, 1000],
"step": 100,
"unit": "http://www.opengis.net/def/uom/SI/Pa",
// Vertical dimensions need a "crs" as well
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"metered_levels": {
"type": "spatial",
// Change "number" to "axis" and explicitly state which axis
"axis": "z",
// Specific dimension values
"values": [0, 10, 25, 50, 100, 1000],
"unit": "http://www.opengis.net/def/uom/SI/metre",
// Vertical dimensions need a "crs" as well
"crs": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
},
"time": {
"type": "temporal",
"extent": ["2015-01-01T00:00:00Z", "2018-12-31T23:59:59Z"],
// Support time intervals in ISO 8601
"step": "PT1H"
}
},
"cube:data": {
"temperature_metered": {
"extent": [0, 273.15],
// Specify the dimentions of this "cube" of data using the dimension names
"dimensions": ["time", "x", "y", "metered_levels"],
"unit": "http://www.opengis.net/def/uom/SI/kelvin"
},
"temperature_pressure": {
"extent": [0, 273.15],
// Specify the dimentions of this "cube" of data using the dimension names
"dimensions": ["time", "x", "y", "pressure_levels"],
"unit": "http://www.opengis.net/def/uom/SI/kelvin"
}
}
}
} It opens up lots of questions for me - are multiple cubes per definition allowed? You may find this interesting, which is a way to describe any "cube" (UGRID calls it a "mesh") of data unstructured or structured data. It follows the CF specification and is on proposal to be incorporated into CF in the (hopefully near) future. https://github.com/ugrid-conventions/ugrid-conventions |
@kwilcox Thanks for your comments, highly appreciated. Most of your suggestions actually have been in the proposal before, but I changed them as I now inherit the fields from the Dimension Object (see the actual specification, not just the example, please). So crs is just another name for reference_system here and basically the same applies to number / axis. Both crs/trs and axis are specific, just other names. The actual meaning is the same and doesn't change, but it allows to inherit the fields. If we want to use the specific names we need to duplicate all the other fields in the Dimension Object, which feels messy and duplicates several definitions. If there are strong reasons for switching to crs/trs and axis, we could still do it. In this case please write down the reasons. ;-) Temperature is not a good example, I agree. I need to find out a better one. You added cube:data, but I'm not sure what you want to express with it. What's the difference to cube:dimensions? We don't want the actual data in the STAC files. I'll add ISO8601 steps for temporal dimensions. I haven't yet worked with meshs or so, I only work with one cube at a time. So if you need more than one cube, you'd need to come up with a proposal to be added here. |
Removed From my side this is ready to be reviewed and merged for 0.7. IMHO further Feedback can better be discussed in focused PRs. |
The only thing that still bugs me with this PR is that the Z axis is often quite different to the X and Y axes, so the specifications for Z axes are usually not covered by the Spatial Dimension Object, but are rather more just a Dimension Object with an additional axis field. So maybe we need to separate the Z axis from X and Y somehow, but I don't have a good and clean idea at the moment. |
In geodesy, the vertical equivalent of the "usual" 2D coordinates is a "height" axis, which usually has its own reference system, of type Vertical CS - see e.g. https://epsg.io/?q=vertical%20CS%20kind%3ACS . The height axis is usually abbreviated with The abbreviation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this merged. We discussed on the STAC call, and seems like a good idea to get it in as a 'proposal', so we can do PR's on parts of the spec instead of piling everything in to this massive PR.
cc: @mojodna and @matthewhanson - would be great for one of you to approve to and we can merge and continue work there.
This is a first and very basic draft for a datacube extension.
The initial aim was to describe the dimensions differently from what we currently have implemented as "extents" in the collections as it is not a very good representations of dimensions as used for datacubes. For example, spatial dimensions could be represented as one, two or three dimensions, there could be multiple time dimensions etc.
Files to be added once the draft is reviewed and more stable:
We need this for openEO, and I think there are other organizations that could also be interested in it, for example Geoscience Australia / openDataCube (@omad) and hope to get some ideas from there as well.