CF expansion and alignment#8
Conversation
|
Hi @Fred-Leclercq @m-mohr |
|
Why is it so important to follow exactly the property names in CF? Can't we just use name and unit for backward compatibility? Are all these properties required in the metadata? I'm not a fan of just dumping everything into STAC without justification with a usecase. If there's a good usecase for every field, fine to have them all. But otherwise STAC philosophy to keep it small and simple as much as possible. |
|
Thanks for your feedback @m-mohr . We need a STAC catalogue that we can search for the correct field. "air_temperature" as standard_name is insufficient, this can be an instantaneous measurement, or an hourly mean/maximum/minimum, or a 24-hour mean/maximum/minimum. Same for rainfall and many other fields. The cell_methods fields allows to filter for the required "air_temperature" fields. Not everybody is deeply familiar with reading "cell_methods" field, so the long name is a detailed layman description of what the element is representing. This describes the data to a wider audience than CF experts. Staying with air_temperature as an example, the vertical dimension can be in metres (above ground, above model level, above mean sea level), in pressure units (Pascal) or in model sigma levels. We need to find the data that has elements at the right vertical height and these additional fields allow to filter for this. As suggested, happy to add this structure as a cf:elements or cf:exact next to the cf:parameter definition to maintain backward compatibility. |
|
I’m inclined to accept this pull request, but I’d like to review it in more detail. Most likely by the end of this week or early next week. @m-mohr any objections or remarks? |
|
Yeah, I've never been a fan of just dumping every information possible into STAC without any kind of alignment. For example, STAC defines a Similarly for long_name, that seems to be meant as a title or description ("for example, be used for labeling plots"), we have the standard_name breaks compatibility just for the sake of a different name. Why is name not working to keep compatible with existing implementations? The values can be as in CF, it's really just the name in the JSON... vertical_dimension: looks like a usecase for the datacube extension. asset_variable_name: If something only applies to a specific asset, the metadata should be in the asset. (Then there's weirdness that creeping in from CF's design: If cell_methods is a list ("comprising a list"), why is it a string an not an array?) So right now I just see dumping CF information into STAC without an attempt to properly align with STAC, so in the state it is right now I'd be -1 on this PR, sorry. I'm happy to add the information that is required for your usecase, but it should fit into the STAC ecosystem. If people don't align, everybody in the end just has a JSON encoding of their proprietary stuff and STAC gets more or less useless. The intention is that everyone can get easily familiar with what we describe in STAC, not just recognisable for people familiar with CF. |
|
An alternative idea that just came to mind is to actually just define fields such as search was mentioned in the PR introduction. For example, we could define the following fields
And reuse existing fields:
asset_variable_name is not needed any longer. These fields can no be used in any place, for example in items, bands, datacube extension, assets etc. So if a specific item uses one specific standard name, just embed cf:standard_name in the properties. Easily searchable. I want to enable your usecases, I'm just not sure whether the proposed approach really is well-suited for STAC. I think it could be easier to discuss and try this in a call. Maybe the STAC community call? |
|
Thanks for the feedback @m-mohr . Very helpful to get your insights into the considerations for STAC extensions as a whole. A few points for further discussion:
Maybe I can list a set of variables that we typically find in a dataset for thoughts how to catalogue it best in STAC. f10 1h min
f10 1h mean
f10 24h min
f10 24h mean
f60
t2m
t850hPa
t500hPa
Hopefully this relevant example can help with guiding a STAC solution. |
The STAC community calls are every other Monday, 17:00 CEST. Join https://groups.google.com/a/cloudnativegeo.org/g/stac-community to get an invite.
That's the point, I guess. All extensions should be general-purpose in the core. If a more specific variant is needed, they should inherit from the general-purpose extension. I'm not a fan of the xmip6, xarray and landsat extensions.
Yes, I fully agree with you. The xarray extension is not well-designed, it's just a lazy way to dump "proprietary" stuff into STAC. If every programming language starts doing this we end up with open_kwargs, open_args_for_r, parameters_in_js, etc. Not good.
There's no ambiguity if the extension is written properly and specifies a mapping to the extisting scheme. I appreciate the examples. Is any of the variables a separate asset? In this case I think my previous proposal would work well, listing the properties independantly, not in an array. So if for example f10 1h min and t500hPa are each an asset, it could/should look more like this: {
"assets": {
"f10_1h_min": {
"href": "f10_1h_min.nc",
"type": "application/netcdf",
"cf:standard_name": "wind_speed",
"cf:height": 10, # this should probably be generalized, height is not CF specific
"description": "minimum wind speed in 1 hour at 10 m agl",
"unit": "kt",
"cf:cell_methods": "time: minimum (interval: -1 h)"
},
"t500hPa": {
"href": "t500hPa.nc",
"type": "application/netcdf",
"cf:standard_name": "air_temperature",
"cf:air_pressure": 500, # this should probably be generalized, air pressure is not CF specific
"description": "air temperature at 500 hPa",
"unit": "degC",
}
}
}If it's a single netCDF asset, you'd use the datacube extension and provide the fields similarly in variables instead of in assets. Datacube extension variables are open to be extended by the CF extension. A bit of discussion is probably needed for height and air_pressure, they should probably not be defined in the CF extension. |
The different variables all sit in a single NetCDF. But even if not then it's probably not ideal to define different vertical dimensions with a "cf:" prefix since there are many (and there could be more in the future), and they are all defined as standard_names themselves anyway. However, trying to use the datacube extension I would get something like this to define elements, its vertical position and the time interval that the cell_methods is applied over. What do you think of the below @m-mohr ? |
Sure, that's fine and would work similar to what you are showing in your example.
I'm not sure I understand this sentence...
The example looks good, except that you embedded the dimension information directly into the variables, those are listed externally in the data cube extension. I'm confused how the dimensions can have the same name but then have different values and units at the same time.
No, but it would also not work with the current CF extension or your proposal in this PR. STAC search has a hard time searching through arrays of objects or assets. So neither of the proposed solutions would work well for search yet. |
The name of the vertical definition is meaningless, only the CF fields within the object define the vertical dimension precisely. I gave these a neutral name below to reflect this. But how would I pull out the time or vertical dimension which is specific to each of the cube:variables ? If I pull this out then I would need a field to link the vertical dimension with the variable, that would make a search even more difficult since the label that can be used for the vertical dimension is meaningless without the CF fields that define it. Hope you can help @m-mohr . |
|
Not sure whether this solves your issue, but this is at least compliant to the datacube extension: {
"datetime": "2020-12-11T22:38:32Z",
"cube:dimensions": {
"time_interval1": {
"type": "temporal",
"description": "time interval that cell_methods is applied over",
"values": [-24],
"unit": "h"
},
"vertical_dimension1": {
"type": "spatial",
"axis": "z",
"cf:standard_name": "height",
"description": "Height above ground level",
"unit": "m",
"values": [10]
},
"time_interval2": {
"type": "temporal",
"description": "time interval that cell_methods is applied over",
"values": [-60],
"unit": "min"
},
"vertical_dimension2": {
"type": "spatial",
"axis": "z",
"cf:standard_name": "air_pressure",
"description": "Air pressure",
"unit": "hPa",
"values": [500]
}
},
"cube:variables": [
{
"sea_surface_temperature": {
"type": "data",
"cf:standard_name": "sea_surface_temperature",
"description": "Average temperature on sea surface for preceding 24 hours",
"unit": "K",
"cf:cell_methods": "time: mean",
"dimensions": ["time_interval1"]
},
"wind_speed_at_10m": {
"type": "data",
"cf:standard_name": "wind_speed",
"description": "minimum wind speed in 1 hour at 10 m agl",
"units": "kt",
"cf:cell_methods": "time: minimum",
"dimensions": ["vertical_dimension1", "time_interval2"]
},
"temp_at_500hPa": {
"type": "data",
"cf:standard_name": "air_temperature",
"description": "air temperature at 500 hPa",
"units": "degC",
"dimensions": ["vertical_dimension2"]
}
}
]
} |
|
Sorry for the not coming back for a while. Doing multiple roles at work at the moment. Agree that normalising vertical and time dimension into separate definitions is cleaner and in line with the datacube extension. |
|
Depends on your database implementation, the STAC API specification itself would allow it via custom queryables. Alternatively, duplicate the information into the existing cf:parameter property, I think that was the primary purpose of the original extension anyway. |
|
Hm. But then we are back full circle where we started since the original extension does not provide the ability to define a time aggregation dimension for the cell_methods nor allows to specify the vertical dimension and value. These are all essential together for finding the right environmental variable. Unless I misunderstand what you are suggesting how to use cf:parameter @m-mohr . Re: API custom queryable. Could you sketch how the custom query would look like? Somehow it would need to search for the variable first, then from there read in the dimension labels, and then lookup the dimensions with the correct labels to filter for the relevant parameters for time and height. And since the labels are arbitrary this does not look like an easy way to specify. How would you see this working? |
No, I don't think so. Due to the common metadata model, you can use the fields in multiple places and as such can summarize at a higher level than the variables for search purposes.
Could you sketch how it would be working with your proposal? I'm not 100% sure I understand it yet. |
Co-authored-by: Emmanuel Mathot <emmanuel.mathot@gmail.com>
I try @m-mohr . For finding which collections have a minimum wind speed in 1 hour at 10 metres above ground level I would need to extract information in the following orders:
How would a queryable look like that embeds this kind of logic and extracts some of the necessary information (like the label of the variables object and the dimension labels) per collection and datacube? |
|
Thanks for the feedback @m-mohr |
|
@drandyziegler I can't push my changes to your branch.
(I'm not sure if you can enable this after creating the PR, in the worst case I can just merge it into a branch here make my edits and create a new PR form that branch.) |
|
@m-mohr Can't find this option in this PR. Might be something that needs to be set when creating the PR as you suspect. |
|
@m-mohr I added you as a collaborator to the project. Hope this helped. |
|
Thanks, the CI errors are fixed. Reamining todos:
|
Thanks @m-mohr . |
|
Thanks. I can try to do that in the next days. Can you add a bit more info about what null means in this case? |
|
Yes @m-mohr . A cell_methods is applied over a dimension. @emmanuelmathot suggested in the comment here #8 (comment) to have this as an array that lines up with the order in the dimensions array. I added a bit more to the README to explain this better. This array approach has two shortcomings:
In these cases a |
m-mohr
left a comment
There was a problem hiding this comment.
I'm approving for now, I still need to update the schema but will do in a separate PR.
|
Hi @emmanuelmathot |
emmanuelmathot
left a comment
There was a problem hiding this comment.
Fine with me for now as well. Need a schema update for a proper release
|
Okay, let's merge this for now. The JSON Schema issue is tracked in #9. It is assigned to me, but I'm pretty busy the next months so if anyone gets to it before me, feel free to take over. |

Addressed the following
Updated the schema accordingly but noticed that the test does not pick up on typos in the "vertical_dimension" object. No idea why.