Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List non-common fields in Collection (Summaries) #413

Closed
m-mohr opened this issue Mar 5, 2019 · 19 comments
Closed

List non-common fields in Collection (Summaries) #413

m-mohr opened this issue Mar 5, 2019 · 19 comments
Assignees
Labels
prio: should-have would be very good to have in the release stac-catalog stac-collection
Milestone

Comments

@m-mohr
Copy link
Collaborator

m-mohr commented Mar 5, 2019

In openEO we need to list also the non-common fields in Collections, i.e. all properties that are available in the Items but have different values. We don't necessarily need to list the actual values (or extents), but we need to inform the user what he can query against. This is probably similar to what we plan to add for the assets as "asset definition" or "asset schema". So a collection with an "item schema" could something as simple as:

{
    "id": "s2",
    "title": "Sentinel-2",
    "description": "...",
    "stac_version": "0.6.2",
    "extent": {
        "spatial": [-180,-90,180,90],
        "temporal": ["2013-06-01",null]
    },
    "properties": {
        "eo:gsd": 10,
        "eo:platform": "sentinel-2",
        "eo:instrument": "MSI",
        "eo:bands": []
    },
    "varying:properties": [
        "eo:epsg",
        "eo:cloud_cover",
        "eo:off_nadir"
    ],
    "links": []
}

I need to define this anyway for openEO in the next days, so the question I have is whether that could be something that is useful also for other users so we can combine efforts and standardize. Otherwise I'd probably start of as a proprietary extension.

This could be something that may also be useful for STAC API users that don't want to look into Items to find out what they could query against.

@matthewhanson
Copy link
Collaborator

I think this is useful. I've always seen a Collection as a definition for the Items that are in it. So a user should be able to look at a Collection and get all the info they need to know about the Items without having to get Items to look at the structure.

I think a name like "available_properties" is better than "varying" though, and don't think it needs a prefix.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 5, 2019

Regarding the name: I understand "available_properties" as "all properties available", but I currently only list the properties that are not actually listed with fixed values in "properties", so that's why I chose varying. But I'm really open for a new name, because varying was just a synonym I googled for uncommon, which felt even worse. (Non-native speaker issues, I guess)... Happy to remove the prefix.

Edit: other_properties was another idea I had earlier.

@jbants
Copy link
Collaborator

jbants commented Mar 12, 2019

How about extended_properties or additional_properties?

@cholmes
Copy link
Contributor

cholmes commented Mar 12, 2019

Just spent a good 15 minutes thinking about this...

It seems to me like the ideal answer might be to have just a 'properties' object, that can contain:

  • 'locked' properties (the common properties), just one value
  • 'range' properties, that give the range you can query, or an enum.
  • 'open' properties, that just have the property name and take any value.

So you could just have:

"properties": {
  "eo:gsd": 10,
  "eo:platform": {
      "values": [
        "Sentinel-2A",
        "Sentinel-2B"
      ]
    },
    "eo:epsg": {
      "values": [
        32601,
        ...
      ]
    }.
     "my:property": [],
     ...
}

Though then I'd maybe want to rename it 'properties_definition' instead of 'properties'.

That obviously doesn't work as an extension. If we do want to start as an extension (which in general I think is a good idea), then I think I actually like the varying:properties original suggestion the best (or variable:properties which is what I thought up before seeing the original). other_properties also works for me. I don't like the name of the extension as 'non-common' though. I'd call it 'varying properties definition extension' if we go with varying:varying. If we go with other_properties then I'd probably call it the 'other properties in collections' extension or something like that. Like I think the name of the field we use should be the same or else quite close to the name of the extension (and both should describe as best they can what the extension is about).

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 13, 2019

Regarding the naming: I have released the openEO API based on the draft we have currently so it's not baked in there for some time and we'll try it out in the next months to see how it works. So for now we are using other_properties, but I'm not quite happy with it. It doesn't describe the field very well. varying_properties or variable_properties would be much more intuitive, I guess. So I'd be happy to revert the change. Based on the field name we can then find a name for the extension.

@cholmes I had similar ideas when I was drafting the extensions. It feels better to have them all on the same level and I would love to define it that way, but I scrapped it as it has some major drawbacks:

  1. The commons extension was designed in a way that providers could simply merge the collection properties back into the items. That makes it very easy to construct a full item and would get lost with the new approach.
  2. Validation of collections won't work anymore as it would fail on the new objects, which are not compliant. Also, it requires the field to be named 'properties', 'properties_defintion' would kill the validation of collections, too. I ignore issues with the required properties for now as it is an issue we still have to solve globally, see Validating extensions with required fields fails when used in collection properties #421.
  3. It's hard to distinguish between Summary Objects and objects that are defined in extensions. For example, is an object containing "values" a Summary Object or a Dimension Object from the Data Cube extension? Both define it in the a very similar way.

@cholmes
Copy link
Contributor

cholmes commented Mar 13, 2019

Hmmm... yeah, those are some major drawbacks.

I do like the idea of a lightweight way to say what properties are in collection. But the other route is to define a fuller item definition. Of course when we say 'item schema', then it seems to beg the question of why not just use JSON Schema? It feels to me that the 'heavyweight' solution would be that a collection can/should reference the JSON Schema that is valid for the collection. Like how the boundless server generates the schema for its collections - https://stac.boundlessgeo.io/stac/schema/landsat-8-l1 A static catalog would just make that a static reference. It'd be the combination of the extension schemas, that would fully describe all the fields.

In JSON Schema I imagine you can also specify the 'commons' constraints - like just say that the "eo:gsd" is 10, instead of any number.

I think the common properties would still make sense to have, as to me the purpose of that is to reduce the repeating fields in each item. But when the purpose is to let clients know what to query against then maybe the heavier JSON Schema definition makes sense?

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 13, 2019

I'm personally not interested in an Item Schema and so I guess that term was not correct. What I tried to achieve here is probably better called an Item Summary. What values does it offer in items. Give a summary about the content, either a range or the values I can expect and query for. I fear that JSON Schema makes things very complicated as it allows a whole lot of stuff. Of course, it offers enum, minimum and maximum for what we currently describe as extent and range, but do we really want complex schemas with anyOf, not, etc pp in the collection? That's not easily parseable and readable.

@cholmes
Copy link
Contributor

cholmes commented Mar 13, 2019

Yeah, I hear you on json schema getting complicated. I just fear that it's a slippery slope from item summary to item schema. Like someone will want to express anyOf / not in their summary, and we'll reinvent json schema. Alternatively I suppose we could say that only simple schemas are allowed? Like just use a subset of JSON Schema.

I'm in no way set on JSON Schema for this, and like I said, I do see a space for a lightweight thing, an item summary. Do you feel the summary needs to be inside the collection json? It could potentially be cleaner to just have an Item Summary as a separate file, that has all the properties (common and varying). Or I suppose you could have Item Summary in the collection, but have it include all the properties.

I guess the crux of the issue is whether it's better to not repeat, leveraging the common properties as the way to define part of the Item Summary, or it's better to have a clean Item Summary that lists all the values, with some duplication of the common fields.

I think I lean a bit to the latter (but likely could be convinced otherwise), that an Item Summary extension should be stand alone, and should define all that is needed, doing its thing instead of trying to mix purposes.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 13, 2019

Both in Data Cube and the Non-Commons extension I would only strictly only allow extents and a set of values for Summaries. Summaries should be easy to consume. If somebody want to define a full Item Schema, that should be a different extension and that could then just refer to JSON Schema.

Yes, I need the Summary in the Collections. I need an easy way that users can look up what they can query for. That's what I designed the extension for. Whether they are merged with the common properties or not, is a design decision we can discuss. I see pros and cons for both. Probably depends highly on how we want the commons extension to work. Or we duplicate stuff, as you explained. In this case, I could drop the commons extension completely for openEO, as we could just specify everything as Summary ;-)

Adding JSON Schema or specifying things in a separate file would mean that I'd not use the extension and just use what we have now as proprietary extension specifically designed for openEO. It's just that I need what we have now. We can discuss details like names and improve smaller bits or add a new field, but I don't want to invent a big allrounder extension for Item Schemas here.

@cholmes
Copy link
Contributor

cholmes commented Mar 13, 2019

Cool. I'm definitely happy for a lightweight extension in collections. And yes, I figured that openEO dropping the commons extension would be the logical outcome, and I think that may be ok.

Thanks for humoring me - I just want to be sure that we clearly define the scope, as I can quite easily see someone coming and asking for just a little bit more in the Summary. And then that happens two or three times and suddenly we're making our own schema language.

Let's talk through the pros and cons of aligning with commons in the next working session.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 14, 2019

Dropping the commons extension for collection-only STACs would be a logical next step as currently we don't have any other place to describe the collections further with EO/SAR/... fields. But then we have two places to look for the data, currently it is just one place, which is easier to implement. That's quite a tough decision we shouldn't take lightly and discuss in a bigger round.

I'll have an eye on not reinventing the wheel (JSON Schema).

@cholmes
Copy link
Contributor

cholmes commented Mar 25, 2019

We discussed this on the call (with @matthewhanson @hgs-msmith @jbants @joshfix @mojodna ) and agreed the best way forward would be to make the 'Item Summary' extension, which handles both common and non-common properties. And accept some duplication with the 'commons in core' functionality of 'merging'.

We didn't go deep into the having 'two places to look for data', but my reaction is that they are for different purposes. The summary provides what fields to query on, and the 'properties' (commons in core) provides the fields to 'merge' in. I also am not sure if there will be huge overlap, as I'd see static catalogs going for the commons stuff, and dynamic catalogs going for the Item Summary. In a dynamic one the server will do the merge. And in the static one there's less people needing a 'summary' to figure out what to query, since there's no querying.

@matthewhanson
Copy link
Collaborator

I've been perusing through old issues and came across this one:
#278

And it occurred to me that defining the extensions used in the collection solves this problem, as it defines what properties would be in use. If my collection uses the eo and sci extensions then I know what fields to expect.

  • It doesn't explicitly tell you which of these fields are in the collection
  • It would include optional fields that belong to an extension but may not be used

Knowing which extensions are used in a Collection can definitely be useful, the question is if you had that info would you still feel you would need a summary of properties in the Items?

@jbants
Copy link
Collaborator

jbants commented Mar 27, 2019

There have been a couple references to an extensions property for the spec along the lines of extensions: ['eo', 'sar']. Personally, I like this idea.

@m-mohr raised a good point about links to specific extensions. If someone defined a custom eo extension, how would a user know to which extension spec it references? Can we follow something like the links schema to refer to extensions? or is that too cumbersome?

"extensions": [
    {
      "href": "https://github.com/radiantearth/stac-spec/blob/master/extensions/eo/schema.json",
      "name": "eo"
    },
    {
      "href": "https://github.com/radiantearth/stac-spec/blob/master/extensions/sar/schema.json",
      "name": "sar"
    },
    {
      "href": "https://github.com/radiantearth/stac-spec/blob/master/extensions/scientific/schema.json",
      "name": "sci"
    }
  ]

@matthewhanson
Copy link
Collaborator

matthewhanson commented Mar 27, 2019

We could also just put them in the links section with a rel type of "extension"

"links": [
   {
      "title": "eo",
      "rel": "extension",
      "href": "https://github.com/radiantearth/stac-spec/blob/v0.6.2/extensions/eo/schema.json",
      "type": "application/json"
   }
]

If you defined a custom extension you would need to provide a link to it.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 28, 2019

I feel we are drifting a bit off here and discussing also things that should be discussed in #278, also because my initial use case changed slightly. I wrote in the first post:

In openEO we need to list also the non-common fields in Collections, i.e. all properties that are available in the Items but have different values. We don't necessarily need to list the actual values (or extents), but we need to inform the user what he can query against.

After implementing this, I realized that it is in fact quite useful to list the actual values or extents, but it's not easy for non-primitive data types. Additionally, querying in openEO terms means to query using openEO processes for processing decisions, not querying against the STAC search API. Also, we don't necessarily have (public) Items, but we still need to query against fields that can't be moved to the collections (commons extension) as they are not common.

I'm all in to link to solve #278 (may it be schemas or something different), but I still need the Summary extension #416. For me it could also just be named "Allow all fields in collections so that collections are more useful standalone Extension" ;-) but it seemed like others have a similar need for querying against the API so we could just make it more universally usable by calling it Item Summary extension (or so). Now we need to make sure whether we need to split forces für #413 and #278 or not. I could also just make #416 a proprietary openEO extension if that seems to make more sense and you could work on #278 by defining how to link schemas.

Regarding linking schemas: We shouldn't link to GitHub, but make more permanent links such as https://www.stac.cloud/schemas/v0.6.2/extensions/eo.json or so. The master branch changes with every version. Any maybe this is getting a bit too complicated with schemas? Not all fields in the schema may be available? This extension aimed towards giving the user easy access to what to expect behind a collection, the schema is very rough and a users would usually just read the written spec instead. Not sure which client software would parse the actual schemas and make sense of them.

Edit: Oh, I missed @cholmes comment. Have you spoken about a name for the field? I'm fine with also allowing common fields as summaries. We still need to figure out how to encode that well.

Edit 2: Having item summaries would solve issues such as #216 to some extent for collections, too.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Mar 28, 2019

I'm just trying to update the PR and it is not too easy as I'm wearing two hats here.

  1. The collection-level viewpoint from openEO and GEE, which only have collections but don't expose items. For us, it is finally a nice way to fully describe collections. Previously, we were basically "misusing" the commons spec and that limited us to just include collection properties where the properties had a single value. If you had a mixed collection with two different sensors, we were lost (and that lead to this issue).
  2. The item-level viewpoint from most other STAC collaborators, which use this extension more as a list of fields that are available in items and queryable in the STAC APIs. Summaries for the items are a nice bonus.

Writing the extension spec to suite both needs is not so easy. For example the field name. For (1) properties sounds good, but conflicts with the commons extension. For (2) summaries could be an option or something like queryable_properties. These terms though are way off for (1).

There are more issues here, but I'd like to discuss it in a call instead of writing a whole paper here about it. Seems like a bigger strategic decision, which is also influenced by discussions regarding the common metadata model, extensions supported, commons extension etc.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Jun 6, 2019

Discussed on sprint, see notes: https://docs.google.com/document/d/1evZHrn1kOdLTIOFaJ2_Z3G3C7MwB8N-ORG8xKCtpAD0/edit

I'll come up with a PR.

@m-mohr m-mohr changed the title List non-common fields in Collection List non-common fields in Collection (Summaries) Jun 6, 2019
@m-mohr m-mohr added prio: should-have would be very good to have in the release stac-catalog PR available and removed new extension question labels Jul 22, 2019
@m-mohr
Copy link
Collaborator Author

m-mohr commented Aug 14, 2019

Merged to dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
prio: should-have would be very good to have in the release stac-catalog stac-collection
Projects
None yet
Development

No branches or pull requests

4 participants