Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Metadata Extension #757

Closed
cholmes opened this issue Apr 6, 2020 · 8 comments
Closed

Extended Metadata Extension #757

cholmes opened this issue Apr 6, 2020 · 8 comments

Comments

@cholmes
Copy link
Contributor

cholmes commented Apr 6, 2020

One idea discussed recently has been to specify a common way for STAC implementors to add extra fields in a familiar JSON structure without needing to put it in the core STAC record.

Details may shift as we get into it, but general outline is something like:

  • Specify an 'extended_metadata.json' asset that can be included as one of the assets in a STAC record.
  • It is a stac-like json structure, that mirrors the properties and lets users put in any fields they want.
  • Recommend that people put in fields that are not used for search, but are useful for loading / processing data.
  • Specify some way for STAC API's to combine extended and core STAC metadata - perhaps a 'full' request, or perhaps let users add 'fields' from not just the stac record but also the extended metadata record.

This structure can help STAC really focus on its mission, which is search, while enabling the reuse of its nice JSON structure for those who want STAC to serve as their full metadata. Implementors could choose which fields they want where - we'd not try to dictate where things should be, but provide rough guidelines. This can also help providers save on indexing costs - only STAC core would need to be indexed / searchable. Extended metadata would be attached to the core json, but not searchable.

@cholmes cholmes added this to the 1.0.0-beta1 milestone Apr 6, 2020
@m-mohr
Copy link
Collaborator

m-mohr commented Apr 7, 2020

From a client perspective this sounds like a bad idea. If you are looking for some specific properties, you must load all data anyway and then search both files. That adds complexity and I'm not sure what's the concrete benefit. What's the advantage of splitting? I don't get it yet from the text above. The files are not overly complex or large at the moment and we don't have extensions yet that add like hundreds of kilobytes to it. If there's any extension that adds such large chunks of data, I'd prefer that the extension defines a new asset role or link rel type for the bigger chunks of data. So for now I'm strongly against this, but maybe I didn't get the idea right.

@cholmes
Copy link
Contributor Author

cholmes commented Apr 7, 2020

Well it's really for data fields that we've been saying 'those don't really belong in STAC', and giving people a place to put them. Right now the only fields that really qualify I believe are in the proj extension - necessary for loading data. And the ones were talking about in #688 - shape (rows and columns) and transform. I could see things like radiance coefficients going in there too, and other satellite / sensor information that helps with processing, but that you wouldn't want to search on. And another one is additional information for tile servers, like max / min zoom levels that the data represents.

So I suppose we could make a stronger recommendation - that it's just for things that are for loading / processing data. That to me is the purpose. And it's not about hundreds of kilobytes - from the provider perspective if It's even 5 fields that I don't have to index in elastic search, across hundreds of millions of records, that's thousands of dollars in savings. It's about the fields that you want to provide to the client, but don't want to have to index.

AndI think the idea is that this extension does define a new asset role, and then specifies a bit more of what is in that asset, in a structure that is similar to STAC proper. The thing it replaces in my mind is old xml metadata files - we have those at Planet, and lots of people have them. They're mostly used for loading data into GIS / remote sensing tools - it'd be great if that could be replaced by a JSON structure that complements STAC.

I guess theoretically we could do this not as an extension but as a companion spec....

@cholmes
Copy link
Contributor Author

cholmes commented Apr 14, 2020

@m-mohr - any more thoughts? Or we can carve at time in our next call. I'm happy to write this up, but won't if it is a really bad idea.

The use case for me is to put in all the information in an landsat MTL file in a modern json structure. Landsat needs backwards compatibility, so will stick with the MTL. They did translate the MTL directly into JSON, see https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/atoms/files/LSDS-1388-Landsat-Cloud-Optimized-GeoTIFF_DFCB-v2.0.pdf

The goal for me with the extended metadata is a place to put the additional 180 lines that they didn't select to have in their STAC catalog, in a way that is potentially more interoperable than the dump of the MTL.

@cholmes cholmes self-assigned this Apr 14, 2020
@m-mohr
Copy link
Collaborator

m-mohr commented Apr 14, 2020

Yes, sorry, I forgot to comment here.

What my biggest concern is that things will be mixed between the files and clients don't know where to get data from. If you can freely move around fields in properties between the files, you'll end up that STAC clients most of the times need to load both files anyway and then there's no real benefit. You could simply put it directly into the STAC file. For example, if some users put data cube information in core STAC and some in extended STAC files, you never know where to look for it. So either no extended metadata or define for each extension where to place the fields, either core or extended. But then you need to ensure that extensions select the correct place. I wouldn't even be sure where to put the data cube fields at the moment. It's not something you can easily search as it's a complex object, but still I'd probably want to not have that hidden in an extended metadata file. Similarly probably for some fields in the labels extensions etc.

Also, I'm not sure whether this is really necessary to define anything in the spec. Depending on whether we see this as assets or links, we already have everything here to add such data:

  • Assets: Just specify an asset with type application/json (or any applicable other type, not sure whether there's one for MTL) and roles metadata and there's your additional metadata file.
  • Works similarly with links: Set type to application/json (default anyway?!) and rel type could be any of appendix, describes / describeBy or so. The rel type is probably the only thing we would need to clarify. And that could probably be done directly in core.

Looking at the MTL example, it even seems there are some fields that you potentially could want to search on, e.g. CLOUD_COVER_LAND seems quite useful to search for but then it would be hidden in the extended metadata. In contrast you probably don't want to search for ANGLE_COEFFICIENT_FILE_NAME.

I feel this is driven by the fear to "overload" the STAC files, but I don't see a big issue here. Have there ever been any problems with too large STAC files yet?

I've had small issues with rendering Googles large visualization objects in collections, but the solution was easy: Render unknown complex fields only on demand. Usually, you can't properly display them anyway as you don't know the structure. Any other issues so far? What would be the drawback to put MTL directly into STAC or just refer to them in assets and links as described above? You won't be able to search on them anyway if they are in extended metadata. I'm not sure about the Elastic index issue, because I guess I just wouldn't index fields that are unknown anyway (i.e. specify what fields to index), but yes, that may be an issue for some.

So I'd personally try to solve this as light as possible in the spec after answering all the questions that are still open (and for me it feels there are a lot still open).

@cholmes
Copy link
Contributor Author

cholmes commented Apr 14, 2020

Cool. Overall I agree that the spec has all the main things we need, and am fine to experiment outside the spec. If Planet goes down this route then we can just publish what we do in a separate repo and see if it evolves to something useful.

For some specifics:

I feel this is driven by the fear to "overload" the STAC files, but I don't see a big issue here. Have there ever been any problems with too large STAC files yet?

So for Planet this really comes down to cost. We're actively removing some of our fields in our data api because every single field has to be fully indexed in a large elastic search cluster when there are 800 million records, and in our analysis many aren't being used to search on. It's much less about a client rendering, and more about the index. I could see another route being a way to specify in STAC API that only certain fields are 'searchable' - I think the WFS 'queryable' thing was maybe exploring some of that. Like I think we have an implicit contract right now that all fields are searchable, and this proposal is one way to make it clear that some aren't. And there's also just some cost on egress on returning much larger payloads (though I'm not sure if that is actually a meaningful difference).

What would be the drawback to put MTL directly into STAC

Well, I think that does go in the face of one of our consistent messages about STAC, that it is focused on search, and you should select the fields you actually want to be searched on. I think it's valuable to have providers be a bit more thoughtful about the fields they select for STAC records, as the ones that users will see in search results and actually want to filter on, instead of just dumping their huge metadata records into it.

or just refer to them in assets and links as described above?

To be clear that was going to be how the extension would work, in my mind - I guess I did a poor job of explaining it. It was going to be something like extended_metadata.json, with role = metadata. But instead of it being arbitrary json it'd be a recommendation of how to organize the json, perhaps reusing the 'properties' object. But it'd absolutely just be another asset in the asset list. The hope would just be put forth a common way for those who are dumping their full metadata into a new json to do something a bit more interoperable.

So I'd personally try to solve this as light as possible in the spec after answering all the questions that are still open (and for me it feels there are a lot still open).

Cool, I can just write up something outside of the spec repo for people to look at. That may help make the debate a bit less abstract. Though I also do hear your core concern about splitting stuff, but I also do see a case to standardize a bit on additional metadata as people will use it. Perhaps we just encourage OGC or ISO to make updated JSON versions of their specs, that have room for providers to put all that they want.

I suppose one other route is to focus this on the STAC API, since that's where the index cost comes in. Though I think the one thing that could be nice in STAC core is some indicator for API's on fields that the provider does not think are valuable to index, so that each STAC API implementor doesn't need to decide for themselves. On the API side I could see some nice construct where users can request a STAC record plus the extended metadata in one call.

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 15, 2020

I think the WFS 'queryable' thing was maybe exploring some of that.

Yes, I think that's a good idea. See radiantearth/stac-api-spec#19

Well, I think that does go in the face of one of our consistent messages about STAC, that it is focused on search, and you should select the fields you actually want to be searched on.

Indeed, although the same applies for the proj extension, for example. The issue is probably that people put into the "main" item what is useful for their use case and others put it into the "extended" file. So you end up with many places you need to look at, which feels like a nightmare to implement and defeating the purpose of the extended metadata as you may need to download both files always, which doesn't help with download cost (the example you have above). I hope people just link to their large original metadata files but we have everything in place one may need for that.

But instead of it being arbitrary json it'd be a recommendation of how to organize the json, perhaps reusing the 'properties' object.

I never got what the advantage is to have the "properties" again. Why not just put everything at top-level there? It doesn't make merging easier and I'm not sure we even want clients to merge it...

I suppose one other route is to focus this on the STAC API

I guess adding queryables is a good start and then it could be simply a best practice and some tweaks or good defaults for the fields extension.

What do others think? This is just my and your opinion so far...

@m-mohr m-mohr mentioned this issue Apr 24, 2020
@cholmes cholmes modified the milestones: 1.0.0-beta1, 1.0.0-beta2 Apr 27, 2020
@m-mohr m-mohr modified the milestones: 1.0.0-beta.2, 1.0.0-beta.3 Jun 15, 2020
@m-mohr m-mohr modified the milestones: 1.0.0-beta.3, future Jan 4, 2021
@m-mohr
Copy link
Collaborator

m-mohr commented Feb 23, 2021

@m-mohr m-mohr modified the milestones: future, new extensions May 4, 2021
@PowerChell
Copy link

Closing due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants