Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make summaries 'strongly recommended' in Collections #965

Closed
m-mohr opened this issue Feb 3, 2021 · 7 comments
Closed

Make summaries 'strongly recommended' in Collections #965

m-mohr opened this issue Feb 3, 2021 · 7 comments
Assignees
Milestone

Comments

@m-mohr
Copy link
Collaborator

m-mohr commented Feb 3, 2021

We've recently discussed how important summaries for discovery of data, but on the other hand most people actually dont implement summaries. Therefore, I propose to make summaries required in Collections with a minimum amount of proerties set to 1 in JSON Schema. Summaries in Catalogs would still be optional! This issue is to spur discussions.

@m-mohr m-mohr added this to the 1.0.0-RC.1 milestone Feb 3, 2021
@m-mohr m-mohr self-assigned this Feb 3, 2021
@cholmes
Copy link
Contributor

cholmes commented Feb 7, 2021

In general I like the thought a lot, and was all for it when I first thought about it. They're important to have, and making them required is a good nudge towards getting them used more.

On the other hand I think we're already introducing more change than I was hoping we would for RC1. If we were to do another beta I'd probably be more into it. But I think it's really important for STAC momentum to get something stable soon.

If all the tooling supported making summaries very easily then that would make it easier to 'require' as well. Admittedly that's a bit 'chicken and egg' - obviously the tools will figure out how to support it if the spec makes it required. But I worry that many may just do the 'bare minimum'.

So my current thinking is really prioritize the tooling to be there to make it very easy to create summaries, and to put stronger recommendations in the spec, and add a best practices section that also really emphasizes the importance. I think this can be done with #820 - I wasn't planning quite so strong recommendations, but easily can. And then in any funding of the ecosystem I can prioritize funding their support.

It might be interesting to also think about ways to make summaries look 'cool' in STAC Browser, like do some graphical representations of some of them. Like gsd could have a little line graph absolute (or log?) scale where you'd see the resolution relative to others. And the various view angles could actually show the range of possibiliites, etc.

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 8, 2021

Hmm, as we are going to do the RC1 soon, it's the last chance to get it in any time soon. Collections and items anyway need to be regenerated with the "type" change. Thus, it seems the right time to solve the chicken and egg issue by just requiring it in the spec. If you want to get funding for those functions anyway, isn't that even more an argument to require it? I don't understand the stability issue to be honest. How does requiring a field make it less stable after we are breaking all collections anway?

@jjrom
Copy link
Contributor

jjrom commented Feb 8, 2021

From an implementation point of view, i'm not very happy to have a required "summaries" for catalogs. For instance, the resto server follows the rule "one item must belong to one and only one collection but can belongs to any number of catalogs". Thus, the collection summaries property is stored in a resto.collection table in the database. When an item is added/removed to/from a collection (e.g. Landsat), resto recompute the statistics for the collection (i.e. new spatiotemporal extent, number of items, etc.). On the other hand, a catalog is a group of items belonging to different collection (e.g. "optical images in France"). It is a "live" view dynamically made from an SQL query on the items table (i.e. resto.features table). Summaries property cannot be computed this way unless there are computed "on the fly" which would absolutely degrade the performance to an unusable level.

So I suspect that this change will have a negative impact on implementations that would provide "dynamic catalog generation"

@m-mohr
Copy link
Collaborator Author

m-mohr commented Feb 8, 2021

i'm not very happy to have a required "summaries" for catalogs

@jjrom Maybe there's a misunderstanding here: Summaries would only be required for Collections, not for catalogs. Thus couldn't you just hook the summary generation for Collections into the part that recomputes the statistics also for extents etc.?

@m-mohr m-mohr changed the title Make summaries required Make summaries required in Collections Feb 8, 2021
@jjrom
Copy link
Contributor

jjrom commented Feb 8, 2021

@m-mohr Ok sorry for the misunderstanding. Reading discussion on summaries in other threads, I had the feeling that summaries will end up to be required also for catalogs - which i definitively want to avoid :)
On the other hand, I totally agree to make summaries required for collection

@cholmes
Copy link
Contributor

cholmes commented Feb 8, 2021

So I'm still going back and forth in my head on this one, but I'll present my arguments for not doing it.

The one I've been thinking further about is how few things we absolutely require. I recently wrote 'The core spec is designed to be as flexible as possible, so that it is not too rigid and unable to handle unanticipated needs.' I worry that locking in summaries makes it less flexible. Do all data types have something to summarize? OpenEO I know does, and I guess from that netcdf type n-dimensional things do as well. But are there some potential data types that don't make sense to summarize anything?

There were many things in the past that I wanted to make required, but got talked out of it into STRONGLY RECOMMENDED for many things, so that people could use the spec in ways other than 'the normal'. I worry that having summaries truly required locks us in. And/or leads to people doing 'null' summaries, which I always find frustrating - why require something that people have to just ignore?

I don't understand the stability issue to be honest. How does requiring a field make it less stable after we are breaking all collections anway?

The difference for me comes down to the ease for implementors. Changing an implementation to add a type field is trivial. Implementing summaries is not. It's also not necessarily something that really should be fully automated, as I don't think we want summaries for every item field in a collection. So even if the tooling were to all be there every collection would need work to get it into a good state with summaries.

So I think I more mean stability for catalog maintainers, than I do for tool authors. With RC1 as it stands any implementor will be able to run an 'upgrade tool', and their catalog will be set. With summaries we will tell them that they must select the fields to summarize and put them into summaries. Or (more likely) we get a lot of people 'skipping' it and doing the bare minimum.

If you want to get funding for those functions anyway, isn't that even more an argument to require it?

I'm not sure - I don't anticipate problems getting funding for it. But if it's required then we go out the door at 1.0.0 with a number of implementations not working. I'd like us to have the tools work, and I'd also not like to delay the release for a month or two while we wait for them to all work.

I also think building the implementations after the spec requires something is a recipe for getting something wrong and needing to update the spec. This to me is the classic issue with OGC specs, where once it's out people notice little things wrong, and it's a big pain to fix them. If all the tools implemented summaries I'd feel great about bringing it in. But if pystac goes to implement it then it could raise some issue we're not quite thinking of. I think the chances are low, but I'm just not sure.

If implementors (of software and of actual production catalogs) are all into this change and commit to getting their implementations up to date asap then I can definitely be convinced. Or if there is more use of summaries than I think - like if 90% of catalogs on stacindex are already using them.

But yeah, I guess just philosophically I've expressed that things should not be mandated, that 'the market' will decide (yes, I concede I don't always stick to this 100%). If summaries are valuable then they will be used, but requiring them in the spec makes things a wee bit less flexible for future data types for which summaries may not make sense - where there is not some implicit values to summarize, or where doing so is for some reason very hard.

Mostly I'm curious what others think, happy to be swayed on this, but have had a tingly sense in the last couple days that it may not be the right move. And to reiterate I'm in full support of getting summaries super used - I just think recommendations, examples, descriptions and tooling is the way to go, requiring in the spec feels like I very blunt and forceful instrument.

@cholmes cholmes changed the title Make summaries required in Collections Make summaries 'strongly recommended' in Collections Feb 8, 2021
@cholmes
Copy link
Contributor

cholmes commented Feb 8, 2021

Discussed on a call today, with @lossyrob @matthewhanson @m-mohr and @jbants - Decision was to upgrade summaries to be 'STRONGLY RECOMMENDED' like other STAC fields, and to commit to explaining them well and building support into all the tools. Really work to get them in all catalogs, but to not make it required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants