Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Ext: Dynamic queryables difficult for item-search #182

Closed
rsmith013 opened this issue Jul 28, 2021 · 11 comments
Closed

Filter Ext: Dynamic queryables difficult for item-search #182

rsmith013 opened this issue Jul 28, 2021 · 11 comments

Comments

@rsmith013
Copy link

rsmith013 commented Jul 28, 2021

One feature we are really interested in is the ability to request from the server the possible queryables and their accepted values.

As defined, there are two routes:

  • /queryables - The global intersect of all collection queryables
  • `/collections/{collection_id}/queryables - Collection specific queryables

As noted in the documentation, this falls short in providing a useful interface where an API presents diverse content.

Issues ogcapi-582 ogcapi-576 go some way in addressing this by requesting wildcard schema definitions and /queryables?collections=collection1,collection2 but I feel this still leaves limitations. Some discussed in the issues themselves.

  1. Wildcard definitions have issues because you lose the information about possible queryable properties and values. We are likely to end up with millions of items and 100s if not 1000s of collections. It is unreasonable to expect a user to search these to find some possible attributes they might wish to search on.
  2. adding the ?collections parameter requires the client to know the collections they are interested in up-front. One of the benefits of item-search is cross-collection search.

I wonder whether a more useful approach would be to allow the same query parameters as /search on /queryables.
The implementation can then search for the list of results that match and provide the intersect of queryables to the user for further refinement.

e.g.

Return the list of items which match the filter expression
/search?filter=sentinel:data_coverage > 50 OR eo:cloud_cover < 10

Return the queryables available for the results which match the current filter expression
/queryables?filter=sentinel:data_coverage > 50 OR eo:cloud_cover < 10

The /queryables?collections=collection1,collection2 approach requires the API to return a list of relevant collections to be useful IMO. Perhaps allow the context extension to return an array of collections in the response which are relevant for the current search.

Proposed solutions:

  1. Accept the /search query parameters on the /queryables endpoint to dynamically build the intersect
  2. Return collection ids in context response, this can then be used via proposed /queryables?collections=...,
@rsmith013 rsmith013 mentioned this issue Sep 17, 2021
4 tasks
@rsmith013
Copy link
Author

I have created an extension to the context extension to add collections into the response.
https://github.com/cedadev/stac-context-collections

In my implementation, the number of collections returned in this response is max 10 as the more collections returned, the less likely there is to be an intersect. The value from this response is then used via /queryables?collections=..., as suggested

Open to thoughts on this approach. I appreciate the filter extension based on OGC OA Feat: Part 3 is still a moving target.

@dwilson1988
Copy link

dwilson1988 commented Oct 26, 2021

@rsmith013

I'm looking into this now. Has anyone proposed a different response to those working on OGC OA Feat: Part 3 for global queryables?

Instead of /queryables returning the intersection, why not separate them out by collection similar to the collections endpoint?
e.g.:

GET /queryables

{
  "queryables": [
    {
      "$id": "collection1" 
      "title": "Collection 1"
      ...
    }, {
      "$id": "collection1" 
      "title": "Collection 1"
      ...
    }
  ]
}

GET /queryables?collections=collection1

{
  "queryables": [
    {
      "$id": "collection1" 
      "title": "Collection 1"
      ...
    }
  ]
}

I'm struggling to understand the logic of using the intersection of queryables given the collections could be RADICALLY different (in our use case, they certainly are).

@rsmith013
Copy link
Author

@dwilson1988 We are expecting to have many hundreds of collections so I am not sure this would be a scalable approach either. We are working under the assumption that each collection will have its own set of queryables. These could be very different. We are then working to create a separate service that will create global search facets (queryables) and make a mapping between the collection-specific terms and the global set.

I am not sure which area you're working in but we are coming from an earth system modeling perspective (climate models mixed with earth observation remote and local sensing)

As an example:

CMIP6

cmip6:source_id -> global:model
global:general_data_type

CMIP5

cmip5:model -> global:model
global:general_data_type

Sentinel 1

global:general_data_type
platform
processing_level

Sentinel 3

global:general_data_type
platform
processing_level

So although I agreed that at the top level, having an intersection will likely tend to zero for heterogenous STAC collections we are hoping to have a few key search facets available at the top level e.g. data type, permitted_use, inspire_theme, gemet_topic... as well as incorporating free-text search . We hope that this will allow the user to narrow their search sufficiently that an intersection will return a meaningful overlap. The context collections extension I have proposed based on thoughts in another issue allows clients to submit a set of collections to the queryables endpoint and return their intersection.

So to use the above examples, if you were looking for satellite data, then it is likely that processing_level would bubble up from a queryables search. Or if you were looking for model data then model would appear as a queryable.

I would be interested to hear other thoughts and approaches of feedback on our proposed answer to the issue.

@dwilson1988
Copy link

@rsmith013 We definitely have a lot of earth observation datasets as well, but a few others. Definitely a wide breadth of data types and very heterogenous field names except for a few that have imposed consistency like start_datetime, end_datetime, datetime, etc.

Not sure I see a scaling issue the approach I suggested. Unless there are dozens of fields on every dataset, I wouldn't expect the response to be much larger than just a /collections response? In the end, it's probably not a huge deal, but it would nice to be able to get the queryables for everything without making first a request to collections and then to each queryables endpoint.

Free text search is at least partially implemented in CQL/CQL2, I would expect a function there might become the canonical way to do it. I do really like the idea of the context collections extension - I might implement that on our side.

@rsmith013
Copy link
Author

I am also thinking from a UI perspective. If you return the queryables for each collection in isolation, how do you display that to a user? Even with some kind of post-processing, which could do as I am suggesting where you map terms together, would still result in an unusable interface (assuming you have many varied vocabs).
To some extent, having a very small number of high-level queryables and expanding these as the user query gets more specific would be a more user-friendly experience. IMO.

The filters extension is developing, so will keep an eye on that. Always good to remove code.

How were you thinking you would handle the client/UI side of things with separated queryables for each collection?

@dwilson1988
Copy link

Well, the queryables by collection is already in the OGC API Features Part 3 spec, but our primary usage is not exactly UI driven, but part API client and part machine to machine. We use the collecitons specific queryables endpoint (collections//queryables) with a Dataset (in our usage, a superset of a Collection) object in our API. So a user of that will be able to check what they are able to use to query it. Our user interface allows a user to browse these datasets individually, so queryables would just be displayed as a list or sorts in that Dataset's display.

@rsmith013
Copy link
Author

If I understand correctly you have a structure like this:

Blank diagram - Page 2 (1)

Where collection queryables are an aggregation of the item properties and dataset queryables are an aggregation of the collections associated with it. I guess you are wanting all the queryables in one hit and then would sort it by dataset yourself? I can see why you would want to lose the intersect. I would think something like Memcached would be an answer here. As you don't need all the queryables every time, just a subset for each dataset.

Use Memcached (other caches do exist) to store the response from collections//queryable then the lookup to each member collection should be lightning fast. Assuming also that these things change infrequently, caching the dataset response for the same (i.e. the aggregation of all the /collections//queryables responses) would be even better.

Although the global intersect is flawed where you have heterogeneous collection queryables, I am protecting the idea because I think that from a STAC client (UI or otherwise) perspective, it is more useful. It sounds like your Dataset concept already reduces the number of collections needed per request and I would think that caching would help with performance.

@dwilson1988
Copy link

No, I wasn't very clear - Dataset is just a STAC Collection + more stuff, not multiple STAC Collections.

I'm just looking for flexibility, but querying individual collections isn't a huge burden, just a potential annoyance.

@rsmith013
Copy link
Author

ok, I'm with you now.

@rsmith013
Copy link
Author

Thinking deeper on this issue and having played with my original suggestion context-collections it still doesn't quite do the job. The issue is that once you enter a collection, there is no further refinement. As you need to know the search context to generate the facets, the /queryables endpoints does not seem the most sensible place to do this (You would have to perform the search on the queryables endpoint e.g /queryables?datetime=...&bbox=...&filter=...).

Other solutions such as Google Custom Search
place facet counts and further refinements in the search context.

@philvarner
Copy link
Collaborator

Moved to: stac-api-extensions/filter#9

@philvarner philvarner closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants