Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federated search #42

Open
matthewhanson opened this issue Apr 19, 2021 · 4 comments
Open

Federated search #42

matthewhanson opened this issue Apr 19, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@matthewhanson
Copy link
Member

A big advantage of STAC is being able to use data from multiple sources.
It would be a nice feature to be able to search multiple STAC endpoints and combine the results into a single FeatureCollection

@matthewhanson matthewhanson added this to the 1.0.0 milestone Apr 19, 2021
@matthewhanson matthewhanson removed this from the 1.0.0 milestone Sep 20, 2021
@matthewhanson matthewhanson added this to the 0.5.0 milestone Apr 19, 2022
@gadomski gadomski modified the milestones: 0.5.0, 0.6.0 Aug 30, 2022
@gadomski gadomski added the enhancement New feature or request label Nov 9, 2022
@gadomski gadomski modified the milestones: 0.6.0, 0.7.0 Jan 27, 2023
@gadomski gadomski assigned gadomski and unassigned gadomski Jun 7, 2023
@gadomski gadomski removed this from the 0.7.0 milestone Jun 7, 2023
@gadomski
Copy link
Member

gadomski commented Jun 7, 2023

I have questions. First, would this be enough to support your use case, @matthewhanson?

import pystac_client
from pystac_client import Client

client_a = Client.open("http://stac-api-a.test")
client_b = Client.open("http://stac-api-b.test")

search_a = client_a.search(collections=["foo"], datetime="2023-06-07")
search_b = client_b.search(collections=["bar"], datetime="2023-06-07")

items = search_a.item_collection()
items.extend(search_b.item_collection())

If that's enough, then we just need to add an .extend() method to ItemCollection in pystac.

If that's not enough, I'm at a bit of a loss. Each STAC API tends to be so different that it doesn't seem realistic to, e.g., use the same collection IDs across clients. If you want to re-use the same set of parameters, it's pretty trivial to do this:

query = {
   "datetime": "2023-06-07",
   "bbox": [-73.21, 43.99, -73.12, 44.05],
}
items = client_a.search(collections=["foo"], **query).item_collection()
items.extend(client_b.search(collections=["bar"], **query).item_collection())

@matthewhanson, an you sketch out what you had in mind, if it's more than what I've described?

@bitner
Copy link

bitner commented Nov 2, 2023

The important thing here would be to ensure that if an order was specified in the search that the results would be interleaved based on that order.

@bitner
Copy link

bitner commented Nov 2, 2023

Quick and dirty proof of concept for a federated search that merges records according to their sortby settings.

from pystac_client import Client
import morecantile
import heapq
from functools import reduce, cmp_to_key

dot_get = lambda p, d: reduce(dict.get, p.split('.'), d)

def ogc_sort_func(sorts, a, b, depth=0):
    sort = sorts[depth]
    # print(sort, depth)
    field = sort.get('field')
    direction = sort.get('direction','asc')
    desc = 1 if direction.lower()[0] == 'd' else -1
    # print(field, direction)
    av = dot_get(field,a)
    bv = dot_get(field,b)
    # print(av, bv, av==bv)
    if (av is None and bv is None) or av == bv:
        # print('stepping through', sorts, a, b)
        return ogc_sort_func(sorts, a, b, depth=depth+1)
    elif av is None:
        out = -1
    elif bv is None:
        out = 1
    elif av < bv:
        out = 1
    else:
        out = -1
    return desc * out

tms = morecantile.tms.get("WebMercatorQuad")
x, y, z = tms.tile(-93,45,5)
bbox = list(tms.bounds(morecantile.Tile(x, y, z)))
print(bbox)

sortby = [{"field":"properties.datetime","direction":"desc"},{"field":"id","direction":"desc"}]
datetime=["2020-10-10","2020-10-10T18:00:00Z"]
catalog = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1')
results = catalog.search(
    limit=100,
    max_items=1000,
    bbox=bbox,
    collections=["naip"],
    datetime=datetime,
    sortby=sortby
)
a=results.items_as_dicts()

results = catalog.search(
    limit=100,
    max_items=1000,
    bbox=bbox,
    datetime=datetime,
    collections=["landsat-c2-l2"],
    sortby=sortby
)

b=results.items_as_dicts()

results = catalog.search(
    limit=100,
    max_items=1000,
    bbox=bbox,
    datetime=datetime,
    collections=["sentinel-2-l2a"],
    sortby=sortby
)

c=results.items_as_dicts()

keyfunc = lambda l, r: ogc_sort_func(sortby, l, r)

print('merging')
g=heapq.merge(a,b,c, key=cmp_to_key(keyfunc))

print('cycling')
for i in range(100):
    row=next(g)
    print(dot_get('properties.datetime', row), row.get('id'),row.get('collection') )

@bitner
Copy link

bitner commented Nov 2, 2023

For that, I did the sorting just on the items as dicts, but if we were to actually implement this, you could use Items as classes and either create a new subclass or monkeypatch a lt method onto it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants