Paginated search queries now don't return a token on the last page #243

pedro-cf · 2024-05-05T18:03:35Z

Related Issue(s):

Paginated search queries always returns a next token even if it's the last page of the pagination. #242

Merge dependencie(s):

Always generate links for all searches #241

Description:

Paginated search queries now don't return a token on the last page.
Made some fixes to the respective tests. In particular test_pagination_token_idempotent had and indentation issue
Improved execute_search to make use of es_response["hits"]["total"]["value"]

PR Checklist:

Code is formatted and linted (run pre-commit run --all-files)
Tests pass (run make test)
Documentation has been updated to reflect changes, if applicable
Changes are added to the changelog

…-issue

…quests for 6 items

StijnCaerts · 2024-05-06T06:23:58Z

es_response["hits"]["total"]["value"] might not be accurate if there are more than 10,000 hits. Then a lower bound will be indicated with the "gte" relation in es_response["hits"]["total"]["relation"].

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#track-total-hits

So I'd suggest to use the count from the search only if es_response["hits"]["total"]["relation"] == "eq". The async count tasks are still useful when the count exceeds the default threshold.

jonhealy1 · 2024-05-06T10:27:14Z

Maybe we can add to this test or do something similar to make sure that the last page isn't returning a token? #244

pedro-cf · 2024-05-06T10:29:27Z

Maybe we can add to this test or do something similar to make sure that the last page isn't returning a token? #244

Greetings, a test for this already exists test_pagination_item_collection.

It used to test for 7 requests for 6 items, now I've fixed it for 6 requests for 6 items.

https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/243/files#diff-a38bef06b3f69317891c409bc81044f53cca52841ea63ec9a6821ea08dea98f6L593-L605

test_item_search_temporal_window_timezone_get and test_pagination_post also tests this.

These are all tests I've fixed up.

pedro-cf · 2024-05-06T11:58:10Z

es_response["hits"]["total"]["value"] might not be accurate if there are more than 10,000 hits. Then a lower bound will be indicated with the "gte" relation in es_response["hits"]["total"]["relation"].

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#track-total-hits

So I'd suggest to use the count from the search only if es_response["hits"]["total"]["relation"] == "eq". The async count tasks are still useful when the count exceeds the default threshold.

Greetings @StijnCaerts , thank you so much for the feedback.

Do you think this approach is correct?

        search_task = asyncio.create_task(
            self.client.search(
                index=index_param,
                ignore_unavailable=ignore_unavailable,
                body=search_body,
                size=limit,
            )
        )

        count_task = asyncio.create_task(
            self.client.count(
                index=index_param,
                ignore_unavailable=ignore_unavailable,
                body=search.to_dict(count=True),
            )
        )

        try:
            es_response = await search_task
        except exceptions.NotFoundError:
            raise NotFoundError(f"Collections '{collection_ids}' do not exist")

        hits = es_response["hits"]["hits"]
        items = (hit["_source"] for hit in hits)

        matched = es_response["hits"]["total"]["value"]
        if es_response["hits"]["total"]["relation"] != "eq":
            if count_task.done():
                try:
                    matched = count_task.result().get("count")
                except Exception as e:
                    logger.error(f"Count task failed: {e}")
        else:
            count_task.cancel()

Assume matched = es_response["hits"]["total"]["value"] as default
In the case es_response["hits"]["total"]["relation"] != "eq" use the count_task and if the count fails it will log the error but still use the default as backup
Else use the default and cancel the count_task if it's still running (I assume count_task.cancel() is safe to call)

StijnCaerts · 2024-05-06T13:45:08Z

I think either getting the correct count or no count at all would be the preferred behaviour.
The context extension is deprecated at this point, I don't know if there is an alternative available. Maybe we should handle this in a separate PR.

https://github.com/stac-api-extensions/context
radiantearth/stac-api-spec#396

pedro-cf · 2024-05-06T14:10:40Z

I think either getting the correct count or no count at all would be the preferred behaviour.

Is there situations where the count can fail ? Is this why count_task.result().get("count") is surrounded in a try/except?

With the addition of the page in the token it's critical to get a matched value everytime.

StijnCaerts · 2024-05-06T16:23:30Z

Is there situations where the count can fail ? Is this why count_task.result().get("count") is surrounded in a try/except?

Probably the same reason a search request could fail, eg. invalid collection, bad query, ...

Without an accurate count, it is impossible to tell if we're on the last page. The only case where you are sure you're on the last page is when the current page size is less than the limit.

jonhealy1 · 2024-05-06T16:24:01Z

I don't know why it's in a try except myself? A lot of the db stuff was done by an old contributor.

jonhealy1 · 2024-05-06T16:27:26Z

es_response["hits"]["total"]["value"] is accurate up to 10,000 results. If the actual count_task fails - which is probably unlikely - we can use this maybe because most people are not going to paginate through more than 10,000 results.

StijnCaerts · 2024-05-06T16:33:18Z

Indeed, if you are paging through more than 10,000 hits, I think there is no harm in 1 extra request with an empty response
😉

StijnCaerts · 2024-05-06T16:50:14Z

Another option would be to implement this workaround: https://stackoverflow.com/a/67200853/9339603

Pro's:

uniform handling regardless of number of hits ($.hits.total.relation: eq / gte)
no need track count in the pagination token

Con's:

edge cases like limit=0
performance impact?

pedro-cf · 2024-05-06T18:54:52Z

Another option would be to implement this workaround: https://stackoverflow.com/a/67200853/9339603

@StijnCaerts I've tried implementing this approach

        search_after = None

        if token:
            search_after = urlsafe_b64decode(token.encode()).decode().split(",")

        query = search.query.to_dict() if search.query else None

        index_param = indices(collection_ids)

        search_task = asyncio.create_task(
            self.client.search(
                index=index_param,
                ignore_unavailable=ignore_unavailable,
                query=query,
                sort=sort or DEFAULT_SORT,
                search_after=search_after,
                size=limit + 1,  # Fetch one more result than the limit
            )
        )

        count_task = asyncio.create_task(
            self.client.count(
                index=index_param,
                ignore_unavailable=ignore_unavailable,
                body=search.to_dict(count=True),
            )
        )

        try:
            es_response = await search_task
        except exceptions.NotFoundError:
            raise NotFoundError(f"Collections '{collection_ids}' do not exist")

        hits = es_response["hits"]["hits"]
        items = (hit["_source"] for hit in hits[:limit])

        next_token = None
        if len(hits) > limit:
            if hits and (sort_array := hits[limit - 1].get("sort")):
                next_token = urlsafe_b64encode(
                    ",".join([str(x) for x in sort_array]).encode()
                ).decode()

        matched = None
        if count_task.done():
            try:
                matched = count_task.result().get("count")
            except Exception as e:
                logger.error(f"Count task failed: {e}")

        return items, matched, next_token

but I'm getting these errors on these tests:

FAILED stac_fastapi/tests/api/test_api.py::test_app_query_extension_limit_gt10000 - elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or e...
FAILED stac_fastapi/tests/api/test_api.py::test_app_query_extension_limit_10000 - elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or e...

pedro-cf · 2024-05-06T19:01:37Z

As I understand elasticsearch itself doesn't allow limits above 10,000 (by default), so when you pass 10,000 +1 it responds with a bad request.

I was a bit confused from why these tests weren't failing on the main branch but it's because of this from stac-fastapi changelog:

Limit values above 10,000 are now replaced with 10,000 instead of returning a 400 error (#526)

pedro-cf · 2024-05-06T23:35:58Z

Ended up applying this #243 (comment) with edge case handling, in this case the 10,000.

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py

pedro-cf · 2024-05-07T14:10:16Z

what do you think @jonhealy1 ?

I don't really like this hardcoded 10,000, but it's basically set by:

elasticsearch setting which only allows requests with up to 10,000 by default.
stac-fastapi.api replaces values above limit with limit value (10,000)

jonhealy1 · 2024-05-07T14:27:45Z

Can we import the max limit from stac-fastapi?

pedro-cf · 2024-05-07T14:36:57Z

Can we import the max limit from stac-fastapi?

Yes, we can use this:

import stac_fastapi.types.search
max_result_window =  stac_fastapi.types.search.Limit.le

or would you prefer this:

from stac_fastapi.types.search import Limit
max_result_window = Limit.le

I prefer option 1 since it makes it clear where the limit comes from.

jonhealy1 · 2024-05-07T14:44:55Z

I prefer option one too

pedro-cf · 2024-05-07T14:52:22Z

I prefer option one too

added

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py

jonhealy1

Nice work here Pedro. I will wait until tomorrow to merge just in case anyone else has any thoughts.

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py

StijnCaerts · 2024-05-08T11:45:21Z

I just left one small remark. Otherwise it looks good to me, thanks for the contribution!

jonhealy1

Approved!

pedro-cf and others added 15 commits May 5, 2024 16:41

improved pagination

0fe0908

pre-commit run --all-files

aa070f2

.

4cb4062

wip

3c2b63a

Merge branch 'pagination-links-issue' into pagination-last-page-token…

43db965

…-issue

fixed test_item_search_temporal_window_timezone_get

ce84ef6

test_item_search_temporal_window_timezone_get fix

127f7f1

test_pagination_item_collection fix: with the new fix we only do 6 re…

7eb505f

…quests for 6 items

test_pagination_post: same reason

08319b2

test_pagination_token_idempotent fix: bad indentation

405e83d

changelog

a1c06b9

opensearch

a27643c

pre-commit run --all-files

bcf7ee7

opensearch fixes

87e6c95

Merge branch 'main' into pagination-last-page-token-issue

609b178

jonhealy1 requested review from jonhealy1, StijnCaerts and jamesfisher-gis May 6, 2024 04:15

account for > 10,000 hits

583b959

Merge branch 'main' into pagination-last-page-token-issue

0c0503f

Merge branch 'main' into pagination-last-page-token-issue

ea23cd5

.

cf97df7

pedro-cf added 2 commits May 7, 2024 00:25

.

31513ab

+1 strategy w/ edge issue handling

1df9d2e

jonhealy1 reviewed May 7, 2024

View reviewed changes

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py Outdated Show resolved Hide resolved

es_response["hits"]["total"]["value"]

7513194

jamesfisher-gis reviewed May 7, 2024

View reviewed changes

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py Show resolved Hide resolved

max_result_window from stac-fastapi

3e3aa21

jonhealy1 self-requested a review May 7, 2024 14:51

jonhealy1 requested changes May 7, 2024

View reviewed changes

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py Outdated Show resolved Hide resolved

oops

ab19d8c

jonhealy1 self-requested a review May 7, 2024 14:58

jonhealy1 approved these changes May 7, 2024

View reviewed changes

StijnCaerts requested changes May 8, 2024

View reviewed changes

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py Outdated Show resolved Hide resolved

matched default

6ea1179

jonhealy1 self-requested a review May 8, 2024 18:07

jonhealy1 approved these changes May 8, 2024

View reviewed changes

jonhealy1 merged commit 55dd87e into stac-utils:main May 8, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paginated search queries now don't return a token on the last page #243

Paginated search queries now don't return a token on the last page #243

pedro-cf commented May 5, 2024 •

edited

Loading

StijnCaerts commented May 6, 2024

jonhealy1 commented May 6, 2024

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

StijnCaerts commented May 6, 2024

pedro-cf commented May 6, 2024 •

edited

Loading

StijnCaerts commented May 6, 2024

jonhealy1 commented May 6, 2024

jonhealy1 commented May 6, 2024 •

edited

Loading

StijnCaerts commented May 6, 2024

StijnCaerts commented May 6, 2024

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024

pedro-cf commented May 7, 2024 •

edited

Loading

jonhealy1 commented May 7, 2024

pedro-cf commented May 7, 2024 •

edited

Loading

jonhealy1 commented May 7, 2024

pedro-cf commented May 7, 2024

jonhealy1 left a comment

StijnCaerts commented May 8, 2024

jonhealy1 left a comment

Paginated search queries now don't return a token on the last page #243

Paginated search queries now don't return a token on the last page #243

Conversation

pedro-cf commented May 5, 2024 • edited Loading

StijnCaerts commented May 6, 2024

jonhealy1 commented May 6, 2024

pedro-cf commented May 6, 2024 • edited Loading

pedro-cf commented May 6, 2024 • edited Loading

StijnCaerts commented May 6, 2024

pedro-cf commented May 6, 2024 • edited Loading

StijnCaerts commented May 6, 2024

jonhealy1 commented May 6, 2024

jonhealy1 commented May 6, 2024 • edited Loading

StijnCaerts commented May 6, 2024

StijnCaerts commented May 6, 2024

pedro-cf commented May 6, 2024 • edited Loading

pedro-cf commented May 6, 2024 • edited Loading

pedro-cf commented May 6, 2024

pedro-cf commented May 7, 2024 • edited Loading

jonhealy1 commented May 7, 2024

pedro-cf commented May 7, 2024 • edited Loading

jonhealy1 commented May 7, 2024

pedro-cf commented May 7, 2024

jonhealy1 left a comment

Choose a reason for hiding this comment

StijnCaerts commented May 8, 2024

jonhealy1 left a comment

Choose a reason for hiding this comment

pedro-cf commented May 5, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

jonhealy1 commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 6, 2024 •

edited

Loading

pedro-cf commented May 7, 2024 •

edited

Loading

pedro-cf commented May 7, 2024 •

edited

Loading