Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginated search queries always returns a next token even if it's the last page of the pagination. #242

Closed
pedro-cf opened this issue May 4, 2024 · 4 comments

Comments

@pedro-cf
Copy link
Collaborator

pedro-cf commented May 4, 2024

Describe the bug
Paginated search queries always returns a next token even if it's the last page of the pagination.

To Reproduce
Steps to reproduce the behavior:

  1. Run app docker-compose up
  2. Create test collection
  3. Creation 1 test item
  4. Run search query with "limit":1 -> Response will contain an "next link"

Expected behavior
I assume the last page of the pagination should not include a "next link"

@jonhealy1
Copy link
Collaborator

It's true. If there are no more items, there shouldn't be a next link.

@pedro-cf
Copy link
Collaborator Author

pedro-cf commented May 5, 2024

async def execute_search(
self,
search: Search,
limit: int,
token: Optional[str],
sort: Optional[Dict[str, Dict[str, str]]],
collection_ids: Optional[List[str]],
ignore_unavailable: bool = True,
) -> Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]:
"""Execute a search query with limit and other optional parameters.
Args:
search (Search): The search query to be executed.
limit (int): The maximum number of results to be returned.
token (Optional[str]): The token used to return the next set of results.
sort (Optional[Dict[str, Dict[str, str]]]): Specifies how the results should be sorted.
collection_ids (Optional[List[str]]): The collection ids to search.
ignore_unavailable (bool, optional): Whether to ignore unavailable collections. Defaults to True.
Returns:
Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]: A tuple containing:
- An iterable of search results, where each result is a dictionary with keys and values representing the
fields and values of each document.
- The total number of results (if the count could be computed), or None if the count could not be
computed.
- The token to be used to retrieve the next set of results, or None if there are no more results.
Raises:
NotFoundError: If the collections specified in `collection_ids` do not exist.
"""
search_after = None
if token:
search_after = urlsafe_b64decode(token.encode()).decode().split(",")
query = search.query.to_dict() if search.query else None
index_param = indices(collection_ids)
search_task = asyncio.create_task(
self.client.search(
index=index_param,
ignore_unavailable=ignore_unavailable,
query=query,
sort=sort or DEFAULT_SORT,
search_after=search_after,
size=limit,
)
)
count_task = asyncio.create_task(
self.client.count(
index=index_param,
ignore_unavailable=ignore_unavailable,
body=search.to_dict(count=True),
)
)
try:
es_response = await search_task
except exceptions.NotFoundError:
raise NotFoundError(f"Collections '{collection_ids}' do not exist")
hits = es_response["hits"]["hits"]
items = (hit["_source"] for hit in hits)
next_token = None
if hits and (sort_array := hits[-1].get("sort")):
next_token = urlsafe_b64encode(
",".join([str(x) for x in sort_array]).encode()
).decode()
# (1) count should not block returning results, so don't wait for it to be done
# (2) don't cancel the task so that it will populate the ES cache for subsequent counts
maybe_count = None
if count_task.done():
try:
maybe_count = count_task.result().get("count")
except Exception as e:
logger.error(f"Count task failed: {e}")
return items, maybe_count, next_token

Issue is here, but I'm not quite sure how to fix this, cause I'm not so familiar with elasticsearch.

@pedro-cf
Copy link
Collaborator Author

pedro-cf commented May 5, 2024

I can't seem to figure out a good way to obtain the "page" number in elasticsearch.

We could pass a page number inside the token, but this would cause issues in other places sadly:

my example:

    async def execute_search(
...
    ) -> Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]:
        """
...
        """
        search_after = None
        page = 1

        if token:
            decoded_token = urlsafe_b64decode(token.encode()).decode().split(",")
            search_after = decoded_token[:-1]
            page = int(decoded_token[-1]) + 1

        query = search.query.to_dict() if search.query else None

        index_param = indices(collection_ids)

        search_task = asyncio.create_task(
            self.client.search(
                index=index_param,
                ignore_unavailable=ignore_unavailable,
                query=query,
                sort=sort or DEFAULT_SORT,
                search_after=search_after,
                size=limit,
            )
        )

        try:
            es_response = await search_task
        except exceptions.NotFoundError:
            raise NotFoundError(f"Collections '{collection_ids}' do not exist")

        matched = es_response["hits"]["total"]["value"]
        hits = es_response["hits"]["hits"]
        items = (hit["_source"] for hit in hits)

        next_token = None
        if matched > page * limit:
            if hits and (sort_array := hits[-1].get("sort")):
                next_token = urlsafe_b64encode(
                    ",".join([str(x) for x in sort_array] + [str(page)]).encode()
                ).decode()

        return items, matched, next_token

Note that this version of this function also makes use of es_response["hits"]["total"]["value"] which from what I understand represents the number of matches which would be a great replacement for:

count_task = asyncio.create_task(
self.client.count(
index=index_param,
ignore_unavailable=ignore_unavailable,
body=search.to_dict(count=True),
)
)

maybe_count = None
if count_task.done():
try:
maybe_count = count_task.result().get("count")
except Exception as e:
logger.error(f"Count task failed: {e}")
return items, maybe_count, next_token

@jonhealy1
Copy link
Collaborator

I added another test related to this here: #244

jonhealy1 added a commit that referenced this issue May 8, 2024
)

**Related Issue(s):**

-
#242

**Merge dependencie(s):**

-
#241

**Description:**
- Paginated search queries now don't return a token on the last page.
- Made some fixes to the respective tests. In particular
`test_pagination_token_idempotent` had and indentation issue
- Improved `execute_search` to make use of
`es_response["hits"]["total"]["value"]`

**PR Checklist:**

- [x] Code is formatted and linted (run `pre-commit run --all-files`)
- [x] Tests pass (run `make test`)
- [x] Documentation has been updated to reflect changes, if applicable
- [x] Changes are added to the changelog

---------

Co-authored-by: Jonathan Healy <jonathan.d.healy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants