Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of pages on search pagination #1604

Merged
merged 1 commit into from May 10, 2017

Conversation

sonalkr132
Copy link
Member

Fixes: https://app.honeybadger.io/projects/40972/faults/33345740
It was erroring for large page numbers:

org.elasticsearch.search.query.QueryPhaseExecutionException:
Result window is too large, from + size must be less than or equal to: [10000] but was [74520].

####
Parameters: {"utf8"=>"✓", "query"=>"%", "page"=>"2484"}
Rubygem Search (196.7ms) {index: "rubygems-development", type: "rubygem",
  body: {query: ...}, size: 30, from: 74490}  # `from` > 10_000 

terminate_after was considered but it doesn't seem to take precedence
over from.

cc: @karmi

@jvanbaarsen
Copy link
Contributor

@sonalkr132 What is the reason you are picking 333? The error seems to say that you can go up to 1000, or am I reading this wrong?

@sonalkr132
Copy link
Member Author

from is calculated from (page - 1) * per_page. We have:

from + size < 10_000
((page - 1) * per_page) + per_page < 10_000
page * per_page < 10_000
page < (10_000/30) # per_page = 30
page < 333.33

@@ -1,8 +1,10 @@
class SearchesController < ApplicationController
before_action :set_page, only: :show
LAST_PAGE = 333
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this number is a little hard to understand just looking at the code. Maybe a comment or link to this PR would be helpful for future contributors.

@janko
Copy link
Contributor

janko commented Apr 26, 2017

At my company we also had to solve the same problem after upgrading Elasticsearch. To avoid limiting the last page, I would recommend scrolling through pages instead, using a scroll page size of [page * per_page, 10_000].min. If page * per_page is lower than 10,000, then you wouldn't scroll past the first page, and if it's higher then you would scroll in pages of 10,000 documents (which had optimal performance for us), until you reach the ES page which contains the request documents. In the process you would keep a scroll offset variable, so that you don't have to keep all retrieved documents in memory, just the current ES page.

@sonalkr132
Copy link
Member Author

Hi @janko-m Thank you for having a look ✌️
Are you recommending use of scroll api? If yes, what about the warning of not using it using for real time user request:

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

@janko
Copy link
Contributor

janko commented May 1, 2017

Yes, I was recommending the scroll API. I wasn't aware of this warning, but we are using scrolling for serving real time requests in production; in my benchmarks fetching the first 10,000 documents using the scrolling API had equal performance to fetching the first 10,000 documents without using the scrolling API, so there should be no performance downsides.

Naturally, fetching a second page of 10,000 documents is going to be less performant than fetching the first page, but it's usually like that with pagination. Considering it's not possible to fetch the second page of 10,000 documents with from + size anyway, and that it's unlikely folks will go over the 333th page, I think it's a good way to go.

Note that if you'll be using the scroll API, it's a good idea to manually clear the scroll cursor after the requested documents have been fetched. In that case it's probably best to combine from + size and the scroll API, use the former for first 10,000 documents, and the latter for documents after the 10,000th; that way the scroll context will be created only when it's needed.

@karmi
Copy link

karmi commented May 4, 2017

Hi all, sorry for being so late to the party.

First of all, I wonder what is the rationale for allowing people to load the 1,000th (or even 100th) page with results. I would argue that people look at couple of first pages at most, and the majority doesn't even click to the second page (but of course, website analytics would give a more realistic answer). Loading the 10,000th page is very costly for Elasticsearch, since it has to fetch all the results, and skip and throw out the 9,000 "pages".

For this reason, eg. Kaminari has the max_pages configuration option, which sets the hard limit for paging.

I guess there are valid reasons for using the "Scroll" API for returning the whole set of results, as @janko-m suggests, but in my experience this is limited to a narrow set of use-cases. The "Scroll" API is indeed not intended for a user-facing operation like paging the search results on a webpage, but for something like exporting the results to CSV, for example.

So, in practical terms, I would set a hard limit on the 100th page of results, in the spirit of Kaminari's max_pages variable. With 30 results per page, that's still 3,000 search results, presumably a lot for the user to find what they are looking for — or to change their query.

Again, sorry for the late reply, please ping me in any Rubygems.org issue related to Elasticsearch, I'll do my best to help.

@dwradcliffe
Copy link
Member

dwradcliffe commented May 4, 2017 via email

@sonalkr132
Copy link
Member Author

I'm totally on setting a reasonable limit short of 1000 pages.

This PR does the same. I doubt we need to tinker with will_paginate to set this limit everywhere.
I have added the requested comment. Let me know if this needs any more changes.

@karmi
Copy link

karmi commented May 6, 2017

@sonalkr132 I still think 333 pages is maybe too much, I would cap it at 100 pages.


def show
return unless params[:query] && params[:query].is_a?(String)
@page = LAST_PAGE if @page > LAST_PAGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is really edge case, but what about to show proper warning or error message?

@sonalkr132
Copy link
Member Author

Updated it to limit at 100 pages.

what about to show proper warning or error message?

@simi I doubt we need to inform about things going unexpectedly for a feature/workflow not supported by us. It remind me of this.
x17puib

@dwradcliffe
Copy link
Member

100 pages is good. 👍 I think we need to return an error and fail the request if the page is higher than that. We don't want to return a 200 and don't want the same results returned for pages >100. Looks like github returns 404 for this case which is what I would do too.

Fixes: https://app.honeybadger.io/projects/40972/faults/33345740
It was erroring for large page numbers:
```
org.elasticsearch.search.query.QueryPhaseExecutionException: Result window is too large, from + size must be less than or equal to: [10000] but was [74520].
```
`terminate_after` was considered but it doesn't seem to take precedence
over `from`.
@sonalkr132
Copy link
Member Author

Updated! @simi Sorry for not paying heed to your suggestion 👷

@sonalkr132 sonalkr132 merged commit 2afa406 into rubygems:master May 10, 2017
@sonalkr132 sonalkr132 deleted the result-window-too-large branch May 10, 2017 05:05
@simi
Copy link
Member

simi commented May 10, 2017

@sonalkr132 don't worry. You're doing amazing work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants