Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have query return generator to reduce server stress #64

Closed
j08lue opened this issue Nov 1, 2016 · 2 comments
Closed

Have query return generator to reduce server stress #64

j08lue opened this issue Nov 1, 2016 · 2 comments

Comments

@j08lue
Copy link
Contributor

j08lue commented Nov 1, 2016

@valgur, in the context of the last PR you suggested to have load_query return a generator over the pages of long (> 100) queries and that the download function advances only after downloading each page. This might reduce server stress, because the queries would not be sent right after each other.

Changing the load_query() to return a generator instead of a full list might be a good idea. As I see it, the motivation on the SciHub side for limiting the maximum query sizes was to limit spikes caused by large queries. Using a generator and doing another sub-query only when the current one's results have all been consumed would spread out the queries made, e.g. when iterating over the query results in download_all().
There is one minor problem with that approach, though. The SciHub's OpenSearch API provides the number of rows a full query result would contain but Python generators don't support len(), so exposing that information would need to be done via some other means.

I am not sure how big the benefit on the server side is, but I think this would be good practice. It works only, though, if the user does not need to list the details of all products returned by the query, before he downloads them. So in the CLI search function, we still need to load all the pages immediately.

@kr-stn
Copy link
Member

kr-stn commented Nov 4, 2016

I do think it is a good idea to try and reduce the stress on the server but:

  • I had cases where the OpenSearch was down but download with known scene IDs still worked
  • this might run into issues with very large queries, when a new product is published in the meantime that also fulfills the search parameters. Due to the pagination the last product from query n might then become the first product of query n+1.
  • as you said - If a user wants to know the length or list all products we still need to fire off all queries in short succession

Overall I see the point in this improvement but I don't think it is too important. The largest impact it will have is on very large queries and the resulting long duration download jobs.

@valgur
Copy link
Member

valgur commented Jan 27, 2017

It's about time I followed up on the topic since I initiated it.

I agree with all of your points and now think that it would not be a practical or reasonable approach. It would make the API more complicated without providing much benefit. If the queries are spread out to a longer timeframe, e.g. when downloading a long list of products, then the possibility of the queries being affected by server downtime becomes quite likely considering the unfortunate unreliability of SciHub. On the other hand, if the first subquery succeeds then any consecutive ones are quite likely to do so as well, so it's better to take advantage of that.

@willemarcel or @j08lue, I think we have reached a consensus here and you can safely close this issue.

@j08lue j08lue closed this as completed Jan 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants