Have query return generator to reduce server stress #64

j08lue · 2016-11-01T07:32:31Z

@valgur, in the context of the last PR you suggested to have load_query return a generator over the pages of long (> 100) queries and that the download function advances only after downloading each page. This might reduce server stress, because the queries would not be sent right after each other.

Changing the load_query() to return a generator instead of a full list might be a good idea. As I see it, the motivation on the SciHub side for limiting the maximum query sizes was to limit spikes caused by large queries. Using a generator and doing another sub-query only when the current one's results have all been consumed would spread out the queries made, e.g. when iterating over the query results in download_all().
There is one minor problem with that approach, though. The SciHub's OpenSearch API provides the number of rows a full query result would contain but Python generators don't support len(), so exposing that information would need to be done via some other means.

I am not sure how big the benefit on the server side is, but I think this would be good practice. It works only, though, if the user does not need to list the details of all products returned by the query, before he downloads them. So in the CLI search function, we still need to load all the pages immediately.

The text was updated successfully, but these errors were encountered:

kr-stn · 2016-11-04T09:28:21Z

I do think it is a good idea to try and reduce the stress on the server but:

I had cases where the OpenSearch was down but download with known scene IDs still worked
this might run into issues with very large queries, when a new product is published in the meantime that also fulfills the search parameters. Due to the pagination the last product from query n might then become the first product of query n+1.
as you said - If a user wants to know the length or list all products we still need to fire off all queries in short succession

Overall I see the point in this improvement but I don't think it is too important. The largest impact it will have is on very large queries and the resulting long duration download jobs.

valgur · 2017-01-27T15:39:08Z

It's about time I followed up on the topic since I initiated it.

I agree with all of your points and now think that it would not be a practical or reasonable approach. It would make the API more complicated without providing much benefit. If the queries are spread out to a longer timeframe, e.g. when downloading a long list of products, then the possibility of the queries being affected by server downtime becomes quite likely considering the unfortunate unreliability of SciHub. On the other hand, if the first subquery succeeds then any consecutive ones are quite likely to do so as well, so it's better to take advantage of that.

@willemarcel or @j08lue, I think we have reached a consensus here and you can safely close this issue.

kr-stn mentioned this issue Jan 26, 2017

Download improvement ideas #91

Closed

j08lue closed this as completed Jan 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have query return generator to reduce server stress #64

Have query return generator to reduce server stress #64

j08lue commented Nov 1, 2016

kr-stn commented Nov 4, 2016 •

edited

valgur commented Jan 27, 2017 •

edited

Have query return generator to reduce server stress #64

Have query return generator to reduce server stress #64

Comments

j08lue commented Nov 1, 2016

kr-stn commented Nov 4, 2016 • edited

valgur commented Jan 27, 2017 • edited

kr-stn commented Nov 4, 2016 •

edited

valgur commented Jan 27, 2017 •

edited