Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a retry request function #4902

Merged
merged 14 commits into from Apr 1, 2021
Merged

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Nov 23, 2020

Alternative to #3685 for implementing request retry from spider callbacks.

I prefer this approach because the implementation is simpler (basically a refactoring of existing code), while still providing a similar API, and going in line with the current implementation of retries as a downloader middleware, by not modifying Request or Response in any way.

With this implementation, there is a function which can be used to retry a request from a spider callback:

# …
from scrapy.downloadermiddlewares.retry import get_retry_request

class MySpider(Spider):
    # …
    def parse(self, response):
        if not response.text:
            new_request = get_retry_request(response.request, reason='empty', spider=self)
            if new_request:
                yield new_request
            return

To do:

  • Choose the best API subset, remove the rest or make it private.
  • Add documentation.
  • Add tests.

Resolves #3590, closes #3685

Copy link
Member

@noviluni noviluni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the first/second approach over the RetrySpiderMixin, the pattern it's easier if you don't need to add the mixin to the spider, however I undestand that doing spider=self it's not so friendly, so I can undestand/accept the third/fourth approach

scrapy/downloadermiddlewares/retry.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/retry.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/retry.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/retry.py Outdated Show resolved Hide resolved
@ejulio
Copy link
Contributor

ejulio commented Feb 9, 2021

@Gallaecio , I was thinking about it recently and what are your thoughts on reusing RetryMiddleware in the following manner.

class RetryMiddleware: # scrapy class

    def open_spider(self, spider):
        get_retry_reason = getattr(spider, 'get_retry_reason', lambda resp: None)
        self.get_retry_reason = get_retry_reason

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
        else:
            reason = self.get_retry_reason(response)
        if reason:
            return self._retry(request, reason, spider) or response
        return response


class MySpider(Spider):

    def get_retry_reason(self, response):
         return "A custom reason"

Basically, we'd be using the current middleware but adding an extension point to it.
This would make it easier to specify new retry conditions and it is in the proper priority chain in scrapy.
Also, the code is out from the parse methods, making them a bit cleaner, but allowing access to the spider.
Finally, as it doesn't end up in parse it might help reducing side-effects (say, processing spidermiddlwares for something that should've been retried.

Another option would be to allow a custom_retry_condition meta key (or similar), but I'm not sure if that would be better.

@Gallaecio
Copy link
Member Author

@ejulio I agree that it would be best to adhere to the regular flow of Scrapy (middlewares, extensions, etc.) for this, and that it would be best to keep this in the retry middleware to avoid unnecessary processing. I had not even thought about the latter.

However, I’m not sure asking to subclass the middleware would be better. Since we are talking about enabling spider-specific conditions for retry, I think it makes sense for that code to be defined in the spider.

Your custom meta key idea, on the other hand… I think that could work. It would not require a middleware, it would be in line with the Scrapy flow, it would not let those responses go pass the retry middleware… It could even be modified by middlewares in a similar fashion to decorators. The more I think about it, the more I like it.

@ejulio
Copy link
Contributor

ejulio commented Feb 10, 2021

@Gallaecio , we shouldn't subclass scrapy's middleware.
The standard scrapy middleware should subscribe to open_spider and get the get_retry_reason if defined in the spider.
If not defined, it works as is, if defined, it is a hook the spider can use to handle custom retries.

I was thinking that we could go even further, but that could be a bit drastic.
We could introduce`

RETRY_CONDITIONS = {
    some_function: {'RETRY_TIMES': 3, other confs..}
}

Then, in the spider

custom_settings = {
    'RETRY_CONDITIONS': {
        self.custom_retry_condition: {'RETRY_TIMES': 3},
        self.other_custom_retry_condition: {'RETRY_TIMES': 3},
    }
}

In this scheme, the fallback to RETRY_HTTP_CODES would be to insert a common function if the list is defined.
But I think it's a bigger change and not sure if it's better than a single function with some conditions

@Gallaecio
Copy link
Member Author

I’m not convinced about the open_spider approach. It seems too convoluted to me. And if you want to have a different retry logic for different parts of the same spider (e.g. different callbacks), it would require you to use something like meta to pass information from the callback to get_retry_reason to condition the logic.

I agree that having custom retry times can be useful, but I think that should be done (or maybe is already possible) through meta.

@ejulio
Copy link
Contributor

ejulio commented Feb 11, 2021

Makes sense then.
Having a granular control on the request itself on amount of retries and custom condition.

Currently we don't support custom amount in meta, but should be fairly easy to add.

The custom condition, I think, is harder to do.
If we put a function in meta it might make the request non-serializable, disabling the disk queues.
So we need to be careful about it

@Gallaecio
Copy link
Member Author

Gallaecio commented Feb 11, 2021

Good point, I hadn’t even thought of serialization. I’ll mention #4945 here as relatively relevant (how would someone extend request serialization to support functions in meta?).

@codecov
Copy link

codecov bot commented Feb 22, 2021

Codecov Report

Merging #4902 (e0a4fc4) into master (fa3ebb1) will increase coverage by 0.18%.
The diff coverage is 91.23%.

❗ Current head e0a4fc4 differs from pull request most recent head d458ccf. Consider uploading reports for the commit d458ccf to get more accurate results

@@            Coverage Diff             @@
##           master    #4902      +/-   ##
==========================================
+ Coverage   88.02%   88.21%   +0.18%     
==========================================
  Files         158      162       +4     
  Lines        9723    10314     +591     
  Branches     1433     1503      +70     
==========================================
+ Hits         8559     9098     +539     
- Misses        910      947      +37     
- Partials      254      269      +15     
Impacted Files Coverage Δ
scrapy/utils/log.py 89.24% <ø> (ø)
scrapy/pipelines/images.py 90.35% <80.00%> (-1.47%) ⬇️
scrapy/core/http2/protocol.py 83.41% <83.41%> (ø)
scrapy/core/downloader/contextfactory.py 87.03% <84.61%> (-2.97%) ⬇️
scrapy/core/http2/stream.py 91.37% <91.37%> (ø)
scrapy/core/http2/agent.py 96.38% <96.38%> (ø)
scrapy/core/downloader/handlers/http11.py 94.00% <100.00%> (+1.53%) ⬆️
scrapy/core/downloader/handlers/http2.py 100.00% <100.00%> (ø)
scrapy/downloadermiddlewares/retry.py 100.00% <100.00%> (ø)
scrapy/selector/unified.py 100.00% <100.00%> (ø)
... and 9 more

@Gallaecio
Copy link
Member Author

I’ve completed the pull request as originally intended.

@Gallaecio Gallaecio changed the title Implement retry request functions and mixin Implement a retry request function Feb 25, 2021
Allow logger/stats customization in get_retry_request
@elacuesta elacuesta added this to the 2.5 milestone Mar 4, 2021
@pawelmhm
Copy link
Contributor

pawelmhm commented Mar 5, 2021

Good idea! Having option to retry in Scrapy is super useful.

            new_request = get_retry_request(response.request, reason='empty', spider=self)
            if new_request:
                yield new_request
            return

this is still a lot of code to write, 4 lines. How about simpifying it to some exception

raise RetryRequest(message)

?
This is much more simple. Other benefits aside from simplicity: you can keep track of retries in stats, log proper message always on retry automatically from middleware when handling it. We use something like this in our project and it works nice.

@Gallaecio
Copy link
Member Author

The main problem I see with raising an exception is that it prevents you from doing additional stuff in the callback, specially when the maximum number of retries is exceeded.

Imagine the request you are retrying is to fetch additional data from an item (passed in Request.meta), and that you want to yield the item without the additional data if the request to fetch that additional data reaches the maximum number of retries. I don’t see a clear way to do that with the raise approach.

@pawelmhm
Copy link
Contributor

pawelmhm commented Mar 5, 2021

Imagine the request you are retrying is to fetch additional data from an item (passed in Request.meta), and that you want to yield the item without the additional data if the request to fetch that additional data reaches the maximum number of retries. I don’t see a clear way to do that with the raise approach

good point. You would have to probably add some errback argument to retry exception like

raise RetryException(errback=self.return_item)

@Gallaecio
Copy link
Member Author

I’m not completely convinced yet, but I’m starting to see quite some benefits to the API you propose. I’m not completely sold on errback being better for something that will not happen asynchronously, but in addition to a shorter syntax I see other benefits:

  • Using a spider middleware (which is how the raise approach can be implemented) seems more Scrapyish.
  • We will not need to pass the response and the spider as parameters, process_spider_exception already gets those.
  • We can use the errback from the request instead of requiring it as a parameter.
  • We can probably keep a single, optional parameter, reason, and make any other customization available through settings and meta keys, or making it easier to subclass the new retry spider middleware and the existing retry downloader middleware.

@pawelmhm
Copy link
Contributor

pawelmhm commented Mar 5, 2021

We can use the errback from the request instead of requiring it as a parameter

yes, good idea!

If you'd like to I can create a PR reusing some code we have in other project and we can continue discussion there, decide which approach is better.

@Gallaecio
Copy link
Member Author

@pawelmhm After I add typing to this pull request, we will go forward with it, as the function can be easier for certain usage patterns (e.g. middlewares).

This does not mean a no to the exception approach. In fact, the exception spider middleware could use that function.

So it looks like Scrapy 2.5 will offer the function, but we may replace it with an exception following your work on the next Scrapy version, and leave the function around only for middleware usage and corner cases.

Copy link
Contributor

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio , I get the idea and it's fine.
However, my takes are:

  1. If we are already considering a new API/change (Exception based), maybe we shouldn't move this forward. For this I can invoke the "only explicit way to do it", and considering we usually take some time to deprecate a feature, we might need to live with two APIs for the same thing for awhile.

  2. To some extent, I don't like the get_retry_request interface so much. It plays with spider, settings and stats, so lots of state going on. Not that it's a problem, but I wonder if there could be a better way. Probably, the exception approach would handle this

logger.error("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
{'request': request, 'retries': retries, 'reason': reason},
extra={'spider': spider})
max_retry_times = request.meta.get('max_retry_times', self.max_retry_times)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a weird change in the API.
From the middleware point of view, we allow max_retry_times in meta.
But from parse in the spider, we allow it in that function.
So, who is responsible from setting the required amount of retries, the function making the request or the function handling the response?

Copy link
Member Author

@Gallaecio Gallaecio Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm of the function is: use meta, else settings, else function parameter. So I first considered to just not pass anything from the middleware about this, and let the function handle it, as the function has access to both meta and settings, and at first sight it has the same algorithm as the middleware.

However, then I thought of backward compatibility and inheritance. self.max_retry_times is not a private attribute of the middleware, it’s public. So there may be subclasses out there that override this attribute, for example setting a different hardcoded value or reading the value from a different source (a different setting name, multiple settings, a configuration file, etc.). In order to keep backward compatibility, I wrote this logic here and pass the value explicitly to the function.

scrapy/downloadermiddlewares/retry.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member Author

  1. If we are already considering a new API/change (Exception based), maybe we shouldn't move this forward. […] we usually take some time to deprecate a feature.

The idea is not to eventually deprecate the function in favor of the exception. The idea is to add the exception, recommend the exception (i.e. the documentation will use the exception whenever possible), but keep the function.

So, in 2.5 we would be recommending the function to retry from a callback, in 2.6 or later we would change that recommendation to the exception approach, probably mentioning the function as an alternative where an exception is not desirable for whatever reason (mostly middlewares, but some people may prefer not to use errback to handle the exhaustion of retries).

I would understand if the purpose was to deprecate the function, but given it’s not, I don’t see a strong reason why we should postpone the function implementation until we also provide the exception approach.

  1. To some extent, I don't like the get_retry_request interface so much. It plays with spider, settings and stats, so lots of state going on. Not that it's a problem, but I wonder if there could be a better way. Probably, the exception approach would handle this.

Yes, that’s the reason @pawelmhm’s approach convinced me as superior. But talking it over with @kmike, @wRAR and @elacuesta, they brought up the point that there may still be use cases for the function approach, so it would be better to provide both approaches even if we recommend the exception approach for most scenarios.

@Gallaecio Gallaecio closed this Mar 19, 2021
@Gallaecio Gallaecio reopened this Mar 19, 2021
@kmike kmike merged commit d0e2348 into scrapy:master Apr 1, 2021
@kmike
Copy link
Member

kmike commented Apr 1, 2021

Thanks @Gallaecio, @elacuesta, @pawelmhm, @ejulio and @noviluni!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to retry a request from inside a spider callback
6 participants