Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to retry a request from inside a spider callback #3590

Closed
kasun opened this issue Jan 19, 2019 · 21 comments · Fixed by #4902
Closed

Ability to retry a request from inside a spider callback #3590

kasun opened this issue Jan 19, 2019 · 21 comments · Fixed by #4902

Comments

@kasun
Copy link
Contributor

kasun commented Jan 19, 2019

There are situations where websites return 200 responses but the content is not available due to bans or temporal issues which can be fixed by retrying requests.

There should be an easier way to retry requests inside spider callbacks, which should ideally reuse the code in Retry downloader middleware.

I see two approaches for this.

  1. Introduce new exception called RetryRequest which can be raised inside a spider callback to indicate a retry. I personally prefer this but the implementation of this is a little untidy due to this bug process_spider_exception() not invoked for generators #220

    from scrapy.exceptions import RetryRequest
    
    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            raise RetryRequest('Missing content')
    
  2. Introduce a new class RetryRequest which wraps a request that needs to be retried. A RetryRequest can be yielded from a spider callback to indicate a retry

    from scrapy.http import RetryRequest
    
    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield RetryRequest(response.request, reason='Missing content')
    

Will be sending two PRs for the two approaches. Happy to hear about any other alternatives too.

@Gallaecio
Copy link
Member

What aspects of the current approach don’t you like? It can be done with a one-liner (yield response.request.replace(dont_filter=True)).

@kasun
Copy link
Contributor Author

kasun commented Jan 21, 2019

@Gallaecio I miss everything the retry middleware currently does.

You can't infinitely do this. you need to count the retries and give up after a certain retry attempts
You need to have stats and log messages

IMO there should be a easier way to do above things than re-implement the same code in every project/spider.

@ejulio
Copy link
Contributor

ejulio commented Feb 15, 2019

I think the idea is nice, though I think you should focus on only one approach.
IMHO, number 1 is better because it makes more sense to retry when some specific exception happened.
I don't think a special kind of Request would be the best approach here.

Also, I think that we should focus on only one of them because There should be one-- and preferably only one --obvious way to do it.

@apalala
Copy link

apalala commented Feb 15, 2019

I frequently, comfortably, and successfully use the raise RetryRequest("a reason") approach. It counts and limits retries, logs the reason, and, in general, does what it should without much work from the developer. As the OP mentioned, corrupted content over 200 responses is frequent enough.

@stav
Copy link
Contributor

stav commented Feb 15, 2019

What about attaching the middleware routine to the response:

class RetryMiddleware(scrapy.downloadermiddlewares.retry.RetryMiddleware):

    def process_response(self, request, response, spider):
        """Attach a .retry() method to every response"""
        response = super().process_response(request, response, spider)

        def retry(**kwargs):
            reason = kwargs.pop('reason', 'response_retry_called')
            retryreq = request.replace(**kwargs)
            return self._retry(retryreq, reason, spider)

        if isinstance(response, scrapy.http.Response):
            response.retry = retry

        return response

Then it could be used like:

def parse(self, response):
    if response.xpath('//title[text()="Content not found"]'):
        yield response.retry(reason='Missing content')

@ejulio
Copy link
Contributor

ejulio commented Feb 15, 2019

@stav, from the API user point of view, I don't know if response.retry gives enough meaning that we are, actually, retrying the request.

@stav
Copy link
Contributor

stav commented Feb 15, 2019

What about:

        yield response.request.retry(reason='Missing content')

@apalala
Copy link

apalala commented Feb 15, 2019

I like yield response.request.retry(reason='Missing content'), except it will often require a return afterwards.

At any rate, the solution should accept the same parameters as request.replace(), do whatever is necessary so the new request is not detected as a duplicate, and keep a retry count and limit.

It could be both, with raise RetryRequest() implemented in terms of Request.retry().

@stav
Copy link
Contributor

stav commented Feb 15, 2019

it will often require a return afterwards

We could also just have:

return [response.request.retry(reason='Missing content')]

It could be both, with raise RetryRequest()

What benefit does allowing parse to raise an exception give us? I'm always confused with whose responsibility it is to catch exceptions, so this is probably just my ignorance 😕

@apalala
Copy link

apalala commented Feb 16, 2019

What benefit does allowing parse to raise an exception give us?

Good question!

While implementing session pooling I wrote a session-aware retry_request(), and never missed not having an exception for the functionality.

@GeorgeA92
Copy link
Contributor

We can call _retry method of Retry middleware from the spider code

class Myspider(scrapy.Spider):
.....
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.RetryMiddleware = [middleware for middleware in downloader_middlewares if "RetryMiddleware" in str(type(middleware))][0]
.....

    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield self.RetryMiddleware._retry(response.request,"some reason",self)
            return

@ejulio
Copy link
Contributor

ejulio commented Feb 21, 2019

@GeorgeA92, I don't like this approach :(
My problem is that we are calling a private method from the outside.

@GeorgeA92
Copy link
Contributor

@ejulio I agree.
There are 2 non private methods where _retry method called.

process_requests calls _retry if response code listed in retry_http_codes (from RETRY_HTTP_CODES setting).
This is not our case because 200 response is not in retry_http_codes and response needs to pass all middlewares (including RetryMiddleware) in order to reach spider callback (and retry request from spider callback).

process_exception calls _retry if exception argument listed in EXCEPTIONS_TO_RETRY which is stored as tuple.

EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError,
ConnectionRefusedError, ConnectionDone, ConnectError,
ConnectionLost, TCPTimedOutError, ResponseFailed,
IOError, TunnelError)

We can use non private process_exception method instead of _retry in spider code after following:
1.define new Exception
2.add our custom Exception in EXCEPTION_TO_RETRY in RetryMiddleware object
3.call process_exception method with our new exception as argument
Code is slightly different:

class ContentNotFoundException(Exception):
    """
    ContentNotFound
    """

class Myspider(scrapy.Spider):
.....
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.RetryMiddleware = [middleware for middleware in downloader_middlewares if "RetryMiddleware" in str(type(middleware))][0]
	self.RetryMiddleware.EXCEPTIONS_TO_RETRY = tuple(list(self.RetryMiddleware.EXCEPTIONS_TO_RETRY)+[ContentNotFoundException])
.....

    def parse(self, response):
        if response.xpath('//title[text()="Content not found"]'):
            yield self.RetryMiddleware.process_exception(response.request, ContentNotFoundException(), self)
            return

@kasun
Copy link
Contributor Author

kasun commented Mar 15, 2019

After going through the different ideas, I opted for Response.retry_request() over raising a RetryException() to indicate a retry.

This new method will just return a request that is marked as to be retried and actual modification needed to retry the request is been handled in a new spidermiddleware. The reason is you need access to spider settings to determine whether the request has reached the maximum retry attempts etc.

please have a look at #3685

@yuanfeiz
Copy link

Any progress on this?

@ShanikaEdiriweera
Copy link

Any update on this?

I also encounter the issue mentioned by @kasun

@usfstdd
Copy link

usfstdd commented May 5, 2020

use try and except to handle retry request, and for counting the number of retry set a variable in the "meta", update that in the except section.

that is wrong?? @kasun

@fellipegpbotelho
Copy link

Any update on this?

@psdon
Copy link

psdon commented Feb 10, 2021

any updates on this?

@Gallaecio
Copy link
Member

The latest is the API design discussion in #4902

@DeflateAwning
Copy link
Contributor

DeflateAwning commented Apr 25, 2023

For anyone stumbling on this thread in the future:

From Scrapy>=2.5.0, the preferred solution is now get_retry_request(...).

From the docs:

from scrapy.downloadermiddlewares.retry import get_retry_request

def parse(self, response):
    if not response.text:
        new_request_or_none = get_retry_request(
            response.request,
            spider=self,
            reason='empty',
        )
        return new_request_or_none

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet