-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
There are situations where websites return 200 responses but the content is not available due to bans or temporal issues which can be fixed by retrying requests.
There should be an easier way to retry requests inside spider callbacks, which should ideally reuse the code in Retry downloader middleware.
I see two approaches for this.
-
Introduce new exception called RetryRequest which can be raised inside a spider callback to indicate a retry. I personally prefer this but the implementation of this is a little untidy due to this bug process_spider_exception() not invoked for generators #220
from scrapy.exceptions import RetryRequest def parse(self, response): if response.xpath('//title[text()="Content not found"]'): raise RetryRequest('Missing content') -
Introduce a new class RetryRequest which wraps a request that needs to be retried. A RetryRequest can be yielded from a spider callback to indicate a retry
from scrapy.http import RetryRequest def parse(self, response): if response.xpath('//title[text()="Content not found"]'): yield RetryRequest(response.request, reason='Missing content')
Will be sending two PRs for the two approaches. Happy to hear about any other alternatives too.