Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry request from spider callback #3685

Closed
wants to merge 2 commits into from

Conversation

kasun
Copy link
Contributor

@kasun kasun commented Mar 15, 2019

New approach for #3590

This,

  1. provides a method retry_request() in response class that will return a request instance that is marked as to be retried
  2. A spidermiddleware (RetryMiddleware) will check whether this request can be retried, do the necessary modifications and retry the request

Resolves #3590

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #3685 into master will increase coverage by 0.14%.
The diff coverage is 97.67%.

@@            Coverage Diff             @@
##           master    #3685      +/-   ##
==========================================
+ Coverage   84.52%   84.67%   +0.14%     
==========================================
  Files         167      169       +2     
  Lines        9410     9566     +156     
  Branches     1397     1425      +28     
==========================================
+ Hits         7954     8100     +146     
- Misses       1199     1206       +7     
- Partials      257      260       +3     
Impacted Files Coverage Δ
scrapy/settings/default_settings.py 98.63% <ø> (ø)
scrapy/http/response/__init__.py 90.90% <60.00%> (-2.54%) ⬇️
scrapy/downloadermiddlewares/retry.py 94.59% <100.00%> (-1.24%) ⬇️
scrapy/http/request/__init__.py 100.00% <100.00%> (ø)
scrapy/spidermiddlewares/retry.py 100.00% <100.00%> (ø)
scrapy/utils/retry.py 100.00% <100.00%> (ø)
scrapy/crawler.py 91.60% <0.00%> (-0.48%) ⬇️
scrapy/linkextractors/sgml.py 96.80% <0.00%> (ø)
scrapy/linkextractors/__init__.py 100.00% <0.00%> (ø)
... and 3 more

Copy link
Contributor

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @kasun .
Great work here 👏
I also left some comments on things I think can be improved, but feel free to give your opinion as well.

for x in result:
if isinstance(x, Request):
if x.is_marked_for_retry():
if not self.enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be the first if statement of the method.
Maybe, it would be even better to raise NotConfigured in __init__ if RETRY_ENABLED is False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly done for the debug message. Just in case if a user disabled request retrying globally and then tries to retry a request.

self.max_retry_times = spider.settings.getint('RETRY_TIMES')
self.priority_adjust = spider.settings.getint('RETRY_PRIORITY_ADJUST')

def is_exhausted(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a property.

self.meta.pop('_retry_requested', None)
self.meta.pop('_retry_request_reason', None)

def is_marked_for_retry(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a property here and in the following ones.

@@ -96,3 +96,31 @@ def replace(self, *args, **kwargs):
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)

def mark_for_retry(self, reason=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that return req.retry() reads better than req.mark_for_retry().
Besides that, mark, for me, reads as something that is changing the instance and not creating a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that, I don't know if this behavior should be here.
Should a request know about the configurations other modules need to retry it?
Or should the retrying module set this configurations in the request?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with req.retry(), which is similar to req.replace().

:param reason: Reason for retrying. Optional, defaults to None.
:type reason: str
"""
request = self.request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the variable is required.

retry_handler.record_retry_failure(reason)
else:
new_req = retry_handler.make_retry_request()
retry_handler.record_retry(new_req, reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called inside make_retry_request.
Otherwise, I could create a retry without recording it.


reason = x.get_retry_reason() or 'Retry Requested'
retry_handler = RetryHandler(spider, x)
if retry_handler.is_exhausted():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make_retry_request should return None if it is exhausted.
Then you can remove this if statement from here.

class RetryHandler(object):
"""utilities to handle retries"""

def __init__(self, spider, request):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it seems that the most important dependency is the request.
So, I would change the precedence to put the request first.

def __init__(self, spider, request):
self.spider = spider
self.request = request
self.max_retry_times = spider.settings.getint('RETRY_TIMES')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe setting default values, or raising an specific exception if these settings aren't available.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to override max_retry_times per request.

from scrapy.utils.python import global_object_name


class RetryHandler(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handler does not required any state, so it could be a set of functions.
At least, it could be instantiated only once to store the settings and then the requests could be arguments to methods.

@@ -96,3 +96,31 @@ def replace(self, *args, **kwargs):
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)

def mark_for_retry(self, reason=None, **kwargs):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with req.retry(), which is similar to req.replace().

@@ -133,3 +133,23 @@ def follow(self, url, callback=None, method='GET', headers=None, body=None,
priority=priority,
dont_filter=dont_filter,
errback=errback)

def retry_request(self, reason=None, **kwargs):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, there's Response.follow() which returns a Request. Why not Response.retry()?

def __init__(self, spider, request):
self.spider = spider
self.request = request
self.max_retry_times = spider.settings.getint('RETRY_TIMES')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to override max_retry_times per request.

@ShanikaEdiriweera
Copy link

Any update?

@fellipegpbotelho
Copy link

Any updates?

@kmike kmike closed this in #4902 Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to retry a request from inside a spider callback
5 participants