New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry request from spider callback #3685
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3685 +/- ##
==========================================
+ Coverage 84.52% 84.67% +0.14%
==========================================
Files 167 169 +2
Lines 9410 9566 +156
Branches 1397 1425 +28
==========================================
+ Hits 7954 8100 +146
- Misses 1199 1206 +7
- Partials 257 260 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @kasun .
Great work here 👏
I also left some comments on things I think can be improved, but feel free to give your opinion as well.
for x in result: | ||
if isinstance(x, Request): | ||
if x.is_marked_for_retry(): | ||
if not self.enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be the first if statement of the method.
Maybe, it would be even better to raise NotConfigured
in __init__
if RETRY_ENABLED
is False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly done for the debug message. Just in case if a user disabled request retrying globally and then tries to retry a request.
self.max_retry_times = spider.settings.getint('RETRY_TIMES') | ||
self.priority_adjust = spider.settings.getint('RETRY_PRIORITY_ADJUST') | ||
|
||
def is_exhausted(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a property.
self.meta.pop('_retry_requested', None) | ||
self.meta.pop('_retry_request_reason', None) | ||
|
||
def is_marked_for_retry(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a property here and in the following ones.
@@ -96,3 +96,31 @@ def replace(self, *args, **kwargs): | |||
kwargs.setdefault(x, getattr(self, x)) | |||
cls = kwargs.pop('cls', self.__class__) | |||
return cls(*args, **kwargs) | |||
|
|||
def mark_for_retry(self, reason=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that return req.retry()
reads better than req.mark_for_retry()
.
Besides that, mark, for me, reads as something that is changing the instance and not creating a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that, I don't know if this behavior should be here.
Should a request know about the configurations other modules need to retry it?
Or should the retrying module set this configurations in the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with req.retry()
, which is similar to req.replace()
.
:param reason: Reason for retrying. Optional, defaults to None. | ||
:type reason: str | ||
""" | ||
request = self.request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the variable is required.
retry_handler.record_retry_failure(reason) | ||
else: | ||
new_req = retry_handler.make_retry_request() | ||
retry_handler.record_retry(new_req, reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called inside make_retry_request
.
Otherwise, I could create a retry without recording it.
|
||
reason = x.get_retry_reason() or 'Retry Requested' | ||
retry_handler = RetryHandler(spider, x) | ||
if retry_handler.is_exhausted(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make_retry_request
should return None
if it is exhausted.
Then you can remove this if statement from here.
class RetryHandler(object): | ||
"""utilities to handle retries""" | ||
|
||
def __init__(self, spider, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, it seems that the most important dependency is the request.
So, I would change the precedence to put the request first.
def __init__(self, spider, request): | ||
self.spider = spider | ||
self.request = request | ||
self.max_retry_times = spider.settings.getint('RETRY_TIMES') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe setting default values, or raising an specific exception if these settings aren't available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to override max_retry_times
per request.
from scrapy.utils.python import global_object_name | ||
|
||
|
||
class RetryHandler(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handler does not required any state, so it could be a set of functions.
At least, it could be instantiated only once to store the settings and then the requests could be arguments to methods.
@@ -96,3 +96,31 @@ def replace(self, *args, **kwargs): | |||
kwargs.setdefault(x, getattr(self, x)) | |||
cls = kwargs.pop('cls', self.__class__) | |||
return cls(*args, **kwargs) | |||
|
|||
def mark_for_retry(self, reason=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with req.retry()
, which is similar to req.replace()
.
@@ -133,3 +133,23 @@ def follow(self, url, callback=None, method='GET', headers=None, body=None, | |||
priority=priority, | |||
dont_filter=dont_filter, | |||
errback=errback) | |||
|
|||
def retry_request(self, reason=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, there's Response.follow()
which returns a Request
. Why not Response.retry()
?
def __init__(self, spider, request): | ||
self.spider = spider | ||
self.request = request | ||
self.max_retry_times = spider.settings.getint('RETRY_TIMES') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to override max_retry_times
per request.
Any update? |
Any updates? |
New approach for #3590
This,
Resolves #3590