Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Retry stats #2543

Merged
merged 2 commits into from Feb 27, 2017
Merged

[MRG+1] Retry stats #2543

merged 2 commits into from Feb 27, 2017

Conversation

@kmike
Copy link
Member

@kmike kmike commented Feb 7, 2017

What do you think about providing retry counts in scrapy stats?

This PR is not backwards compatible for anyone who used RetryMiddleware.__init__ directly.

@@ -70,8 +72,10 @@ def _retry(self, request, reason, spider):
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
self.stats.inc_value('retry/count')

This comment has been minimized.

@redapple

redapple Feb 7, 2017
Contributor

what about including the retry reason in the stats? a bit like DownloaderStats for exceptions

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

I thought that it could be redundant - you'll get the same exception names as in downloader, e.g. 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 155 and 'retry/twisted.internet.error.ConnectionRefusedError': 155; downloader/response_status_count/ will be also duplicated - it could be 20x lines in stats with very little new information.

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

I can see though how it makes it more obvious for the user what caused these retries

This comment has been minimized.

@redapple

redapple Feb 7, 2017
Contributor

It would tell you explicitly that these exceptions are retried, while downloader/exception "only" tells you they happen. Then you may want to disable retries or retry less.

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

RIght, but all these exceptions are listed above, in EXCEPTIONS_TO_RETRY list. We're not documenting this list though, and names could be different for subclasses.

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

Yeah, this is true. My point is that you can spot it without adding this information to stats: status codes and exceptions to retry are static, they don't depend on website - you can check settings and EXCEPTIONS_TO_RETRY list. You check retry/count, notice that the value is high (=> there is some issue with crawling this website), and then investigate it further: check response status codes and downloader exceptions. So this PR does this minimal change which allows to spot and debug the problem.

I'm not opposed to adding more information to stats, it is just a bit more controversial IMHO (#2173 becomes more verbose). But more stats could make it easier to debug retry issues, that's true.

This comment has been minimized.

@redapple

redapple Feb 7, 2017
Contributor

I don't mind having verbose stats if it's to debug problems. Having to look at the code for EXCEPTIONS_TO_RETRY and default RETRY_HTTP_CODES (and settings to double check if it's not the default) feels not-straightforward.
I'm not sure I would spot retry/count more than high downloader/exception_type_count values.

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

Do you want per-reason counts for retry/stopped as well?

This comment has been minimized.

@redapple

redapple Feb 7, 2017
Contributor

Do you want per-reason counts for retry/stopped as well?

I don't think so, no

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

@redapple done!

return retryreq
else:
self.stats.inc_value('retry/stopped')

This comment has been minimized.

@redapple

redapple Feb 7, 2017
Contributor

I have doubts on using "stopped" here, but I can't find a satisfying alternative (retry/givingup, retry/max_times_reached...)

This comment has been minimized.

@kmike

kmike Feb 7, 2017
Author Member

Yeah, a better name is welcome.

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

what about retry/max_reached?

@codecov-io
Copy link

@codecov-io codecov-io commented Feb 7, 2017

Codecov Report

Merging #2543 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2543      +/-   ##
==========================================
+ Coverage   83.59%   83.61%   +0.02%     
==========================================
  Files         161      161              
  Lines        8812     8824      +12     
  Branches     1296     1297       +1     
==========================================
+ Hits         7366     7378      +12     
  Misses       1200     1200              
  Partials      246      246
Impacted Files Coverage Δ
scrapy/utils/misc.py 91.66% <ø> (ø)
scrapy/downloadermiddlewares/retry.py 95.65% <100%> (+0.91%)
scrapy/utils/python.py 85.62% <100%> (+0.17%)
scrapy/downloadermiddlewares/stats.py 92.3% <100%> (+0.3%)
scrapy/utils/response.py 85.41% <100%> (+0.31%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afac3fd...39df675. Read the comment docs.

@kmike
Copy link
Member Author

@kmike kmike commented Feb 8, 2017

@redapple I tried running scrapy with this patch on a crawl, and having per-error stats look useful indeed.

@redapple
Copy link
Contributor

@redapple redapple commented Feb 8, 2017

Cool. Do you have sample logs to share showing how it looks?

@kmike
Copy link
Member Author

@kmike kmike commented Feb 8, 2017

In my case the log is a bit unusual because most times exceptions are handled by another middleware before they get to standard retry middleware, so there is not a lot of duplication:

 'downloader/exception_count': 49828,
 'downloader/exception_type_count/builtins.UnicodeEncodeError': 31,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 2,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3370,
 'downloader/exception_type_count/twisted.internet.error.NoRouteError': 248,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 578,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 35726,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 296,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 9577,
 'downloader/request_bytes': 215175880,
 'downloader/request_count': 530706,
 'downloader/request_method_count/GET': 530706,
 'downloader/response_bytes': 2823160968,
 'downloader/response_count': 480778,
 'downloader/response_status_count/200': 462638,
 'downloader/response_status_count/301': 12610,
 'downloader/response_status_count/302': 25,
 'downloader/response_status_count/307': 146,
 'downloader/response_status_count/401': 6,
 'downloader/response_status_count/403': 1462,
 'downloader/response_status_count/404': 513,
 'downloader/response_status_count/500': 835,
 'downloader/response_status_count/502': 339,
 'downloader/response_status_count/503': 1842,
 'downloader/response_status_count/504': 324,
 'downloader/response_status_count/520': 30,
 'downloader/response_status_count/525': 8,
 'dupefilter/filtered': 174522,
 'dupefilter2/seen': 1150356,
 'item_scraped_count': 462637,
 'links/ad': 48946,
 'links/page': 429,
 'log_count/INFO': 9176,
 'pages/ad': 462059,
 'pages/category': 578,
 'request_depth_max': 9,
 'response_received_count': 462637,
 'retry/count': 5,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 1,
 'retry/reason_count/twisted.internet.error.TimeoutError': 2,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2,
 'scheduler/dequeued': 530706,
 'scheduler/dequeued/disk': 530706,
 'scheduler/enqueued': 112089,
 'scheduler/enqueued/disk': 112089,
@kmike kmike force-pushed the retry-stats branch from d74c7e0 to 11c0fd9 Feb 8, 2017
@@ -101,7 +101,9 @@ def test_https_noconnect_auth_error(self):
self._assert_got_response_code(407, l)

def _assert_got_response_code(self, code, log):
print(log)

This comment has been minimized.

@redapple

redapple Feb 8, 2017
Contributor

Is this leftover from debugging?

This comment has been minimized.

@kmike

kmike Feb 8, 2017
Author Member

Not exactly a leftover; stdout is hidden by default, but pytest shows it if tests fail. So by printing log it becomes easier to debug the issue in case of failures.

@redapple redapple added this to the v1.4 milestone Feb 8, 2017
@redapple redapple changed the title retry stats [MRG+1] Retry stats Feb 8, 2017
@kmike kmike force-pushed the retry-stats branch from 11c0fd9 to e285b1d Feb 14, 2017
@kmike
Copy link
Member Author

@kmike kmike commented Feb 14, 2017

I've searched our internal codebase and there are several examples of RetryMiddleware subclasses with __init__ method overridden.

@kmike
Copy link
Member Author

@kmike kmike commented Feb 27, 2017

Merging it, given that #2566 implementation is the same and I had @redapple's +1 for the previous backwards incompatible version.

@kmike kmike merged commit c72ba07 into master Feb 27, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike kmike deleted the retry-stats branch Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants