Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Retry stats #2543

Merged
merged 2 commits into from Feb 27, 2017
Merged

[MRG+1] Retry stats #2543

merged 2 commits into from Feb 27, 2017

Conversation

kmike
Copy link
Member

@kmike kmike commented Feb 7, 2017

What do you think about providing retry counts in scrapy stats?

This PR is not backwards compatible for anyone who used RetryMiddleware.__init__ directly.

@@ -70,8 +72,10 @@ def _retry(self, request, reason, spider):
retryreq.meta['retry_times'] = retries
retryreq.dont_filter = True
retryreq.priority = request.priority + self.priority_adjust
self.stats.inc_value('retry/count')
Copy link
Contributor

@redapple redapple Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about including the retry reason in the stats? a bit like DownloaderStats for exceptions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that it could be redundant - you'll get the same exception names as in downloader, e.g. 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 155 and 'retry/twisted.internet.error.ConnectionRefusedError': 155; downloader/response_status_count/ will be also duplicated - it could be 20x lines in stats with very little new information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see though how it makes it more obvious for the user what caused these retries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would tell you explicitly that these exceptions are retried, while downloader/exception "only" tells you they happen. Then you may want to disable retries or retry less.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RIght, but all these exceptions are listed above, in EXCEPTIONS_TO_RETRY list. We're not documenting this list though, and names could be different for subclasses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is true. My point is that you can spot it without adding this information to stats: status codes and exceptions to retry are static, they don't depend on website - you can check settings and EXCEPTIONS_TO_RETRY list. You check retry/count, notice that the value is high (=> there is some issue with crawling this website), and then investigate it further: check response status codes and downloader exceptions. So this PR does this minimal change which allows to spot and debug the problem.

I'm not opposed to adding more information to stats, it is just a bit more controversial IMHO (#2173 becomes more verbose). But more stats could make it easier to debug retry issues, that's true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having verbose stats if it's to debug problems. Having to look at the code for EXCEPTIONS_TO_RETRY and default RETRY_HTTP_CODES (and settings to double check if it's not the default) feels not-straightforward.
I'm not sure I would spot retry/count more than high downloader/exception_type_count values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want per-reason counts for retry/stopped as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want per-reason counts for retry/stopped as well?

I don't think so, no

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redapple done!

return retryreq
else:
self.stats.inc_value('retry/stopped')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts on using "stopped" here, but I can't find a satisfying alternative (retry/givingup, retry/max_times_reached...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a better name is welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about retry/max_reached?

@codecov-io
Copy link

codecov-io commented Feb 7, 2017

Codecov Report

Merging #2543 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2543      +/-   ##
==========================================
+ Coverage   83.59%   83.61%   +0.02%     
==========================================
  Files         161      161              
  Lines        8812     8824      +12     
  Branches     1296     1297       +1     
==========================================
+ Hits         7366     7378      +12     
  Misses       1200     1200              
  Partials      246      246
Impacted Files Coverage Δ
scrapy/utils/misc.py 91.66% <ø> (ø)
scrapy/downloadermiddlewares/retry.py 95.65% <100%> (+0.91%)
scrapy/utils/python.py 85.62% <100%> (+0.17%)
scrapy/downloadermiddlewares/stats.py 92.3% <100%> (+0.3%)
scrapy/utils/response.py 85.41% <100%> (+0.31%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afac3fd...39df675. Read the comment docs.

@kmike
Copy link
Member Author

kmike commented Feb 8, 2017

@redapple I tried running scrapy with this patch on a crawl, and having per-error stats look useful indeed.

@redapple
Copy link
Contributor

redapple commented Feb 8, 2017

Cool. Do you have sample logs to share showing how it looks?

@kmike
Copy link
Member Author

kmike commented Feb 8, 2017

In my case the log is a bit unusual because most times exceptions are handled by another middleware before they get to standard retry middleware, so there is not a lot of duplication:

 'downloader/exception_count': 49828,
 'downloader/exception_type_count/builtins.UnicodeEncodeError': 31,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 2,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3370,
 'downloader/exception_type_count/twisted.internet.error.NoRouteError': 248,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 578,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 35726,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 296,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 9577,
 'downloader/request_bytes': 215175880,
 'downloader/request_count': 530706,
 'downloader/request_method_count/GET': 530706,
 'downloader/response_bytes': 2823160968,
 'downloader/response_count': 480778,
 'downloader/response_status_count/200': 462638,
 'downloader/response_status_count/301': 12610,
 'downloader/response_status_count/302': 25,
 'downloader/response_status_count/307': 146,
 'downloader/response_status_count/401': 6,
 'downloader/response_status_count/403': 1462,
 'downloader/response_status_count/404': 513,
 'downloader/response_status_count/500': 835,
 'downloader/response_status_count/502': 339,
 'downloader/response_status_count/503': 1842,
 'downloader/response_status_count/504': 324,
 'downloader/response_status_count/520': 30,
 'downloader/response_status_count/525': 8,
 'dupefilter/filtered': 174522,
 'dupefilter2/seen': 1150356,
 'item_scraped_count': 462637,
 'links/ad': 48946,
 'links/page': 429,
 'log_count/INFO': 9176,
 'pages/ad': 462059,
 'pages/category': 578,
 'request_depth_max': 9,
 'response_received_count': 462637,
 'retry/count': 5,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 1,
 'retry/reason_count/twisted.internet.error.TimeoutError': 2,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2,
 'scheduler/dequeued': 530706,
 'scheduler/dequeued/disk': 530706,
 'scheduler/enqueued': 112089,
 'scheduler/enqueued/disk': 112089,

@@ -101,7 +101,9 @@ def test_https_noconnect_auth_error(self):
self._assert_got_response_code(407, l)

def _assert_got_response_code(self, code, log):
print(log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this leftover from debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly a leftover; stdout is hidden by default, but pytest shows it if tests fail. So by printing log it becomes easier to debug the issue in case of failures.

@redapple redapple added this to the v1.4 milestone Feb 8, 2017
@redapple redapple changed the title retry stats [MRG+1] Retry stats Feb 8, 2017
@kmike
Copy link
Member Author

kmike commented Feb 14, 2017

I've searched our internal codebase and there are several examples of RetryMiddleware subclasses with __init__ method overridden.

@kmike
Copy link
Member Author

kmike commented Feb 27, 2017

Merging it, given that #2566 implementation is the same and I had @redapple's +1 for the previous backwards incompatible version.

@kmike kmike merged commit c72ba07 into master Feb 27, 2017
@kmike kmike deleted the retry-stats branch February 27, 2017 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants