Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DupeFilter: add settings for verbose logging and filtered requests stats #553

Merged
merged 2 commits into from Feb 18, 2014

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jan 21, 2014

Too many duplicate requests is often not a good sign (buggy spider or misbehaving website) -- not for CrawlSpider obviously.

Here I propose adding 1 setting for the default RFPDupeFilterclass:

  • DUPEFILTER_DEBUG: to log all duplicate requests and not only the first one

This patch also adds a stats counter with key dupefilter/filtered to count the number of duplicate requests.
This counter is always enabled.

Question: is spider.crawler.stats the proper way to access the crawler's stats collector? is it safe?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jan 22, 2014

@redapple This is relative because the CrawlSpiders will get all urls and attempt to crawl them, but I like the stat

+1

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 22, 2014

@nramirezuy , yes I guess for CrawlSpider this would mean a lot of log spam, but in my case in context of manual (Base)Spider I had issues with next pages link not getting next products (always returning the same page of paginated results) and I couldn't figure it out until I logged all duplicate requests.

Updated the PR description regarding CrawlSpider :)

@kmike
Copy link
Member

@kmike kmike commented Jan 22, 2014

I think DUPEFILTER_STATS (or DUPEFILTER_STATS_ENABLED or DUPEFILTER_ENABLE_STATS) is a better name because this options turns on/off stats, not changes their verbosity. Also, there should be docs for this feature. Besides that this PR looks good. I haven't a need for this feature personally, but I don't see why we shouldn't provide this feature.

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 22, 2014

@kmike done.

@kmike
Copy link
Member

@kmike kmike commented Jan 22, 2014

Hmm, do we need an option for this at all? Maybe just collect these stats?

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 22, 2014

Collecting dupe stats could be default indeed. /cc @dangra @pablohoffman ?

@dangra
Copy link
Member

@dangra dangra commented Jan 23, 2014

Collecting dupe stats could be default indeed

I agree.

@dangra
Copy link
Member

@dangra dangra commented Jan 23, 2014

And I don't see the point of disabling its stats.

@kmike
Copy link
Member

@kmike kmike commented Jan 28, 2014

This looks goot to me.

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 28, 2014

No :) but that's why there's code review, to discuss ;)

@dangra
Copy link
Member

@dangra dangra commented Jan 28, 2014

@redapple: indeed :) can you ammend + squash? I'll merge as soon as is ready.

if self.verbose_log:
fmt = "Filtered duplicate request: %(request)s"
log.msg(format=fmt, request=request, level=self._log_level, spider=spider)
elif self.logdupes:
fmt = "Filtered duplicate request: %(request)s - no more duplicates will be shown (see DUPEFILTER_CLASS)"

This comment has been minimized.

@pablohoffman

pablohoffman Jan 28, 2014
Member

we should probably mention about the new setting here (instead of DUPEFILTER_CLASS)

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

good point

self.logdupes = False

if self._stats_enabled:

This comment has been minimized.

@pablohoffman

pablohoffman Jan 28, 2014
Member

I would remove this check and class attribute.

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

indeed.

@@ -28,17 +28,22 @@ def log(self, request, spider): # log that a request has been filtered
class RFPDupeFilter(BaseDupeFilter):
"""Request Fingerprint duplicates filter"""

def __init__(self, path=None):
_log_level = log.DEBUG

This comment has been minimized.

@pablohoffman

pablohoffman Jan 28, 2014
Member

Are these class attributes added just to support subclasses that override this functionality?

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

Yep

This comment has been minimized.

@redapple

redapple Feb 1, 2014
Author Contributor

@pablohoffman , do you prefer that I remove it?

This comment has been minimized.

self.logdupes = False

if self._stats_enabled:
spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)

This comment has been minimized.

@pablohoffman

pablohoffman Jan 28, 2014
Member

maybe worth caching stats to avoid two attribute resolutions on every request (duplicate or not). or am I micro-optimizing here?

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

well, it's done in DepthMiddleware, but there self.stats is also tested before calling self.stats.inc_value() (maybe the same reason as for #566 ?)

if path:
self.file = open(os.path.join(path, 'requests.seen'), 'a+')
self.fingerprints.update(x.rstrip() for x in self.file)

@classmethod
def from_settings(cls, settings):
return cls(job_dir(settings))
verbose_log = settings.getbool('DUPEFILTER_VERBOSE_LOG')

This comment has been minimized.

@pablohoffman

pablohoffman Jan 28, 2014
Member

I'm not sure about the setting name, we've been using the _DEBUG suffix for enabling specific per-component extra logging - see DOWNLOADER_DEBUG, COOKIES_DEBUG, AUTOTHROTTLE_DEBUG. Under that convention, DUPEFILTER_DEBUG would be the most appropriate one.

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

I'm good with that. I took example on DEPTH_STATS_VERBOSE but that's a different meaning

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 28, 2014

Gotta recap all comments now :)

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Feb 3, 2014

Does it make sense to add this functionality at the scheduler level and not as something specific to one type of dupe filter (RFPDupeFilter)?

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 4, 2014

I haven't looked at the specific and don't know the history of it, but is there a reason dupefilter is not middleware?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Feb 4, 2014

@redapple I think is something older, that almost nobody wants to disable, so nobody asked before.
As I can see it can be implemented as a SpiderMiddleware.

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 5, 2014

@pablohoffman @dangra @kmike @nramirezuy , could the potential refactoring at scheduler level be done in another PR?
is something missing before merging?

@dangra
Copy link
Member

@dangra dangra commented Feb 5, 2014

I haven't looked at the specific and don't know the history of it, but is there a reason dupefilter is not middleware?

We never agreed on putting it as a spidermiddleware because redirected requests won't be filtered.

It was a scheduler middleware (yeah, that existed at some point), but it was the only scheduler middleware for a long time, so we removed it in of the big-refactors and made dupefilter a builtin component.

@dangra
Copy link
Member

@dangra dangra commented Feb 5, 2014

is something missing before merging?

it's looking good to merge to me, +1 to submit another PR for the scheduler refactor

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 18, 2014

anything preventing this from getting merged?

dangra added a commit that referenced this pull request Feb 18, 2014
DupeFilter: add settings for verbose logging and filtered requests stats
@dangra dangra merged commit 70a57dc into scrapy:master Feb 18, 2014
1 check passed
1 check passed
@nramirezuy
default The Travis CI build passed
Details
@redapple redapple deleted the redapple:dupefilter-verbose branch Jul 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants