Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dupefilter easily subclassable #533

Merged
merged 2 commits into from Jan 15, 2014

Conversation

@chekunkov
Copy link
Contributor

@chekunkov chekunkov commented Jan 15, 2014

If I need to subclass dupefilter to implement different duplicate check - I should overwrite entire request_seen method, which looks good to me and possibly could be changed in future, while only change I need - change the way request fingerprint is calculated. So I propose to add request_fingerprint method to dupefilter classes which will allow easily override the way duplicates are checked without touching request_seen method.

…be easily subclassed without need to override entire request_seen method.
@@ -14,15 +15,19 @@ def from_settings(cls, settings):
def request_seen(self, request):
return False

def request_fingerprint(self, request):

This comment has been minimized.

@kmike

kmike Jan 15, 2014
Member

I think this method does not belong to BaseDupeFilter, because dupe filters may not use request fingerprints, and it is Request Fingerprint duplicates filter (RFPDupeFilter) who needs request fingerprint. Let's just remove this method from here.

This comment has been minimized.

@chekunkov

chekunkov Jan 15, 2014
Author Contributor

You're right, I'll remove it.

if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)

def request_fingerprint(self, request):
"""Override this method to implement a different duplicate check.

This comment has been minimized.

@kmike

kmike Jan 15, 2014
Member

I think this docstring is meaningless - it doesn't tell what is method doing, and it is obvious that you can override a method. I'd just remove the docstring; this would also make code more consistent with other code in this module.

This comment has been minimized.

@chekunkov

chekunkov Jan 15, 2014
Author Contributor

I wrote this docstring by anology with OffsiteMiddleware.get_host_regex docstring so it is obvious to developer what method he should override in first place:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py#L42

If you still think this comment is useless - please tell me, I'll remove it too.

This comment has been minimized.

@kmike

kmike Jan 15, 2014
Member

I don't have strong preference. If you like it, then let's keep it (with a formatting fixed - it's better to remove extra lines).

I think the main extension point is still request_seen method, so if we add a comment here it's better to add comment to BaseDupeFilter.request_seen method as well.

@kmike
Copy link
Member

@kmike kmike commented Jan 15, 2014

+1 for this PR - it is an useful refactoring. I recall wanting to override just fingerprint calculations at least once.

kmike added a commit that referenced this pull request Jan 15, 2014
…sable

Make dupefilter easily subclassable
@kmike kmike merged commit af16fa3 into scrapy:master Jan 15, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + os.linesep)

def request_fingerprint(self, request):
return request_fingerprint(request)

This comment has been minimized.

@pablohoffman

pablohoffman Jan 15, 2014
Member

wouldn't this cause a conflict between the imported function and method name?

This comment has been minimized.

@kmike

kmike Jan 15, 2014
Member

There is a conflict, but only in class definition scope (not in scope of methods themselves).

This comment has been minimized.

@kmike

kmike Jan 15, 2014
Member

If class (not instance) is already created there are request_fingerprint and self.request_fingerprint, and they are different.

This comment has been minimized.

@chekunkov

chekunkov Jan 15, 2014
Author Contributor

If it's critical - could be changed to get_request_fingerprint or something like that. But I don't know if I could update it now.

This comment has been minimized.

@chekunkov

chekunkov Jan 15, 2014
Author Contributor

@kmike is right about the scopes.

This comment has been minimized.

@pablohoffman

pablohoffman Jan 15, 2014
Member

No need to change, thanks about the clarification. Good refactoring!

This comment has been minimized.

@pablohoffman

pablohoffman Jan 15, 2014
Member

Btw, in order to become more broadly useful (and officially supported), we should document this ability to override the dupe filter fingerprint.

This comment has been minimized.

@chekunkov

chekunkov Jan 15, 2014
Author Contributor

I'm not sure I can write a proper documentation (I think my language skill is not high enough). What I was thinking of is adding a sentense to http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class:

In order to change the way duplicates are checked you could subclass RFPDupeFilter and override it request_fingerprint method.

Please tell me if I should somehow commit this (or make another PR for documentation, or maybe? update??? current PR), or someone else will edit documentation more properly.

@dangra
Copy link
Member

@dangra dangra commented Jan 16, 2014

I'd like to see a testcase if request_fingerprint method is going to be part of default DupeFilter api

@chekunkov
Copy link
Contributor Author

@chekunkov chekunkov commented Jan 16, 2014

@dangra What that testcase should test? How output of request_seen method changes while overriding request_fingerprint method - did I get it right?

And If I write one - how can I submit it? New PR?

@dangra
Copy link
Member

@dangra dangra commented Jan 16, 2014

@chekunkov Yes, it should test that extending the class and overriding the method actually uses the method. The goal is to prevent that a later change (for simplification sake) doesn't remove this method without notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants