Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_crawler support to dupefilters #2956

Merged
merged 5 commits into from Jul 26, 2018

Conversation

elacuesta
Copy link
Member

Fixes #2940

@codecov
Copy link

codecov bot commented Oct 9, 2017

Codecov Report

Merging #2956 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2956      +/-   ##
==========================================
+ Coverage   84.34%   84.35%   +0.01%     
==========================================
  Files         167      167              
  Lines        9359     9359              
  Branches     1388     1388              
==========================================
+ Hits         7894     7895       +1     
  Misses       1209     1209              
+ Partials      256      255       -1
Impacted Files Coverage Δ
scrapy/core/scheduler.py 62.13% <100%> (ø) ⬆️
scrapy/utils/trackref.py 86.48% <0%> (+2.7%) ⬆️

@@ -40,6 +40,10 @@ def __init__(self, path=None, debug=False):
self.fingerprints.update(x.rstrip() for x in self.file)

@classmethod
def from_crawler(cls, crawler):
return cls.from_settings(crawler.settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this method needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kmike! Thanks for the quick response.
First I thought about leaving only from_crawler, but to keep this backwards compatible (and also because of your comment in the original issue: "add from_crawler support to dupefilters in addition to from_settings") I kept both methods. It follows the same chain of precedence that can be found in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L35-L40

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to follow the same logic, but if you remove from_crawler method from this class, it would do exactly the same (i.e. from_crawler doesn't seem to be needed here).

Also, if you subclass from a dupefilter and override from_setting, overridden method won't be called after this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I see your point now, thanks! I'll remove it 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike: Related to this: is it possible that the MiddlewareManager is affected by this too? If I'm not mistaken, all of its subclasses (DownloaderMiddlewareManager, SpiderMiddlewareManager, ExtensionManager, ItemPipelineManager) are created using from_crawler, and none of them are supposed to be overridden by the user (except maybe the pipeline manager using the ITEM_PROCESSOR setting, but that it's not even documented). The only occurrence of the manager using from_settings is in the tests/test_middleware.py file (test_enabled_from_settings)
What do you think about some cleanup in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L28-L58 to keep only the from_crawler method? In a different PR, of course.

return df

crawler = get_crawler(settings_dict={'DUPEFILTER_DEBUG': True, 'USER_AGENT': 'test ua'})
dupefilter = FromCrawlerRFPDupeFilter.from_crawler(crawler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't check that Scheduler calls from_crawler method. I think this tests would pass even before all changes in this PR, so it is not really testing the new feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but to be fair, the current tests don't check the way the Scheduler creates the dupefilter either :-)

I can add a test for that.

@johtso
Copy link

johtso commented Mar 22, 2018

Any update on this?



class RFPDupeFilterTest(unittest.TestCase):

def test_from_crawler_scheduler(self):
settings = {'DUPEFILTER_DEBUG': True, 'METHOD': 'from_crawler',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to set method in dupefilter itself - if I'm not mistaken, currently these tests would pass if you change the dupefilter class (if a wrong method is called).

@kmike
Copy link
Member

kmike commented Mar 22, 2018

@johtso thanks for the ping :)

@elacuesta I think this PR looks good, it needs just a small testing tweak. It'd be also good to add a test for dupefilters without from_crawler/from_settings methods. What's the reason for supporting them, by the way?

@elacuesta
Copy link
Member Author

Hello Mikhail! I think I addressed your latest comments (setting the string to compare in the class itself, add a test for dupefilters created without from_crawler/from_settings methods). Not sure what would be the case for the directly created dupefilters, maybe someone needs to do some initialization but doesn't need the crawler nor the settings and just a child class with a custom constructor is enough?

@kmike
Copy link
Member

kmike commented Mar 23, 2018

Thanks @elacuesta! Test coverage is not complete because tests still don't check that dupefilter without from_crawler / from_settings methods work: dupefilter you're using inherits from a base class which has these methods.

@elacuesta
Copy link
Member Author

I didn't realize that before, thanks! I changed the test case, the new class doesn't implement any of the dupefilter methods but I think that's not the point of the test, it's just to ensure the right class is used.

@elacuesta
Copy link
Member Author

Ping @kmike 😇

@kmike
Copy link
Member

kmike commented Jul 19, 2018

This looks good 👍

However, I'd prefer to avoid copy-paste, and use the same function to create middlewares/extensions and dupefilters. Such function can be found in #1605, which is almost ready to merge as well; I wonder if you're up to finishing it, and using as a base for your PR :)

@elacuesta
Copy link
Member Author

Updated to use scrapy.utils.misc.create_instance. The diff shows unrelated changes but that should go away after merging #3348

@kmike kmike added this to the v1.6 milestone Jul 25, 2018
@dangra
Copy link
Member

dangra commented Jul 25, 2018

hi @elacuesta, can you rebase now that #3348 is merged? thanks

@elacuesta
Copy link
Member Author

@dangra Rebased 👍

@dangra dangra merged commit 93afe18 into scrapy:master Jul 26, 2018
@elacuesta elacuesta deleted the dupefilter_from_crawler branch July 26, 2018 20:03
@dangra
Copy link
Member

dangra commented Jul 26, 2018

Merge +1 sorry 🙄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants