Add from_crawler support to dupefilters #2956

elacuesta · 2017-10-09T15:11:56Z

codecov · 2017-10-09T15:31:59Z

Codecov Report

Merging #2956 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2956      +/-   ##
==========================================
+ Coverage   84.34%   84.35%   +0.01%     
==========================================
  Files         167      167              
  Lines        9359     9359              
  Branches     1388     1388              
==========================================
+ Hits         7894     7895       +1     
  Misses       1209     1209              
+ Partials      256      255       -1

Impacted Files	Coverage Δ
scrapy/core/scheduler.py	`62.13% <100%> (ø)`	⬆️
scrapy/utils/trackref.py	`86.48% <0%> (+2.7%)`	⬆️

kmike · 2017-10-09T16:17:14Z

scrapy/dupefilters.py

@@ -40,6 +40,10 @@ def __init__(self, path=None, debug=False):
            self.fingerprints.update(x.rstrip() for x in self.file)

    @classmethod
+    def from_crawler(cls, crawler):
+        return cls.from_settings(crawler.settings)


why is this method needed?

Hey @kmike! Thanks for the quick response.
First I thought about leaving only from_crawler, but to keep this backwards compatible (and also because of your comment in the original issue: "add from_crawler support to dupefilters in addition to from_settings") I kept both methods. It follows the same chain of precedence that can be found in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L35-L40

It makes sense to follow the same logic, but if you remove from_crawler method from this class, it would do exactly the same (i.e. from_crawler doesn't seem to be needed here).

Also, if you subclass from a dupefilter and override from_setting, overridden method won't be called after this PR.

You're right, I see your point now, thanks! I'll remove it 👍

@kmike: Related to this: is it possible that the MiddlewareManager is affected by this too? If I'm not mistaken, all of its subclasses (DownloaderMiddlewareManager, SpiderMiddlewareManager, ExtensionManager, ItemPipelineManager) are created using from_crawler, and none of them are supposed to be overridden by the user (except maybe the pipeline manager using the ITEM_PROCESSOR setting, but that it's not even documented). The only occurrence of the manager using from_settings is in the tests/test_middleware.py file (test_enabled_from_settings)
What do you think about some cleanup in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L28-L58 to keep only the from_crawler method? In a different PR, of course.

kmike · 2017-10-09T16:19:17Z

tests/test_dupefilters.py

+                return df
+
+        crawler = get_crawler(settings_dict={'DUPEFILTER_DEBUG': True, 'USER_AGENT': 'test ua'})
+        dupefilter = FromCrawlerRFPDupeFilter.from_crawler(crawler)


This test doesn't check that Scheduler calls from_crawler method. I think this tests would pass even before all changes in this PR, so it is not really testing the new feature.

You're right, but to be fair, the current tests don't check the way the Scheduler creates the dupefilter either :-)

I can add a test for that.

johtso · 2018-03-22T14:31:38Z

Any update on this?

kmike · 2018-03-22T20:43:36Z

tests/test_dupefilters.py



 class RFPDupeFilterTest(unittest.TestCase):

+    def test_from_crawler_scheduler(self):
+        settings = {'DUPEFILTER_DEBUG': True, 'METHOD': 'from_crawler',


I think it makes sense to set method in dupefilter itself - if I'm not mistaken, currently these tests would pass if you change the dupefilter class (if a wrong method is called).

kmike · 2018-03-22T20:45:48Z

@johtso thanks for the ping :)

@elacuesta I think this PR looks good, it needs just a small testing tweak. It'd be also good to add a test for dupefilters without from_crawler/from_settings methods. What's the reason for supporting them, by the way?

elacuesta · 2018-03-23T16:33:45Z

Hello Mikhail! I think I addressed your latest comments (setting the string to compare in the class itself, add a test for dupefilters created without from_crawler/from_settings methods). Not sure what would be the case for the directly created dupefilters, maybe someone needs to do some initialization but doesn't need the crawler nor the settings and just a child class with a custom constructor is enough?

kmike · 2018-03-23T18:21:26Z

Thanks @elacuesta! Test coverage is not complete because tests still don't check that dupefilter without from_crawler / from_settings methods work: dupefilter you're using inherits from a base class which has these methods.

elacuesta · 2018-03-24T00:21:56Z

I didn't realize that before, thanks! I changed the test case, the new class doesn't implement any of the dupefilter methods but I think that's not the point of the test, it's just to ensure the right class is used.

elacuesta · 2018-06-17T19:15:15Z

Ping @kmike 😇

kmike · 2018-07-19T22:45:17Z

This looks good 👍

However, I'd prefer to avoid copy-paste, and use the same function to create middlewares/extensions and dupefilters. Such function can be found in #1605, which is almost ready to merge as well; I wonder if you're up to finishing it, and using as a base for your PR :)

elacuesta · 2018-07-21T03:13:15Z

Updated to use scrapy.utils.misc.create_instance. The diff shows unrelated changes but that should go away after merging #3348

dangra · 2018-07-25T14:59:58Z

hi @elacuesta, can you rebase now that #3348 is merged? thanks

…ings)

elacuesta · 2018-07-26T19:26:14Z

@dangra Rebased 👍

dangra · 2018-07-26T20:03:50Z

Merge +1 sorry 🙄

kmike reviewed Oct 9, 2017

View reviewed changes

elacuesta force-pushed the dupefilter_from_crawler branch from e237114 to cbf25af Compare October 9, 2017 17:49

kmike reviewed Mar 22, 2018

View reviewed changes

kmike mentioned this pull request Jul 19, 2018

Add from_crawler constructor for feed exporters and storages #1605

Merged

elacuesta force-pushed the dupefilter_from_crawler branch from 7184b11 to 139da9e Compare July 21, 2018 01:18

kmike added this to the v1.6 milestone Jul 25, 2018

elacuesta added 5 commits July 26, 2018 16:24

Add from_crawler support to dupefilters

701cd2f

Test dupefilter creation by the Scheduler

d306fe3

Add test for direct creation of dupefilter (no from_crawler/from_sett…

0089a4a

…ings)

Fix test for dupefilter

9e14f8c

Simplify dupefilter creation

999341b

elacuesta force-pushed the dupefilter_from_crawler branch from 139da9e to 999341b Compare July 26, 2018 19:25

dangra merged commit 93afe18 into scrapy:master Jul 26, 2018

elacuesta deleted the dupefilter_from_crawler branch July 26, 2018 20:03

Gallaecio mentioned this pull request Jul 8, 2019

Scrapy - Retrieve spider object in dupefilter #1489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add from_crawler support to dupefilters #2956

Add from_crawler support to dupefilters #2956

elacuesta commented Oct 9, 2017

codecov bot commented Oct 9, 2017 •

edited

kmike Oct 9, 2017

elacuesta Oct 9, 2017

kmike Oct 9, 2017

elacuesta Oct 9, 2017

elacuesta Oct 9, 2017

kmike Oct 9, 2017

elacuesta Oct 9, 2017

johtso commented Mar 22, 2018

kmike Mar 22, 2018

kmike commented Mar 22, 2018

elacuesta commented Mar 23, 2018

kmike commented Mar 23, 2018

elacuesta commented Mar 24, 2018

elacuesta commented Jun 17, 2018

kmike commented Jul 19, 2018

elacuesta commented Jul 21, 2018

dangra commented Jul 25, 2018

elacuesta commented Jul 26, 2018

dangra commented Jul 26, 2018

Add from_crawler support to dupefilters #2956

Add from_crawler support to dupefilters #2956

Conversation

elacuesta commented Oct 9, 2017

codecov bot commented Oct 9, 2017 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johtso commented Mar 22, 2018

Choose a reason for hiding this comment

kmike commented Mar 22, 2018

elacuesta commented Mar 23, 2018

kmike commented Mar 23, 2018

elacuesta commented Mar 24, 2018

elacuesta commented Jun 17, 2018

kmike commented Jul 19, 2018

elacuesta commented Jul 21, 2018

dangra commented Jul 25, 2018

elacuesta commented Jul 26, 2018

dangra commented Jul 26, 2018

codecov bot commented Oct 9, 2017 •

edited