New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add from_crawler support to dupefilters #2956
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2956 +/- ##
==========================================
+ Coverage 84.34% 84.35% +0.01%
==========================================
Files 167 167
Lines 9359 9359
Branches 1388 1388
==========================================
+ Hits 7894 7895 +1
Misses 1209 1209
+ Partials 256 255 -1
|
scrapy/dupefilters.py
Outdated
@@ -40,6 +40,10 @@ def __init__(self, path=None, debug=False): | |||
self.fingerprints.update(x.rstrip() for x in self.file) | |||
|
|||
@classmethod | |||
def from_crawler(cls, crawler): | |||
return cls.from_settings(crawler.settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this method needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kmike! Thanks for the quick response.
First I thought about leaving only from_crawler
, but to keep this backwards compatible (and also because of your comment in the original issue: "add from_crawler support to dupefilters in addition to from_settings") I kept both methods. It follows the same chain of precedence that can be found in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L35-L40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to follow the same logic, but if you remove from_crawler method from this class, it would do exactly the same (i.e. from_crawler doesn't seem to be needed here).
Also, if you subclass from a dupefilter and override from_setting, overridden method won't be called after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I see your point now, thanks! I'll remove it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike: Related to this: is it possible that the MiddlewareManager
is affected by this too? If I'm not mistaken, all of its subclasses (DownloaderMiddlewareManager, SpiderMiddlewareManager, ExtensionManager, ItemPipelineManager) are created using from_crawler
, and none of them are supposed to be overridden by the user (except maybe the pipeline manager using the ITEM_PROCESSOR
setting, but that it's not even documented). The only occurrence of the manager using from_settings
is in the tests/test_middleware.py
file (test_enabled_from_settings
)
What do you think about some cleanup in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L28-L58 to keep only the from_crawler
method? In a different PR, of course.
tests/test_dupefilters.py
Outdated
return df | ||
|
||
crawler = get_crawler(settings_dict={'DUPEFILTER_DEBUG': True, 'USER_AGENT': 'test ua'}) | ||
dupefilter = FromCrawlerRFPDupeFilter.from_crawler(crawler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't check that Scheduler calls from_crawler method. I think this tests would pass even before all changes in this PR, so it is not really testing the new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but to be fair, the current tests don't check the way the Scheduler creates the dupefilter either :-)
I can add a test for that.
e237114
to
cbf25af
Compare
Any update on this? |
tests/test_dupefilters.py
Outdated
|
||
|
||
class RFPDupeFilterTest(unittest.TestCase): | ||
|
||
def test_from_crawler_scheduler(self): | ||
settings = {'DUPEFILTER_DEBUG': True, 'METHOD': 'from_crawler', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to set method in dupefilter itself - if I'm not mistaken, currently these tests would pass if you change the dupefilter class (if a wrong method is called).
@johtso thanks for the ping :) @elacuesta I think this PR looks good, it needs just a small testing tweak. It'd be also good to add a test for dupefilters without from_crawler/from_settings methods. What's the reason for supporting them, by the way? |
Hello Mikhail! I think I addressed your latest comments (setting the string to compare in the class itself, add a test for dupefilters created without from_crawler/from_settings methods). Not sure what would be the case for the directly created dupefilters, maybe someone needs to do some initialization but doesn't need the crawler nor the settings and just a child class with a custom constructor is enough? |
Thanks @elacuesta! Test coverage is not complete because tests still don't check that dupefilter without from_crawler / from_settings methods work: dupefilter you're using inherits from a base class which has these methods. |
I didn't realize that before, thanks! I changed the test case, the new class doesn't implement any of the dupefilter methods but I think that's not the point of the test, it's just to ensure the right class is used. |
Ping @kmike 😇 |
This looks good 👍 However, I'd prefer to avoid copy-paste, and use the same function to create middlewares/extensions and dupefilters. Such function can be found in #1605, which is almost ready to merge as well; I wonder if you're up to finishing it, and using as a base for your PR :) |
7184b11
to
139da9e
Compare
Updated to use |
hi @elacuesta, can you rebase now that #3348 is merged? thanks |
139da9e
to
999341b
Compare
@dangra Rebased 👍 |
Merge +1 sorry 🙄 |
Fixes #2940