Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_crawler support to dupefilters #2956

Merged
merged 5 commits into from Jul 26, 2018

Conversation

@elacuesta
Copy link
Member

@elacuesta elacuesta commented Oct 9, 2017

Fixes #2940

@codecov
Copy link

@codecov codecov bot commented Oct 9, 2017

Codecov Report

Merging #2956 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2956      +/-   ##
==========================================
+ Coverage   84.34%   84.35%   +0.01%     
==========================================
  Files         167      167              
  Lines        9359     9359              
  Branches     1388     1388              
==========================================
+ Hits         7894     7895       +1     
  Misses       1209     1209              
+ Partials      256      255       -1
Impacted Files Coverage Δ
scrapy/core/scheduler.py 62.13% <100%> (ø) ⬆️
scrapy/utils/trackref.py 86.48% <0%> (+2.7%) ⬆️
@@ -40,6 +40,10 @@ def __init__(self, path=None, debug=False):
self.fingerprints.update(x.rstrip() for x in self.file)

@classmethod
def from_crawler(cls, crawler):
return cls.from_settings(crawler.settings)

This comment has been minimized.

@kmike

kmike Oct 9, 2017
Member

why is this method needed?

This comment has been minimized.

@elacuesta

elacuesta Oct 9, 2017
Author Member

Hey @kmike! Thanks for the quick response.
First I thought about leaving only from_crawler, but to keep this backwards compatible (and also because of your comment in the original issue: "add from_crawler support to dupefilters in addition to from_settings") I kept both methods. It follows the same chain of precedence that can be found in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L35-L40

This comment has been minimized.

@kmike

kmike Oct 9, 2017
Member

It makes sense to follow the same logic, but if you remove from_crawler method from this class, it would do exactly the same (i.e. from_crawler doesn't seem to be needed here).

Also, if you subclass from a dupefilter and override from_setting, overridden method won't be called after this PR.

This comment has been minimized.

@elacuesta

elacuesta Oct 9, 2017
Author Member

You're right, I see your point now, thanks! I'll remove it 👍

This comment has been minimized.

@elacuesta

elacuesta Oct 9, 2017
Author Member

@kmike: Related to this: is it possible that the MiddlewareManager is affected by this too? If I'm not mistaken, all of its subclasses (DownloaderMiddlewareManager, SpiderMiddlewareManager, ExtensionManager, ItemPipelineManager) are created using from_crawler, and none of them are supposed to be overridden by the user (except maybe the pipeline manager using the ITEM_PROCESSOR setting, but that it's not even documented). The only occurrence of the manager using from_settings is in the tests/test_middleware.py file (test_enabled_from_settings)
What do you think about some cleanup in https://github.com/scrapy/scrapy/blob/1.4/scrapy/middleware.py#L28-L58 to keep only the from_crawler method? In a different PR, of course.

return df

crawler = get_crawler(settings_dict={'DUPEFILTER_DEBUG': True, 'USER_AGENT': 'test ua'})
dupefilter = FromCrawlerRFPDupeFilter.from_crawler(crawler)

This comment has been minimized.

@kmike

kmike Oct 9, 2017
Member

This test doesn't check that Scheduler calls from_crawler method. I think this tests would pass even before all changes in this PR, so it is not really testing the new feature.

This comment has been minimized.

@elacuesta

elacuesta Oct 9, 2017
Author Member

You're right, but to be fair, the current tests don't check the way the Scheduler creates the dupefilter either :-)

I can add a test for that.

@elacuesta elacuesta force-pushed the elacuesta:dupefilter_from_crawler branch from e237114 to cbf25af Oct 9, 2017
@johtso
Copy link

@johtso johtso commented Mar 22, 2018

Any update on this?



class RFPDupeFilterTest(unittest.TestCase):

def test_from_crawler_scheduler(self):
settings = {'DUPEFILTER_DEBUG': True, 'METHOD': 'from_crawler',

This comment has been minimized.

@kmike

kmike Mar 22, 2018
Member

I think it makes sense to set method in dupefilter itself - if I'm not mistaken, currently these tests would pass if you change the dupefilter class (if a wrong method is called).

@kmike
Copy link
Member

@kmike kmike commented Mar 22, 2018

@johtso thanks for the ping :)

@elacuesta I think this PR looks good, it needs just a small testing tweak. It'd be also good to add a test for dupefilters without from_crawler/from_settings methods. What's the reason for supporting them, by the way?

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 23, 2018

Hello Mikhail! I think I addressed your latest comments (setting the string to compare in the class itself, add a test for dupefilters created without from_crawler/from_settings methods). Not sure what would be the case for the directly created dupefilters, maybe someone needs to do some initialization but doesn't need the crawler nor the settings and just a child class with a custom constructor is enough?

@kmike
Copy link
Member

@kmike kmike commented Mar 23, 2018

Thanks @elacuesta! Test coverage is not complete because tests still don't check that dupefilter without from_crawler / from_settings methods work: dupefilter you're using inherits from a base class which has these methods.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Mar 24, 2018

I didn't realize that before, thanks! I changed the test case, the new class doesn't implement any of the dupefilter methods but I think that's not the point of the test, it's just to ensure the right class is used.

@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jun 17, 2018

Ping @kmike 😇

@kmike
Copy link
Member

@kmike kmike commented Jul 19, 2018

This looks good 👍

However, I'd prefer to avoid copy-paste, and use the same function to create middlewares/extensions and dupefilters. Such function can be found in #1605, which is almost ready to merge as well; I wonder if you're up to finishing it, and using as a base for your PR :)

@elacuesta elacuesta force-pushed the elacuesta:dupefilter_from_crawler branch from 7184b11 to 139da9e Jul 21, 2018
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 21, 2018

Updated to use scrapy.utils.misc.create_instance. The diff shows unrelated changes but that should go away after merging #3348

@kmike kmike added this to the v1.6 milestone Jul 25, 2018
@dangra
Copy link
Member

@dangra dangra commented Jul 25, 2018

hi @elacuesta, can you rebase now that #3348 is merged? thanks

@elacuesta elacuesta force-pushed the elacuesta:dupefilter_from_crawler branch from 139da9e to 999341b Jul 26, 2018
@elacuesta
Copy link
Member Author

@elacuesta elacuesta commented Jul 26, 2018

@dangra Rebased 👍

@dangra dangra merged commit 93afe18 into scrapy:master Jul 26, 2018
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 84.34%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@elacuesta elacuesta deleted the elacuesta:dupefilter_from_crawler branch Jul 26, 2018
@dangra
Copy link
Member

@dangra dangra commented Jul 26, 2018

Merge +1 sorry 🙄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants