-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Extensions, spider and settings initialization order #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Moving settings handling out of Swapping So +1 to this change. |
good point. my justification for this difference is - spider works like settings producer and it should be able to change them, other components are settings consumers, they shouldn't be able to change settings. |
@chekunkov Scrapy Cloud already implemented settings on schedule. No idea about ScrapyRT. If we decide on implementing this:
I would rather see it implemented on I don't think the init order really matters, we can supply different settings scopes and sort it from there. |
This is what @jdemaeyer's add-on PR is about, to provide a consistent interface for components to change settings of other components. |
Don't you think that by allowing one component to change another component's settings it's possible to end up in situation where project can be misconfigured and completely broken because of some non-obvious settings conflict and it will be hard to debug such issues? |
That's what settings priorities (command line > |
@chekunkov Besides you can always use standards to define where you want to set settings |
what about deprecating extensions? as explained in the documentation, it only connects to signals which can be done with any other component (Middleware, Pipeline, even the own spider). |
@eLRuLL You're right, it seems that all extensions features can be emulated with middleware if we make a change suggested in this ticket. Hm, but deprecating extension would mean that if you only need to connect signals you have to put it in a middleware or a spider; in spider it can't be enabled via an option; in middleware you need to figure out where to put this middleware (to downloader middlewares? to spider middlewares? what priority to use?). |
Current initialization order of spider, settings and all sorts of middlewares and extensions:
Crawler.__init__
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):self.spidercls.update_settings(self.settings)
__init__
self.settings.freeze()
Crawler.crawl
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):Spider.__init__
__init__
__init__
__init__
It's not clear - why extensions are initialized during
Crawler.__init__
and not inCrawler.crawl
? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?I'm asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set
CLOSESPIDER_TIMEOUT
based on passed spider argument. Due toCloseSpider
implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order wasthat task would be as easy as set CLOSESPIDER_TIMEOUT in
custom_settings
.I don't speak about command line usage,
-s
does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it's not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when-s
doesn't work well.Based on above arguments I would like to propose different initialization order:
Crawler.__init__
:Crawler.crawl
:Spider.__init__
spider.update_settings(self.settings)
- notice that in this case it isn't required forupdate_settings
to be a@classmethod
self.settings.freeze()
__init__
__init__
__init__
__init__
What do you think about this proposal?
Discussion on this issue was originally started in #1276 (comment)
The text was updated successfully, but these errors were encountered: