Skip to content

Extensions, spider and settings initialization order #1305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chekunkov opened this issue Jun 15, 2015 · 9 comments · Fixed by #6038
Closed

Extensions, spider and settings initialization order #1305

chekunkov opened this issue Jun 15, 2015 · 9 comments · Fixed by #6038

Comments

@chekunkov
Copy link
Contributor

Current initialization order of spider, settings and all sorts of middlewares and extensions:

Crawler.__init__ (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):

  1. self.spidercls.update_settings(self.settings)
  2. All extensions __init__
  3. self.settings.freeze()

Crawler.crawl (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):

  1. Spider.__init__
  2. All downloader middlewares __init__
  3. All spider middlewares __init__
  4. All pipelines __init__

It's not clear - why extensions are initialized during Crawler.__init__ and not in Crawler.crawl? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?

I'm asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set CLOSESPIDER_TIMEOUT based on passed spider argument. Due to CloseSpider implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order was

'spider init' -> 'update settings' -> 'settings.freeze' -> 'middlewares init' 

that task would be as easy as set CLOSESPIDER_TIMEOUT in custom_settings.

I don't speak about command line usage, -s does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it's not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when -s doesn't work well.

Based on above arguments I would like to propose different initialization order:

Crawler.__init__:

  1. self.settings = settings.copy()

Crawler.crawl:

  1. Spider.__init__
  2. spider.update_settings(self.settings) - notice that in this case it isn't required for update_settings to be a @classmethod
  3. self.settings.freeze()
  4. All extensions __init__
  5. All downloader middlewares __init__
  6. All spider middlewares __init__
  7. All pipelines __init__

What do you think about this proposal?

Discussion on this issue was originally started in #1276 (comment)

@kmike
Copy link
Member

kmike commented Oct 30, 2015

Moving settings handling out of Crawler.__init__ makes sense, I think it should add flexibility without breaking anything. It'd make it possible to fix #1280.

Swapping Spider.__init__ with .update_settings is a more serious change: spiders won't be able to read and use final settings in their __init__ method, and their from_crawler method will become different from from_crawler methods of all other components. But I think it is not a serious downside, it worths changing.

So +1 to this change.

@chekunkov
Copy link
Contributor Author

spiders won't be able to read and use final settings in their init method, and their from_crawler method will become different from from_crawler methods of all other components

good point. my justification for this difference is - spider works like settings producer and it should be able to change them, other components are settings consumers, they shouldn't be able to change settings.

@nramirezuy
Copy link
Contributor

@chekunkov Scrapy Cloud already implemented settings on schedule. No idea about ScrapyRT.
I think this is more an issue of the platform than Scrapy, but I agree it has some uses.

If we decide on implementing this:

Crawler.crawl:

  1. All pipelines from_crawler
  2. All downloader middlewares from_crawler
  3. All spider middlewares from_crawler
  4. All extensions from_crawler
  5. spider.update_settings(self.settings) - notice that in this case it isn't required for update_settings to be a @classmethod
  6. Spider.from_crawler
  7. self.settings.freeze()

I would rather see it implemented on Spider.from_crawler, it isn't a lot of change but kinda helps sorting, you can remove the arguments from kwargs before reaching __init__ if you want to. I also think every component should be allowed to change settings, I would like to be sure that CookiesMware is disabled when my CustomCookiesMware is enabled.

I don't think the init order really matters, we can supply different settings scopes and sort it from there.

@kmike
Copy link
Member

kmike commented Nov 5, 2015

I also think every component should be allowed to change settings

This is what @jdemaeyer's add-on PR is about, to provide a consistent interface for components to change settings of other components.

@chekunkov
Copy link
Contributor Author

Don't you think that by allowing one component to change another component's settings it's possible to end up in situation where project can be misconfigured and completely broken because of some non-obvious settings conflict and it will be hard to debug such issues?

@jdemaeyer
Copy link
Contributor

Don't you think that by allowing one component to change another component's settings it's possible to end up in situation where project can be misconfigured and completely broken because of some non-obvious settings conflict and it will be hard to debug such issues?

That's what settings priorities (command line > settings.py > settings set by add-ons) and the check_configuration() add-on callback are for :)

@nramirezuy
Copy link
Contributor

@chekunkov Besides you can always use standards to define where you want to set settings

@eLRuLL
Copy link
Member

eLRuLL commented Mar 4, 2016

what about deprecating extensions? as explained in the documentation, it only connects to signals which can be done with any other component (Middleware, Pipeline, even the own spider).

@kmike
Copy link
Member

kmike commented Mar 4, 2016

@eLRuLL You're right, it seems that all extensions features can be emulated with middleware if we make a change suggested in this ticket. Hm, but deprecating extension would mean that if you only need to connect signals you have to put it in a middleware or a spider; in spider it can't be enabled via an option; in middleware you need to figure out where to put this middleware (to downloader middlewares? to spider middlewares? what priority to use?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants