Skip to content

Extensions, spider and settings initialization order #1305

Closed
@chekunkov

Description

@chekunkov

Current initialization order of spider, settings and all sorts of middlewares and extensions:

Crawler.__init__ (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):

  1. self.spidercls.update_settings(self.settings)
  2. All extensions __init__
  3. self.settings.freeze()

Crawler.crawl (see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):

  1. Spider.__init__
  2. All downloader middlewares __init__
  3. All spider middlewares __init__
  4. All pipelines __init__

It's not clear - why extensions are initialized during Crawler.__init__ and not in Crawler.crawl? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?

I'm asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set CLOSESPIDER_TIMEOUT based on passed spider argument. Due to CloseSpider implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order was

'spider init' -> 'update settings' -> 'settings.freeze' -> 'middlewares init' 

that task would be as easy as set CLOSESPIDER_TIMEOUT in custom_settings.

I don't speak about command line usage, -s does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it's not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when -s doesn't work well.

Based on above arguments I would like to propose different initialization order:

Crawler.__init__:

  1. self.settings = settings.copy()

Crawler.crawl:

  1. Spider.__init__
  2. spider.update_settings(self.settings) - notice that in this case it isn't required for update_settings to be a @classmethod
  3. self.settings.freeze()
  4. All extensions __init__
  5. All downloader middlewares __init__
  6. All spider middlewares __init__
  7. All pipelines __init__

What do you think about this proposal?

Discussion on this issue was originally started in #1276 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions