Description
Current initialization order of spider, settings and all sorts of middlewares and extensions:
Crawler.__init__
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L32):
self.spidercls.update_settings(self.settings)
- All extensions
__init__
self.settings.freeze()
Crawler.crawl
(see https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L70):
Spider.__init__
- All downloader middlewares
__init__
- All spider middlewares
__init__
- All pipelines
__init__
It's not clear - why extensions are initialized during Crawler.__init__
and not in Crawler.crawl
? Is it some legacy code untouched from times when it was possible to run several spider through the same set of extensions and middlewares?
I'm asking this because sometimes I feel like I want to change some crawl settings after spider initialization and initialize middlewares only after that. For example I got request from customer to make it possible to set CLOSESPIDER_TIMEOUT
based on passed spider argument. Due to CloseSpider
implementation to support this I need to override it, disable default and set custom extension in settings. If initialization order was
'spider init' -> 'update settings' -> 'settings.freeze' -> 'middlewares init'
that task would be as easy as set CLOSESPIDER_TIMEOUT in custom_settings
.
I don't speak about command line usage, -s
does work in command line, but spiders often started not via command line - in Scrapy Cloud, ScrapyRT - it's not always possible to set per crawl settings in cases like that. It could also happen that spider has some logic to decide whether or not some setting should be set based on spider arguments - this is also the case when -s
doesn't work well.
Based on above arguments I would like to propose different initialization order:
Crawler.__init__
:
- self.settings = settings.copy()
Crawler.crawl
:
Spider.__init__
spider.update_settings(self.settings)
- notice that in this case it isn't required forupdate_settings
to be a@classmethod
self.settings.freeze()
- All extensions
__init__
- All downloader middlewares
__init__
- All spider middlewares
__init__
- All pipelines
__init__
What do you think about this proposal?
Discussion on this issue was originally started in #1276 (comment)