Change extensions/spiders/settings initialisation order #1580

jdemaeyer · 2015-11-02T15:35:42Z

This is an implementation of @chekunkov's suggested changes from #1305. I will be uploading consecutive commits one-by-one in the next minutes so we can see which tests break at what point.

jdemaeyer · 2015-11-02T15:39:13Z

1: Moving extension's __init__ into Crawler.crawl()

Two tests fail because they essentially test the implementation that we just changed. The functionality they're testing (spider settings) should still be intact.

jdemaeyer · 2015-11-02T15:45:17Z

2: Moving Spider.update_settings() into Crawler.crawl()

Three new tests failing because we're trying to close a Downloader that is not running. This could also be an issue with the new ExecutionEngine.close() method?

One of the previously failing tests now fails earlier because the spider settings are not applied in __init__()

It is no longer possible to change the STATS_CLASS from the spider at this point.

jdemaeyer · 2015-11-02T15:50:45Z

3: Initialise spider before calling its update_settings() method

I have not yet made update_settings() an instance method.

No new failing tests.

jdemaeyer · 2015-11-02T16:00:04Z

4: Allow update_settings() to be either class or instance method

No new failing tests.

jdemaeyer · 2015-11-02T16:10:14Z

The tests that fail with "Tried to stop a LoopingCall that was not running." all have two calls to Crawler.crawl().

I think the real problem is that we're trying to write to frozen settings (or that we try to freeze settings twice). The error then manifests because we hit the except clause where the self.engine.close() call tries to close a pre-existing (and already finished) engine.

These three tests pass when I comment out the self.settings.freeze() line.

jdemaeyer · 2015-11-02T16:28:14Z

This initialisation order would be great for #1442 (add-on callbacks for spiders) because the spider arguments could easily be mapped into the spider's add-on config.

chekunkov · 2015-11-02T18:07:28Z

It is no longer possible to change the STATS_CLASS from the spider at this point.

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

jdemaeyer · 2015-11-03T22:23:13Z

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

SPIDER_LOADER_CLASS is another setting that can already not be changed from the spider. (Of course that seems more intuitive than not being able to set log settings).

I don't really see a smart way to keep supporting changing these settings from the spider. It makes sense to set stats and logging up as soon as the crawler is initialized, so I don't think we can move that to crawl(). On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

For backwards compatibility we could implement a check whether update_settings() is a class method in __init__(). If it is, we call it. If not, we call it in crawl(). Not really fond of this though :/

kmike · 2015-11-03T22:27:06Z

What happens between Crawler.__init__ and crawler.crawl() in a common case (create a crawler in CrawlerRunner/CrawlerProcess)? Is there anything seriuos which prevents us from moving things from __init__ to crawl freely?

kmike · 2015-11-03T22:28:22Z

In other words, why does crawler need a logger and stats as soon as it is initialized? Can it do anything useful before .crawl() call?

jdemaeyer · 2015-11-03T22:54:08Z

The ExecutionEngine copies crawler.logformatter. So problems arise when the engine is created outside of crawl(), like scrapy shell does

There are many failing tests when I move the stats & log init into crawl() (see last commit). Quite a few of these test downloader middlewares and create a crawler instance, immediately give it to the from_crawler() method, then fail because there's no crawler.stats attribute. However, in the real Scrapy flow stats would exist when from_crawler() is called.

chekunkov · 2015-11-04T09:59:52Z

In other words, why does crawler need a logger and stats as soon as it is initialized?

@kmike One thing which looks important to me - setup logging before Spider.__init__ - I still remember times when it wasn't possible to log messages during spider initialization and that wasn't good - some spiders are doing quite complex configuration setup there and logging is essential.

On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

@jdemaeyer Maybe it makes sense to pass spider args to Crawler.__init__ - so that spider can be instantiated, settings filled and frozen - and Crawler.crawl will be responsible only for creating engine and starting spider with start_requests? crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

kmike · 2015-11-04T10:37:27Z

One thing which looks important to me - setup logging before Spider.init

Spider.__init__ is called in Crawler.crawl, so if we move logging setup to Crawler.crawl then Spider.__init__ can still log messages.

chekunkov · 2015-11-04T11:16:42Z

Yes, but it we keep logging setup before Spider.__init__ - we cannot change LOG_LEVEL and LOG_FORMATTER from spider. If we move logging setup after Spider.__init__ - we can change those settings but logging won't work from Spider.__init__. I think first option is better, but I don't like the idea of having special-case settings which cannot be changed from Spider.__init__ because they are used to setup some object before spider creation. On the other hand @jdemaeyer is right and we already have such special cases (SPIDER_LOADER_CLASS) so maybe that's not a big problem.

I might regret telling that, but now I'm not sure it's a good idea to do settings configuration during/after Spider.__init__. Telling that I have an idea to share. If we have access to spider arguments in Cralwer.__init__ - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

chekunkov · 2015-11-04T11:52:28Z

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

jdemaeyer · 2015-11-05T22:00:49Z

If we have access to spider arguments in Cralwer.init - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

That would work for add-ons. I think changing the Crawler API is quite delicate though. If we don't do it backwards-compatible it's gonna break everything that runs spiders by other means than the Scrapy command line tool. If we do it backwards-compatible, e.g. by giving the spider args to both __init__() and crawl(), it feels somewhat dirty (spiders that use the spider args in their update_settings() class method will behave different depending on whether the running tool implemented the new API or not).

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

An issue with this is that the core settings could then not be changed in stand-alone spiders (except through the command line). This is still the option I like most though.

crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

Hmm, but running the same spider twice (in the same crawler) with different spider args is still supported (at least some tests do it).

nramirezuy · 2015-11-06T15:43:41Z

Some settings doesn't even makes sense to be changed on the Spider. LOG_LEVEL, you can create a new logger and use that; SPIDER_LOADER_CLASS, you are already on the spider. Having core settings is too much effort, for something that doesn't makes sense to be changed at that stage.

Maintain backwards compatibility to people using Scrapy as a library doesn't makes too much sense to me, and most if isn't even documented. If feel this people doesn't need to update too often either because they aren't using most of the sugar we add. About patches on Downloader and things like that, they can easily monkey patch those.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

kmike · 2015-11-06T15:46:31Z

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

I like the idea of separating settings, at least documenting what can be changed per-spider. There is a few other settings which are global, e.g. thread pool size.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

+1

This reverts commit 2c68c95.

This reverts commit 4e40377.

codecov-io · 2015-11-09T17:15:49Z

Current coverage is `82.95%`

Merging #1580 into master will increase coverage by +0.07% as of e238733

Powered by Codecov. Updated on successful CI builds.

jdemaeyer · 2015-11-09T17:16:31Z

If we can settle for:

some settings being unchangeable from the spider (that shouldn't really be per-spider anyways, essentially only logging and stats ones), and
a little backwards incompatibility (Crawler.extensions not available after Crawler.__init__())

then initialising the spider and extensions in crawl(), as originally proposed in #1305, seems the most elegant solution to me. Bonus points: There are no changes to the call signatures of Crawler. The PR is now back to this state (i.e. it only moves three lines from __init__() to crawl() in Crawler).

I've overhauled the tests that had multiple calls to the same crawler's crawl() (see #1587), and updated the spider settings tests.

I wonder if we should make Spider.update_settings() an instance method instead of a class method. This:

class MySpider(Spider):
    def __init__(self, *args, **kwargs):
        self.custom_settings.update({'XY': 'blah'})

won't work otherwise (since update_settings() will read the class's custom_settings, not the instance's one).

nramirezuy · 2015-11-09T21:24:43Z

@jdemaeyer That last change would totally fix using custom_settings when sub classing.

chekunkov · 2015-11-09T21:32:58Z

some settings being unchangeable from the spider

sounds ok if we document this properly

a little backwards incompatibility (Crawler.extensions not available after Crawler.init())

sounds ok as well - don't have an idea where crawler.extensions might be needed before crawler.crawl()

I wonder if we should make Spider.update_settings() an instance method instead of a class method

in addition to your example - it will be possible to update settings this way

class MySpider(Spider):
    def update_settings(self, settings):
        # for simplicity - imagine that value is validated in __init__
        if self.close_after:
            # intentionally omitted priority='spider'
            settings.set('CLOSESPIDER_TIMEOUT', self.close_after)

so one can ignore custom_settings and work with settings object directly - I would definitely use that instead of custom_settings because it's simpler and gives much more flexibility.

felipeboffnunes · 2022-11-17T10:22:40Z

@kmike this is fairly stale at this point. Given the 7 years of changes on the tool, is this still a relevant PR, or could we close it?

jdemaeyer · 2022-11-17T10:30:04Z

Wow, what a sweet rush of nostalgia washing over me 😅

jdemaeyer added 2 commits November 2, 2015 16:18

Move extension init into Crawler.crawl()

c0566b2

Move Spider.update_settings() into Crawler.crawl()

d67f292

Initialize spider before calling its update_settings()

b06a670

Allow Spider.update_settings() to be an instance method

86c74ce

jdemaeyer changed the title ~~[WIP] Change extensions/spiders/settings initialisation order~~ Change extensions/spiders/settings initialisation order Nov 2, 2015

jdemaeyer added 2 commits November 3, 2015 23:32

Allow multiple calls to Crawler.crawl()

4e40377

Move stats & log init into crawl()

2c68c95

This was referenced Nov 9, 2015

[WIP] Add-on callbacks for spiders #1442

Closed

Multiple crawls with the same crawler #1587

Closed

jdemaeyer added 3 commits November 9, 2015 16:26

Revert "Move stats & log init into crawl()"

aafb31d

This reverts commit 2c68c95.

Revert "Allow multiple calls to Crawler.crawl()"

fc26397

This reverts commit 4e40377.

Fix tests that had multiple calls to crawl()

380f76d

Move spider settings tests

daec045

Make Spider.update_settings() an instance method

2629997

jdemaeyer mentioned this pull request Nov 23, 2015

scrapy LOG_LEVEL setting in Spider.custom_settings does not work #1612

Closed

wRAR mentioned this pull request Sep 5, 2023

Change extensions/spiders/settings initialisation order, v2 #6038

Merged

wRAR closed this in #6038 Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change extensions/spiders/settings initialisation order #1580

Change extensions/spiders/settings initialisation order #1580

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

chekunkov commented Nov 2, 2015

jdemaeyer commented Nov 3, 2015

kmike commented Nov 3, 2015

kmike commented Nov 3, 2015

jdemaeyer commented Nov 3, 2015

chekunkov commented Nov 4, 2015

kmike commented Nov 4, 2015

chekunkov commented Nov 4, 2015

chekunkov commented Nov 4, 2015

jdemaeyer commented Nov 5, 2015

nramirezuy commented Nov 6, 2015

kmike commented Nov 6, 2015

codecov-io commented Nov 9, 2015

jdemaeyer commented Nov 9, 2015

nramirezuy commented Nov 9, 2015

chekunkov commented Nov 9, 2015

felipeboffnunes commented Nov 17, 2022

jdemaeyer commented Nov 17, 2022

Change extensions/spiders/settings initialisation order #1580

Change extensions/spiders/settings initialisation order #1580

Conversation

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

jdemaeyer commented Nov 2, 2015

chekunkov commented Nov 2, 2015

jdemaeyer commented Nov 3, 2015

kmike commented Nov 3, 2015

kmike commented Nov 3, 2015

jdemaeyer commented Nov 3, 2015

chekunkov commented Nov 4, 2015

kmike commented Nov 4, 2015

chekunkov commented Nov 4, 2015

chekunkov commented Nov 4, 2015

jdemaeyer commented Nov 5, 2015

nramirezuy commented Nov 6, 2015

kmike commented Nov 6, 2015

codecov-io commented Nov 9, 2015

Current coverage is 82.95%

jdemaeyer commented Nov 9, 2015

nramirezuy commented Nov 9, 2015

chekunkov commented Nov 9, 2015

felipeboffnunes commented Nov 17, 2022

jdemaeyer commented Nov 17, 2022

Current coverage is `82.95%`