Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change extensions/spiders/settings initialisation order #1580

Closed

Conversation

jdemaeyer
Copy link
Contributor

This is an implementation of @chekunkov's suggested changes from #1305. I will be uploading consecutive commits one-by-one in the next minutes so we can see which tests break at what point.

@jdemaeyer
Copy link
Contributor Author

1: Moving extension's __init__ into Crawler.crawl()

Two tests fail because they essentially test the implementation that we just changed. The functionality they're testing (spider settings) should still be intact.

@jdemaeyer
Copy link
Contributor Author

2: Moving Spider.update_settings() into Crawler.crawl()

Three new tests failing because we're trying to close a Downloader that is not running. This could also be an issue with the new ExecutionEngine.close() method?

One of the previously failing tests now fails earlier because the spider settings are not applied in __init__()

It is no longer possible to change the STATS_CLASS from the spider at this point.

@jdemaeyer
Copy link
Contributor Author

3: Initialise spider before calling its update_settings() method

I have not yet made update_settings() an instance method.

No new failing tests.

@jdemaeyer
Copy link
Contributor Author

4: Allow update_settings() to be either class or instance method

No new failing tests.

@jdemaeyer
Copy link
Contributor Author

The tests that fail with "Tried to stop a LoopingCall that was not running." all have two calls to Crawler.crawl().

I think the real problem is that we're trying to write to frozen settings (or that we try to freeze settings twice). The error then manifests because we hit the except clause where the self.engine.close() call tries to close a pre-existing (and already finished) engine.

These three tests pass when I comment out the self.settings.freeze() line.

@jdemaeyer jdemaeyer changed the title [WIP] Change extensions/spiders/settings initialisation order Change extensions/spiders/settings initialisation order Nov 2, 2015
@jdemaeyer
Copy link
Contributor Author

This initialisation order would be great for #1442 (add-on callbacks for spiders) because the spider arguments could easily be mapped into the spider's add-on config.

@chekunkov
Copy link
Contributor

It is no longer possible to change the STATS_CLASS from the spider at this point.

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

@jdemaeyer
Copy link
Contributor Author

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

SPIDER_LOADER_CLASS is another setting that can already not be changed from the spider. (Of course that seems more intuitive than not being able to set log settings).

I don't really see a smart way to keep supporting changing these settings from the spider. It makes sense to set stats and logging up as soon as the crawler is initialized, so I don't think we can move that to crawl(). On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

For backwards compatibility we could implement a check whether update_settings() is a class method in __init__(). If it is, we call it. If not, we call it in crawl(). Not really fond of this though :/

@kmike
Copy link
Member

kmike commented Nov 3, 2015

What happens between Crawler.__init__ and crawler.crawl() in a common case (create a crawler in CrawlerRunner/CrawlerProcess)? Is there anything seriuos which prevents us from moving things from __init__ to crawl freely?

@kmike
Copy link
Member

kmike commented Nov 3, 2015

In other words, why does crawler need a logger and stats as soon as it is initialized? Can it do anything useful before .crawl() call?

@jdemaeyer
Copy link
Contributor Author

The ExecutionEngine copies crawler.logformatter. So problems arise when the engine is created outside of crawl(), like scrapy shell does

There are many failing tests when I move the stats & log init into crawl() (see last commit). Quite a few of these test downloader middlewares and create a crawler instance, immediately give it to the from_crawler() method, then fail because there's no crawler.stats attribute. However, in the real Scrapy flow stats would exist when from_crawler() is called.

@chekunkov
Copy link
Contributor

In other words, why does crawler need a logger and stats as soon as it is initialized?

@kmike One thing which looks important to me - setup logging before Spider.__init__ - I still remember times when it wasn't possible to log messages during spider initialization and that wasn't good - some spiders are doing quite complex configuration setup there and logging is essential.

On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

@jdemaeyer Maybe it makes sense to pass spider args to Crawler.__init__ - so that spider can be instantiated, settings filled and frozen - and Crawler.crawl will be responsible only for creating engine and starting spider with start_requests? crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

@kmike
Copy link
Member

kmike commented Nov 4, 2015

One thing which looks important to me - setup logging before Spider.init

Spider.__init__ is called in Crawler.crawl, so if we move logging setup to Crawler.crawl then Spider.__init__ can still log messages.

@chekunkov
Copy link
Contributor

Yes, but it we keep logging setup before Spider.__init__ - we cannot change LOG_LEVEL and LOG_FORMATTER from spider. If we move logging setup after Spider.__init__ - we can change those settings but logging won't work from Spider.__init__. I think first option is better, but I don't like the idea of having special-case settings which cannot be changed from Spider.__init__ because they are used to setup some object before spider creation. On the other hand @jdemaeyer is right and we already have such special cases (SPIDER_LOADER_CLASS) so maybe that's not a big problem.

I might regret telling that, but now I'm not sure it's a good idea to do settings configuration during/after Spider.__init__. Telling that I have an idea to share. If we have access to spider arguments in Cralwer.__init__ - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

@chekunkov
Copy link
Contributor

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

@jdemaeyer
Copy link
Contributor Author

If we have access to spider arguments in Cralwer.init - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

That would work for add-ons. I think changing the Crawler API is quite delicate though. If we don't do it backwards-compatible it's gonna break everything that runs spiders by other means than the Scrapy command line tool. If we do it backwards-compatible, e.g. by giving the spider args to both __init__() and crawl(), it feels somewhat dirty (spiders that use the spider args in their update_settings() class method will behave different depending on whether the running tool implemented the new API or not).

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

An issue with this is that the core settings could then not be changed in stand-alone spiders (except through the command line). This is still the option I like most though.

crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

Hmm, but running the same spider twice (in the same crawler) with different spider args is still supported (at least some tests do it).

@nramirezuy
Copy link
Contributor

Some settings doesn't even makes sense to be changed on the Spider. LOG_LEVEL, you can create a new logger and use that; SPIDER_LOADER_CLASS, you are already on the spider. Having core settings is too much effort, for something that doesn't makes sense to be changed at that stage.

Maintain backwards compatibility to people using Scrapy as a library doesn't makes too much sense to me, and most if isn't even documented. If feel this people doesn't need to update too often either because they aren't using most of the sugar we add. About patches on Downloader and things like that, they can easily monkey patch those.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

@kmike
Copy link
Member

kmike commented Nov 6, 2015

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

I like the idea of separating settings, at least documenting what can be changed per-spider. There is a few other settings which are global, e.g. thread pool size.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

+1

@codecov-io
Copy link

Current coverage is 82.95%

Merging #1580 into master will increase coverage by +0.07% as of e238733

Powered by Codecov. Updated on successful CI builds.

@jdemaeyer
Copy link
Contributor Author

If we can settle for:

  • some settings being unchangeable from the spider (that shouldn't really be per-spider anyways, essentially only logging and stats ones), and
  • a little backwards incompatibility (Crawler.extensions not available after Crawler.__init__())

then initialising the spider and extensions in crawl(), as originally proposed in #1305, seems the most elegant solution to me. Bonus points: There are no changes to the call signatures of Crawler. The PR is now back to this state (i.e. it only moves three lines from __init__() to crawl() in Crawler).

I've overhauled the tests that had multiple calls to the same crawler's crawl() (see #1587), and updated the spider settings tests.

I wonder if we should make Spider.update_settings() an instance method instead of a class method. This:

class MySpider(Spider):
    def __init__(self, *args, **kwargs):
        self.custom_settings.update({'XY': 'blah'})

won't work otherwise (since update_settings() will read the class's custom_settings, not the instance's one).

@nramirezuy
Copy link
Contributor

@jdemaeyer That last change would totally fix using custom_settings when sub classing.

@chekunkov
Copy link
Contributor

some settings being unchangeable from the spider

sounds ok if we document this properly

a little backwards incompatibility (Crawler.extensions not available after Crawler.init())

sounds ok as well - don't have an idea where crawler.extensions might be needed before crawler.crawl()

I wonder if we should make Spider.update_settings() an instance method instead of a class method

in addition to your example - it will be possible to update settings this way

class MySpider(Spider):
    def update_settings(self, settings):
        # for simplicity - imagine that value is validated in __init__
        if self.close_after:
            # intentionally omitted priority='spider'
            settings.set('CLOSESPIDER_TIMEOUT', self.close_after)

so one can ignore custom_settings and work with settings object directly - I would definitely use that instead of custom_settings because it's simpler and gives much more flexibility.

@felipeboffnunes
Copy link
Member

@kmike this is fairly stale at this point. Given the 7 years of changes on the tool, is this still a relevant PR, or could we close it?

@jdemaeyer
Copy link
Contributor Author

Wow, what a sweet rush of nostalgia washing over me 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants