Change extensions/spiders/settings initialisation order, v2 #6038

wRAR · 2023-09-05T15:33:29Z

Continuation of #1580, without code changes for now.

Several tests fail, because you can no longer set TWISTED_REACTOR and LOG_FILE/LOG_LEVEL in custom_settings. Which is a problem. We can try moving the reactor installation to crawl(), but the logging problem was discussed in the old PR and doesn't have a good solution.

~~Also, I think we should deprecate running Crawler.crawl() multiple times as a separate uncontroversial PR, as proposed in the old one. #6040~~ done

Closes #1580, fixes #1305, fixes #2392, fixes #3663.

This reverts commit 2c68c95.

This reverts commit 4e40377.

codecov · 2023-09-05T15:35:38Z

Codecov Report

Merging #6038 (7da3964) into master (da39fbd) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head 7da3964 differs from pull request most recent head 6428356. Consider uploading reports for the commit 6428356 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6038      +/-   ##
==========================================
+ Coverage   88.92%   88.95%   +0.02%     
==========================================
  Files         163      163              
  Lines       11538    11567      +29     
  Branches     1877     1877              
==========================================
+ Hits        10260    10289      +29     
  Misses        969      969              
  Partials      309      309

Files Changed	Coverage Δ
scrapy/addons.py	`83.33% <ø> (ø)`
scrapy/settings/__init__.py	`95.45% <ø> (ø)`
scrapy/commands/shell.py	`93.61% <100.00%> (+0.13%)`	⬆️
scrapy/core/engine.py	`85.66% <100.00%> (+0.27%)`	⬆️
scrapy/core/scraper.py	`85.18% <100.00%> (+0.15%)`	⬆️
scrapy/crawler.py	`88.20% <100.00%> (+0.70%)`	⬆️
scrapy/downloadermiddlewares/httpcache.py	`94.04% <100.00%> (+0.07%)`	⬆️
scrapy/downloadermiddlewares/retry.py	`100.00% <100.00%> (ø)`
scrapy/dupefilters.py	`83.13% <100.00%> (+0.41%)`	⬆️
scrapy/extensions/httpcache.py	`95.47% <100.00%> (+0.01%)`	⬆️
... and 2 more

wRAR · 2023-09-05T15:46:21Z

We can try moving the reactor installation to crawl()

This seems to work, at least when nothing imports twisted.internet.reactor between creating a Crawler instance (or a CrawlerProcess istance) and calling crawl() on it.

wRAR · 2023-09-06T10:36:42Z

Direct link to the logging situation: #1580 (comment)

In my experience changing logging per-spider is often done, e.g. if you have a large project where only some spiders need additional debugging. It can also be done via -s but that should still be supported here.

wRAR · 2023-09-06T12:20:34Z

Maybe we just need to have both a classmethod that is called early, for backward compatibility and modifying things that need to be set early, and a new instance method that is called late.

In any case it's not clear when should the addons modify the settings.

scrapy/crawler.py

wRAR · 2023-09-06T15:00:11Z

How about this:

we don't touch update_settings(): it's still a classmethod and it's still called early, it still can change any settings, including ADDONS and LOG_LEVEL;
reactor initialization is kept in Crawler.__init__() unless we also want to affect it in spider __init__();
component initialization and freezing settings is still moved to Crawler.crawl();
spider __init__() can now change the settings based on arguments, instance-specific logic etc. and it happens before component initialization so most settings can be changed without problems.

Not sure where should the addon processing happen here, probably after the spider __init__(), in which case it allows all code paths to modify ADDONS. Not sure how bad is not being able to see settings modified by the addons themselves in all code paths though.

Gallaecio · 2023-09-06T15:19:31Z

It sounds good to me in general, although I am not 100% sure.

I imagine some existing __init__ code might expect some components to exist beforehand, to access them from the crawler, but I am not sure how common that is. And since signals can be used to get another spider method called later on, maybe it is worth the backward-incompatible change?

Also, changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments. If we consider argument processing out of scope, then this is a non-issue. But what we choose here could limit what we can do when implementing argument processing support later on.

wRAR · 2023-09-07T11:55:16Z

I imagine some existing __init__ code might expect some components to exist beforehand

Note that this doesn't work in the current PR code too, as ExtensionManager is initialized later.

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

Gallaecio · 2023-09-07T12:10:37Z

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

I am thinking of a future where we implement argument validation and conversion as an opt-in feature by exposing a setting for a new type of Scrapy component, in line with the fingerprinter setting, e.g. ARG_PARSER. The default parser could keep the current behavior, i.e. leave the arguments as they are, while a new parser could implement something like pydantic to error out in case of bad input, or convert strings to dict or int.

My thinking is that this setting-defined component, which could have settings of its own to control its behavior (e.g. a setting to disable exiting on error to instead only log a warning), should be executed before __init__, so that __init__ would get those arguments already processed, or not be called at all if argument processing exists earlier.

To enable such a future scenario, I am thinking we should have the following order:

existing settings editing
(future) argument parsing based on settings
setting editing based on (parsed) arguments
__init__ getting final settings and arguments.

wRAR · 2023-09-07T12:28:51Z

Another thing I've just found: Spider.__init__() does not have access to the crawler: spider.crawler and spider.settings are set in from_crawler() after the instance is created. So the place to access them is an overridden from_crawler(), not an overridden __init__() (or some new method we can call on the instance at the end of the default from_crawler().

wRAR · 2023-09-07T12:58:43Z

There is a problem with moving the fingerprinter initialization inside crawl(): too many tests assume that crawler.request_fingerprinter exists just after creating the crawler.

I wonder if we want to split out a "prepare" method of Crawler that creates the spider instance etc. but doesn't start the engine, though I don't know if it would be useful outside tests.

wRAR · 2023-09-13T10:45:11Z

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

docs/topics/settings.rst

docs/topics/spiders.rst

scrapy/crawler.py

scrapy/addons.py

kmike · 2023-09-13T11:51:32Z

scrapy/crawler.py

        if get_scrapy_root_handler() is not None:
            # scrapy root handler already installed: update it with new settings
            install_scrapy_root_handler(self.settings)
+
+    def _apply_settings(self) -> None:
+        if self._settings_loaded:


Can it be simplified/improved to if self.settings.frozen?

Also, if it's called when the settings are frozen, should it be a warning?

Or an error?

Right, and error is even better, if there is no use case for running this method twice

Though when designing it I thought that we need a check so that we can safely run it twice, e.g. directly and then via crawl(). This only matters in tests as users shouldn't run _apply_settings() directly, and either a warning or an error will affect them, as we indeed have tests that call get_crawler() (which calls _apply_settings() so that you can get an initialized crawler) and then call crawl().

This method is private, so it looks ok.

wRAR · 2023-09-13T12:26:43Z

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

We rely on the "Unhandled error in Deferred" messages that are printed when such Deferreds are garbage collected (we call CrawlerProcess.crawl() without storing the Deferred that it returns), and on PyPy something works in some other way and they are not printed. So in tests where we expect an unhandled error we should instead add an explicit errback.

…ipts.

kmike · 2023-09-13T13:34:50Z

tests/CrawlerProcess/asyncio_enabled_reactor_different_loop.py

@@ -24,5 +24,6 @@ def start_requests(self):
        "ASYNCIO_EVENT_LOOP": "uvloop.Loop",
    }
 )
-process.crawl(NoRequestsSpider)
+d = process.crawl(NoRequestsSpider)
+d.addErrback(lambda failure: failure.printTraceback())


I wonder if we should do it by default for CrawlerRunner, maybe with CrawlerRunner.__init__ flag to disable printing.

though let's not solve it here :)

The issue here is that we're not following https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script in the tests because it's problematic. So, it may be problematic for our users as well.

kmike · 2023-09-13T13:39:25Z

The PR looks good to me, but we should make the tests pass.

wRAR · 2023-09-13T14:09:55Z

failure.printTraceback() doesn't work on Twisted 18.9.0 on Python 3.8+. This is fixed in Twisted 19.7.0, and the first Twisted version that officially supports Python 3.8 is even higher, 21.2.0. We could replace the logging code in these tests or we could bump the minimal Twisted version, even though spiders work with 18.9.0 too.

wRAR · 2023-09-13T14:17:04Z

Though it looks like using twisted.python.log.err() still prints the exception we need so we can just switch to it.

wRAR · 2023-09-13T15:10:35Z

The tests now pass.

kmike

🚀

jdemaeyer and others added 12 commits November 2, 2015 16:18

Move extension init into Crawler.crawl()

c0566b2

Move Spider.update_settings() into Crawler.crawl()

d67f292

Initialize spider before calling its update_settings()

b06a670

Allow Spider.update_settings() to be an instance method

86c74ce

Allow multiple calls to Crawler.crawl()

4e40377

Move stats & log init into crawl()

2c68c95

Revert "Move stats & log init into crawl()"

aafb31d

This reverts commit 2c68c95.

Revert "Allow multiple calls to Crawler.crawl()"

fc26397

This reverts commit 4e40377.

Fix tests that had multiple calls to crawl()

380f76d

Move spider settings tests

daec045

Make Spider.update_settings() an instance method

2629997

Merge branch 'master' into change-init-order

69282c5

Move reactor installation into Crawler.crawl().

df112a3

Gallaecio reviewed Sep 6, 2023

View reviewed changes

scrapy/crawler.py Outdated Show resolved Hide resolved

Gallaecio reviewed Sep 6, 2023

View reviewed changes

scrapy/crawler.py Show resolved Hide resolved

Gallaecio reviewed Sep 6, 2023

View reviewed changes

scrapy/crawler.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/master' into change-init-order

b438099

wRAR added discuss enhancement labels Sep 6, 2023

Roll back the update_settings() changes.

97b98bf

Gallaecio approved these changes Sep 13, 2023

View reviewed changes

wRAR added 2 commits September 13, 2023 12:35

Docs improvements.

61e6bfc

Improve and simplify tests.

028a56b