Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change extensions/spiders/settings initialisation order, v2 #6038

Merged
merged 30 commits into from Sep 14, 2023

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Sep 5, 2023

Continuation of #1580, without code changes for now.

Several tests fail, because you can no longer set TWISTED_REACTOR and LOG_FILE/LOG_LEVEL in custom_settings. Which is a problem. We can try moving the reactor installation to crawl(), but the logging problem was discussed in the old PR and doesn't have a good solution.

Also, I think we should deprecate running Crawler.crawl() multiple times as a separate uncontroversial PR, as proposed in the old one. #6040 done

Closes #1580, fixes #1305, fixes #2392, fixes #3663.

@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Merging #6038 (7da3964) into master (da39fbd) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head 7da3964 differs from pull request most recent head 6428356. Consider uploading reports for the commit 6428356 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6038      +/-   ##
==========================================
+ Coverage   88.92%   88.95%   +0.02%     
==========================================
  Files         163      163              
  Lines       11538    11567      +29     
  Branches     1877     1877              
==========================================
+ Hits        10260    10289      +29     
  Misses        969      969              
  Partials      309      309              
Files Changed Coverage Δ
scrapy/addons.py 83.33% <ø> (ø)
scrapy/settings/__init__.py 95.45% <ø> (ø)
scrapy/commands/shell.py 93.61% <100.00%> (+0.13%) ⬆️
scrapy/core/engine.py 85.66% <100.00%> (+0.27%) ⬆️
scrapy/core/scraper.py 85.18% <100.00%> (+0.15%) ⬆️
scrapy/crawler.py 88.20% <100.00%> (+0.70%) ⬆️
scrapy/downloadermiddlewares/httpcache.py 94.04% <100.00%> (+0.07%) ⬆️
scrapy/downloadermiddlewares/retry.py 100.00% <100.00%> (ø)
scrapy/dupefilters.py 83.13% <100.00%> (+0.41%) ⬆️
scrapy/extensions/httpcache.py 95.47% <100.00%> (+0.01%) ⬆️
... and 2 more

@wRAR
Copy link
Member Author

wRAR commented Sep 5, 2023

We can try moving the reactor installation to crawl()

This seems to work, at least when nothing imports twisted.internet.reactor between creating a Crawler instance (or a CrawlerProcess istance) and calling crawl() on it.

@wRAR
Copy link
Member Author

wRAR commented Sep 6, 2023

Direct link to the logging situation: #1580 (comment)

In my experience changing logging per-spider is often done, e.g. if you have a large project where only some spiders need additional debugging. It can also be done via -s but that should still be supported here.

@wRAR
Copy link
Member Author

wRAR commented Sep 6, 2023

Maybe we just need to have both a classmethod that is called early, for backward compatibility and modifying things that need to be set early, and a new instance method that is called late.

In any case it's not clear when should the addons modify the settings.

scrapy/crawler.py Outdated Show resolved Hide resolved
scrapy/crawler.py Outdated Show resolved Hide resolved
@wRAR
Copy link
Member Author

wRAR commented Sep 6, 2023

How about this:

  • we don't touch update_settings(): it's still a classmethod and it's still called early, it still can change any settings, including ADDONS and LOG_LEVEL;
  • reactor initialization is kept in Crawler.__init__() unless we also want to affect it in spider __init__();
  • component initialization and freezing settings is still moved to Crawler.crawl();
  • spider __init__() can now change the settings based on arguments, instance-specific logic etc. and it happens before component initialization so most settings can be changed without problems.

Not sure where should the addon processing happen here, probably after the spider __init__(), in which case it allows all code paths to modify ADDONS. Not sure how bad is not being able to see settings modified by the addons themselves in all code paths though.

@Gallaecio
Copy link
Member

Gallaecio commented Sep 6, 2023

It sounds good to me in general, although I am not 100% sure.

I imagine some existing __init__ code might expect some components to exist beforehand, to access them from the crawler, but I am not sure how common that is. And since signals can be used to get another spider method called later on, maybe it is worth the backward-incompatible change?

Also, changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments. If we consider argument processing out of scope, then this is a non-issue. But what we choose here could limit what we can do when implementing argument processing support later on.

@wRAR
Copy link
Member Author

wRAR commented Sep 7, 2023

I imagine some existing __init__ code might expect some components to exist beforehand

Note that this doesn't work in the current PR code too, as ExtensionManager is initialized later.

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

@Gallaecio
Copy link
Member

Gallaecio commented Sep 7, 2023

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

I am thinking of a future where we implement argument validation and conversion as an opt-in feature by exposing a setting for a new type of Scrapy component, in line with the fingerprinter setting, e.g. ARG_PARSER. The default parser could keep the current behavior, i.e. leave the arguments as they are, while a new parser could implement something like pydantic to error out in case of bad input, or convert strings to dict or int.

My thinking is that this setting-defined component, which could have settings of its own to control its behavior (e.g. a setting to disable exiting on error to instead only log a warning), should be executed before __init__, so that __init__ would get those arguments already processed, or not be called at all if argument processing exists earlier.

To enable such a future scenario, I am thinking we should have the following order:

  1. existing settings editing
  2. (future) argument parsing based on settings
  3. setting editing based on (parsed) arguments
  4. __init__ getting final settings and arguments.

@wRAR
Copy link
Member Author

wRAR commented Sep 7, 2023

Another thing I've just found: Spider.__init__() does not have access to the crawler: spider.crawler and spider.settings are set in from_crawler() after the instance is created. So the place to access them is an overridden from_crawler(), not an overridden __init__() (or some new method we can call on the instance at the end of the default from_crawler().

@wRAR
Copy link
Member Author

wRAR commented Sep 7, 2023

There is a problem with moving the fingerprinter initialization inside crawl(): too many tests assume that crawler.request_fingerprinter exists just after creating the crawler.

I wonder if we want to split out a "prepare" method of Crawler that creates the spider instance etc. but doesn't start the engine, though I don't know if it would be useful outside tests.

@wRAR
Copy link
Member Author

wRAR commented Sep 13, 2023

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

docs/topics/settings.rst Outdated Show resolved Hide resolved
docs/topics/spiders.rst Outdated Show resolved Hide resolved
scrapy/crawler.py Outdated Show resolved Hide resolved
scrapy/addons.py Outdated Show resolved Hide resolved
if get_scrapy_root_handler() is not None:
# scrapy root handler already installed: update it with new settings
install_scrapy_root_handler(self.settings)

def _apply_settings(self) -> None:
if self._settings_loaded:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be simplified/improved to if self.settings.frozen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if it's called when the settings are frozen, should it be a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and error is even better, if there is no use case for running this method twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though when designing it I thought that we need a check so that we can safely run it twice, e.g. directly and then via crawl(). This only matters in tests as users shouldn't run _apply_settings() directly, and either a warning or an error will affect them, as we indeed have tests that call get_crawler() (which calls _apply_settings() so that you can get an initialized crawler) and then call crawl().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is private, so it looks ok.

@wRAR
Copy link
Member Author

wRAR commented Sep 13, 2023

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

We rely on the "Unhandled error in Deferred" messages that are printed when such Deferreds are garbage collected (we call CrawlerProcess.crawl() without storing the Deferred that it returns), and on PyPy something works in some other way and they are not printed. So in tests where we expect an unhandled error we should instead add an explicit errback.

@wRAR wRAR closed this Sep 13, 2023
@wRAR wRAR reopened this Sep 13, 2023
@@ -24,5 +24,6 @@ def start_requests(self):
"ASYNCIO_EVENT_LOOP": "uvloop.Loop",
}
)
process.crawl(NoRequestsSpider)
d = process.crawl(NoRequestsSpider)
d.addErrback(lambda failure: failure.printTraceback())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do it by default for CrawlerRunner, maybe with CrawlerRunner.__init__ flag to disable printing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though let's not solve it here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that we're not following https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script in the tests because it's problematic. So, it may be problematic for our users as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

@kmike
Copy link
Member

kmike commented Sep 13, 2023

The PR looks good to me, but we should make the tests pass.

@wRAR
Copy link
Member Author

wRAR commented Sep 13, 2023

failure.printTraceback() doesn't work on Twisted 18.9.0 on Python 3.8+. This is fixed in Twisted 19.7.0, and the first Twisted version that officially supports Python 3.8 is even higher, 21.2.0. We could replace the logging code in these tests or we could bump the minimal Twisted version, even though spiders work with 18.9.0 too.

@wRAR
Copy link
Member Author

wRAR commented Sep 13, 2023

Though it looks like using twisted.python.log.err() still prints the exception we need so we can just switch to it.

@wRAR
Copy link
Member Author

wRAR commented Sep 13, 2023

The tests now pass.

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@wRAR wRAR merged commit dba3767 into scrapy:master Sep 14, 2023
26 checks passed
@wRAR wRAR deleted the change-init-order branch September 14, 2023 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants