-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Change extensions/spiders/settings initialisation order, v2 #6038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6038 +/- ##
==========================================
+ Coverage 88.92% 88.95% +0.02%
==========================================
Files 163 163
Lines 11538 11567 +29
Branches 1877 1877
==========================================
+ Hits 10260 10289 +29
Misses 969 969
Partials 309 309
|
This seems to work, at least when nothing imports twisted.internet.reactor between creating a Crawler instance (or a CrawlerProcess istance) and calling crawl() on it. |
Direct link to the logging situation: #1580 (comment) In my experience changing logging per-spider is often done, e.g. if you have a large project where only some spiders need additional debugging. It can also be done via |
Maybe we just need to have both a classmethod that is called early, for backward compatibility and modifying things that need to be set early, and a new instance method that is called late. In any case it's not clear when should the addons modify the settings. |
How about this:
Not sure where should the addon processing happen here, probably after the spider |
It sounds good to me in general, although I am not 100% sure. I imagine some existing Also, changing arguments based on final settings would not be possible before |
Note that this doesn't work in the current PR code too, as
Not sure what is this use case? |
I am thinking of a future where we implement argument validation and conversion as an opt-in feature by exposing a setting for a new type of Scrapy component, in line with the fingerprinter setting, e.g. My thinking is that this setting-defined component, which could have settings of its own to control its behavior (e.g. a setting to disable exiting on error to instead only log a warning), should be executed before To enable such a future scenario, I am thinking we should have the following order:
|
Another thing I've just found: |
There is a problem with moving the fingerprinter initialization inside I wonder if we want to split out a "prepare" method of |
PyPy tests fail because on PyPy exceptions raised in |
scrapy/crawler.py
Outdated
if get_scrapy_root_handler() is not None: | ||
# scrapy root handler already installed: update it with new settings | ||
install_scrapy_root_handler(self.settings) | ||
|
||
def _apply_settings(self) -> None: | ||
if self._settings_loaded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be simplified/improved to if self.settings.frozen
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if it's called when the settings are frozen, should it be a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and error is even better, if there is no use case for running this method twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though when designing it I thought that we need a check so that we can safely run it twice, e.g. directly and then via crawl(). This only matters in tests as users shouldn't run _apply_settings() directly, and either a warning or an error will affect them, as we indeed have tests that call get_crawler()
(which calls _apply_settings() so that you can get an initialized crawler) and then call crawl()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is private, so it looks ok.
We rely on the "Unhandled error in Deferred" messages that are printed when such Deferreds are garbage collected (we call |
@@ -24,5 +24,6 @@ def start_requests(self): | |||
"ASYNCIO_EVENT_LOOP": "uvloop.Loop", | |||
} | |||
) | |||
process.crawl(NoRequestsSpider) | |||
d = process.crawl(NoRequestsSpider) | |||
d.addErrback(lambda failure: failure.printTraceback()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do it by default for CrawlerRunner, maybe with CrawlerRunner.__init__
flag to disable printing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though let's not solve it here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that we're not following https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script in the tests because it's problematic. So, it may be problematic for our users as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
The PR looks good to me, but we should make the tests pass. |
|
Though it looks like using |
The tests now pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Continuation of #1580, without code changes for now.
Several tests fail, because you can no longer set
TWISTED_REACTOR
andLOG_FILE
/LOG_LEVEL
incustom_settings
. Which is a problem. We can try moving the reactor installation tocrawl()
, but the logging problem was discussed in the old PR and doesn't have a good solution.Also, I think we should deprecate runningdoneCrawler.crawl()
multiple times as a separate uncontroversial PR, as proposed in the old one. #6040Closes #1580, fixes #1305, fixes #2392, fixes #3663.