-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Changes to crawler API #873
Conversation
I think SpiderManager is quite different from CrawlerRunner: SpiderManager is a class for loading spiders by name, somewhat similar to unittest.TestLoader, while CrawlerRunner is currently a class for combining multiple crawlers. I'd rename SpiderManager to SpiderLoader or SpiderFinder to make it clear. IMHO "Manager" doesn't provide any insight about what is class actually for. I don't understand why CrawlerRunner is not merged with CrawlerProcess - CrawlerRunner is not useful by itself. Its only use in Scrapy is scrapy.utils.test.get_crawler, and the fact the only function that uses CrawlerRunner needs to call a private method, and that CrawlerRunner is only used to get a Crawler shows that CrawlerRunner API doesn't make sense. It seems this function instantiates Crawler via CrawlerRunner only because CrawlerRunner sets up logging. This should be unnecessary once we switch to stdlib logging or move logging initialization to crawler class (as you suggested). |
It'd be great to reduce SpiderManager use as much as possible. I think that the task of loading Spider by name from a Scrapy project is not essential; internals should work with Spider classes directly as much as possible. If I read SEP-19 correctly, this idea was there :) What about removing CrawlerRunner.spiders? Caller code will be responsible of loading spider manager class from settings (optional), instantiating it (optional) and getting spider classes. |
@kmike I think CrawlerRunner is there to support multiples Crawlers some day. Is most of all to use it as a library, scrapy.cmd will always support a single crawler.
2, 3, 4. I'm agree on this, it will allow a more easy usage from a script.
Basically @dangra is proposing a mutation of |
@nramirezuy You're right that CrawlerRunner is a documented way to run Crawlers, I've missed that. It is documented like this:
If I'm not mistaken, this is how it can be written without CrawlerRunner:
|
@kmike: because CrawlerRunner also 1- Creates the Crawler after merging spider settings I think above are common enough to be grouped into a single helper object. |
What doesn't belong to the crawler module is the reactor handling, it is only used by the command line so far. |
I have no better name for CrawlerRunner and I admit CrawlerManager similarity to SpiderManager is a weak excuse :) |
Internals works with Spider classes and specially with its instance. The command line and logs use the spider name but only because the name is a required attribute of the spider object. |
We agree on moving to stdlib logging, the idea is to create a logger per crawler and attach a handler to the root. |
/cc @curita |
I like SpiderFinder. |
Such helper object makes sense, thanks for the explanation. But IMHO (2) instantiate the spidermanager and hold a reference doesn't belong to CrawlerRunner because SpiderManager assumes there is a Scrapy project, and it'd be great to be able to use CrawerRunner without it. |
You have a point here. I'll keep it in mind. |
There are a lot of suggestions, I'll try to summarize them: CrawlerRunner:
CrawlerProcess:
SpiderManager:
|
What do you think about allowing setting override in |
@nramirezuy I don't know how much granularity on configuration we want, but it's something possible. There's an issue about that on the spider settings although. If settings are overridden in the |
Something like this may work: SETTINGS_PRIORITIES = {
'default': 0,
'command': 10,
'project': 20,
'downloader_middleware': 27,
'spider_middleware': 28,
'extension': 29,
'spider': 30,
'cmdline': 40,
} |
open discussion.
spiders
tospidermanager
CrawlerManager.join()
)CrawlerManager.crawl()
more proposed changes: