[WIP] Changes to crawler API #873

dangra · 2014-09-02T21:09:02Z

open discussion.

rename references to SPIDER_MANAGER_CLASS instance from spiders to spidermanager
rename CrawlerRunner to CrawlerManager for consistency with Spider->SpiderManager
Correctly detect when all crawlers finished so reactor can be stopped (new public method CrawlerManager.join())
Allow passing Crawler instances directly to CrawlerManager.crawl()

more proposed changes:

use SPIDERMANAGER_CLASS (missing one underscore) to point to implementer of the new interface.
Show a more explanatory error message when SPIDER_MANAGER_CLASS is set (there is not support for old interface).
find a better name for CrawlerProcess and if possible do not extend CrawlerManager.
Move crawler logging initialization to crawler class.

kmike · 2014-09-03T07:50:05Z

I think SpiderManager is quite different from CrawlerRunner: SpiderManager is a class for loading spiders by name, somewhat similar to unittest.TestLoader, while CrawlerRunner is currently a class for combining multiple crawlers. I'd rename SpiderManager to SpiderLoader or SpiderFinder to make it clear. IMHO "Manager" doesn't provide any insight about what is class actually for.

I don't understand why CrawlerRunner is not merged with CrawlerProcess - CrawlerRunner is not useful by itself. Its only use in Scrapy is scrapy.utils.test.get_crawler, and the fact the only function that uses CrawlerRunner needs to call a private method, and that CrawlerRunner is only used to get a Crawler shows that CrawlerRunner API doesn't make sense. It seems this function instantiates Crawler via CrawlerRunner only because CrawlerRunner sets up logging. This should be unnecessary once we switch to stdlib logging or move logging initialization to crawler class (as you suggested).

kmike · 2014-09-03T08:01:03Z

It'd be great to reduce SpiderManager use as much as possible. I think that the task of loading Spider by name from a Scrapy project is not essential; internals should work with Spider classes directly as much as possible. If I read SEP-19 correctly, this idea was there :) What about removing CrawlerRunner.spiders? Caller code will be responsible of loading spider manager class from settings (optional), instantiating it (optional) and getting spider classes.

nramirezuy · 2014-09-03T13:57:46Z

@kmike I think CrawlerRunner is there to support multiples Crawlers some day. Is most of all to use it as a library, scrapy.cmd will always support a single crawler.

Rename spiders to spidermanager. +1

2, 3, 4. I'm agree on this, it will allow a more easy usage from a script.

Find a better name for CrawlerProcess and if possible do not extend CrawlerManager. -> CrawlerManager using an instance of the CrawlerProcess. Isn't Process there because of a twisted interface?

Basically @dangra is proposing a mutation of CrawlerRunner to CrawlerManager.

kmike · 2014-09-03T15:13:32Z

@nramirezuy You're right that CrawlerRunner is a documented way to run Crawlers, I've missed that.

It is documented like this:

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

If I'm not mistaken, this is how it can be written without CrawlerRunner:

settings = get_project_settings()

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield Crawler(FollowAllSpider, settings).crawl(domain=domain)
    reactor.stop()

dangra · 2014-09-03T15:31:06Z

@kmike: because CrawlerRunner also

1- Creates the Crawler after merging spider settings
2- It instantiate the spidermanager and holds a reference
3- It implements the correct logic to wait for all crawlers (.join())
4- it implements the logic to stop all running crawlers.

I think above are common enough to be grouped into a single helper object.

dangra · 2014-09-03T15:32:37Z

What doesn't belong to the crawler module is the reactor handling, it is only used by the command line so far.

dangra · 2014-09-03T15:33:43Z

I have no better name for CrawlerRunner and I admit CrawlerManager similarity to SpiderManager is a weak excuse :)

dangra · 2014-09-03T15:43:14Z

It'd be great to reduce SpiderManager use as much as possible. I think that the task of loading Spider by name from a Scrapy project is not essential; internals should work with Spider classes directly as much as possible.

Internals works with Spider classes and specially with its instance. The command line and logs use the spider name but only because the name is a required attribute of the spider object.

dangra · 2014-09-03T15:44:53Z

It seems this function instantiates Crawler via CrawlerRunner only because CrawlerRunner sets up logging. This should be unnecessary once we switch to stdlib logging or move logging initialization to crawler class (as you suggested).

We agree on moving to stdlib logging, the idea is to create a logger per crawler and attach a handler to the root.

dangra · 2014-09-03T15:45:42Z

/cc @curita

dangra · 2014-09-03T15:49:12Z

I'd rename SpiderManager to SpiderLoader or SpiderFinder to make it clear. IMHO "Manager" doesn't provide any insight about what is class actually for.

I like SpiderFinder.

kmike · 2014-09-04T09:10:16Z

CrawlerRunner also
1- Creates the Crawler after merging spider settings
2- It instantiate the spidermanager and holds a reference
3- It implements the correct logic to wait for all crawlers (.join())
4- it implements the logic to stop all running crawlers.
I think above are common enough to be grouped into a single helper object.

What doesn't belong to the crawler module is the reactor handling, it is only used by the command line so far.

Such helper object makes sense, thanks for the explanation. tornado.gen supports "join" quite naturally - you yield a list and get results when all tasks are ready, but it seems it is more complicated in Twisted & inlineCallbacks. Stopping logic also can be useful.

But IMHO (2) instantiate the spidermanager and hold a reference doesn't belong to CrawlerRunner because SpiderManager assumes there is a Scrapy project, and it'd be great to be able to use CrawerRunner without it.

dangra · 2014-09-05T14:22:03Z

But IMHO (2) instantiate the spidermanager and hold a reference doesn't belong to CrawlerRunner because SpiderManager assumes there is a Scrapy project, and it'd be great to be able to use CrawerRunner without it.

You have a point here. I'll keep it in mind.

curita · 2014-09-07T22:28:35Z

There are a lot of suggestions, I'll try to summarize them:

CrawlerRunner:

Rename references to SPIDER_MANAGER_CLASS instance from spiders to spidermanager
Yes, it's kind of misleading otherwise.
Rename CrawlerRunner to CrawlerManager for consistency with Spider->SpiderManager
Both are equally bad for different reasons :P I like CrawlerRunner slightly more but I don't have a strong preference.
Remove CrawlerRunner.spiders
It's true that loading spiders isn't necessary a task for CrawlerRunner, it could be somewhere else, but I think keeping it there simplifies the user API by saving an additional step. SpiderManager assumes there is a project where to look spiders from (at least if the user didn't provide a different manager, which is something possible), but if it's not there, it doesn't interfere in providing spider instances directly.
Correctly detect when all crawlers finished so reactor can be stopped (new public method CrawlerManager.join())
Agree, it should return a deferred list with the crawl_deferreds (Like it's done in the _start_reactor method in CrawlerProcess).
Allow passing Crawler instances directly to CrawlerManager.crawl()
+1, but crawlers created outside CrawlerRunners won't have their spider settings populated at initialization time if not done manually, it should be documented to avoid unexpected behavior maybe.
Move crawler logging initialization to crawler class
+1

CrawlerProcess:

Find a better name for CrawlerProcess and if possible do not extend CrawlerManager
+1, and I like @nramirezuy's idea of using an instance of CrawlerRunner in CrawlerProcess instead of inheriting it.
Move CrawlerProcess to another module, it is only used by the command line so far
If we change its inheritance of CrawlerRunner, I agree, since there will be no strong dependency and it's true that is more of a helper of the ScrapyCommands and command line.

SpiderManager:

Rename SpiderManager to SpiderLoader or SpiderFinder to make it clear
I'll go with SpiderLoader since it actually has a load method, but I think both options are more descriptive than the current name is.
Use SPIDERMANAGER_CLASS (missing one underscore) to point to implementer of the new interface
Not exactly sure about this, I prefer keeping its name and provide a meaningful error if the spider manager interface doesn't respect the new one.
Show a more explanatory error message when SPIDER_MANAGER_CLASS is set (there is not support for old interface)
Yes, it should be checked somewhere that the manager adapts the zope.interface. (Maybe when loading it in CrawlerRunner?)

nramirezuy · 2014-09-08T15:11:41Z

What do you think about allowing setting override in from_crawler?
It is more clear there and you can do it from every middleware we just have to define new setting scope for each, also it allow you to define the middleware defaults in the same file. The spider receives the arguments on from_crawler so it allows to change settings based on spider arguments. This was the idea on SEP-19.

curita · 2014-09-14T20:26:39Z

@nramirezuy I don't know how much granularity on configuration we want, but it's something possible. There's an issue about that on the spider settings although. If settings are overridden in the spider.from_crawler method, they won't be available when setting other components while creating crawlers, since spiders are instantiated later in crawler.crawl. Actually, we could avoid what I mentioned about spider settings not being populated when creating crawlers outside CrawlerRunners by moving this population to the crawler.init method. If we add another configuration level we should consider carefully when and for what components these new settings will be available.

nramirezuy · 2014-09-15T14:08:42Z

Something like this may work:

SETTINGS_PRIORITIES = {
    'default': 0,
    'command': 10,
    'project': 20,
    'downloader_middleware': 27,
    'spider_middleware': 28,
    'extension': 29,
    'spider': 30,
    'cmdline': 40,
}

wip

f95e9bc

nramirezuy mentioned this pull request Oct 3, 2014

sargs available on update_settings #876

Open

curita mentioned this pull request Oct 12, 2014

Accessing crawler from log before crawler is assigned to spider #881

Closed

curita added this to the Scrapy 1.0 milestone Jan 12, 2015

curita self-assigned this Jan 21, 2015

curita mentioned this pull request Mar 3, 2015

[MRG+1] Python logging #1060

Merged

pablohoffman removed this from the Scrapy 1.0 milestone Mar 13, 2015

dangra closed this Mar 20, 2015

curita mentioned this pull request Apr 2, 2015

[MRG+1] Move spider settings population from CrawlerRunner to Crawler.__init__ #1128

Merged

This was referenced Apr 10, 2015

Move CrawlerProcess to scrapy/utils/command.py #1146

Closed

Allow passing Crawler instances directly to CrawlerRunner.crawl() #1147

Merged

Verify SPIDER_MANAGER_CLASS interface while loading it in CrawlerRunner #1148

Merged

kmike mentioned this pull request Apr 16, 2015

[MRG+1] rename SpiderManager to SpiderLoader #1166

Merged

dangra deleted the crawler-api branch August 10, 2015 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Changes to crawler API #873

[WIP] Changes to crawler API #873

dangra commented Sep 2, 2014

kmike commented Sep 3, 2014

kmike commented Sep 3, 2014

nramirezuy commented Sep 3, 2014

kmike commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

kmike commented Sep 4, 2014

dangra commented Sep 5, 2014

curita commented Sep 7, 2014

nramirezuy commented Sep 8, 2014

curita commented Sep 14, 2014

nramirezuy commented Sep 15, 2014

[WIP] Changes to crawler API #873

[WIP] Changes to crawler API #873

Conversation

dangra commented Sep 2, 2014

kmike commented Sep 3, 2014

kmike commented Sep 3, 2014

nramirezuy commented Sep 3, 2014

kmike commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

dangra commented Sep 3, 2014

kmike commented Sep 4, 2014

dangra commented Sep 5, 2014

curita commented Sep 7, 2014

nramirezuy commented Sep 8, 2014

curita commented Sep 14, 2014

nramirezuy commented Sep 15, 2014