Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Changes to crawler API #873

Closed
wants to merge 1 commit into from
Closed

[WIP] Changes to crawler API #873

wants to merge 1 commit into from

Conversation

dangra
Copy link
Member

@dangra dangra commented Sep 2, 2014

open discussion.

  • rename references to SPIDER_MANAGER_CLASS instance from spiders to spidermanager
  • rename CrawlerRunner to CrawlerManager for consistency with Spider->SpiderManager
  • Correctly detect when all crawlers finished so reactor can be stopped (new public method CrawlerManager.join())
  • Allow passing Crawler instances directly to CrawlerManager.crawl()

more proposed changes:

  • use SPIDERMANAGER_CLASS (missing one underscore) to point to implementer of the new interface.
  • Show a more explanatory error message when SPIDER_MANAGER_CLASS is set (there is not support for old interface).
  • find a better name for CrawlerProcess and if possible do not extend CrawlerManager.
  • Move crawler logging initialization to crawler class.

@kmike
Copy link
Member

kmike commented Sep 3, 2014

I think SpiderManager is quite different from CrawlerRunner: SpiderManager is a class for loading spiders by name, somewhat similar to unittest.TestLoader, while CrawlerRunner is currently a class for combining multiple crawlers. I'd rename SpiderManager to SpiderLoader or SpiderFinder to make it clear. IMHO "Manager" doesn't provide any insight about what is class actually for.

I don't understand why CrawlerRunner is not merged with CrawlerProcess - CrawlerRunner is not useful by itself. Its only use in Scrapy is scrapy.utils.test.get_crawler, and the fact the only function that uses CrawlerRunner needs to call a private method, and that CrawlerRunner is only used to get a Crawler shows that CrawlerRunner API doesn't make sense. It seems this function instantiates Crawler via CrawlerRunner only because CrawlerRunner sets up logging. This should be unnecessary once we switch to stdlib logging or move logging initialization to crawler class (as you suggested).

@kmike
Copy link
Member

kmike commented Sep 3, 2014

It'd be great to reduce SpiderManager use as much as possible. I think that the task of loading Spider by name from a Scrapy project is not essential; internals should work with Spider classes directly as much as possible. If I read SEP-19 correctly, this idea was there :) What about removing CrawlerRunner.spiders? Caller code will be responsible of loading spider manager class from settings (optional), instantiating it (optional) and getting spider classes.

@nramirezuy
Copy link
Contributor

@kmike I think CrawlerRunner is there to support multiples Crawlers some day. Is most of all to use it as a library, scrapy.cmd will always support a single crawler.

  1. Rename spiders to spidermanager. +1

2, 3, 4. I'm agree on this, it will allow a more easy usage from a script.

  • Find a better name for CrawlerProcess and if possible do not extend CrawlerManager. -> CrawlerManager using an instance of the CrawlerProcess. Isn't Process there because of a twisted interface?

Basically @dangra is proposing a mutation of CrawlerRunner to CrawlerManager.

@kmike
Copy link
Member

kmike commented Sep 3, 2014

@nramirezuy You're right that CrawlerRunner is a documented way to run Crawlers, I've missed that.

It is documented like this:

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

If I'm not mistaken, this is how it can be written without CrawlerRunner:

settings = get_project_settings()

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield Crawler(FollowAllSpider, settings).crawl(domain=domain)
    reactor.stop()

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

@kmike: because CrawlerRunner also

1- Creates the Crawler after merging spider settings
2- It instantiate the spidermanager and holds a reference
3- It implements the correct logic to wait for all crawlers (.join())
4- it implements the logic to stop all running crawlers.

I think above are common enough to be grouped into a single helper object.

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

What doesn't belong to the crawler module is the reactor handling, it is only used by the command line so far.

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

I have no better name for CrawlerRunner and I admit CrawlerManager similarity to SpiderManager is a weak excuse :)

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

It'd be great to reduce SpiderManager use as much as possible. I think that the task of loading Spider by name from a Scrapy project is not essential; internals should work with Spider classes directly as much as possible.

Internals works with Spider classes and specially with its instance. The command line and logs use the spider name but only because the name is a required attribute of the spider object.

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

It seems this function instantiates Crawler via CrawlerRunner only because CrawlerRunner sets up logging. This should be unnecessary once we switch to stdlib logging or move logging initialization to crawler class (as you suggested).

We agree on moving to stdlib logging, the idea is to create a logger per crawler and attach a handler to the root.

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

/cc @curita

@dangra
Copy link
Member Author

dangra commented Sep 3, 2014

I'd rename SpiderManager to SpiderLoader or SpiderFinder to make it clear. IMHO "Manager" doesn't provide any insight about what is class actually for.

I like SpiderFinder.

@kmike
Copy link
Member

kmike commented Sep 4, 2014

CrawlerRunner also
1- Creates the Crawler after merging spider settings
2- It instantiate the spidermanager and holds a reference
3- It implements the correct logic to wait for all crawlers (.join())
4- it implements the logic to stop all running crawlers.
I think above are common enough to be grouped into a single helper object.

What doesn't belong to the crawler module is the reactor handling, it is only used by the command line so far.

Such helper object makes sense, thanks for the explanation. tornado.gen supports "join" quite naturally - you yield a list and get results when all tasks are ready, but it seems it is more complicated in Twisted & inlineCallbacks. Stopping logic also can be useful.

But IMHO (2) instantiate the spidermanager and hold a reference doesn't belong to CrawlerRunner because SpiderManager assumes there is a Scrapy project, and it'd be great to be able to use CrawerRunner without it.

@dangra
Copy link
Member Author

dangra commented Sep 5, 2014

But IMHO (2) instantiate the spidermanager and hold a reference doesn't belong to CrawlerRunner because SpiderManager assumes there is a Scrapy project, and it'd be great to be able to use CrawerRunner without it.

You have a point here. I'll keep it in mind.

@curita
Copy link
Member

curita commented Sep 7, 2014

There are a lot of suggestions, I'll try to summarize them:

CrawlerRunner:

  • Rename references to SPIDER_MANAGER_CLASS instance from spiders to spidermanager
    Yes, it's kind of misleading otherwise.
  • Rename CrawlerRunner to CrawlerManager for consistency with Spider->SpiderManager
    Both are equally bad for different reasons :P I like CrawlerRunner slightly more but I don't have a strong preference.
  • Remove CrawlerRunner.spiders
    It's true that loading spiders isn't necessary a task for CrawlerRunner, it could be somewhere else, but I think keeping it there simplifies the user API by saving an additional step. SpiderManager assumes there is a project where to look spiders from (at least if the user didn't provide a different manager, which is something possible), but if it's not there, it doesn't interfere in providing spider instances directly.
  • Correctly detect when all crawlers finished so reactor can be stopped (new public method CrawlerManager.join())
    Agree, it should return a deferred list with the crawl_deferreds (Like it's done in the _start_reactor method in CrawlerProcess).
  • Allow passing Crawler instances directly to CrawlerManager.crawl()
    +1, but crawlers created outside CrawlerRunners won't have their spider settings populated at initialization time if not done manually, it should be documented to avoid unexpected behavior maybe.
  • Move crawler logging initialization to crawler class
    +1

CrawlerProcess:

  • Find a better name for CrawlerProcess and if possible do not extend CrawlerManager
    +1, and I like @nramirezuy's idea of using an instance of CrawlerRunner in CrawlerProcess instead of inheriting it.
  • Move CrawlerProcess to another module, it is only used by the command line so far
    If we change its inheritance of CrawlerRunner, I agree, since there will be no strong dependency and it's true that is more of a helper of the ScrapyCommands and command line.

SpiderManager:

  • Rename SpiderManager to SpiderLoader or SpiderFinder to make it clear
    I'll go with SpiderLoader since it actually has a load method, but I think both options are more descriptive than the current name is.
  • Use SPIDERMANAGER_CLASS (missing one underscore) to point to implementer of the new interface
    Not exactly sure about this, I prefer keeping its name and provide a meaningful error if the spider manager interface doesn't respect the new one.
  • Show a more explanatory error message when SPIDER_MANAGER_CLASS is set (there is not support for old interface)
    Yes, it should be checked somewhere that the manager adapts the zope.interface. (Maybe when loading it in CrawlerRunner?)

@nramirezuy
Copy link
Contributor

What do you think about allowing setting override in from_crawler?
It is more clear there and you can do it from every middleware we just have to define new setting scope for each, also it allow you to define the middleware defaults in the same file. The spider receives the arguments on from_crawler so it allows to change settings based on spider arguments. This was the idea on SEP-19.

@curita
Copy link
Member

curita commented Sep 14, 2014

@nramirezuy I don't know how much granularity on configuration we want, but it's something possible. There's an issue about that on the spider settings although. If settings are overridden in the spider.from_crawler method, they won't be available when setting other components while creating crawlers, since spiders are instantiated later in crawler.crawl. Actually, we could avoid what I mentioned about spider settings not being populated when creating crawlers outside CrawlerRunners by moving this population to the crawler.init method. If we add another configuration level we should consider carefully when and for what components these new settings will be available.

@nramirezuy
Copy link
Contributor

Something like this may work:

SETTINGS_PRIORITIES = {
    'default': 0,
    'command': 10,
    'project': 20,
    'downloader_middleware': 27,
    'spider_middleware': 28,
    'extension': 29,
    'spider': 30,
    'cmdline': 40,
}

@dangra dangra deleted the crawler-api branch August 10, 2015 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants