Navigation Menu

Skip to content
This repository has been archived by the owner on Apr 20, 2019. It is now read-only.

Workers

temoto edited this page Sep 13, 2010 · 9 revisions

Worker is Heroshi crawling unit

Worker workflow:

  1. contact manager and get list of Link to crawl from queue manager
  2. crawl links
  3. as results are fetching, report each result back to manager at once
  4. repeat

Workers communicate with manager through HTTP RESTful API. At the time of writing, manager location is predefined in workers configuration.

When worker finds a URL on just retrieved page, it is obviously new to him. But it may be already visited before, so workers are restricted to crawl only a set of URLs got from queue manager. Worker reports all URLs found to the manager. This approach allows to have full control over what URLs are crawled and when. But as disadvantage, it means lots of traffic (Link lists) between workers and manager.

Workers use:

  • async network I/O via wonderful eventlet library.
  • BeautifulSoup to parse HTML and search for more links.
Clone this wiki locally