Skip to content

Latest commit

 

History

History
40 lines (30 loc) · 1.66 KB

glossary.rst

File metadata and controls

40 lines (30 loc) · 1.66 KB

Glossary

spider log A stream of encoded messages from spiders. Each message is product of extraction from document content. Most of the time it is links, scores, classification results.

scoring log

Contains score updating events and scheduling flag (if link needs to be scheduled for download) going from strategy worker to db worker.

spider feed

A stream of messages from db worker to spiders containing new batches of documents to crawl.

strategy worker

Special type of worker, running the crawling strategy code: scoring the links, deciding if link needs to be scheduled (consults state cache) and when to stop crawling. That type of worker is sharded.

db worker

Is responsible for communicating with storage DB, and mainly saving metadata and content along with retrieving new batches to download.

state cache

In-memory data structure containing information about state of documents, whatever they were scheduled or not. Periodically synchronized with persistent storage.

message bus

Transport layer abstraction mechanism. Provides interface for transport layer abstraction and several implementations.

spider

A process retrieving and extracting content from the Web, using spider feed as incoming queue and storing results to spider log. In this documentation fetcher is used as synonym.

crawling strategy

A class containing crawling logic covering seeds addition, processing of downloaded content and scheduling of new requests to crawl.