Glossary

spider log A stream of encoded messages from spiders. Each message is product of extraction from document content. Most of the time it is links, scores, classification results.

scoring log: Contains score updating events and scheduling flag (if link needs to be scheduled for download) going from strategy worker to db worker.
spider feed: A stream of messages from db worker to spiders containing new batches of documents to crawl.
strategy worker: Special type of worker, running the crawling strategy code: scoring the links, deciding if link needs to be scheduled (consults state cache) and when to stop crawling. That type of worker is sharded.
db worker: Is responsible for communicating with storage DB, and mainly saving metadata and content along with retrieving new batches to download.
state cache: In-memory data structure containing information about state of documents, whatever they were scheduled or not. Periodically synchronized with persistent storage.
message bus: Transport layer abstraction mechanism. Provides interface for transport layer abstraction and several implementations.
spider: A process retrieving and extracting content from the Web, using spider feed as incoming queue and storing results to spider log. In this documentation fetcher is used as synonym.
crawling strategy: A class containing crawling logic covering seeds addition, processing of downloaded content and scheduling of new requests to crawl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glossary.rst

glossary.rst

Glossary

Files

glossary.rst

Latest commit

History

glossary.rst

File metadata and controls

Glossary