Frontier Backend <frontera.core.components.Backend>
is where the crawling logic/policies lies, essentially a brain of your crawler. Queue <frontera.core.components.Queue>
, Metadata <frontera.core.components.Metadata>
and States <frontera.core.components.States>
are classes where all low level code is meant to be placed, and Backend opposite, operates on a higher levels. Frontera is bundled with database and in-memory implementations of Queue, Metadata and States which can be combined in your custom backends or used standalone by directly instantiating FrontierManager <frontera.core.manager.FrontierManager>
and Backend.
Backend methods are called by the FrontierManager after Middleware <frontera.core.components.Middleware>
, using hooks for Request <frontera.core.models.Request>
and Response <frontera.core.models.Response>
processing according to frontier data flow <frontier-data-flow>
.
Unlike Middleware, that can have many different instances activated, only one Backend can be used per frontier.
To activate the frontier backend component, set it through the BACKEND
setting.
Here’s an example:
BACKEND = 'frontera.contrib.backends.memory.FIFO'
Keep in mind that some backends may need to be additionally configured through a particular setting. See backends documentation <frontier-built-in-backend>
for more info.
Each backend component is a single Python class inherited from Backend <frontera.core.components.Backend>
or DistributedBackend <frontera.core.components.DistributedBackend>
and using one or all of Queue
, Metadata
and States
.
FrontierManager
will communicate with active backend through the methods described below.
frontera.core.components.Backend
Methods
frontera.core.components.Backend.frontier_start
- return
None.
frontera.core.components.Backend.frontier_stop
- return
None.
frontera.core.components.Backend.finished
frontera.core.components.Backend.add_seeds
- return
None.
frontera.core.components.Backend.page_crawled
- return
None.
frontera.core.components.Backend.request_error
- return
None.
frontera.core.components.Backend.get_next_requests
Class Methods
frontera.core.components.Backend.from_manager
Properties
frontera.core.components.Backend.queue
frontera.core.components.Backend.states
frontera.core.components.Backend.metadata
frontera.core.components.DistributedBackend
Inherits all methods of Backend, and has two more class methods, which are called during strategy and db worker instantiation.
frontera.core.components.DistributedBackend.strategy_worker
frontera.core.components.DistributedBackend.db_worker
Backend should communicate with low-level storage by means of these classes:
frontera.core.components.Metadata
Methods
frontera.core.components.Metadata.add_seeds
frontera.core.components.Metadata.request_error
frontera.core.components.Metadata.page_crawled
Known implementations are: MemoryMetadata
and sqlalchemy.components.Metadata
.
frontera.core.components.Queue
Methods
frontera.core.components.Queue.get_next_requests
frontera.core.components.Queue.schedule
frontera.core.components.Queue.count
Known implementations are: MemoryQueue
and sqlalchemy.components.Queue
.
frontera.core.components.States
Methods
frontera.core.components.States.update_cache
frontera.core.components.States.set_states
frontera.core.components.States.flush
frontera.core.components.States.fetch
Known implementations are: MemoryStates
and sqlalchemy.components.States
.
This article describes all backend components that come bundled with Frontera.
To know the default activated Backend <frontera.core.components.Backend>
check the BACKEND
setting.
Some of the built-in Backend <frontera.core.components.Backend>
objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.
Differences between them will be on storage engine used. For instance, memory.FIFO <frontera.contrib.backends.memory.FIFO>
and sqlalchemy.FIFO <frontera.contrib.backends.sqlalchemy.FIFO>
will use the same logic but with different storage engines.
All these backend variations are using the same CommonBackend <frontera.contrib.backends.CommonBackend>
class implementing one-time visit crawling policy with priority queue.
frontera.contrib.backends.CommonBackend
This set of Backend <frontera.core.components.Backend>
objects will use an heapq module as queue and native dictionaries as storage for basic algorithms <frontier-backends-basic-algorithms>
.
Base class for in-memory Backend <frontera.core.components.Backend>
objects.
In-memory Backend <frontera.core.components.Backend>
implementation of FIFO algorithm.
In-memory Backend <frontera.core.components.Backend>
implementation of LIFO algorithm.
In-memory Backend <frontera.core.components.Backend>
implementation of BFS algorithm.
In-memory Backend <frontera.core.components.Backend>
implementation of DFS algorithm.
In-memory Backend <frontera.core.components.Backend>
implementation of a random selection algorithm.
This set of Backend <frontera.core.components.Backend>
objects will use SQLAlchemy as storage for basic algorithms <frontier-backends-basic-algorithms>
.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
If you need to use your own declarative sqlalchemy models, you can do it by using the SQLALCHEMYBACKEND_MODELS
setting.
This setting uses a dictionary where key
represents the name of the model to define and value
the model to use.
For a complete list of all settings used for SQLAlchemy backends check the settings <frontera-settings>
section.
Base class for SQLAlchemy Backend <frontera.core.components.Backend>
objects.
SQLAlchemy Backend <frontera.core.components.Backend>
implementation of FIFO algorithm.
SQLAlchemy Backend <frontera.core.components.Backend>
implementation of LIFO algorithm.
SQLAlchemy Backend <frontera.core.components.Backend>
implementation of BFS algorithm.
SQLAlchemy Backend <frontera.core.components.Backend>
implementation of DFS algorithm.
SQLAlchemy Backend <frontera.core.components.Backend>
implementation of a random selection algorithm.
Based on custom SQLAlchemy backend, and queue. Crawling starts with seeds. After seeds are crawled, every new document will be scheduled for immediate crawling. On fetching every new document will be scheduled for recrawling after fixed interval set by SQLALCHEMYBACKEND_REVISIT_INTERVAL
.
Current implementation of revisiting backend has no prioritization. During long term runs spider could go idle, because there are no documents available for crawling, but there are documents waiting for their scheduled revisit time.
Base class for SQLAlchemy Backend <frontera.core.components.Backend>
implementation of revisiting back-end.
frontera.contrib.backends.hbase.HBaseBackend
Is more suitable for large scale web crawlers. Settings reference can be found here hbase-settings
. Consider tunning a block cache to fit states within one block for average size website. To achieve this it's recommended to use hostname_local_fingerprint <frontera.utils.fingerprint.hostname_local_fingerprint>
to achieve documents closeness within the same host. This function can be selected with URL_FINGERPRINT_FUNCTION
setting.