Frontier :class:`Backend <crawlfrontier.core.components.Backend>` is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled. It's called by the :class:`FrontierManager <crawlfrontier.core.manager.FrontierManager>` after :class:`Middleware <crawlfrontier.core.components.Middleware>`, using hooks for :class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` processing according to :ref:`frontier data flow <frontier-data-flow>`.
Unlike :class:`Middleware`, that can have many different instances activated, only one :class:`Backend <crawlfrontier.core.components.Backend>` can be used per frontier.
Some backends require, depending on the logic implemented, a persistent storage to manage :class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` objects info.
To activate the frontier middleware component, set it through the :setting:`BACKEND` setting.
Here’s an example:
BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'
Keep in mind that some backends may need to be enabled through a particular setting. See :ref:`each backend documentation <frontier-built-in-backend>` for more info.
Writing your own frontier backend is easy. Each :class:`Backend <crawlfrontier.core.components.Backend>` component is a single Python class inherited from :class:`Component <crawlfrontier.core.components.Component>`.
:class:`FrontierManager <crawlfrontier.core.manager.FrontierManager>` will communicate with active :class:`Backend <crawlfrontier.core.components.Backend>` through the methods described below.
.. autoclass:: crawlfrontier.core.components.Backend **Methods** .. automethod:: crawlfrontier.core.components.Backend.frontier_start :return: None. .. automethod:: crawlfrontier.core.components.Backend.frontier_stop :return: None. .. automethod:: crawlfrontier.core.components.Backend.add_seeds :return: None. .. automethod:: crawlfrontier.core.components.Backend.get_next_requests .. automethod:: crawlfrontier.core.components.Backend.page_crawled :return: None. .. automethod:: crawlfrontier.core.components.Backend.request_error :return: None. **Class Methods** .. automethod:: crawlfrontier.core.components.Backend.from_manager
This page describes all :ref:`each backend documentation <frontier-built-in-backend>` components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the :ref:`backend usage guide. <frontier-writing-backend>`.
To know the default activated :class:`Backend <crawlfrontier.core.components.Backend>` check the :setting:`BACKEND` setting.
Some of the built-in :class:`Backend <crawlfrontier.core.components.Backend>` objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.
Differences between them will be on storage engine used. For instance, :class:`memory.FIFO <crawlfrontier.contrib.backends.memory.FIFO>` and :class:`sqlalchemy.FIFO <crawlfrontier.contrib.backends.sqlalchemy.FIFO>` will use the same logic but with different storage engines.
This set of :class:`Backend <crawlfrontier.core.components.Backend>` objects will use an heapq object as storage for :ref:`basic algorithms <frontier-backends-basic-algorithms>`.
Base class for in-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` objects.
In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of FIFO algorithm.
In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of LIFO algorithm.
In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of BFS algorithm.
In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of DFS algorithm.
In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of a random selection algorithm.
This set of :class:`Backend <crawlfrontier.core.components.Backend>` objects will use SQLAlchemy as storage for :ref:`basic algorithms <frontier-backends-basic-algorithms>`.
By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.
:class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` are represented by a declarative sqlalchemy model:
class Page(Base): __tablename__ = 'pages' __table_args__ = ( UniqueConstraint('url'), ) class State: NOT_CRAWLED = 'NOT CRAWLED' QUEUED = 'QUEUED' CRAWLED = 'CRAWLED' ERROR = 'ERROR' url = Column(String(1000), nullable=False) fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True) depth = Column(Integer, nullable=False) created_at = Column(TIMESTAMP, nullable=False) status_code = Column(String(20)) state = Column(String(10)) error = Column(String(20))
If you need to create your own models, you can do it by using the :setting:`DEFAULT_MODELS` setting:
DEFAULT_MODELS = { 'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page', }
This setting uses a dictionary where key
represents the name of the model to define and value
the model to use.
If you want for instance to create a model to represent domains:
DEFAULT_MODELS = { 'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page', 'Domain': 'myproject.backends.sqlalchemy.models.Domain', }
Models can be accessed from the Backend dictionary attribute models
.
For a complete list of all settings used for sqlalchemy backends check the :doc:`settings <frontier-settings>` section.
Base class for SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` objects.
SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of FIFO algorithm.
SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of LIFO algorithm.
SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of BFS algorithm.
SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of DFS algorithm.
SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of a random selection algorithm.