Skip to content

Latest commit

 

History

History
226 lines (140 loc) · 8.3 KB

frontier-backends.rst

File metadata and controls

226 lines (140 loc) · 8.3 KB

Backends

Frontier :class:`Backend <crawlfrontier.core.components.Backend>` is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled. It's called by the :class:`FrontierManager <crawlfrontier.core.manager.FrontierManager>` after :class:`Middleware <crawlfrontier.core.components.Middleware>`, using hooks for :class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` processing according to :ref:`frontier data flow <frontier-data-flow>`.

Unlike :class:`Middleware`, that can have many different instances activated, only one :class:`Backend <crawlfrontier.core.components.Backend>` can be used per frontier.

Some backends require, depending on the logic implemented, a persistent storage to manage :class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` objects info.

Activating a backend

To activate the frontier middleware component, set it through the :setting:`BACKEND` setting.

Here’s an example:

BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'

Keep in mind that some backends may need to be enabled through a particular setting. See :ref:`each backend documentation <frontier-built-in-backend>` for more info.

Writing your own backend

Writing your own frontier backend is easy. Each :class:`Backend <crawlfrontier.core.components.Backend>` component is a single Python class inherited from :class:`Component <crawlfrontier.core.components.Component>`.

:class:`FrontierManager <crawlfrontier.core.manager.FrontierManager>` will communicate with active :class:`Backend <crawlfrontier.core.components.Backend>` through the methods described below.

.. autoclass:: crawlfrontier.core.components.Backend

    **Methods**

    .. automethod:: crawlfrontier.core.components.Backend.frontier_start

        :return: None.

    .. automethod:: crawlfrontier.core.components.Backend.frontier_stop

        :return: None.

    .. automethod:: crawlfrontier.core.components.Backend.add_seeds

        :return: None.

    .. automethod:: crawlfrontier.core.components.Backend.get_next_requests

    .. automethod:: crawlfrontier.core.components.Backend.page_crawled

        :return: None.

    .. automethod:: crawlfrontier.core.components.Backend.request_error

        :return: None.

    **Class Methods**

    .. automethod:: crawlfrontier.core.components.Backend.from_manager





Built-in backend reference

This page describes all :ref:`each backend documentation <frontier-built-in-backend>` components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the :ref:`backend usage guide. <frontier-writing-backend>`.

To know the default activated :class:`Backend <crawlfrontier.core.components.Backend>` check the :setting:`BACKEND` setting.

Basic algorithms

Some of the built-in :class:`Backend <crawlfrontier.core.components.Backend>` objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.

Differences between them will be on storage engine used. For instance, :class:`memory.FIFO <crawlfrontier.contrib.backends.memory.FIFO>` and :class:`sqlalchemy.FIFO <crawlfrontier.contrib.backends.sqlalchemy.FIFO>` will use the same logic but with different storage engines.

Memory backends

This set of :class:`Backend <crawlfrontier.core.components.Backend>` objects will use an heapq object as storage for :ref:`basic algorithms <frontier-backends-basic-algorithms>`.

Base class for in-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` objects.

In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of FIFO algorithm.

In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of LIFO algorithm.

In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of BFS algorithm.

In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of DFS algorithm.

In-memory heapq :class:`Backend <crawlfrontier.core.components.Backend>` implementation of a random selection algorithm.

SQLAlchemy backends

This set of :class:`Backend <crawlfrontier.core.components.Backend>` objects will use SQLAlchemy as storage for :ref:`basic algorithms <frontier-backends-basic-algorithms>`.

By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.

:class:`Request <crawlfrontier.core.models.Request>` and :class:`Response <crawlfrontier.core.models.Response>` are represented by a declarative sqlalchemy model:

class Page(Base):
    __tablename__ = 'pages'
    __table_args__ = (
        UniqueConstraint('url'),
    )
    class State:
        NOT_CRAWLED = 'NOT CRAWLED'
        QUEUED = 'QUEUED'
        CRAWLED = 'CRAWLED'
        ERROR = 'ERROR'

    url = Column(String(1000), nullable=False)
    fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True)
    depth = Column(Integer, nullable=False)
    created_at = Column(TIMESTAMP, nullable=False)
    status_code = Column(String(20))
    state = Column(String(10))
    error = Column(String(20))

If you need to create your own models, you can do it by using the :setting:`DEFAULT_MODELS` setting:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
}

This setting uses a dictionary where key represents the name of the model to define and value the model to use. If you want for instance to create a model to represent domains:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
    'Domain': 'myproject.backends.sqlalchemy.models.Domain',
}

Models can be accessed from the Backend dictionary attribute models.

For a complete list of all settings used for sqlalchemy backends check the :doc:`settings <frontier-settings>` section.

Base class for SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` objects.

SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of FIFO algorithm.

SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of LIFO algorithm.

SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of BFS algorithm.

SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of DFS algorithm.

SQLAlchemy :class:`Backend <crawlfrontier.core.components.Backend>` implementation of a random selection algorithm.