Frontera is a web crawling tool box, allowing to build crawlers of any scale and purpose.
Frontera provides crawl frontier <crawl-frontier>
framework by managing when and what to crawl next, and checking for crawling goal accomplishment.
Frontera also provides replication, sharding and isolation of all crawler components to scale and distribute it.
Frontera contain components to allow creation of fully-operational web crawler with Scrapy. Even though it was originally designed for Scrapy, it can also be used with any other crawling framework/system as the framework offers a generic tool box.
The purpose of this chapter is to introduce you to the concepts behind Frontera so that you can get an idea of how it works and decide if it is suited to your needs.
topics/overview topics/run-modes topics/quick-start-single topics/quick-start-distributed topics/cluster-setup
topics/overview
Understand what Frontera is and how it can help you.
topics/run-modes
High level architecture and Frontera run modes.
topics/quick-start-single
using Scrapy as a container for running Frontera.
topics/quick-start-distributed
with SQLite and ZeroMQ.
topics/cluster-setup
Setting up clustered version of Frontera on multiple machines with HBase and Kafka.
topics/installation topics/frontier-objects topics/frontier-middlewares topics/frontier-canonicalsolvers topics/frontier-backends topics/message_bus topics/own_crawling_strategy topics/scrapy-integration topics/frontera-settings
topics/installation
HOWTO and Dependencies options.
topics/frontier-objects
Understand the classes used to represent requests and responses.
topics/frontier-middlewares
Filter or alter information for links and documents.
topics/frontier-canonicalsolvers
Identify and make use of canonical url of document.
topics/frontier-backends
Define your own crawling policy and custom storage.
topics/message_bus
Built-in message bus reference.
topics/own_crawling_strategy
Implementing own crawling strategy for distributed backend.
topics/scrapy-integration
Learn how to use Frontera with Scrapy.
topics/frontera-settings
Settings reference.
topics/what-is-cf topics/graph-manager topics/scrapy-recorder topics/fine-tuning topics/dns-service
topics/what-is-cf
Learn Crawl Frontier theory.
topics/graph-manager
Define fake crawlings for websites to test your frontier.
topics/scrapy-recorder
Create Scrapy crawl recordings and reproduce them later.
topics/fine-tuning
Cluster deployment and fine tuning information.
topics/dns-service
Few words about DNS service setup.
topics/architecture topics/frontier-api topics/requests-integration topics/examples topics/tests topics/loggers topics/frontier-tester topics/faq topics/contributing topics/glossary
topics/architecture
See how Frontera works and its different components.
topics/frontier-api
Learn how to use the frontier.
topics/requests-integration
Learn how to use Frontera with Requests.
topics/examples
Some example projects and scripts using Frontera.
topics/tests
How to run and write Frontera tests.
topics/loggers
A list of loggers for use with python native logging system.
topics/frontier-tester
Test your frontier in an easy way.
topics/faq
Frequently asked questions.
topics/contributing
HOWTO contribute.
topics/glossary
Glossary of terms.