Skip to content

Latest commit

 

History

History
170 lines (117 loc) · 4.09 KB

index.rst

File metadata and controls

170 lines (117 loc) · 4.09 KB

Frontera documentation

Frontera is a web crawling tool box, allowing to build crawlers of any scale and purpose. It includes:

  • crawl frontier <crawl-frontier> framework managing when and what to crawl and checking for crawling goal* accomplishment,
  • workers, Scrapy wrappers, and data bus components to scale and distribute the crawler.

Frontera contain components to allow creation of fully-operational web crawler with Scrapy. Even though it was originally designed for Scrapy, it can also be used with any other crawling framework/system.

Introduction

The purpose of this chapter is to introduce you to the concepts behind Frontera so that you can get an idea of how it works and decide if it is suited to your needs.

topics/overview topics/run-modes topics/quick-start-single topics/quick-start-distributed topics/cluster-setup

topics/overview

Understand what Frontera is and how it can help you.

topics/run-modes

High level architecture and Frontera run modes.

topics/quick-start-single

using Scrapy as a container for running Frontera.

topics/quick-start-distributed

with SQLite and ZeroMQ.

topics/cluster-setup

Setting up clustered version of Frontera on multiple machines with HBase and Kafka.

Using Frontera

topics/installation topics/strategies topics/frontier-objects topics/frontier-middlewares topics/frontier-canonicalsolvers topics/frontier-backends topics/message_bus topics/custom_crawling_strategy topics/scrapy-integration topics/frontera-settings

topics/installation

HOWTO and Dependencies options.

topics/strategies

A list of built-in crawling strategies.

topics/frontier-objects

Understand the classes used to represent requests and responses.

topics/frontier-middlewares

Filter or alter information for links and documents.

topics/frontier-canonicalsolvers

Identify and make use of canonical url of document.

topics/frontier-backends

Built-in backends, and tips on implementing your own.

topics/message_bus

Built-in message bus reference.

topics/custom_crawling_strategy

Implementing your own crawling strategy.

topics/scrapy-integration

Learn how to use Frontera with Scrapy.

topics/frontera-settings

Settings reference.

Advanced usage

topics/what-is-cf topics/graph-manager topics/scrapy-recorder topics/fine-tuning topics/dns-service

topics/what-is-cf

Learn Crawl Frontier theory.

topics/graph-manager

Define fake crawlings for websites to test your frontier.

topics/scrapy-recorder

Create Scrapy crawl recordings and reproduce them later.

topics/fine-tuning

Cluster deployment and fine tuning information.

topics/dns-service

Few words about DNS service setup.

Developer documentation

topics/architecture topics/frontier-api topics/requests-integration topics/examples topics/tests topics/loggers topics/frontier-tester topics/contributing topics/glossary

topics/architecture

See how Frontera works and its different components.

topics/frontier-api

Learn how to use the frontier.

topics/requests-integration

Learn how to use Frontera with Requests.

topics/examples

Some example projects and scripts using Frontera.

topics/tests

How to run and write Frontera tests.

topics/loggers

A list of loggers for use with python native logging system.

topics/frontier-tester

Test your frontier in an easy way.

topics/contributing

HOWTO contribute.

topics/glossary

Glossary of terms.