Create your Scrapy project as you usually do. Enter a directory where you’d like to store your code and then run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
These are basically:
- scrapy.cfg: the project configuration file
- tutorial/: the project’s python module, you’ll later import your code from here.
- tutorial/items.py: the project’s items file.
- tutorial/pipelines.py: the project’s pipelines file.
- tutorial/settings.py: the project’s settings file.
- tutorial/spiders/: a directory where you’ll later put your spiders.
See installation
.
This article about integration with Scrapy <scrapy-integration>
explains this step in detail.
Configure frontier settings to use a built-in backend like in-memory BFS:
BACKEND = 'frontera.contrib.backends.memory.BFS'
Run your Scrapy spider as usual from the command line:
scrapy crawl myspider
And that's it! You got your spider running integrated with Frontera.
You’ve seen a simple example of how to use Frontera with Scrapy, but this is just the surface. Frontera provides many powerful features for making frontier management easy and efficient, such as:
- Built-in support for
database storage <frontier-backends-sqlalchemy>
for crawled pages. - Easy
built-in integration with Scrapy <scrapy-integration>
andany other crawler <frontier-api>
through its API. Two distributed crawling modes <use-cases>
with use of ZeroMQ or Kafka and distributed backends.- Creating different crawling logic/policies
defining your own backend <frontier-backends>
. - Plugging your own request/response altering logic using
middlewares <frontier-middlewares>
. - Create fake sitemaps and reproduce crawling without crawler with the
Graph Manager <graph-manager>
. Record your Scrapy crawls <scrapy-recorder>
and use it later for frontier testing.- Logging facility that you can hook on to for catching errors and debug your frontiers.