scrapy-crawlera |version| documentation

scrapy-crawlera is a Scrapy Downloader Middleware to interact with Crawlera automatically.

Configuration

.. toctree::
   :caption: Configuration

Add the Crawlera middleware including it into the DOWNLOADER_MIDDLEWARES in your settings.py file:
```
DOWNLOADER_MIDDLEWARES = {
    ...
    'scrapy_crawlera.CrawleraMiddleware': 610
}
```

Then there are two ways to enable it

Through settings.py:

CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'apikey'

Through spider attributes:

class MySpider:
    crawlera_enabled = True
    crawlera_apikey = 'apikey'

(optional) If you are not using the default Crawlera proxy (http://proxy.crawlera.com:8010), for example if you have a dedicated or private instance, make sure to also set CRAWLERA_URL in settings.py, e.g.:
```
CRAWLERA_URL = 'http://myinstance.crawlera.com:8010'
```

How to use it

.. toctree::
   :caption: How to use it
   :hidden:

   settings

:doc:`settings`: All configurable Scrapy Settings added by the Middleware.

With the middleware, the usage of crawlera is automatic, every request will go through crawlera without nothing to worry about. If you want to disable crawlera on a specific Request, you can do so by updating meta with dont_proxy=True:

scrapy.Request(
    'http://example.com',
    meta={
        'dont_proxy': True,
        ...
    },
)

Remember that you are now making requests to Crawlera, and the Crawlera service will be the one actually making the requests to the different sites.

If you need to specify special Crawlera Headers, just apply them as normal Scrapy Headers.

Here we have an example of specifying a Crawlera header into a Scrapy request:

scrapy.Request(
    'http://example.com',
    headers={
        'X-Crawlera-Max-Retries': 1,
        ...
    },
)

Remember that you could also set which headers to use by default by all requests with DEFAULT_REQUEST_HEADERS

Note

Crawlera headers are removed from requests when the middleware is activated but Crawlera is disabled. For example, if you accidentally disable Crawlera via crawlera_enabled = False but keep sending X-Crawlera-* headers in your requests, those will be removed from the request headers.

This Middleware also adds some configurable Scrapy Settings, check :ref:`the complete list here <settings>`.

All the rest

.. toctree::
   :caption: All the rest
   :hidden:

   news

:doc:`news`: See what has changed in recent scrapy-crawlera versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.rst

index.rst

scrapy-crawlera |version| documentation

Configuration

How to use it

All the rest

Files

index.rst

Latest commit

History

index.rst

File metadata and controls

scrapy-crawlera |version| documentation

Configuration

How to use it

All the rest