Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

Commit

Permalink
Merge pull request #9 from scrapinghub/doc-custom-image-contract
Browse files Browse the repository at this point in the history
Describe Custom Images contract
  • Loading branch information
chekunkov committed Jul 20, 2016
2 parents 3fca8a1 + dce9a08 commit f5ffa2c
Show file tree
Hide file tree
Showing 3 changed files with 117 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ html:
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

livehtml:
sphinx-autobuild -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html

dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
Expand Down
113 changes: 113 additions & 0 deletions docs/custom-images-contract.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
.. _custom-images-contract:

======================
Custom Images contract
======================

A contract is a set of requirements that any crawler custom Docker image have to comply with to be able to run on Scrapy Cloud.

Scrapy crawler Docker images are already supported via the :ref:`scrapinghub-entrypoint-scrapy <sh-entrypoint-scrapy>` contract implementation. If you want to run crawlers built using other framework/language than Scrapy/Python, you have to make sure your image follows the contract statements listed below.


Contract statements
-------------------

1. Docker image should be able to run via ``start-crawl`` command without arguments.

.. code-block:: bash
docker run myscrapyimage start-crawl
2. Docker image should be able to return a spiders list via ``list-spiders`` command without arguments.

.. code-block:: bash
docker run myscrapyimage list-spiders
3. Crawler should be able to get all needed params using :ref:`system environment variables <environment-variables>`.


.. _environment-variables:

Environment variables
---------------------

SHUB_JOBKEY
^^^^^^^^^^^

Job key in format ``PROJECT_ID/SPIDER_ID/JOB_ID``.

**Example**:

.. code-block:: javascript
123/45/67
SHUB_JOB_DATA
^^^^^^^^^^^^^

Job arguments, in json format.

**Example**:

.. code-block:: javascript
{"key": "1111112/2/2", "project": 1111112, "version": "version1",
"spider": "spider-name", "spider_type": "auto", "tags": [],
"priority": 2, "scheduled_by": "user", "started_by": "admin",
"pending_time": 1460374516193, "running_time": 1460374557448, ... }
SHUB_SETTINGS
^^^^^^^^^^^^^

Job settings (i.e. organization / project / spider / job settings), in json format.

There are several layers of settings, and they all serve to different needs.

The settings may contain the following sections (dict keys):

- ``organization_settings``
- ``project_settings``
- ``spider_settings``
- ``job_settings``
- ``enabled_addons``

Organization / project / spider / job settings define appropriate levels of same settings but with different priorities. Enabled addons define Scrapinghub addons specific settings and may have an extended structure.

All the settings should replicate Dash API project ``/settings/get.json`` endpoint response (except ``job_settings`` if exists):

.. code-block:: bash
http -a APIKEY: http://dash.scrapinghub.com/api/settings/get.json project==PROJECTID
.. note:: All environment variables starting from ``SHUB_`` are reserved for Scrapinghub internal use and shouldn’t be used with any other purposes (they will be dropped/replaced on a job start).


.. _sh-entrypoint-scrapy:

Scrapy entrypoint
-----------------

A base support wrapper written in Python implementing Custom Images contract to run Scrapy-based python crawlers and scripts on Scrapy Cloud.

Main functions of this wrapper are the following:

- providing ``start-crawl`` entrypoint
- providing ``list-spiders`` entrypoint (starting from ``0.7.0`` version)
- translating system environment variables to Scrapy ``crawl`` / ``list`` commands

In fact, there are a lot of different features:

- parsing job data from environment
- processing job args and settings
- running a job with Scrapy
- collecting stats
- advanced logging & error handling
- transparent integration with Scrapinghub storage
- custom scripts support

**scrapinghub-entrypoint-scrapy** package is available on:

- `PyPI <https://pypi.python.org/pypi/scrapinghub-entrypoint-scrapy>`_
- `Github <https://github.com/scrapinghub/scrapinghub-entrypoint-scrapy/>`_
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Contents:
:maxdepth: 2

deploy-custom-image
custom-images-contract

Indices and tables
==================
Expand Down

0 comments on commit f5ffa2c

Please sign in to comment.