Skip to content

Commit

Permalink
Merge pull request #4724 from Gallaecio/feed-uri-params
Browse files Browse the repository at this point in the history
Document FEED_URI_PARAMS
  • Loading branch information
wRAR committed Aug 14, 2020
2 parents 6165341 + 65e0aba commit acb3b44
Showing 1 changed file with 67 additions and 8 deletions.
75 changes: 67 additions & 8 deletions docs/topics/feed-exports.rst
Expand Up @@ -321,13 +321,14 @@ The following is a list of the accepted keys and the setting that is used
as a fallback value if that key is not provided for a specific feed definition.

* ``format``: the serialization format to be used for the feed.
See :ref:`topics-feed-format` for possible values.
See :ref:`topics-feed-format` for possible values.
Mandatory, no fallback setting
* ``batch_item_count``: falls back to :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
* ``encoding``: falls back to :setting:`FEED_EXPORT_ENCODING`
* ``fields``: falls back to :setting:`FEED_EXPORT_FIELDS`
* ``indent``: falls back to :setting:`FEED_EXPORT_INDENT`
* ``store_empty``: falls back to :setting:`FEED_STORE_EMPTY`
* ``batch_item_count``: falls back to :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`
* ``uri_params``: falls back to :setting:`FEED_URI_PARAMS`

.. setting:: FEED_EXPORT_ENCODING

Expand Down Expand Up @@ -500,7 +501,7 @@ generated:
* ``%(batch_time)s`` - gets replaced by a timestamp when the feed is being created
(e.g. ``2020-03-28T14-45-08.237134``)

* ``%(batch_id)d`` - gets replaced by the sequence number of the batch.
* ``%(batch_id)d`` - gets replaced by the 1-based sequence number of the batch.

Use :ref:`printf-style string formatting <python:old-string-formatting>` to
alter the number format. For example, to make the batch ID a 5-digit
Expand All @@ -517,16 +518,74 @@ And your :command:`crawl` command line is::

The command line above can generate a directory tree like::

->projectname
-->dirname
--->1-filename2020-03-28T14-45-08.237134.json
--->2-filename2020-03-28T14-45-09.148903.json
--->3-filename2020-03-28T14-45-10.046092.json
->projectname
-->dirname
--->1-filename2020-03-28T14-45-08.237134.json
--->2-filename2020-03-28T14-45-09.148903.json
--->3-filename2020-03-28T14-45-10.046092.json

Where the first and second files contain exactly 100 items. The last one contains
100 items or fewer.


.. setting:: FEED_URI_PARAMS

FEED_URI_PARAMS
---------------

Default: ``None``

A string with the import path of a function to set the parameters to apply with
:ref:`printf-style string formatting <python:old-string-formatting>` to the
feed URI.

The function signature should be as follows:

.. function:: uri_params(params, spider)

Return a :class:`dict` of key-value pairs to apply to the feed URI using
:ref:`printf-style string formatting <python:old-string-formatting>`.

:param params: default key-value pairs

Specifically:

- ``batch_id``: ID of the file batch. See
:setting:`FEED_EXPORT_BATCH_ITEM_COUNT`.

If :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` is ``0``, ``batch_id``
is always ``1``.

- ``batch_time``: UTC date and time, in ISO format with ``:``
replaced with ``-``.

See :setting:`FEED_EXPORT_BATCH_ITEM_COUNT`.

- ``time``: ``batch_time``, with microseconds set to ``0``.
:type params: dict

:param spider: source spider of the feed items
:type spider: scrapy.spiders.Spider

For example, to include the :attr:`name <scrapy.spiders.Spider.name>` of the
source spider in the feed URI:

#. Define the following function somewhere in your project::

# myproject/utils.py
def uri_params(params, spider):
return {**params, 'spider_name': spider.name}

#. Point :setting:`FEED_URI_PARAMS` to that function in your settings::

# myproject/settings.py
FEED_URI_PARAMS = 'myproject.utils.uri_params'

#. Use ``%(spider_name)s`` in your feed URI::

scrapy crawl <spider_name> -o "%(spider_name)s.jl"


.. _URIs: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
.. _Amazon S3: https://aws.amazon.com/s3/
.. _botocore: https://github.com/boto/botocore
Expand Down

0 comments on commit acb3b44

Please sign in to comment.