Skip to content

Commit

Permalink
Merge branch '2.11' into relnotes-2.11.1
Browse files Browse the repository at this point in the history
  • Loading branch information
wRAR committed Jan 15, 2024
2 parents 9a834e6 + 09a7efe commit ddb7367
Show file tree
Hide file tree
Showing 7 changed files with 53 additions and 20 deletions.
2 changes: 1 addition & 1 deletion docs/_tests/quotes.html
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ <h2>Top Ten tags</h2>
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
Made with <span class='sh-red'></span> by <a href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
Expand Down
2 changes: 1 addition & 1 deletion docs/_tests/quotes1.html
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ <h2>Top Ten tags</h2>
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'></span> by <a href="https://scrapinghub.com">Scrapinghub</a>
Made with <span class='sh-red'></span> by <a href="https://www.zyte.com">Zyte</a>
</p>
</div>
</footer>
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ Scrapy:
* We use `black <https://black.readthedocs.io/en/stable/>`_ for code formatting.
There is a hook in the pre-commit config
that will automatically format your code before every commit. You can also
run black manually with ``tox -e black``.
run black manually with ``tox -e pre-commit``.

* Don't put your name in the code you contribute; git provides enough
metadata to identify author of the code.
Expand Down
5 changes: 5 additions & 0 deletions docs/topics/feed-exports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Scrapy provides this functionality out of the box with the Feed Exports, which
allows you to generate feeds with the scraped items, using multiple
serialization formats and storage backends.

This page provides detailed documentation for all feed export features. If you
are looking for a step-by-step guide, check out `Zyte’s export guides`_.

.. _Zyte’s export guides: https://docs.zyte.com/web-scraping/guides/export/index.html#exporting-scraped-data

.. _topics-feed-format:

Serialization formats
Expand Down
7 changes: 3 additions & 4 deletions docs/topics/practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,9 +288,8 @@ Here are some tips to keep in mind when dealing with these kinds of sites:
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a
super proxy that you can attach your own proxies to.
* use a highly distributed downloader that circumvents bans internally, so you
can just focus on parsing clean pages. One example of such downloaders is
`Zyte Smart Proxy Manager`_
* use a ban avoidance service, such as `Zyte API`_, which provides a `Scrapy
plugin <https://github.com/scrapy-plugins/scrapy-zyte-api>`__

If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.
Expand All @@ -301,4 +300,4 @@ If you are still unable to prevent your bot getting banned, consider contacting
.. _Common Crawl: https://commoncrawl.org/
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _scrapoxy: https://scrapoxy.io/
.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html
53 changes: 41 additions & 12 deletions docs/topics/request-response.rst
Original file line number Diff line number Diff line change
Expand Up @@ -193,18 +193,47 @@ Request objects
:meth:`replace`.

.. attribute:: Request.meta

A dict that contains arbitrary metadata for this request. This dict is
empty for new Requests, and is usually populated by different Scrapy
components (extensions, middlewares, etc). So the data contained in this
dict depends on the extensions you have enabled.

See :ref:`topics-request-meta` for a list of special meta keys
recognized by Scrapy.

This dict is :doc:`shallow copied <library/copy>` when the request is
cloned using the ``copy()`` or ``replace()`` methods, and can also be
accessed, in your spider, from the ``response.meta`` attribute.
:value: {}

A dictionary of arbitrary metadata for the request.

You may extend request metadata as you see fit.

Request metadata can also be accessed through the
:attr:`~scrapy.http.Response.meta` attribute of a response.

To pass data from one spider callback to another, consider using
:attr:`cb_kwargs` instead. However, request metadata may be the right
choice in certain scenarios, such as to maintain some debugging data
across all follow-up requests (e.g. the source URL).

A common use of request metadata is to define request-specific
parameters for Scrapy components (extensions, middlewares, etc.). For
example, if you set ``dont_retry`` to ``True``,
:class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` will never
retry that request, even if it fails. See :ref:`topics-request-meta`.

You may also use request metadata in your custom Scrapy components, for
example, to keep request state information relevant to your component.
For example,
:class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` uses the
``retry_times`` metadata key to keep track of how many times a request
has been retried so far.

Copying all the metadata of a previous request into a new, follow-up
request in a spider callback is a bad practice, because request
metadata may include metadata set by Scrapy components that is not
meant to be copied into other requests. For example, copying the
``retry_times`` metadata key into follow-up requests can lower the
amount of retries allowed for those follow-up requests.

You should only copy all request metadata from one request to another
if the new request is meant to replace the old request, as is often the
case when returning a request from a :ref:`downloader middleware
<topics-downloader-middleware>` method.

Also mind that the :meth:`copy` and :meth:`replace` request methods
:doc:`shallow-copy <library/copy>` request metadata.

.. attribute:: Request.cb_kwargs

Expand Down
2 changes: 1 addition & 1 deletion tests/test_feedexport.py
Original file line number Diff line number Diff line change
Expand Up @@ -2300,7 +2300,7 @@ def run_and_export(self, spider_cls, settings):
content[feed["format"]].append(file.read_bytes())
finally:
self.tearDown()
defer.returnValue(content)
return content

@defer.inlineCallbacks
def assertExportedJsonLines(self, items, rows, settings=None):
Expand Down

0 comments on commit ddb7367

Please sign in to comment.