Skip to content

Commit

Permalink
Merge 2.11.2 changes (#6363)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed May 14, 2024
1 parent b88f22c commit d2f1e00
Show file tree
Hide file tree
Showing 23 changed files with 1,721 additions and 349 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 2.11.1
current_version = 2.11.2
commit = True
tag = True
tag_name = {new_version}
Expand Down
40 changes: 18 additions & 22 deletions docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,39 +138,37 @@ See previous question.
How can I prevent memory errors due to many allowed domains?
------------------------------------------------------------

If you have a spider with a long list of
:attr:`~scrapy.Spider.allowed_domains` (e.g. 50,000+), consider
replacing the default
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware
with a :ref:`custom spider middleware <custom-spider-middleware>` that requires
less memory. For example:
If you have a spider with a long list of :attr:`~scrapy.Spider.allowed_domains`
(e.g. 50,000+), consider replacing the default
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` downloader
middleware with a :ref:`custom downloader middleware
<topics-downloader-middleware-custom>` that requires less memory. For example:

- If your domain names are similar enough, use your own regular expression
instead joining the strings in
:attr:`~scrapy.Spider.allowed_domains` into a complex regular
expression.
instead joining the strings in :attr:`~scrapy.Spider.allowed_domains` into
a complex regular expression.

- If you can `meet the installation requirements`_, use pyre2_ instead of
Python’s re_ to compile your URL-filtering regular expression. See
:issue:`1908`.

See also other suggestions at `StackOverflow`_.
See also `other suggestions at StackOverflow
<https://stackoverflow.com/q/36440681>`__.

.. note:: Remember to disable
:class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable
your custom implementation:
:class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` when you
enable your custom implementation:

.. code-block:: python
SPIDER_MIDDLEWARES = {
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
"myproject.middlewares.CustomOffsiteMiddleware": 500,
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
"myproject.middlewares.CustomOffsiteMiddleware": 50,
}
.. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation
.. _pyre2: https://github.com/andreasvc/pyre2
.. _re: https://docs.python.org/library/re.html
.. _StackOverflow: https://stackoverflow.com/q/36440681/939364

Can I use Basic HTTP Authentication in my spiders?
--------------------------------------------------
Expand Down Expand Up @@ -206,12 +204,10 @@ I get "Filtered offsite request" messages. How can I fix them?
Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
problem, so you may not need to fix them.

Those messages are thrown by the Offsite Spider Middleware, which is a spider
middleware (enabled by default) whose purpose is to filter out requests to
domains outside the ones covered by the spider.

For more info see:
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`.
Those messages are thrown by
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, which is a
downloader middleware (enabled by default) whose purpose is to filter out
requests to domains outside the ones covered by the spider.

What is the recommended way to deploy a Scrapy crawler in production?
---------------------------------------------------------------------
Expand Down
116 changes: 113 additions & 3 deletions docs/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
Release notes
=============


.. _release-VERSION:

Scrapy VERSION (YYYY-MM-DD)
Expand All @@ -12,11 +11,122 @@ Scrapy VERSION (YYYY-MM-DD)
Deprecations
~~~~~~~~~~~~

- :func:`scrapy.core.downloader.Downloader._get_slot_key` is now deprecated.
Consider using its corresponding public method get_slot_key() instead.
- :meth:`scrapy.core.downloader.Downloader._get_slot_key` is deprecated, use
:meth:`scrapy.core.downloader.Downloader.get_slot_key` instead.
(:issue:`6340`)


.. _release-2.11.2:

Scrapy 2.11.2 (2024-05-14)
--------------------------

Security bug fixes
~~~~~~~~~~~~~~~~~~

- Redirects to non-HTTP protocols are no longer followed. Please, see the
`23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`)

.. _23j4-mw76-5v7h security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h

- The ``Authorization`` header is now dropped on redirects to a different
scheme (``http://`` or ``https://``) or port, even if the domain is the
same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more
information.

.. _4qqq-9vqf-3h3f security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f

- When using system proxy settings that are different for ``http://`` and
``https://``, redirects to a different URL scheme will now also trigger the
corresponding change in proxy settings for the redirected request. Please,
see the `jm3v-qxmh-hxwv security advisory`_ for more information.
(:issue:`767`)

.. _jm3v-qxmh-hxwv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv

- :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now
enforced for all requests, and not only requests from spider callbacks.
(:issue:`1042`, :issue:`2241`, :issue:`6358`)

- :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML
entities. (:issue:`6265`)

- defusedxml_ is now used to make
:class:`scrapy.http.request.rpc.XmlRpcRequest` more secure.
(:issue:`6250`, :issue:`6251`)

.. _defusedxml: https://github.com/tiran/defusedxml

Bug fixes
~~~~~~~~~

- Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in
favor of brotli_. (:issue:`6261`)

.. _brotli: https://github.com/google/brotli

.. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli
instead if you can.

- Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by default. This
prevents
:class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from
following redirects that would not be followed by web browsers with
JavaScript enabled. (:issue:`6342`, :issue:`6347`)

- During :ref:`feed export <topics-feed-exports>`, do not close the
underlying file from :ref:`built-in post-processing plugins
<builtin-plugins>`.
(:issue:`5932`, :issue:`6178`, :issue:`6239`)

- :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
now properly applies the ``unique`` and ``canonicalize`` parameters.
(:issue:`3273`, :issue:`6221`)

- Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty
string. (:issue:`6121`, :issue:`6124`)

- Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra
information. (:issue:`6323`, :issue:`6324`)

- ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing
the UTF-8-compatible (e.g. ASCII) parts of the document.
(:issue:`6292`, :issue:`6298`)

- :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an
exception if ``default`` is ``None``.
(:issue:`6308`, :issue:`6310`)

- :class:`~scrapy.selector.Selector` now uses
:func:`scrapy.utils.response.get_base_url` to determine the base URL of a
given :class:`~scrapy.http.Response`. (:issue:`6265`)

- The :meth:`media_to_download` method of :ref:`media pipelines
<topics-media-pipeline>` now logs exceptions before stripping them.
(:issue:`5067`, :issue:`5068`)

- When passing a callback to the :command:`parse` command, build the callback
callable with the right signature.
(:issue:`6182`)

Documentation
~~~~~~~~~~~~~

- Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`.
(:issue:`6203`, :issue:`6208`)

- Document that :attr:`scrapy.selector.Selector.type` can be ``"json"``.
(:issue:`6328`, :issue:`6334`)

Quality assurance
~~~~~~~~~~~~~~~~~

- Make builds reproducible. (:issue:`5019`, :issue:`6322`)

- Packaging and test fixes.
(:issue:`6286`, :issue:`6290`, :issue:`6312`, :issue:`6316`, :issue:`6344`)


.. _release-2.11.1:

Scrapy 2.11.1 (2024-02-14)
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/benchmarking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ You should see an output like this::
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
Expand All @@ -37,7 +38,6 @@ You should see an output like this::
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Expand Down
44 changes: 43 additions & 1 deletion docs/topics/downloader-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,44 @@ HttpProxyMiddleware
Keep in mind this value will take precedence over ``http_proxy``/``https_proxy``
environment variables, and it will also ignore ``no_proxy`` environment variable.

OffsiteMiddleware
-----------------

.. module:: scrapy.downloadermiddlewares.offsite
:synopsis: Offsite Middleware

.. class:: OffsiteMiddleware

.. versionadded:: 2.11.2

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren't in the
spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
but not ``www2.example.com`` nor ``example.com``.

When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to
this one::

DEBUG: Filtered offsite request to 'offsite.example': <GET http://offsite.example/some/page.html>

To avoid filling the log with too much noise, it will only print one of
these messages for each new domain filtered. So, for example, if another
request for ``offsite.example`` is filtered, no log message will be
printed. But if a request for ``other.example`` is filtered, a message
will be printed (but only for the first request filtered).

If the spider doesn't define an
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.

RedirectMiddleware
------------------

Expand Down Expand Up @@ -882,7 +920,11 @@ Meta tags within these tags are ignored.

.. versionchanged:: 2.0
The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
``['script', 'noscript']`` to ``[]``.
``["script", "noscript"]`` to ``[]``.

.. versionchanged:: 2.11.2
The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
``[]`` to ``["noscript"]``.

.. versionchanged:: VERSION
The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -674,6 +674,7 @@ Default:
.. code-block:: python
{
"scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
"scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
"scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
Expand Down Expand Up @@ -1613,7 +1614,6 @@ Default:
{
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
"scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
"scrapy.spidermiddlewares.depth.DepthMiddleware": 900,
Expand Down
11 changes: 9 additions & 2 deletions docs/topics/signals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -343,11 +343,18 @@ request_scheduled
.. signal:: request_scheduled
.. function:: request_scheduled(request, spider)

Sent when the engine schedules a :class:`~scrapy.Request`, to be
downloaded later.
Sent when the engine is asked to schedule a :class:`~scrapy.Request`, to be
downloaded later, before the request reaches the :ref:`scheduler
<topics-scheduler>`.

Raise :exc:`~scrapy.exceptions.IgnoreRequest` to drop a request before it
reaches the scheduler.

This signal does not support returning deferreds from its handlers.

.. versionadded:: 2.11.2
Allow dropping requests with :exc:`~scrapy.exceptions.IgnoreRequest`.

:param request: the request that reached the scheduler
:type request: :class:`~scrapy.Request` object

Expand Down
40 changes: 2 additions & 38 deletions docs/topics/spider-middleware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ value. For example, if you want to disable the off-site middleware:
.. code-block:: python
SPIDER_MIDDLEWARES = {
"myproject.middlewares.CustomSpiderMiddleware": 543,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
"scrapy.spidermiddlewares.referer.RefererMiddleware": None,
"myproject.middlewares.CustomRefererSpiderMiddleware": 700,
}
Finally, keep in mind that some middlewares may need to be enabled through a
Expand Down Expand Up @@ -313,42 +313,6 @@ Default: ``False``

Pass all responses, regardless of its status code.

OffsiteMiddleware
-----------------

.. module:: scrapy.spidermiddlewares.offsite
:synopsis: Offsite Spider Middleware

.. class:: OffsiteMiddleware

Filters out Requests for URLs outside the domains covered by the spider.

This middleware filters out every request whose host names aren't in the
spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
All subdomains of any domain in the list are also allowed.
E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
but not ``www2.example.com`` nor ``example.com``.

When your spider returns a request for a domain not belonging to those
covered by the spider, this middleware will log a debug message similar to
this one::

DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>

To avoid filling the log with too much noise, it will only print one of
these messages for each new domain filtered. So, for example, if another
request for ``www.othersite.com`` is filtered, no log message will be
printed. But if a request for ``someothersite.com`` is filtered, a message
will be printed (but only for the first request filtered).

If the spider doesn't define an
:attr:`~scrapy.Spider.allowed_domains` attribute, or the
attribute is empty, the offsite middleware will allow all requests.

If the request has the :attr:`~scrapy.Request.dont_filter` attribute
set, the offsite middleware will allow the request even if its domain is not
listed in allowed domains.


RefererMiddleware
-----------------
Expand Down
3 changes: 2 additions & 1 deletion docs/topics/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ scrapy.Spider
An optional list of strings containing domains that this spider is
allowed to crawl. Requests for URLs not belonging to the domain names
specified in this list (or their subdomains) won't be followed if
:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` is
enabled.

Let's say your target url is ``https://www.example.com/1.html``,
then add ``'example.com'`` to the list.
Expand Down
2 changes: 1 addition & 1 deletion scrapy/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.11.1
2.11.2

0 comments on commit d2f1e00

Please sign in to comment.