Merge 2.11.2 changes (#6363)

scrapy · May 14, 2024 · d2f1e00 · d2f1e00
1 parent b88f22c
commit d2f1e00
Show file tree

Hide file tree

Showing 23 changed files with 1,721 additions and 349 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 2.11.1
+current_version = 2.11.2
 commit = True
 tag = True
 tag_name = {new_version}

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -138,39 +138,37 @@ See previous question.
 How can I prevent memory errors due to many allowed domains?
 ------------------------------------------------------------
 
-If you have a spider with a long list of
-:attr:`~scrapy.Spider.allowed_domains` (e.g. 50,000+), consider
-replacing the default
-:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware
-with a :ref:`custom spider middleware <custom-spider-middleware>` that requires
-less memory. For example:
+If you have a spider with a long list of :attr:`~scrapy.Spider.allowed_domains`
+(e.g. 50,000+), consider replacing the default
+:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` downloader
+middleware with a :ref:`custom downloader middleware
+<topics-downloader-middleware-custom>` that requires less memory. For example:
 
 -   If your domain names are similar enough, use your own regular expression
-    instead joining the strings in
-    :attr:`~scrapy.Spider.allowed_domains` into a complex regular
-    expression.
+    instead joining the strings in :attr:`~scrapy.Spider.allowed_domains` into
+    a complex regular expression.
 
 -   If you can `meet the installation requirements`_, use pyre2_ instead of
     Python’s re_ to compile your URL-filtering regular expression. See
     :issue:`1908`.
 
-See also other suggestions at `StackOverflow`_.
+See also `other suggestions at StackOverflow
+<https://stackoverflow.com/q/36440681>`__.
 
 .. note:: Remember to disable
-   :class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable
-   your custom implementation:
+   :class:`scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` when you
+   enable your custom implementation:
 
    .. code-block:: python
 
-       SPIDER_MIDDLEWARES = {
-           "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
-           "myproject.middlewares.CustomOffsiteMiddleware": 500,
+       DOWNLOADER_MIDDLEWARES = {
+           "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": None,
+           "myproject.middlewares.CustomOffsiteMiddleware": 50,
        }
 
 .. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation
 .. _pyre2: https://github.com/andreasvc/pyre2
 .. _re: https://docs.python.org/library/re.html
-.. _StackOverflow: https://stackoverflow.com/q/36440681/939364
 
 Can I use Basic HTTP Authentication in my spiders?
 --------------------------------------------------
@@ -206,12 +204,10 @@ I get "Filtered offsite request" messages. How can I fix them?
 Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
 problem, so you may not need to fix them.
 
-Those messages are thrown by the Offsite Spider Middleware, which is a spider
-middleware (enabled by default) whose purpose is to filter out requests to
-domains outside the ones covered by the spider.
-
-For more info see:
-:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`.
+Those messages are thrown by
+:class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware`, which is a
+downloader middleware (enabled by default) whose purpose is to filter out
+requests to domains outside the ones covered by the spider.
 
 What is the recommended way to deploy a Scrapy crawler in production?
 ---------------------------------------------------------------------

diff --git a/docs/news.rst b/docs/news.rst
@@ -3,7 +3,6 @@
 Release notes
 =============
 
-
 .. _release-VERSION:
 
 Scrapy VERSION (YYYY-MM-DD)
@@ -12,11 +11,122 @@ Scrapy VERSION (YYYY-MM-DD)
 Deprecations
 ~~~~~~~~~~~~
 
--   :func:`scrapy.core.downloader.Downloader._get_slot_key` is now deprecated.
-    Consider using its corresponding public method get_slot_key() instead.
+-   :meth:`scrapy.core.downloader.Downloader._get_slot_key` is deprecated, use
+    :meth:`scrapy.core.downloader.Downloader.get_slot_key` instead.
     (:issue:`6340`)
 
 
+.. _release-2.11.2:
+
+Scrapy 2.11.2 (2024-05-14)
+--------------------------
+
+Security bug fixes
+~~~~~~~~~~~~~~~~~~
+
+-   Redirects to non-HTTP protocols are no longer followed. Please, see the
+    `23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`)
+
+    .. _23j4-mw76-5v7h security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h
+
+-   The ``Authorization`` header is now dropped on redirects to a different
+    scheme (``http://`` or ``https://``) or port, even if the domain is the
+    same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more
+    information.
+
+    .. _4qqq-9vqf-3h3f security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f
+
+-   When using system proxy settings that are different for ``http://`` and
+    ``https://``, redirects to a different URL scheme will now also trigger the
+    corresponding change in proxy settings for the redirected request. Please,
+    see the `jm3v-qxmh-hxwv security advisory`_ for more information.
+    (:issue:`767`)
+
+    .. _jm3v-qxmh-hxwv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv
+
+-   :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now
+    enforced for all requests, and not only requests from spider callbacks.
+    (:issue:`1042`, :issue:`2241`, :issue:`6358`)
+
+-   :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML
+    entities. (:issue:`6265`)
+
+-   defusedxml_ is now used to make
+    :class:`scrapy.http.request.rpc.XmlRpcRequest` more secure.
+    (:issue:`6250`, :issue:`6251`)
+
+    .. _defusedxml: https://github.com/tiran/defusedxml
+
+Bug fixes
+~~~~~~~~~
+
+-   Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in
+    favor of brotli_. (:issue:`6261`)
+
+    .. _brotli: https://github.com/google/brotli
+
+    .. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli
+        instead if you can.
+
+-   Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by default. This
+    prevents
+    :class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from
+    following redirects that would not be followed by web browsers with
+    JavaScript enabled. (:issue:`6342`, :issue:`6347`)
+
+-   During :ref:`feed export <topics-feed-exports>`, do not close the
+    underlying file from :ref:`built-in post-processing plugins
+    <builtin-plugins>`.
+    (:issue:`5932`, :issue:`6178`, :issue:`6239`)
+
+-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
+    now properly applies the ``unique`` and ``canonicalize`` parameters.
+    (:issue:`3273`, :issue:`6221`)
+
+-   Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty
+    string. (:issue:`6121`, :issue:`6124`)
+
+-   Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra
+    information. (:issue:`6323`, :issue:`6324`)
+
+-   ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing
+    the UTF-8-compatible (e.g. ASCII) parts of the document.
+    (:issue:`6292`, :issue:`6298`)
+
+-   :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an
+    exception if ``default`` is ``None``.
+    (:issue:`6308`, :issue:`6310`)
+
+-   :class:`~scrapy.selector.Selector` now uses
+    :func:`scrapy.utils.response.get_base_url` to determine the base URL of a
+    given :class:`~scrapy.http.Response`. (:issue:`6265`)
+
+-   The :meth:`media_to_download` method of :ref:`media pipelines
+    <topics-media-pipeline>` now logs exceptions before stripping them.
+    (:issue:`5067`, :issue:`5068`)
+
+-   When passing a callback to the :command:`parse` command, build the callback
+    callable with the right signature.
+    (:issue:`6182`)
+
+Documentation
+~~~~~~~~~~~~~
+
+-   Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`.
+    (:issue:`6203`, :issue:`6208`)
+
+-   Document that :attr:`scrapy.selector.Selector.type` can be ``"json"``.
+    (:issue:`6328`, :issue:`6334`)
+
+Quality assurance
+~~~~~~~~~~~~~~~~~
+
+-   Make builds reproducible. (:issue:`5019`, :issue:`6322`)
+
+-   Packaging and test fixes.
+    (:issue:`6286`, :issue:`6290`, :issue:`6312`, :issue:`6316`, :issue:`6344`)
+
+
 .. _release-2.11.1:
 
 Scrapy 2.11.1 (2024-02-14)

diff --git a/docs/topics/benchmarking.rst b/docs/topics/benchmarking.rst
@@ -24,7 +24,8 @@ You should see an output like this::
      'scrapy.extensions.telnet.TelnetConsole',
      'scrapy.extensions.corestats.CoreStats']
     2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
-    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
+    ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
+     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
      'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
      'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
      'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
@@ -37,7 +38,6 @@ You should see an output like this::
      'scrapy.downloadermiddlewares.stats.DownloaderStats']
     2016-12-16 21:18:49 [scrapy.middleware] INFO: Enabled spider middlewares:
     ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
-     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
      'scrapy.spidermiddlewares.referer.RefererMiddleware',
      'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
      'scrapy.spidermiddlewares.depth.DepthMiddleware']

diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst
@@ -763,6 +763,44 @@ HttpProxyMiddleware
    Keep in mind this value will take precedence over ``http_proxy``/``https_proxy``
    environment variables, and it will also ignore ``no_proxy`` environment variable.
 
+OffsiteMiddleware
+-----------------
+
+.. module:: scrapy.downloadermiddlewares.offsite
+   :synopsis: Offsite Middleware
+
+.. class:: OffsiteMiddleware
+
+   .. versionadded:: 2.11.2
+
+   Filters out Requests for URLs outside the domains covered by the spider.
+
+   This middleware filters out every request whose host names aren't in the
+   spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
+   All subdomains of any domain in the list are also allowed.
+   E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
+   but not ``www2.example.com`` nor ``example.com``.
+
+   When your spider returns a request for a domain not belonging to those
+   covered by the spider, this middleware will log a debug message similar to
+   this one::
+
+      DEBUG: Filtered offsite request to 'offsite.example': <GET http://offsite.example/some/page.html>
+
+   To avoid filling the log with too much noise, it will only print one of
+   these messages for each new domain filtered. So, for example, if another
+   request for ``offsite.example`` is filtered, no log message will be
+   printed. But if a request for ``other.example`` is filtered, a message
+   will be printed (but only for the first request filtered).
+
+   If the spider doesn't define an
+   :attr:`~scrapy.Spider.allowed_domains` attribute, or the
+   attribute is empty, the offsite middleware will allow all requests.
+
+   If the request has the :attr:`~scrapy.Request.dont_filter` attribute
+   set, the offsite middleware will allow the request even if its domain is not
+   listed in allowed domains.
+
 RedirectMiddleware
 ------------------
 
@@ -882,7 +920,11 @@ Meta tags within these tags are ignored.
 
 .. versionchanged:: 2.0
    The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
-   ``['script', 'noscript']`` to ``[]``.
+   ``["script", "noscript"]`` to ``[]``.
+
+.. versionchanged:: 2.11.2
+   The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from
+   ``[]`` to ``["noscript"]``.
 
 .. versionchanged:: VERSION
    The default value of :setting:`METAREFRESH_IGNORE_TAGS` changed from

diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst
@@ -674,6 +674,7 @@ Default:
 .. code-block:: python
 
     {
+        "scrapy.downloadermiddlewares.offsite.OffsiteMiddleware": 50,
         "scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100,
         "scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
         "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
@@ -1613,7 +1614,6 @@ Default:
 
     {
         "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
-        "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
         "scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
         "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
         "scrapy.spidermiddlewares.depth.DepthMiddleware": 900,

diff --git a/docs/topics/signals.rst b/docs/topics/signals.rst
@@ -343,11 +343,18 @@ request_scheduled
 .. signal:: request_scheduled
 .. function:: request_scheduled(request, spider)
 
-    Sent when the engine schedules a :class:`~scrapy.Request`, to be
-    downloaded later.
+    Sent when the engine is asked to schedule a :class:`~scrapy.Request`, to be
+    downloaded later, before the request reaches the :ref:`scheduler
+    <topics-scheduler>`.
+
+    Raise :exc:`~scrapy.exceptions.IgnoreRequest` to drop a request before it
+    reaches the scheduler.
 
     This signal does not support returning deferreds from its handlers.
 
+    .. versionadded:: 2.11.2
+        Allow dropping requests with :exc:`~scrapy.exceptions.IgnoreRequest`.
+
     :param request: the request that reached the scheduler
     :type request: :class:`~scrapy.Request` object
 

diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
@@ -51,8 +51,8 @@ value.  For example, if you want to disable the off-site middleware:
 .. code-block:: python
 
     SPIDER_MIDDLEWARES = {
-        "myproject.middlewares.CustomSpiderMiddleware": 543,
-        "scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None,
+        "scrapy.spidermiddlewares.referer.RefererMiddleware": None,
+        "myproject.middlewares.CustomRefererSpiderMiddleware": 700,
     }
 
 Finally, keep in mind that some middlewares may need to be enabled through a
@@ -313,42 +313,6 @@ Default: ``False``
 
 Pass all responses, regardless of its status code.
 
-OffsiteMiddleware
------------------
-
-.. module:: scrapy.spidermiddlewares.offsite
-   :synopsis: Offsite Spider Middleware
-
-.. class:: OffsiteMiddleware
-
-   Filters out Requests for URLs outside the domains covered by the spider.
-
-   This middleware filters out every request whose host names aren't in the
-   spider's :attr:`~scrapy.Spider.allowed_domains` attribute.
-   All subdomains of any domain in the list are also allowed.
-   E.g. the rule ``www.example.org`` will also allow ``bob.www.example.org``
-   but not ``www2.example.com`` nor ``example.com``.
-
-   When your spider returns a request for a domain not belonging to those
-   covered by the spider, this middleware will log a debug message similar to
-   this one::
-
-      DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>
-
-   To avoid filling the log with too much noise, it will only print one of
-   these messages for each new domain filtered. So, for example, if another
-   request for ``www.othersite.com`` is filtered, no log message will be
-   printed. But if a request for ``someothersite.com`` is filtered, a message
-   will be printed (but only for the first request filtered).
-
-   If the spider doesn't define an
-   :attr:`~scrapy.Spider.allowed_domains` attribute, or the
-   attribute is empty, the offsite middleware will allow all requests.
-
-   If the request has the :attr:`~scrapy.Request.dont_filter` attribute
-   set, the offsite middleware will allow the request even if its domain is not
-   listed in allowed domains.
-
 
 RefererMiddleware
 -----------------

diff --git a/docs/topics/spiders.rst b/docs/topics/spiders.rst
@@ -75,7 +75,8 @@ scrapy.Spider
        An optional list of strings containing domains that this spider is
        allowed to crawl. Requests for URLs not belonging to the domain names
        specified in this list (or their subdomains) won't be followed if
-       :class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` is enabled.
+       :class:`~scrapy.downloadermiddlewares.offsite.OffsiteMiddleware` is
+       enabled.
 
        Let's say your target url is ``https://www.example.com/1.html``,
        then add ``'example.com'`` to the list.

diff --git a/scrapy/VERSION b/scrapy/VERSION
@@ -1 +1 @@
-2.11.1
+2.11.2