Skip to content

Commit

Permalink
Merge branch '2.11-compression-bomb' into 2.11
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Feb 14, 2024
2 parents 5bcb8fd + 12b10a7 commit 809bfac
Show file tree
Hide file tree
Showing 15 changed files with 613 additions and 226 deletions.
19 changes: 19 additions & 0 deletions docs/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,16 @@ Highlights:
Security bug fixes
~~~~~~~~~~~~~~~~~~

- :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
advisory`_ for more information.

.. _7j7m-v7m3-jqm7 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7

- Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, the
deprecated ``scrapy.downloadermiddlewares.decompression`` module has been
removed.

- The ``Authorization`` header is now dropped on redirects to a different
domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
information.
Expand Down Expand Up @@ -2941,13 +2951,22 @@ affect subclasses:

(:issue:`3884`)


.. _release-1.8.4:

Scrapy 1.8.4 (unreleased)
-------------------------

**Security bug fixes:**

- :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
advisory`_ for more information.

- Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, use of the
``scrapy.downloadermiddlewares.decompression`` module is discouraged and
will trigger a warning.

- The ``Authorization`` header is now dropped on redirects to a different
domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
information.
Expand Down
1 change: 1 addition & 0 deletions docs/topics/request-response.rst
Original file line number Diff line number Diff line change
Expand Up @@ -731,6 +731,7 @@ Those are:
* :reqmeta:`download_fail_on_dataloss`
* :reqmeta:`download_latency`
* :reqmeta:`download_maxsize`
* :reqmeta:`download_warnsize`
* :reqmeta:`download_timeout`
* ``ftp_password`` (See :setting:`FTP_PASSWORD` for more info)
* ``ftp_user`` (See :setting:`FTP_USER` for more info)
Expand Down
36 changes: 19 additions & 17 deletions docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -873,40 +873,42 @@ The amount of time (in secs) that the downloader will wait before timing out.
Request.meta key.

.. setting:: DOWNLOAD_MAXSIZE
.. reqmeta:: download_maxsize

DOWNLOAD_MAXSIZE
----------------

Default: ``1073741824`` (1024MB)

The maximum response size (in bytes) that downloader will download.
Default: ``1073741824`` (1 GiB)

If you want to disable it set to 0.
The maximum response body size (in bytes) allowed. Bigger responses are
aborted and ignored.

.. reqmeta:: download_maxsize
This applies both before and after compression. If decompressing a response
body would exceed this limit, decompression is aborted and the response is
ignored.

.. note::
Use ``0`` to disable this limit.

This size can be set per spider using :attr:`download_maxsize`
spider attribute and per-request using :reqmeta:`download_maxsize`
Request.meta key.
This limit can be set per spider using the :attr:`download_maxsize` spider
attribute and per request using the :reqmeta:`download_maxsize` Request.meta
key.

.. setting:: DOWNLOAD_WARNSIZE
.. reqmeta:: download_warnsize

DOWNLOAD_WARNSIZE
-----------------

Default: ``33554432`` (32MB)

The response size (in bytes) that downloader will start to warn.
Default: ``33554432`` (32 MiB)

If you want to disable it set to 0.
If the size of a response exceeds this value, before or after compression, a
warning will be logged about it.

.. note::
Use ``0`` to disable this limit.

This size can be set per spider using :attr:`download_warnsize`
spider attribute and per-request using :reqmeta:`download_warnsize`
Request.meta key.
This limit can be set per spider using the :attr:`download_warnsize` spider
attribute and per request using the :reqmeta:`download_warnsize` Request.meta
key.

.. setting:: DOWNLOAD_FAIL_ON_DATALOSS

Expand Down
94 changes: 0 additions & 94 deletions scrapy/downloadermiddlewares/decompression.py

This file was deleted.

100 changes: 65 additions & 35 deletions scrapy/downloadermiddlewares/httpcompression.py
Original file line number Diff line number Diff line change
@@ -1,53 +1,78 @@
import io
import warnings
import zlib
from logging import getLogger

from scrapy.exceptions import NotConfigured
from scrapy import signals
from scrapy.exceptions import IgnoreRequest, NotConfigured
from scrapy.http import Response, TextResponse
from scrapy.responsetypes import responsetypes
from scrapy.utils._compression import (
_DecompressionMaxSizeExceeded,
_inflate,
_unbrotli,
_unzstd,
)
from scrapy.utils.deprecate import ScrapyDeprecationWarning
from scrapy.utils.gz import gunzip

logger = getLogger(__name__)

ACCEPTED_ENCODINGS = [b"gzip", b"deflate"]

try:
import brotli

ACCEPTED_ENCODINGS.append(b"br")
import brotli # noqa: F401
except ImportError:
pass
else:
ACCEPTED_ENCODINGS.append(b"br")

try:
import zstandard

ACCEPTED_ENCODINGS.append(b"zstd")
import zstandard # noqa: F401
except ImportError:
pass
else:
ACCEPTED_ENCODINGS.append(b"zstd")


class HttpCompressionMiddleware:
"""This middleware allows compressed (gzip, deflate) traffic to be
sent/received from web sites"""

def __init__(self, stats=None):
self.stats = stats
def __init__(self, stats=None, *, crawler=None):
if not crawler:
self.stats = stats
self._max_size = 1073741824
self._warn_size = 33554432
return
self.stats = crawler.stats
self._max_size = crawler.settings.getint("DOWNLOAD_MAXSIZE")
self._warn_size = crawler.settings.getint("DOWNLOAD_WARNSIZE")
crawler.signals.connect(self.open_spider, signals.spider_opened)

@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool("COMPRESSION_ENABLED"):
raise NotConfigured
try:
return cls(stats=crawler.stats)
return cls(crawler=crawler)
except TypeError:
warnings.warn(
"HttpCompressionMiddleware subclasses must either modify "
"their '__init__' method to support a 'stats' parameter or "
"reimplement the 'from_crawler' method.",
"their '__init__' method to support a 'crawler' parameter or "
"reimplement their 'from_crawler' method.",
ScrapyDeprecationWarning,
)
result = cls()
result.stats = crawler.stats
return result
mw = cls()
mw.stats = crawler.stats
mw._max_size = crawler.settings.getint("DOWNLOAD_MAXSIZE")
mw._warn_size = crawler.settings.getint("DOWNLOAD_WARNSIZE")
crawler.signals.connect(mw.open_spider, signals.spider_opened)
return mw

def open_spider(self, spider):
if hasattr(spider, "download_maxsize"):
self._max_size = spider.download_maxsize
if hasattr(spider, "download_warnsize"):
self._warn_size = spider.download_warnsize

def process_request(self, request, spider):
request.headers.setdefault("Accept-Encoding", b", ".join(ACCEPTED_ENCODINGS))
Expand All @@ -59,7 +84,24 @@ def process_response(self, request, response, spider):
content_encoding = response.headers.getlist("Content-Encoding")
if content_encoding:
encoding = content_encoding.pop()
decoded_body = self._decode(response.body, encoding.lower())
max_size = request.meta.get("download_maxsize", self._max_size)
warn_size = request.meta.get("download_warnsize", self._warn_size)
try:
decoded_body = self._decode(
response.body, encoding.lower(), max_size
)
except _DecompressionMaxSizeExceeded:
raise IgnoreRequest(
f"Ignored response {response} because its body "
f"({len(response.body)} B) exceeded DOWNLOAD_MAXSIZE "
f"({max_size} B) during decompression."
)
if len(response.body) < warn_size <= len(decoded_body):
logger.warning(
f"{response} body size after decompression "
f"({len(decoded_body)} B) is larger than the "
f"download warning size ({warn_size} B)."
)
if self.stats:
self.stats.inc_value(
"httpcompression/response_bytes",
Expand All @@ -83,25 +125,13 @@ def process_response(self, request, response, spider):

return response

def _decode(self, body, encoding):
def _decode(self, body, encoding, max_size):
if encoding == b"gzip" or encoding == b"x-gzip":
body = gunzip(body)

return gunzip(body, max_size=max_size)
if encoding == b"deflate":
try:
body = zlib.decompress(body)
except zlib.error:
# ugly hack to work with raw deflate content that may
# be sent by microsoft servers. For more information, see:
# http://carsten.codimi.de/gzip.yaws/
# http://www.port80software.com/200ok/archive/2005/10/31/868.aspx
# http://www.gzip.org/zlib/zlib_faq.html#faq38
body = zlib.decompress(body, -15)
return _inflate(body, max_size=max_size)
if encoding == b"br" and b"br" in ACCEPTED_ENCODINGS:
body = brotli.decompress(body)
return _unbrotli(body, max_size=max_size)
if encoding == b"zstd" and b"zstd" in ACCEPTED_ENCODINGS:
# Using its streaming API since its simple API could handle only cases
# where there is content size data embedded in the frame
reader = zstandard.ZstdDecompressor().stream_reader(io.BytesIO(body))
body = reader.read()
return _unzstd(body, max_size=max_size)
return body
Loading

0 comments on commit 809bfac

Please sign in to comment.