Skip to content

Commit

Permalink
Merge remote-tracking branch '7j7m-v7m3-jqm7/1.8-compression-bomb' in…
Browse files Browse the repository at this point in the history
…to 1.8
  • Loading branch information
Gallaecio committed Feb 14, 2024
2 parents 73e7c0e + 978407c commit 71b8741
Show file tree
Hide file tree
Showing 15 changed files with 750 additions and 110 deletions.
10 changes: 10 additions & 0 deletions docs/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ Scrapy 1.8.4 (unreleased)
.. _ReDoS vulnerability: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
.. _cc65-xxvf-f7r9 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cc65-xxvf-f7r9

- :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
advisory`_ for more information.

.. _7j7m-v7m3-jqm7 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7

- Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, use of the
``scrapy.downloadermiddlewares.decompression`` module is discouraged and
will trigger a warning.

.. _release-1.8.3:

Scrapy 1.8.3 (2022-07-25)
Expand Down
29 changes: 15 additions & 14 deletions docs/topics/request-response.rst
Original file line number Diff line number Diff line change
Expand Up @@ -318,26 +318,27 @@ are some special keys recognized by Scrapy and its built-in extensions.

Those are:

* :reqmeta:`dont_redirect`
* :reqmeta:`dont_retry`
* :reqmeta:`handle_httpstatus_list`
* :reqmeta:`handle_httpstatus_all`
* :reqmeta:`dont_merge_cookies`
* :reqmeta:`bindaddress`
* :reqmeta:`cookiejar`
* :reqmeta:`dont_cache`
* :reqmeta:`redirect_reasons`
* :reqmeta:`redirect_urls`
* :reqmeta:`bindaddress`
* :reqmeta:`dont_merge_cookies`
* :reqmeta:`dont_obey_robotstxt`
* :reqmeta:`download_timeout`
* :reqmeta:`download_maxsize`
* :reqmeta:`download_latency`
* :reqmeta:`dont_redirect`
* :reqmeta:`dont_retry`
* :reqmeta:`download_fail_on_dataloss`
* :reqmeta:`proxy`
* ``ftp_user`` (See :setting:`FTP_USER` for more info)
* :reqmeta:`download_latency`
* :reqmeta:`download_maxsize`
* :reqmeta:`download_warnsize`
* :reqmeta:`download_timeout`
* ``ftp_password`` (See :setting:`FTP_PASSWORD` for more info)
* :reqmeta:`referrer_policy`
* ``ftp_user`` (See :setting:`FTP_USER` for more info)
* :reqmeta:`handle_httpstatus_all`
* :reqmeta:`handle_httpstatus_list`
* :reqmeta:`max_retry_times`
* :reqmeta:`proxy`
* :reqmeta:`redirect_reasons`
* :reqmeta:`redirect_urls`
* :reqmeta:`referrer_policy`

.. reqmeta:: bindaddress

Expand Down
36 changes: 19 additions & 17 deletions docs/topics/settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -632,42 +632,44 @@ The amount of time (in secs) that the downloader will wait before timing out.
Request.meta key.

.. setting:: DOWNLOAD_MAXSIZE
.. reqmeta:: download_maxsize

DOWNLOAD_MAXSIZE
----------------

Default: ``1073741824`` (1024MB)

The maximum response size (in bytes) that downloader will download.
Default: ``1073741824`` (1 GiB)

If you want to disable it set to 0.
The maximum response body size (in bytes) allowed. Bigger responses are
aborted and ignored.

.. reqmeta:: download_maxsize
This applies both before and after compression. If decompressing a response
body would exceed this limit, decompression is aborted and the response is
ignored.

.. note::
Use ``0`` to disable this limit.

This size can be set per spider using :attr:`download_maxsize`
spider attribute and per-request using :reqmeta:`download_maxsize`
Request.meta key.
This limit can be set per spider using the :attr:`download_maxsize` spider
attribute and per request using the :reqmeta:`download_maxsize` Request.meta
key.

This feature needs Twisted >= 11.1.

.. setting:: DOWNLOAD_WARNSIZE
.. reqmeta:: download_warnsize

DOWNLOAD_WARNSIZE
-----------------

Default: ``33554432`` (32MB)

The response size (in bytes) that downloader will start to warn.
Default: ``33554432`` (32 MiB)

If you want to disable it set to 0.
If the size of a response exceeds this value, before or after compression, a
warning will be logged about it.

.. note::
Use ``0`` to disable this limit.

This size can be set per spider using :attr:`download_warnsize`
spider attribute and per-request using :reqmeta:`download_warnsize`
Request.meta key.
This limit can be set per spider using the :attr:`download_warnsize` spider
attribute and per request using the :reqmeta:`download_warnsize` Request.meta
key.

This feature needs Twisted >= 11.1.

Expand Down
11 changes: 11 additions & 0 deletions scrapy/downloadermiddlewares/decompression.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import tarfile
import logging
from tempfile import mktemp
from warnings import warn

import six

Expand All @@ -16,8 +17,18 @@
except ImportError:
from io import BytesIO

from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.responsetypes import responsetypes

warn(
"Use of the scrapy.downloadermiddlewares.decompression module is "
"discouraged, as it is susceptible to decompression bomb attacks. For "
"details, see "
"https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7",
ScrapyDeprecationWarning,
stacklevel=2,
)

logger = logging.getLogger(__name__)


Expand Down
119 changes: 93 additions & 26 deletions scrapy/downloadermiddlewares/httpcompression.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,75 @@
import zlib
import warnings
from logging import getLogger

from scrapy.utils.gz import gunzip
from scrapy import signals
from scrapy.exceptions import IgnoreRequest, NotConfigured
from scrapy.http import Response, TextResponse
from scrapy.responsetypes import responsetypes
from scrapy.exceptions import NotConfigured
from scrapy.utils._compression import (
_DecompressionMaxSizeExceeded,
_inflate,
_unbrotli,
_unzstd,
)
from scrapy.utils.deprecate import ScrapyDeprecationWarning
from scrapy.utils.gz import gunzip

logger = getLogger(__name__)

ACCEPTED_ENCODINGS = [b"gzip", b"deflate"]

ACCEPTED_ENCODINGS = [b'gzip', b'deflate']
try:
import brotli # noqa: F401
except ImportError:
pass
else:
ACCEPTED_ENCODINGS.append(b"br")

try:
import brotli
ACCEPTED_ENCODINGS.append(b'br')
import zstandard # noqa: F401
except ImportError:
pass
else:
ACCEPTED_ENCODINGS.append(b"zstd")


class HttpCompressionMiddleware(object):
"""This middleware allows compressed (gzip, deflate) traffic to be
sent/received from web sites"""

def __init__(self, crawler=None):
if not crawler:
self._max_size = 1073741824
self._warn_size = 33554432
return
self._max_size = crawler.settings.getint("DOWNLOAD_MAXSIZE")
self._warn_size = crawler.settings.getint("DOWNLOAD_WARNSIZE")
crawler.signals.connect(self.open_spider, signals.spider_opened)

@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('COMPRESSION_ENABLED'):
raise NotConfigured
return cls()
try:
return cls(crawler=crawler)
except TypeError:
warnings.warn(
"HttpCompressionMiddleware subclasses must either modify "
"their '__init__' method to support a 'crawler' parameter or "
"reimplement their 'from_crawler' method.",
ScrapyDeprecationWarning,
)
mw = cls()
mw._max_size = crawler.settings.getint("DOWNLOAD_MAXSIZE")
mw._warn_size = crawler.settings.getint("DOWNLOAD_WARNSIZE")
crawler.signals.connect(mw.open_spider, signals.spider_opened)
return mw

def open_spider(self, spider):
if hasattr(spider, "download_maxsize"):
self._max_size = spider.download_maxsize
if hasattr(spider, "download_warnsize"):
self._warn_size = spider.download_warnsize

def process_request(self, request, spider):
request.headers.setdefault('Accept-Encoding',
Expand All @@ -36,9 +83,36 @@ def process_response(self, request, response, spider):
content_encoding = response.headers.getlist('Content-Encoding')
if content_encoding:
encoding = content_encoding.pop()
decoded_body = self._decode(response.body, encoding.lower())
respcls = responsetypes.from_args(headers=response.headers, \
url=response.url, body=decoded_body)
max_size = request.meta.get("download_maxsize", self._max_size)
warn_size = request.meta.get("download_warnsize", self._warn_size)
try:
decoded_body = self._decode(
response.body, encoding.lower(), max_size
)
except _DecompressionMaxSizeExceeded:
raise IgnoreRequest(
"Ignored response {response} because its body "
"({body_size} B) exceeded DOWNLOAD_MAXSIZE "
"({max_size} B) during decompression.".format(
response=response,
body_size=len(response.body),
max_size=max_size,
)
)
if len(response.body) < warn_size <= len(decoded_body):
logger.warning(
"%(response)s body size after decompression "
"(%(body_size)s B) is larger than the "
"download warning size (%(warn_size)s B).",
{
"response": response,
"body_size": len(decoded_body),
"warn_size": warn_size,
},
)
respcls = responsetypes.from_args(
headers=response.headers, url=response.url, body=decoded_body
)
kwargs = dict(cls=respcls, body=decoded_body)
if issubclass(respcls, TextResponse):
# force recalculating the encoding until we make sure the
Expand All @@ -50,20 +124,13 @@ def process_response(self, request, response, spider):

return response

def _decode(self, body, encoding):
if encoding == b'gzip' or encoding == b'x-gzip':
body = gunzip(body)

if encoding == b'deflate':
try:
body = zlib.decompress(body)
except zlib.error:
# ugly hack to work with raw deflate content that may
# be sent by microsoft servers. For more information, see:
# http://carsten.codimi.de/gzip.yaws/
# http://www.port80software.com/200ok/archive/2005/10/31/868.aspx
# http://www.gzip.org/zlib/zlib_faq.html#faq38
body = zlib.decompress(body, -15)
if encoding == b'br' and b'br' in ACCEPTED_ENCODINGS:
body = brotli.decompress(body)
def _decode(self, body, encoding, max_size):
if encoding == b"gzip" or encoding == b"x-gzip":
return gunzip(body, max_size=max_size)
if encoding == b"deflate":
return _inflate(body, max_size=max_size)
if encoding == b"br" and b"br" in ACCEPTED_ENCODINGS:
return _unbrotli(body, max_size=max_size)
if encoding == b"zstd" and b"zstd" in ACCEPTED_ENCODINGS:
return _unzstd(body, max_size=max_size)
return body
45 changes: 37 additions & 8 deletions scrapy/spiders/sitemap.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
import re
import logging
import re

import six

from scrapy.spiders import Spider
from scrapy.http import Request, XmlResponse
from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots
from scrapy.http.response.xml import XmlResponse
from scrapy.spiders import Request, Spider
from scrapy.utils._compression import _DecompressionMaxSizeExceeded
from scrapy.utils.gz import gunzip, gzip_magic_number

from scrapy.utils.sitemap import Sitemap, sitemap_urls_from_robots

logger = logging.getLogger(__name__)

Expand All @@ -18,6 +19,17 @@ class SitemapSpider(Spider):
sitemap_follow = ['']
sitemap_alternate_links = False

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SitemapSpider, cls).from_crawler(crawler, *args, **kwargs)
spider._max_size = getattr(
spider, "download_maxsize", spider.settings.getint("DOWNLOAD_MAXSIZE")
)
spider._warn_size = getattr(
spider, "download_warnsize", spider.settings.getint("DOWNLOAD_WARNSIZE")
)
return spider

def __init__(self, *a, **kw):
super(SitemapSpider, self).__init__(*a, **kw)
self._cbs = []
Expand Down Expand Up @@ -70,8 +82,25 @@ def _get_sitemap_body(self, response):
"""
if isinstance(response, XmlResponse):
return response.body
elif gzip_magic_number(response):
return gunzip(response.body)
if gzip_magic_number(response):
uncompressed_size = len(response.body)
max_size = response.meta.get("download_maxsize", self._max_size)
warn_size = response.meta.get("download_warnsize", self._warn_size)
try:
body = gunzip(response.body, max_size=max_size)
except _DecompressionMaxSizeExceeded:
return None
if uncompressed_size < warn_size <= len(body):
logger.warning(
"%(response)s body size after decompression (%(body_length)s B) "
"is larger than the download warning size (%(warn_size)s B).",
{
"response": response,
"body_length": len(body),
"warn_size": warn_size,
},
)
return body
# actual gzipped sitemap files are decompressed above ;
# if we are here (response body is not gzipped)
# and have a response for .xml.gz,
Expand All @@ -81,7 +110,7 @@ def _get_sitemap_body(self, response):
# without actually being a .xml.gz file in the first place,
# merely XML gzip-compressed on the fly,
# in other word, here, we have plain XML
elif response.url.endswith('.xml') or response.url.endswith('.xml.gz'):
if response.url.endswith('.xml') or response.url.endswith('.xml.gz'):
return response.body


Expand Down
Loading

0 comments on commit 71b8741

Please sign in to comment.