allow spider to limit response size to make the broad crawl more robust #336

Closed
wants to merge 4 commits into
from

Conversation

Projects
None yet
5 participants
@tpeng
Contributor

tpeng commented Jul 8, 2013

default download size limit is 32mb, but the limit can also set to per request via request.meta['download_sizelimit']

@pablohoffman

This comment has been minimized.

Show comment
Hide comment
@pablohoffman

pablohoffman Jul 8, 2013

Member

A couple of comments:

  • regardless of how appropriate the 32M limit is (and it looks good) we cannot enforce it by default because it would break a lot of spiders that rely on being able to download large responses. The most common case are spiders that start crawling from a data feed
  • I think the "DOWNLOAD_MAXSIZE" would be slightly more appropriate than "DOWNLOAD_SIZELIMIT"

Here's a more concrete proposal (feedback welcome!):

  • DOWNLOAD_MAXSIZE
    • responses larger than this size will be dropped (like the current patch does)
    • set to 1G by default
    • equivalent request meta: download_maxsize
  • DOWNLOAD_WARNSIZE
    • responses larger that this will cause a WARNING to be logged
    • set to 32M by default
    • no request meta equivalent
Member

pablohoffman commented Jul 8, 2013

A couple of comments:

  • regardless of how appropriate the 32M limit is (and it looks good) we cannot enforce it by default because it would break a lot of spiders that rely on being able to download large responses. The most common case are spiders that start crawling from a data feed
  • I think the "DOWNLOAD_MAXSIZE" would be slightly more appropriate than "DOWNLOAD_SIZELIMIT"

Here's a more concrete proposal (feedback welcome!):

  • DOWNLOAD_MAXSIZE
    • responses larger than this size will be dropped (like the current patch does)
    • set to 1G by default
    • equivalent request meta: download_maxsize
  • DOWNLOAD_WARNSIZE
    • responses larger that this will cause a WARNING to be logged
    • set to 32M by default
    • no request meta equivalent
use DOWNLOAD_MAXSIZE instead of DOWNLOAD_SIZELIMIT
also introduce DOWNLOAD_WARNSIZE
@tpeng

This comment has been minimized.

Show comment
Hide comment
@tpeng

tpeng Sep 11, 2013

Contributor

ping

Contributor

tpeng commented Sep 11, 2013

ping

docs/topics/settings.rst
+
+The maximum response size (in bytes) that downloader will download.
+
+You can also set the maximum size per request in Request's meta attribute. Example::

This comment has been minimized.

@pablohoffman

pablohoffman Sep 11, 2013

Member

I would say "For example, to limit the response size to 1 Mb". To avoid getting it confused with the default value.

@pablohoffman

pablohoffman Sep 11, 2013

Member

I would say "For example, to limit the response size to 1 Mb". To avoid getting it confused with the default value.

@@ -27,9 +27,15 @@ def __init__(self, settings):
self._pool = HTTPConnectionPool(reactor, persistent=True)
self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
self._contextFactory = self._contextFactoryClass()
+ self._default_download_maxsize= settings['DOWNLOAD_MAXSIZE']

This comment has been minimized.

@pablohoffman

pablohoffman Sep 11, 2013

Member

missing space before "="

@pablohoffman

pablohoffman Sep 11, 2013

Member

missing space before "="

@@ -136,11 +136,6 @@ def _test(response):
request = Request(self.getURL('host'), headers={'Host': 'example.com'})
return self.download_request(request, BaseSpider('foo')).addCallback(_test)
- d = self.download_request(request, BaseSpider('foo'))

This comment has been minimized.

@pablohoffman

pablohoffman Sep 11, 2013

Member

why is this test removed?

@pablohoffman

pablohoffman Sep 11, 2013

Member

why is this test removed?

This comment has been minimized.

@tpeng

tpeng Sep 13, 2013

Contributor

since there is a return above (line 137)

@tpeng

tpeng Sep 13, 2013

Contributor

since there is a return above (line 137)

+ if self._bytes_received > self._maxsize:
+ log.msg("Received %s bytes exceed the download size limit (%s)." % (self._bytes_received, self._maxsize),
+ logLevel=log.ERROR)
+ self._finished.cancel()

This comment has been minimized.

@pablohoffman

pablohoffman Sep 11, 2013

Member

does this closes the connection? should we? have we considered raising a custom exception? (ie. ResponseTooLarge)

@pablohoffman

pablohoffman Sep 11, 2013

Member

does this closes the connection? should we? have we considered raising a custom exception? (ie. ResponseTooLarge)

This comment has been minimized.

@tpeng

tpeng Sep 13, 2013

Contributor

it will close the connection and the idea was to still use CancelledError

@tpeng

tpeng Sep 13, 2013

Contributor

it will close the connection and the idea was to still use CancelledError

def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
+ if 'download_maxsize' not in request.meta:
+ request.meta.update({'download_maxsize': self._default_download_maxsize})
+ # always set to warnsize

This comment has been minimized.

@pablohoffman

pablohoffman Sep 13, 2013

Member

what information does this comment?

@pablohoffman

pablohoffman Sep 13, 2013

Member

what information does this comment?

+ if content_length > maxsize:
+ log.msg("Content-length %s larger than download size limit (%s)." % (content_length, maxsize),
+ logLevel=log.ERROR)
+ raise defer.CancelledError()

This comment has been minimized.

@pablohoffman

pablohoffman Sep 13, 2013

Member

This raises CancelledError but where does it call:

txresponse._transport._producer.loseConnection()

The canceller (_cancel) is not yet set here.

@pablohoffman

pablohoffman Sep 13, 2013

Member

This raises CancelledError but where does it call:

txresponse._transport._producer.loseConnection()

The canceller (_cancel) is not yet set here.

This comment has been minimized.

@tpeng

tpeng Sep 16, 2013

Contributor

called by _cancel in _cb_bodyready

@tpeng

tpeng Sep 16, 2013

Contributor

called by _cancel in _cb_bodyready

This comment has been minimized.

@pablohoffman

pablohoffman Sep 16, 2013

Member

If this raises here, the deferred with _cancel canceller (line 123) is never created, and thus can never be called.

@pablohoffman

pablohoffman Sep 16, 2013

Member

If this raises here, the deferred with _cancel canceller (line 123) is never created, and thus can never be called.

This comment has been minimized.

@tpeng

tpeng Oct 16, 2013

Contributor

i see! so do you think adding txresponse._transport._producer.loseConnection() after line 118 is ok?

@tpeng

tpeng Oct 16, 2013

Contributor

i see! so do you think adding txresponse._transport._producer.loseConnection() after line 118 is ok?

def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
+ if 'download_maxsize' not in request.meta:
+ request.meta.update({'download_maxsize': self._default_download_maxsize})

This comment has been minimized.

@pablohoffman

pablohoffman Sep 13, 2013

Member

Lines 35 & 36 can be replaced by a single line:

request.meta.setdefault('download_maxsize', self._default_download_maxsize)

But do we need to always populate request.meta instead of doing something like this when it's used:

maxsize = request.meta.get('download_maxsize', self._default_download_maxsize)
@pablohoffman

pablohoffman Sep 13, 2013

Member

Lines 35 & 36 can be replaced by a single line:

request.meta.setdefault('download_maxsize', self._default_download_maxsize)

But do we need to always populate request.meta instead of doing something like this when it's used:

maxsize = request.meta.get('download_maxsize', self._default_download_maxsize)

This comment has been minimized.

@dangra

dangra Sep 13, 2013

Member

+1 to do it like it is done for download_timeout

@dangra

dangra Sep 13, 2013

Member

+1 to do it like it is done for download_timeout

This comment has been minimized.

@tpeng

tpeng Sep 16, 2013

Contributor

yep. it's better!

@tpeng

tpeng Sep 16, 2013

Contributor

yep. it's better!

@@ -27,10 +27,13 @@ def __init__(self, settings):
self._pool = HTTPConnectionPool(reactor, persistent=True)
self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
self._contextFactory = self._contextFactoryClass()
+ self._default_download_maxsize = settings['DOWNLOAD_MAXSIZE']

This comment has been minimized.

@pablohoffman

pablohoffman Sep 16, 2013

Member

both of these should use settings.getint()

@pablohoffman

pablohoffman Sep 16, 2013

Member

both of these should use settings.getint()

@@ -42,7 +45,10 @@ class ScrapyAgent(object):
_Agent = Agent
_ProxyAgent = ProxyAgent
- def __init__(self, contextFactory=None, connectTimeout=10, bindAddress=None, pool=None):
+ def __init__(self, download_maxsize, download_warnsize,

This comment has been minimized.

@pablohoffman

pablohoffman Sep 16, 2013

Member

Passing the maxsize/warnsize as initial required position arguments sounds counter-intuitive to me, I would have added them at the end as optional keyword arguments. Even if this API isn't public there's no need to make it intentionally less future compatible (remember that warnsize will be going away eventually).

@pablohoffman

pablohoffman Sep 16, 2013

Member

Passing the maxsize/warnsize as initial required position arguments sounds counter-intuitive to me, I would have added them at the end as optional keyword arguments. Even if this API isn't public there's no need to make it intentionally less future compatible (remember that warnsize will be going away eventually).

@@ -97,11 +103,25 @@ def _cb_bodyready(self, txresponse, request):
if txresponse.length == 0:
return txresponse, '', None
+ headers = Headers(txresponse.headers.getAllRawHeaders())
+ content_length = headers.get('Content-Length', None)
+ maxsize = request.meta.get('download_maxsize') or self._download_maxsize

This comment has been minimized.

@pablohoffman

pablohoffman Sep 16, 2013

Member

any reason not to use the more standard form:

request.meta.get('download_maxsize', self._download_maxsize) ?
@pablohoffman

pablohoffman Sep 16, 2013

Member

any reason not to use the more standard form:

request.meta.get('download_maxsize', self._download_maxsize) ?
@nramirezuy

This comment has been minimized.

Show comment
Hide comment
@nramirezuy

nramirezuy Oct 14, 2013

Member

@tpeng Are you still working on this?

Member

nramirezuy commented Oct 14, 2013

@tpeng Are you still working on this?

@tpeng

This comment has been minimized.

Show comment
Hide comment
@tpeng

tpeng Oct 16, 2013

Contributor

sorry for late reply, but i got stucked when added test case for the content-length > maxsize. the problem is there is no content-length set in response's headers when set the url to file in test_download_maxsize in test_downloader_handlers.

Contributor

tpeng commented Oct 16, 2013

sorry for late reply, but i got stucked when added test case for the content-length > maxsize. the problem is there is no content-length set in response's headers when set the url to file in test_download_maxsize in test_downloader_handlers.

@bootstraponline

This comment has been minimized.

Show comment
Hide comment
@bootstraponline

bootstraponline Nov 6, 2014

no content-length set in response's headers

what about len(response.body) for when there's no content-length set?

no content-length set in response's headers

what about len(response.body) for when there's no content-length set?

@tpeng

This comment has been minimized.

Show comment
Hide comment
@tpeng

tpeng Nov 12, 2014

Contributor

replaced by #946. closing this one.

Contributor

tpeng commented Nov 12, 2014

replaced by #946. closing this one.

@tpeng tpeng closed this Nov 12, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment