Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid download large response #946

Merged
merged 3 commits into from Nov 25, 2014
Merged

Conversation

@tpeng
Copy link
Contributor

@tpeng tpeng commented Nov 12, 2014

introduce DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE in settings and download_maxsize/download_warnsize in spider/request meta, so downloader stop download as soon as the received data exceed the limit. also check the twisted response's length in advance to stop download as early as possible.

NB, this PR redone the changes in #336

introduce DOWNLOAD_MAXSIZE and DOWNLOAD_WARNSIZE in settings and
download_maxsize/download_warnsize in spider/request meta, so
downloader stop downloading as soon as the received data exceed the
limit. also check the twsisted response's length in advance to stop
downloading as early as possible.
@bootstraponline
Copy link

@bootstraponline bootstraponline commented Nov 18, 2014

Is there a way to speed up the code review process on this? scrapinghub.com has a 512MB memory limit. Crawls error with memusage_exceeded after only a few large responses. This PR would solve the problem.

@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 19, 2014

.. note::

This size can be set per spider using :attr:`download_maxsize`
spider attribute and per-request using :reqmeta:`download_maxsize`

This comment has been minimized.

@kmike

kmike Nov 19, 2014
Member

it doesn't look like download_maxsize and download_warnsize spider attributes work.

This comment has been minimized.

DOWNLOAD_MAXSIZE
----------------

Default: `1073741824` (1024Mb)

This comment has been minimized.

@kmike

kmike Nov 19, 2014
Member

According to Wikipedia, Mb is used as a shortcut for Megabit while MB is used as a shortcut for Megabyte

.. setting:: DOWNLOAD_WARNSIZE

DOWNLOAD_WARNSIZE
----------------

This comment has been minimized.

@kmike

kmike Nov 19, 2014
Member

header underline is too short, the display will be broken

@@ -66,6 +66,9 @@

DOWNLOAD_TIMEOUT = 180 # 3mins

DOWNLOAD_MAXSIZE = 1073741824 # 1024m

This comment has been minimized.

@kmike

kmike Nov 19, 2014
Member

1024_1024_1024 could be more clear and easier to change

@@ -66,6 +66,9 @@

DOWNLOAD_TIMEOUT = 180 # 3mins

DOWNLOAD_MAXSIZE = 1073741824 # 1024m
DOWNLOAD_WARNSIZE = 33554432 # 32m

This comment has been minimized.

@kmike

kmike Nov 19, 2014
Member

32_1024_1024?

@kmike
Copy link
Member

@kmike kmike commented Nov 19, 2014

This feature doesn't work in Twisted < 11.1 or with ftp downloads, right? I think it should be either fixed or documented.

@kmike
Copy link
Member

@kmike kmike commented Nov 19, 2014

It'd be good to add more tests: e.g. how does it work for gzipped responses? Should it limit the size of the response.body available to client or the amount of data downloaded? Based on comments from #336 it seems users are more interested in limiting size of response.body.

What about discarding responses earlier based on Content-Length header (maybe only for non-gzipped responses)?

@kmike
Copy link
Member

@kmike kmike commented Nov 19, 2014

To be clear: I haven't checked if the implementation works as intended; @pablohoffman seems to have a better understanding of it.

@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 19, 2014

@kmike this PR has the feature to discard the response earlier base on the Content-Length header. see https://github.com/scrapy/scrapy/pull/946/files#diff-18150b1d259c93bf10bf1d4e5028d753R210 and http://twistedmatrix.com/documents/10.1.0/api/twisted.web.client.Response.html : the txresponse.length will be set by twisted which is the content-length value

@kmike
Copy link
Member

@kmike kmike commented Nov 19, 2014

@tpeng that's great! But if there is such feature it should be tested, including edge cases when Content-Length is set for responses compressed with gzip. There is also an edge case when a large amount of headers is sent; if I'm not mistaken headers size is not included in Content-Length.

@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 19, 2014

@kmike regarding the edge case like large amount of headers: this PR only limit the amount of data to transfer of the response body. so it will not count the data in the headers. for the case like the compressed response, since it limits the data to transfer so it the size is counted on the data after compression. i will add some test cases to make it clear.

@tpeng tpeng force-pushed the tpeng:limit-response-size branch from 0841f6a to a69f042 Nov 19, 2014
@kmike
Copy link
Member

@kmike kmike commented Nov 25, 2014

@tpeng check Travis tests - they fail, it seems because of an older Twisted version. Py33 tests should be also guarded.

@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 25, 2014

@kmike according to @dangra the failures are not related to the change:

py27 builder failed due to an unrelated bug caused by upgrade of mitmproxy
precise builder fails because the twisted version is not compatible
@kmike
Copy link
Member

@kmike kmike commented Nov 25, 2014

Daniel fixed Travis and somebody (maybe it was me) restarted the build. Now the only builder which passes is py27 :)

@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 25, 2014

@kmike ok, now i see the failures are different:

for precise builder:
ImportError: cannot import name GzipEncoderFactory
this change introduced it, we should be able to fix it.

for the py33 builder:
ImportError: No module named 'twisted.python.systemd
no idea how to fix it.

@kmike
Copy link
Member

@kmike kmike commented Nov 25, 2014

For py33 the easiest fix is to skip new imports and don't add new payload/xpayload endpoints to mockserver in Python 3.

@tpeng tpeng force-pushed the tpeng:limit-response-size branch from d3c25d0 to cd19382 Nov 25, 2014
@tpeng
Copy link
Contributor Author

@tpeng tpeng commented Nov 25, 2014

all green!

@dangra
Copy link
Member

@dangra dangra commented Nov 25, 2014

lgtm

pablohoffman added a commit that referenced this pull request Nov 25, 2014
avoid download large response
@pablohoffman pablohoffman merged commit dedea72 into scrapy:master Nov 25, 2014
1 check passed
1 check passed
continuous-integration/travis-ci The Travis CI build passed
Details
@tpeng tpeng deleted the tpeng:limit-response-size branch Nov 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.