Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses #2065

Merged
merged 3 commits into from Jul 11, 2016

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jun 20, 2016

This should fix #2063 and concerns from #951 (comment)

As outlined in #2063 (comment) , the idea here is:

  • if Content-Encoding is said to be "gzip",
  • to process Content-Type: binary/octet-stream (or application/octet-stream) as an opaque file that higher-level layers should handle,
  • even if HTTP specs would require the engine to gunzip the body for the client application.

This is already what happens for Content-Type: application/(x-)gzip (see #660)

This behavior is inline with what browsers do for sitemap files for example,
e.g. Amazon being "famous" for sending back this kind of HTTP headers for https://www.amazon.fr/sitemaps.*.xml.gz files (gzipped-compressed XML sitemap files):

{b'Content-Encoding': [b'gzip'],
 b'Etag': [b'"212a0faf86341f24001a784bf6d8df51"'],
 b'Accept-Ranges': [b'bytes'],
 b'Server': [b'Server'],
 b'Last-Modified': [b'Fri, 08 Apr 2016 21:47:04 GMT'],
 b'Vary': [b'Accept-Encoding'],
 b'Content-Type': [b'binary/octet-stream'],
 b'Date': [b'Mon, 20 Jun 2016 14:31:26 GMT'],
 b'X-Amz-Request-Id': [b'0E43407DAC8EE8DE']}
@redapple
Copy link
Contributor Author

@redapple redapple commented Jun 20, 2016

#2050 should also be merged into this

@codecov-io
Copy link

@codecov-io codecov-io commented Jun 20, 2016

Current coverage is 83.34%

Merging #2065 into master will increase coverage by <.01%

Powered by Codecov. Last updated by d43a357...962eb11

@redapple redapple changed the title [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses [MRG] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses Jun 22, 2016
@redapple redapple added this to the v1.1.1 milestone Jul 8, 2016
@@ -51,8 +51,12 @@ def gunzip(data):
return output

_is_gzipped_re = re.compile(br'^application/(x-)?gzip\b', re.I)
_is_octetstream_re = re.compile(br'^(application|binary)/octet-stream\b', re.I)

This comment has been minimized.

@kmike

kmike Jul 8, 2016
Member

A small trick (not sure if everyone will like it): it can be written as

_is_octetstream = re.compile(br'^(application|binary)/octet-stream\b', re.I).search

and then used like or (_is_octetstream(ctype) and cenc in ...)

This comment has been minimized.

@redapple

redapple Jul 8, 2016
Author Contributor

maybe @Tethik can comment too

This comment has been minimized.

@Tethik

Tethik Jul 8, 2016
Contributor

Nice trick. It would make the final boolean expression a tiny bit easier to read.

return (_is_gzipped(ctype) is not None or
            (_is_octetstream(ctype) is not None and
             cenc in (b'gzip', b'x-gzip')))

👍

This comment has been minimized.

@kmike

kmike Jul 8, 2016
Member

Ah right, _is_gzipped can also be written this way. I'd even drop 'is not None' checks. From https://docs.python.org/3.5/library/re.html#match-objects:

Match objects always have a boolean value of True

There is an example in stdlib docs:

match = re.search(pattern, string)
if match:
    process(match)

This comment has been minimized.

@Tethik

Tethik Jul 8, 2016
Contributor

Doing is not None explicitly makes it a bool though. If we drop the is not None then we're going to return None or True instead of True or False. I doubt it makes a difference though.

return _is_gzipped(ctype) or
            (_is_octetstream(ctype) and
             cenc in (b'gzip', b'x-gzip')))

If we want to be really pedantic we could probably also drop the parantheses too.

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

@redapple The PR looks fine 👍. Could you please clean up and merge it?

2016-07-09 4:25 GMT+05:00 Joakim Uddholm notifications@github.com:

In scrapy/utils/gz.py
#2065 (comment):

@@ -51,8 +51,12 @@ def gunzip(data):
return output

_is_gzipped_re = re.compile(br'^application/(x-)?gzip\b', re.I)
+_is_octetstream_re = re.compile(br'^(application|binary)/octet-stream\b', re.I)

Doing is not None explicitly makes it a bool though. If we drop the is
not None then we're going to return None or True instead of True or
False. I doubt it makes a difference though.

return _is_gzipped(ctype) or
(_is_octetstream(ctype) and
cenc in (b'gzip', b'x-gzip')))

If we want to be really pedantic we could probably also drop the
parantheses too.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scrapy/scrapy/pull/2065/files/778f1cf84cfc2047fdd43b45f908bf0078a160b6#r70154620,
or mute the thread
https://github.com/notifications/unsubscribe/AAGldbtIqFZwferWdBSBy2TG31Tv_0zEks5qTtxngaJpZM4I5xtb
.

@kmike kmike changed the title [MRG] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses [MRG+1] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses Jul 9, 2016
@kmike kmike changed the title [MRG+1] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses [MRG+2] [HttpCompressionMW] Do not decompress gzip binary/octet-stream responses Jul 9, 2016
@redapple redapple merged commit 8a22a74 into scrapy:master Jul 11, 2016
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 100%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
redapple added a commit that referenced this pull request Jul 11, 2016
[backport][1.1] Do not decompress binary/octet-stream responses (PR #2065)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants