Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Always decompress Content-Encoding: gzip at HttpCompression stage #2391

Merged
merged 4 commits into from Mar 7, 2017

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Nov 9, 2016

Let SitemapSpider handle decoding of .xml.gz files if necessary.

Fixes #2389

The change here is to always decompress responses with Content-Encoding: gzip, whatever Content-Type says, and contrary to #193 (comment), #660, #2065

Therefore, SitemapSpider still has to decode "real" .gz files/content if the HTTP response was not "Content-Encoded", relying on gzip magic number instead of response headers.

I believe it follows RFC 7231 better:

The "Content-Encoding" header field indicates what content codings
have been applied to the representation, beyond those inherent in the
media type, and thus what decoding mechanisms have to be applied in
order to obtain data in the media type referenced by the Content-Type
header field.
Content-Encoding is primarily used to allow a
representation's data to be compressed without losing the identity of
its underlying media type.

#951 (comment) also hinted at something like this.

I think it can also fix #2162 though I need to test that.

@redapple redapple added the in progress label Nov 9, 2016
@codecov-io
Copy link

@codecov-io codecov-io commented Nov 9, 2016

Codecov Report

Merging #2391 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2391      +/-   ##
==========================================
+ Coverage   84.15%   84.16%   +0.01%     
==========================================
  Files         162      162              
  Lines        9079     9079              
  Branches     1346     1345       -1     
==========================================
+ Hits         7640     7641       +1     
  Misses       1177     1177              
+ Partials      262      261       -1
Impacted Files Coverage Δ
scrapy/spiders/sitemap.py 56.66% <100%> (-1.4%)
scrapy/downloadermiddlewares/httpcompression.py 88.88% <100%> (+2.22%)
scrapy/utils/gz.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6320b74...b6378c7. Read the comment docs.

@redapple redapple changed the title Always decompress Content-Encoding: gzip at HttpCompression stage [MRG] Always decompress Content-Encoding: gzip at HttpCompression stage Nov 16, 2016
@redapple redapple added this to the v1.3 milestone Nov 23, 2016
@sibiryakov
Copy link
Contributor

@sibiryakov sibiryakov commented Dec 7, 2016

Should we make this behavior optional @redapple ?

@redapple
Copy link
Contributor Author

@redapple redapple commented Dec 7, 2016

@sibiryakov , I would prefer that we make this change not configurable, handing responses with the correct type sooner than later.
#2393 is a good companion to this



def gzip_magic_number(response):
return response.body[:2] == b'\x1f\x8b'

This comment has been minimized.

This comment has been minimized.

@redapple

redapple Mar 7, 2017
Author Contributor

Makes sense @kmike , thanks for the pointer. PR updated.

@kmike
Copy link
Member

@kmike kmike commented Mar 7, 2017

Other than gzip signature this PR looks good. Signature also looks fine - 2 bytes is what gzip format rfc's say, and what wikipedia shows, but 3 bytes with hardcoded compression method seems to be more robust and recommended in mime sniffing spec.

@sibiryakov it seems the only case HTTP decompression should be made optional is when response bodies are not processed at all (no link extraction, no peeking into response body), and it is fine to store them in a raw compressed form, as they are received from transport - it means e.g. removing an ability to search inside these bodies. For me it sounds like a rare use case (http cache without any processing? what to use these cache results for?) - if you're downloading a page then likely you want to use it somehow, and to do that you need to decompress it.

This issue is not about decompressing all gzipped files automatically (scrapy is not doing this), it is about undoing http compression.

@redapple redapple force-pushed the redapple:always-gunzip-content-enc-gzip branch from 8960225 to 4cacecc Mar 7, 2017
self.assertEqual(response.headers['Content-Type'], b'application/gzip')
assert newresponse is not response
assert newresponse.body.startswith(b'<!DOCTYPE')
assert 'Content-Encoding' not in newresponse.headers

This comment has been minimized.

@kmike

kmike Mar 7, 2017
Member

pytest magic asserts are disabled in scrapy testing suite (they break contracts tests if I recall it correctly), so it is better to keep assertIs / assertEqual

This comment has been minimized.

@redapple

redapple Mar 7, 2017
Author Contributor

Done.

@kmike kmike merged commit 802ed30 into scrapy:master Mar 7, 2017
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@kmike
Copy link
Member

@kmike kmike commented Mar 7, 2017

Thanks @redapple!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants