[MRG] Always decompress Content-Encoding: gzip at HttpCompression stage #2391
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2391 +/- ##
==========================================
+ Coverage 84.15% 84.16% +0.01%
==========================================
Files 162 162
Lines 9079 9079
Branches 1346 1345 -1
==========================================
+ Hits 7640 7641 +1
Misses 1177 1177
+ Partials 262 261 -1
Continue to review full report at Codecov.
|
Should we make this behavior optional @redapple ? |
@sibiryakov , I would prefer that we make this change not configurable, handing responses with the correct type sooner than later. |
|
||
|
||
def gzip_magic_number(response): | ||
return response.body[:2] == b'\x1f\x8b' |
kmike
Mar 7, 2017
Member
https://mimesniff.spec.whatwg.org/#matching-an-archive-type-pattern suggests a 3-byte signature
https://mimesniff.spec.whatwg.org/#matching-an-archive-type-pattern suggests a 3-byte signature
Other than gzip signature this PR looks good. Signature also looks fine - 2 bytes is what gzip format rfc's say, and what wikipedia shows, but 3 bytes with hardcoded compression method seems to be more robust and recommended in mime sniffing spec. @sibiryakov it seems the only case HTTP decompression should be made optional is when response bodies are not processed at all (no link extraction, no peeking into response body), and it is fine to store them in a raw compressed form, as they are received from transport - it means e.g. removing an ability to search inside these bodies. For me it sounds like a rare use case (http cache without any processing? what to use these cache results for?) - if you're downloading a page then likely you want to use it somehow, and to do that you need to decompress it. This issue is not about decompressing all gzipped files automatically (scrapy is not doing this), it is about undoing http compression. |
Let SitemapSpider handle decoding of .xml.gz files if necessary
8960225
to
4cacecc
self.assertEqual(response.headers['Content-Type'], b'application/gzip') | ||
assert newresponse is not response | ||
assert newresponse.body.startswith(b'<!DOCTYPE') | ||
assert 'Content-Encoding' not in newresponse.headers |
kmike
Mar 7, 2017
Member
pytest magic asserts are disabled in scrapy testing suite (they break contracts tests if I recall it correctly), so it is better to keep assertIs / assertEqual
pytest magic asserts are disabled in scrapy testing suite (they break contracts tests if I recall it correctly), so it is better to keep assertIs / assertEqual
redapple
Mar 7, 2017
Author
Contributor
Done.
Done.
Thanks @redapple! |
Let SitemapSpider handle decoding of .xml.gz files if necessary.
Fixes #2389
The change here is to always decompress responses with
Content-Encoding: gzip
, whateverContent-Type
says, and contrary to #193 (comment), #660, #2065Therefore, SitemapSpider still has to decode "real" .gz files/content if the HTTP response was not "Content-Encoded", relying on gzip magic number instead of response headers.
I believe it follows RFC 7231 better:
#951 (comment) also hinted at something like this.
I think it can also fix #2162 though I need to test that.