[MRG] Always decompress Content-Encoding: gzip at HttpCompression stage #2391
Let SitemapSpider handle decoding of .xml.gz files if necessary.
Therefore, SitemapSpider still has to decode "real" .gz files/content if the HTTP response was not "Content-Encoded", relying on gzip magic number instead of response headers.
I believe it follows RFC 7231 better:
#951 (comment) also hinted at something like this.
I think it can also fix #2162 though I need to test that.
@@ Coverage Diff @@ ## master #2391 +/- ## ========================================== + Coverage 84.15% 84.16% +0.01% ========================================== Files 162 162 Lines 9079 9079 Branches 1346 1345 -1 ========================================== + Hits 7640 7641 +1 Misses 1177 1177 + Partials 262 261 -1
Other than gzip signature this PR looks good. Signature also looks fine - 2 bytes is what gzip format rfc's say, and what wikipedia shows, but 3 bytes with hardcoded compression method seems to be more robust and recommended in mime sniffing spec.
@sibiryakov it seems the only case HTTP decompression should be made optional is when response bodies are not processed at all (no link extraction, no peeking into response body), and it is fine to store them in a raw compressed form, as they are received from transport - it means e.g. removing an ability to search inside these bodies. For me it sounds like a rare use case (http cache without any processing? what to use these cache results for?) - if you're downloading a page then likely you want to use it somehow, and to do that you need to decompress it.
This issue is not about decompressing all gzipped files automatically (scrapy is not doing this), it is about undoing http compression.