Another IOError: "Not a gzipped file" #193
Comments
Thanks for reporting this. Could you share a public gzipped sitemap url I can test this against? |
Sure, you can try amazon.com/robots.txt (See the last line). |
httpcompression middleware uncompress it because A more reliable way can be checking magic numbers as described in RFC6713 for gzip and zlib
|
Using both headers here is a similar problem in an apache mail thread http://mail-archives.apache.org/mod_mbox/httpd-dev/200207.mbox/%3C3D2D4E76.4010502@talex.com.pl%3E |
Chrome ignores the downloaded file in still gzipped in disk
|
Firefox ignores it too, the downloaded file is still gzipped in disk
|
@pablohoffman, what do you think if we follow browsers behavior and do not attempt uncompressing in httpcompression middleware if |
@dangra +1 |
Added content-type check as per issue #193
fixed by #660 |
Hi guys, I still get this error while downloading amazon sitemap .gz files. I checked the http headers and found Amazon keeps "Content-Encoding: x-gzip" and has changed "Content-Type" to "application/octet-stream". Then the condition "content_encoding and not is_gzipped" in httpcompression middleware is True and this file is decompressed. Later sitemap.py finds the suffix is ".gz" and then calls gunzip again, causing the program to abort. I changed sitemap.py a little bit to ignore the decompression exception. Now it's fine. |
codec option
Take a look at 'httpcompression' middleware and 'sitemap' middleware.
If you will try to download some gzipped file then 'httpcompression' middleware will decompress it first.
See it here:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/httpcompression.py#L36
Then 'sitemap' will try to decompress it again and will rise the IOError here:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/sitemap.py#L57
In my point of view we have to leave only one place for decompression and it will be reasonable if it will be 'httpcompression'.
The text was updated successfully, but these errors were encountered: