Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another IOError: "Not a gzipped file" #193

Closed
vkrest opened this issue Nov 13, 2012 · 10 comments
Closed

Another IOError: "Not a gzipped file" #193

vkrest opened this issue Nov 13, 2012 · 10 comments

Comments

@vkrest
Copy link
Contributor

@vkrest vkrest commented Nov 13, 2012

Take a look at 'httpcompression' middleware and 'sitemap' middleware.
If you will try to download some gzipped file then 'httpcompression' middleware will decompress it first.
See it here:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/httpcompression.py#L36

Then 'sitemap' will try to decompress it again and will rise the IOError here:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/sitemap.py#L57

In my point of view we have to leave only one place for decompression and it will be reasonable if it will be 'httpcompression'.

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Nov 13, 2012

Thanks for reporting this. Could you share a public gzipped sitemap url I can test this against?

@vkrest
Copy link
Contributor Author

@vkrest vkrest commented Nov 13, 2012

Sure, you can try amazon.com/robots.txt (See the last line).

@dangra
Copy link
Member

@dangra dangra commented Jan 10, 2013

httpcompression middleware uncompress it because Content-Encoding is set,
but headers includes Content-Type set to application/x-gzip and this header is checked by scrapy.utils.gz.is_gzipped()

A more reliable way can be checking magic numbers as described in RFC6713 for gzip and zlib

$ HEAD http://www.amazon.com/sitemaps.US_detail_page_sitemap_desktop_index.xml.gz
200 OK
Date: Thu, 10 Jan 2013 18:39:56 GMT
Accept-Ranges: bytes
ETag: "608450686bcf63a017b8965839489376"
Server: Server
Content-Encoding: x-gzip
Content-Length: 8581
Content-Type: application/x-gzip
Last-Modified: Mon, 17 Dec 2012 22:12:00 GMT
Client-Date: Thu, 10 Jan 2013 18:39:54 GMT
Client-Peer: 72.21.194.212:80
Client-Response-Num: 1
X-Amz-Id-2: vpDBqatU9rmJ8pVLWuxnR1hKZmXgx2WMtpGCAnqFEuyomu/IpqKG05zWsJTW9id6
X-Amz-Meta-Md5-Hash: 608450686bcf63a017b8965839489376
X-Amz-Request-Id: EDD0290A690D8D6
@dangra
Copy link
Member

@dangra dangra commented Jan 10, 2013

Using both headers Content-Type: application/x-gzip and Content-Encoding: gzip is obviously wrong for bodies encoded only once that is the common way that makes sense.

here is a similar problem in an apache mail thread http://mail-archives.apache.org/mod_mbox/httpd-dev/200207.mbox/%3C3D2D4E76.4010502@talex.com.pl%3E

@dangra
Copy link
Member

@dangra dangra commented Jan 29, 2013

Chrome ignores Content-Encoding if Content-Type: applicatinon/x-gzip is set

the downloaded file in still gzipped in disk

$ file sitemaps.US_detail_page_sitemap_desktop_index.xml.gz 
sitemaps.US_detail_page_sitemap_desktop_index.xml.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)
GET /sitemaps.US_detail_page_sitemap_desktop_index.xml.gz HTTP/1.1
Host: www.amazon.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.56 Safari/537.17
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,es-419;q=0.6,es;q=0.4
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

HTTP/1.1 200 OK
Date: Tue, 29 Jan 2013 16:53:55 GMT
Server: Server
x-amz-id-2: GAtkni2jyU8fGFG932/98hkfs2LjD8pbMMSYRH9VIoPAb3x8MEuo224EimC+Hwu6
x-amz-request-id: 7055E62E0B5F4B82
x-amz-meta-md5-hash: 608450686bcf63a017b8965839489376
Last-Modified: Mon, 17 Dec 2012 22:12:00 GMT
ETag: "608450686bcf63a017b8965839489376"
Accept-Ranges: bytes
Content-Type: application/x-gzip
Content-Length: 8581
Content-Encoding: gzip

............O.W....._.....Uuw..l..l.U.iv..r...|Q....T..I..i...........OW=...w.>...o.~y......o/..?.......?.................._?|z.....?|{v...O..=}..._.~.........._..._~|........?..._...........}x...._=.....wo..../.......#.?.....|...|.o............n... 
@dangra
Copy link
Member

@dangra dangra commented Jan 29, 2013

Firefox ignores it too, the downloaded file is still gzipped in disk

GET /sitemaps.US_detail_page_sitemap_desktop_index.xml.gz HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

HTTP/1.1 200 OK
Date: Tue, 29 Jan 2013 16:57:07 GMT
Server: Server
x-amz-id-2: oC/P7RfZm0Y6HYl9cDa0oSiGnziuB6um9OdJacySkA5VFAcZs/XAm/Cn0xp54dHY
x-amz-request-id: 199896FB7C02370F
x-amz-meta-md5-hash: 608450686bcf63a017b8965839489376
Last-Modified: Mon, 17 Dec 2012 22:12:00 GMT
ETag: "608450686bcf63a017b8965839489376"
Accept-Ranges: bytes
Content-Type: application/x-gzip
Content-Length: 8581
Content-Encoding: gzip

............O.W....._.....Uuw..l..l.U.iv..r...|Q....T..I..i...........OW=...w.>...o.~y......o/..?.......?.................._?|z.....?|{v...O..=}..._.~.........._..._~|........?..._...........
@dangra
Copy link
Member

@dangra dangra commented Jan 29, 2013

@pablohoffman, what do you think if we follow browsers behavior and do not attempt uncompressing in httpcompression middleware if application/gzip (and similars) is set as Content-Type ?

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Aug 8, 2013

@dangra +1

@dangra dangra added the easy label Feb 13, 2014
rubenvereecken added a commit to rubenvereecken/scrapy that referenced this issue Mar 19, 2014
rubenvereecken added a commit to rubenvereecken/scrapy that referenced this issue Mar 20, 2014
rubenvereecken added a commit to rubenvereecken/scrapy that referenced this issue Mar 20, 2014
rubenvereecken added a commit to rubenvereecken/scrapy that referenced this issue Mar 20, 2014
rubenvereecken added a commit to rubenvereecken/scrapy that referenced this issue Mar 21, 2014
dangra added a commit that referenced this issue Mar 24, 2014
Added content-type check as per issue #193
chekunkov added a commit to chekunkov/scrapy that referenced this issue Apr 26, 2014
chekunkov added a commit to chekunkov/scrapy that referenced this issue Apr 26, 2014
@dangra
Copy link
Member

@dangra dangra commented Jun 23, 2014

fixed by #660

@dangra dangra closed this Jun 23, 2014
@askerlee
Copy link

@askerlee askerlee commented Jan 18, 2015

Hi guys, I still get this error while downloading amazon sitemap .gz files. I checked the http headers and found Amazon keeps "Content-Encoding: x-gzip" and has changed "Content-Type" to "application/octet-stream". Then the condition "content_encoding and not is_gzipped" in httpcompression middleware is True and this file is decompressed. Later sitemap.py finds the suffix is ".gz" and then calls gunzip again, causing the program to abort.

I changed sitemap.py a little bit to ignore the decompression exception. Now it's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants