Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOError, 'Not a gzipped file' #2063

Closed
DharmeshPandav opened this issue Jun 19, 2016 · 3 comments
Closed

IOError, 'Not a gzipped file' #2063

DharmeshPandav opened this issue Jun 19, 2016 · 3 comments
Labels
bug
Milestone

Comments

@DharmeshPandav
Copy link
Contributor

@DharmeshPandav DharmeshPandav commented Jun 19, 2016

while trying to access sitemap from robots.txt , Scrapy fails with IOError, 'Not a gzipped file' error

not sure if this issue is related to following issue(s)
#193 -> closed issue
#660 -> merged pull request to address issue 193
#951 -> open issue

line where code fails in gzip.py at line # 197

def _read_gzip_header(self):
        magic = self.fileobj.read(2)
        if magic != '\037\213':
            raise IOError, 'Not a gzipped file'

Response Header

Content-Encoding: gzip
Accept-Ranges: bytes
X-Amz-Request-Id: BFFF010DDE6268DA
Vary: Accept-Encoding
Server: AmazonS3
Last-Modified: Wed, 15 Jun 2016 19:02:20 GMT
Etag: "300bb71d6897cb2a22bba0bd07978c84"
Cache-Control: no-transform
Date: Sun, 19 Jun 2016 10:54:53 GMT
Content-Type: binary/octet-stream

Error Log:

 Traceback (most recent call last):
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\projects\sitemap_spider\sitemap_spider\spiders\mainspider.py", line 31, in _parse_sitemap
    body = self._get_sitemap_body(response)
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spiders\sitemap.py", line 67, in _get_sitemap_body
    return gunzip(response.body)
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\gz.py", line 37, in gunzip
    chunk = read1(f, 8196)
  File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\gz.py", line 21, in read1
    return gzf.read(size)
  File "c:\python27\Lib\gzip.py", line 268, in read
    self._read(readsize)
  File "c:\python27\Lib\gzip.py", line 303, in _read
    self._read_gzip_header()
  File "c:\python27\Lib\gzip.py", line 197, in _read_gzip_header
    raise IOError, 'Not a gzipped file'

i did download file manually and was able to extract the content so it is not like file is corrupted

as an example sitemap url : you can follow amazon robots.txt

@DharmeshPandav
Copy link
Contributor Author

@DharmeshPandav DharmeshPandav commented Jun 20, 2016

as a work around to this problem
if one is to set COMPRESSION_ENABLED = False in settings.py file, it will work as expected

@redapple
Copy link
Contributor

@redapple redapple commented Jun 20, 2016

Thanks @DharmeshPandav for reminding us of this issue.
It's the same problem as @juraseg commented on in #951 (comment)
For http://www.amazon.fr/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz for example you get

{b'Accept-Ranges': [b'bytes'],
 b'Content-Encoding': [b'gzip'],
 b'Content-Type': [b'binary/octet-stream'],
 b'Date': [b'Mon, 20 Jun 2016 10:50:16 GMT'],
 b'Etag': [b'"48f5a0d2cfff8d96700c49b5799c6d35"'],
 b'Last-Modified': [b'Wed, 15 Jun 2016 19:29:30 GMT'],
 b'Server': [b'AmazonS3'],
 b'Vary': [b'Accept-Encoding'],
 b'X-Amz-Request-Id': [b'674AEA9C05B9A833']}

from the web server, with HttpCompressionMiddleware decompressing the response based on Content-Encoding being "gzip".

When reading IANA's page on application/octet-stream

The recommended action for an implementation that receives an
"application/octet-stream" entity is to simply offer to put the data
in a file, with any Content-Transfer-Encoding undone, or perhaps to
use it as input to a user-specified process.

I'd go for considering Content-Type: binary/octet-stream the same way, taking precedence over Content-Encoding (for the same reasons as in #193 (comment)), i.e. not try to decompress it at HttpCompressionMiddleware level as a gzipped HTML/XML/JSON... response, but like a (compressed) file for another layer to interpret (here, SitemapSpider)

@redapple
Copy link
Contributor

@redapple redapple commented Nov 9, 2016

With #2389, I'm now reconsidering what I wrote earlier :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants