You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\projects\sitemap_spider\sitemap_spider\spiders\mainspider.py", line 31, in _parse_sitemap
body = self._get_sitemap_body(response)
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\spiders\sitemap.py", line 67, in _get_sitemap_body
return gunzip(response.body)
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\gz.py", line 37, in gunzip
chunk = read1(f, 8196)
File "c:\venv\scrapy1.0\lib\site-packages\scrapy\utils\gz.py", line 21, in read1
return gzf.read(size)
File "c:\python27\Lib\gzip.py", line 268, in read
self._read(readsize)
File "c:\python27\Lib\gzip.py", line 303, in _read
self._read_gzip_header()
File "c:\python27\Lib\gzip.py", line 197, in _read_gzip_header
raise IOError, 'Not a gzipped file'
i did download file manually and was able to extract the content so it is not like file is corrupted
as an example sitemap url : you can follow amazon robots.txt
The text was updated successfully, but these errors were encountered:
The recommended action for an implementation that receives an
"application/octet-stream" entity is to simply offer to put the data
in a file, with any Content-Transfer-Encoding undone, or perhaps to
use it as input to a user-specified process.
I'd go for considering Content-Type: binary/octet-stream the same way, taking precedence over Content-Encoding (for the same reasons as in #193 (comment)), i.e. not try to decompress it at HttpCompressionMiddleware level as a gzipped HTML/XML/JSON... response, but like a (compressed) file for another layer to interpret (here, SitemapSpider)
while trying to access sitemap from robots.txt , Scrapy fails with IOError, 'Not a gzipped file' error
not sure if this issue is related to following issue(s)
#193 -> closed issue
#660 -> merged pull request to address issue 193
#951 -> open issue
Response Header
Error Log:
i did download file manually and was able to extract the content so it is not like file is corrupted
as an example sitemap url : you can follow amazon robots.txt
The text was updated successfully, but these errors were encountered: