Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Crawler hangs up on empty response body of robtos.txt #416

Open
abbasharoon opened this issue Jan 7, 2018 · 3 comments
Open

Crawler hangs up on empty response body of robtos.txt #416

abbasharoon opened this issue Jan 7, 2018 · 3 comments

Comments

@abbasharoon
Copy link

Hi,

On some sites, the crawler hangs up by throwing the following error:

Error: Couldn't unzip robots.txt response body
0|www      |     at decodeAndReturnResponse (/var/www/node_modules/simplecrawler/lib/crawler.js:630:37)
0|www      |     at Unzip.onError (zlib.js:212:5)
0|www      |     at emitOne (events.js:96:13)
0|www      |     at Unzip.emit (events.js:188:7)
0|www      |     at Zlib._handle.onerror (zlib.js:373:10)

Site on which it happened:

When the above error is thrown, it stops the crawling of the site without firing complete event. Crawling hangs up after 70+ pages gets crawled. I have following fetchCondition

  const fetchCondition = crawler.addFetchCondition((queueItem, referrerQueueItem) => {
     return (new Urijs(referrerQueueItem.url).domain() === hostDomain);
    //hostDomain is the domain being crawled
  });

Kindly let me know if any further info is required.
Thanks

@konstantinblaesi
Copy link
Contributor

Do you have a snippet to reproduce the error? I was not able to reproduce this with node 8.9.3 on fedora 27 (linux). You might want to look at the value of this error to find the actual reason for the failure.

I was checking the http response headers for https://carnation-inc.com/robots.txt and it says ContentEncoding: gzip, so I wonder why it would fail trying to decode the robots.txt response.

@abbasharoon
Copy link
Author

Hi @konstantinblaesi

The error logged the following to the console:

{ Error: incorrect header check
1|www      |     at Zlib._handle.onerror (zlib.js:370:17) errno: -3, code: 'Z_DATA_ERROR' }
1|www      | Error: Couldn't unzip robots.txt response body
1|www      |     at decodeAndReturnResponse (/home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:633:27)
1|www      |     at Unzip.onError (zlib.js:212:5)
1|www      |     at emitOne (events.js:96:13)
1|www      |     at Unzip.emit (events.js:188:7)
1|www      |     at Zlib._handle.onerror (zlib.js:373:10)

The robots.txt that cause the error isn't from the site the is being originally crawled. It's from an external site having link on that site. The fetch condition I use checks for the referrer and allows fetching on direct external links.

@sebworks
Copy link

I think this is the same issue referenced in #414

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants