Crawler hangs up on empty response body of robtos.txt #416

abbasharoon · 2018-01-07T08:12:26Z

Hi,

On some sites, the crawler hangs up by throwing the following error:

Error: Couldn't unzip robots.txt response body
0|www      |     at decodeAndReturnResponse (/var/www/node_modules/simplecrawler/lib/crawler.js:630:37)
0|www      |     at Unzip.onError (zlib.js:212:5)
0|www      |     at emitOne (events.js:96:13)
0|www      |     at Unzip.emit (events.js:188:7)
0|www      |     at Zlib._handle.onerror (zlib.js:373:10)

Site on which it happened:

When the above error is thrown, it stops the crawling of the site without firing complete event. Crawling hangs up after 70+ pages gets crawled. I have following fetchCondition

  const fetchCondition = crawler.addFetchCondition((queueItem, referrerQueueItem) => {
     return (new Urijs(referrerQueueItem.url).domain() === hostDomain);
    //hostDomain is the domain being crawled
  });

Kindly let me know if any further info is required.
Thanks

The text was updated successfully, but these errors were encountered:

konstantinblaesi · 2018-01-07T15:10:09Z

Do you have a snippet to reproduce the error? I was not able to reproduce this with node 8.9.3 on fedora 27 (linux). You might want to look at the value of this error to find the actual reason for the failure.

I was checking the http response headers for https://carnation-inc.com/robots.txt and it says ContentEncoding: gzip, so I wonder why it would fail trying to decode the robots.txt response.

abbasharoon · 2018-01-09T12:50:46Z

Hi @konstantinblaesi

The error logged the following to the console:

{ Error: incorrect header check
1|www      |     at Zlib._handle.onerror (zlib.js:370:17) errno: -3, code: 'Z_DATA_ERROR' }
1|www      | Error: Couldn't unzip robots.txt response body
1|www      |     at decodeAndReturnResponse (/home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:633:27)
1|www      |     at Unzip.onError (zlib.js:212:5)
1|www      |     at emitOne (events.js:96:13)
1|www      |     at Unzip.emit (events.js:188:7)
1|www      |     at Zlib._handle.onerror (zlib.js:373:10)

The robots.txt that cause the error isn't from the site the is being originally crawled. It's from an external site having link on that site. The fetch condition I use checks for the referrer and allows fetching on direct external links.

sebworks · 2018-02-16T14:42:57Z

I think this is the same issue referenced in #414

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler hangs up on empty response body of robtos.txt #416

Crawler hangs up on empty response body of robtos.txt #416

abbasharoon commented Jan 7, 2018

konstantinblaesi commented Jan 7, 2018

abbasharoon commented Jan 9, 2018

sebworks commented Feb 16, 2018

Crawler hangs up on empty response body of robtos.txt #416

Crawler hangs up on empty response body of robtos.txt #416

Comments

abbasharoon commented Jan 7, 2018

konstantinblaesi commented Jan 7, 2018

abbasharoon commented Jan 9, 2018

sebworks commented Feb 16, 2018