You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.
When the above error is thrown, it stops the crawling of the site without firing complete event. Crawling hangs up after 70+ pages gets crawled. I have following fetchCondition
const fetchCondition = crawler.addFetchCondition((queueItem, referrerQueueItem) => {
return (new Urijs(referrerQueueItem.url).domain() === hostDomain);
//hostDomain is the domain being crawled
});
Kindly let me know if any further info is required.
Thanks
The text was updated successfully, but these errors were encountered:
Do you have a snippet to reproduce the error? I was not able to reproduce this with node 8.9.3 on fedora 27 (linux). You might want to look at the value of this error to find the actual reason for the failure.
{ Error: incorrect header check
1|www | at Zlib._handle.onerror (zlib.js:370:17) errno: -3, code: 'Z_DATA_ERROR' }
1|www | Error: Couldn't unzip robots.txt response body
1|www | at decodeAndReturnResponse (/home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:633:27)
1|www | at Unzip.onError (zlib.js:212:5)
1|www | at emitOne (events.js:96:13)
1|www | at Unzip.emit (events.js:188:7)
1|www | at Zlib._handle.onerror (zlib.js:373:10)
The robots.txt that cause the error isn't from the site the is being originally crawled. It's from an external site having link on that site. The fetch condition I use checks for the referrer and allows fetching on direct external links.
Hi,
On some sites, the crawler hangs up by throwing the following error:
Site on which it happened:
When the above error is thrown, it stops the crawling of the site without firing complete event. Crawling hangs up after 70+ pages gets crawled. I have following fetchCondition
Kindly let me know if any further info is required.
Thanks
The text was updated successfully, but these errors were encountered: