Crawler hangs after several rounds of fetchdataerror #214

tibetty · 2016-02-10T23:50:08Z

var Crawler = require('simplecrawler');

const webSite = 'www.jnto.go.jp';
const protocol = 'http';
const port = 80;
const initialPath = '/eng';

var crawler = new Crawler(webSite);
crawler.initialProtocol = protocol;
crawler.initialPort = port;
crawler.initialPath = initialPath;
crawler.maxDepth = 3;

crawler
.on('fetchcomplete', function(queueItem, responseBuffer, response) {
    console.log('Received %s (%d bytes)', queueItem.url, responseBuffer.length);
})
.on('fetcherror', function(queueItem, response) {
    console.log('Error #%d (%s) happened when retrieve %s', response.statusCode, response.statusMessage, queueItem.url);
})
on('fetchdataerror', function(queueItem, response) {
    console.log('Data error happened when retrieve %s', queueItem.url);
})
.start();

Again, I run above code to test simplecrawler, and it got frozen after reporting several errors consecutively and was never restored.

Error #301 (Moved Permanently) happened when retrieve http://www.jnto.go.jp/weather/eng/
Data error happened when retrieve http://www.jnto.go.jp/eng/pdf/regional/tohoku/fukushima.pdf
Data error happened when retrieve http://www.jnto.go.jp/eng/pdf/regional/chubu/takayama.pdf
Data error happened when retrieve http://www.jnto.go.jp/philippines/pdf/BROCHURE%20Final_philippines_20151217.pdf
Error #403 (Forbidden) happened when retrieve http://www.jnto.go.jp/jpn/member_logins/members_service/
Data error happened when retrieve http://www.jnto.go.jp/jpn/pdf/jnto_50aniv_pamphlet.pdf

It might imply some design/implementation flaws (in our code or dependencies) worth analyzing 'coz js-crawler could complete crawling the website (depth=2) (but unluckily w/ wrong data for binary resources). I tried to do a little CSI work by adding exception handling but nothing was reported out, so all up to you, simplecrawler staffs!

Regards,
tibetty

fredrikekelund · 2016-02-11T07:02:42Z

Thanks for reporting! I think I know what the problem is - I forgot to "close" the queueItem properly. I'll fix it later today!

fredrikekelund · 2016-02-11T10:51:43Z

@tibetty give it another try and see if it behaves as expected now!

tibetty · 2016-02-11T14:00:19Z

Yup, the new version correctly emits 'fetchdataerror' event but my simple test-suite gets hang up after a few consecutive errors. Will stick to this old thread afterwards until the issue is totally rooted.

fredrikekelund · 2016-02-11T14:28:06Z

It seems there are about 5,300 pages on that site (according to Google), excluding all resources like images and JS. Are you sure you're not just missing eg. redirect or timeout events? The crawler might stay silent for a while if you're not listening to all events.

I ran this for 30 minutes without any problems (even though I got 4 fetchdataerrors). Didn't have time to let it run to completion, but are you sure it's simplecrawler that hangs?

var Crawler = require("simplecrawler");

var crawler = new Crawler("www.jnto.go.jp");
crawler.initialPath = "/eng";
crawler.maxConcurrency = 5;
crawler.timeout = 60 * 1000;


var log = function () {
    lastRunMsAgo = 0;

    var time = /(\d+:\d+:\d+)/.exec(new Date().toString())[1];
    Array.prototype.unshift.call(arguments, time);
    console.log.apply(console, arguments);
};

var freezeAfter = 5 * 60 * 1000,
    lastRunMsAgo = 0;

var stopScraperInterval = setInterval(function () {
    lastRunMsAgo += 100;

    if (lastRunMsAgo >= freezeAfter) {
        clearInterval(stopScraperInterval);
        log("Stopping crawler because of timeout after", freezeAfter / 1000, "seconds");
        process.exit();
    }
}, 100);

crawler
    .on("fetchcomplete", function (queueItem, responseBuffer, response) {
        log("Done:", queueItem.url);
    })
    .on("fetchdataerror", function (queueItem) {
        log("Data error:", queueItem.url);
    })
    .on("fetch404", function (queueItem, responseBuffer) {
        log("404:", queueItem.url);
    })
    .on("fetcherror", function (queueItem, responseBuffer) {
        log("Error:", queueItem.url);
    })
    .on("fetchtimeout", function (queueItem, timeoutVal) {
        log("Timeout:", queueItem.url);
    })
    .on("fetchredirect", function (queueItem, parsedUrl, response) {
        log("Redirect", queueItem.url, response.headers.location);
    })
    .on("complete", function () {
        clearInterval(stopScraperInterval);
        console.timeEnd("Crawl");
        log("Finished scraping!");
    });

console.time("Crawl");
crawler.start();

tibetty · 2016-02-12T00:52:31Z

You are right, @fredrikekelund! I intecepted 'fetchstart' last night, and the log showed that simplecrawler worked pretty well, so there might be something wrong with either upstream dependencies (http/request) or the website itself. Kudos to you!

fredrikekelund · 2016-02-12T07:00:27Z

Nice to hear! But just to make sure we cracked this, what problem are you still experiencing (the one you say might lay in the website itself)?

tmpfs · 2016-02-14T22:44:13Z

I am interested in whether this is working correctly with redirects, using 0.6.1 from the registry I got it to work with:

crawler.on('fetchredirect', function(item, url) {
  crawler.queue.add(url.protocol, url.host, url.port, url.path);
})

I believe this functionality should be configurable too (followRedirects maybe). Is the above solution ok? Would the above code break when this issue is fixed?

_Edit:_ Please do not use the above code, it has issues with triggering fetch conditions and redirects on the same host are working as expected.

fredrikekelund · 2016-02-15T16:51:23Z

@tmpfs did you manage to resolve the issue you were having?

tmpfs · 2016-02-16T02:03:31Z

@fredrikekelund, yes I created an isolated test that verifies that 301 and 302 redirects are being followed for the same domain:

https://github.com/tmpfs/linkdown/blob/master/test/standalone/redirect.js

I will edit the above comment to reflect that should not be used.

fredrikekelund · 2016-02-16T07:02:00Z

@tmpfs cool! 👍

fredrikekelund · 2016-02-18T21:16:55Z

@tibetty I'll close this ticket. If you experience any further issues, don't hesitate to open a new one!

fredrikekelund closed this as completed in 25314c1 Feb 11, 2016

fredrikekelund mentioned this issue Feb 11, 2016

Have crawler follor redirect Http 301 #208

Closed

fredrikekelund reopened this Feb 11, 2016

fredrikekelund self-assigned this Feb 13, 2016

cgiffard added the Triaged and Waiting for response label Feb 14, 2016

fredrikekelund closed this as completed Feb 18, 2016

fredrikekelund removed the Triaged and Waiting for response label Apr 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler hangs after several rounds of fetchdataerror #214

Crawler hangs after several rounds of fetchdataerror #214

tibetty commented Feb 10, 2016

fredrikekelund commented Feb 11, 2016

fredrikekelund commented Feb 11, 2016

tibetty commented Feb 11, 2016

fredrikekelund commented Feb 11, 2016

tibetty commented Feb 12, 2016

fredrikekelund commented Feb 12, 2016

tmpfs commented Feb 14, 2016

fredrikekelund commented Feb 15, 2016

tmpfs commented Feb 16, 2016

fredrikekelund commented Feb 16, 2016

fredrikekelund commented Feb 18, 2016

Crawler hangs after several rounds of fetchdataerror #214

Crawler hangs after several rounds of fetchdataerror #214

Comments

tibetty commented Feb 10, 2016

fredrikekelund commented Feb 11, 2016

fredrikekelund commented Feb 11, 2016

tibetty commented Feb 11, 2016

fredrikekelund commented Feb 11, 2016

tibetty commented Feb 12, 2016

fredrikekelund commented Feb 12, 2016

tmpfs commented Feb 14, 2016

fredrikekelund commented Feb 15, 2016

tmpfs commented Feb 16, 2016

fredrikekelund commented Feb 16, 2016

fredrikekelund commented Feb 18, 2016