Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Crawler hangs after several rounds of fetchdataerror #214

Closed
tibetty opened this issue Feb 10, 2016 · 11 comments
Closed

Crawler hangs after several rounds of fetchdataerror #214

tibetty opened this issue Feb 10, 2016 · 11 comments
Assignees

Comments

@tibetty
Copy link

tibetty commented Feb 10, 2016

var Crawler = require('simplecrawler');

const webSite = 'www.jnto.go.jp';
const protocol = 'http';
const port = 80;
const initialPath = '/eng';

var crawler = new Crawler(webSite);
crawler.initialProtocol = protocol;
crawler.initialPort = port;
crawler.initialPath = initialPath;
crawler.maxDepth = 3;

crawler
.on('fetchcomplete', function(queueItem, responseBuffer, response) {
    console.log('Received %s (%d bytes)', queueItem.url, responseBuffer.length);
})
.on('fetcherror', function(queueItem, response) {
    console.log('Error #%d (%s) happened when retrieve %s', response.statusCode, response.statusMessage, queueItem.url);
})
on('fetchdataerror', function(queueItem, response) {
    console.log('Data error happened when retrieve %s', queueItem.url);
})
.start();

Again, I run above code to test simplecrawler, and it got frozen after reporting several errors consecutively and was never restored.

Error #301 (Moved Permanently) happened when retrieve http://www.jnto.go.jp/weather/eng/
Data error happened when retrieve http://www.jnto.go.jp/eng/pdf/regional/tohoku/fukushima.pdf
Data error happened when retrieve http://www.jnto.go.jp/eng/pdf/regional/chubu/takayama.pdf
Data error happened when retrieve http://www.jnto.go.jp/philippines/pdf/BROCHURE%20Final_philippines_20151217.pdf
Error #403 (Forbidden) happened when retrieve http://www.jnto.go.jp/jpn/member_logins/members_service/
Data error happened when retrieve http://www.jnto.go.jp/jpn/pdf/jnto_50aniv_pamphlet.pdf

It might imply some design/implementation flaws (in our code or dependencies) worth analyzing 'coz js-crawler could complete crawling the website (depth=2) (but unluckily w/ wrong data for binary resources). I tried to do a little CSI work by adding exception handling but nothing was reported out, so all up to you, simplecrawler staffs!

Regards,
tibetty

@fredrikekelund
Copy link
Collaborator

Thanks for reporting! I think I know what the problem is - I forgot to "close" the queueItem properly. I'll fix it later today!

@fredrikekelund
Copy link
Collaborator

@tibetty give it another try and see if it behaves as expected now!

@tibetty
Copy link
Author

tibetty commented Feb 11, 2016

Yup, the new version correctly emits 'fetchdataerror' event but my simple test-suite gets hang up after a few consecutive errors. Will stick to this old thread afterwards until the issue is totally rooted.

@fredrikekelund
Copy link
Collaborator

It seems there are about 5,300 pages on that site (according to Google), excluding all resources like images and JS. Are you sure you're not just missing eg. redirect or timeout events? The crawler might stay silent for a while if you're not listening to all events.

I ran this for 30 minutes without any problems (even though I got 4 fetchdataerrors). Didn't have time to let it run to completion, but are you sure it's simplecrawler that hangs?

var Crawler = require("simplecrawler");

var crawler = new Crawler("www.jnto.go.jp");
crawler.initialPath = "/eng";
crawler.maxConcurrency = 5;
crawler.timeout = 60 * 1000;


var log = function () {
    lastRunMsAgo = 0;

    var time = /(\d+:\d+:\d+)/.exec(new Date().toString())[1];
    Array.prototype.unshift.call(arguments, time);
    console.log.apply(console, arguments);
};

var freezeAfter = 5 * 60 * 1000,
    lastRunMsAgo = 0;

var stopScraperInterval = setInterval(function () {
    lastRunMsAgo += 100;

    if (lastRunMsAgo >= freezeAfter) {
        clearInterval(stopScraperInterval);
        log("Stopping crawler because of timeout after", freezeAfter / 1000, "seconds");
        process.exit();
    }
}, 100);

crawler
    .on("fetchcomplete", function (queueItem, responseBuffer, response) {
        log("Done:", queueItem.url);
    })
    .on("fetchdataerror", function (queueItem) {
        log("Data error:", queueItem.url);
    })
    .on("fetch404", function (queueItem, responseBuffer) {
        log("404:", queueItem.url);
    })
    .on("fetcherror", function (queueItem, responseBuffer) {
        log("Error:", queueItem.url);
    })
    .on("fetchtimeout", function (queueItem, timeoutVal) {
        log("Timeout:", queueItem.url);
    })
    .on("fetchredirect", function (queueItem, parsedUrl, response) {
        log("Redirect", queueItem.url, response.headers.location);
    })
    .on("complete", function () {
        clearInterval(stopScraperInterval);
        console.timeEnd("Crawl");
        log("Finished scraping!");
    });

console.time("Crawl");
crawler.start();

@tibetty
Copy link
Author

tibetty commented Feb 12, 2016

You are right, @fredrikekelund! I intecepted 'fetchstart' last night, and the log showed that simplecrawler worked pretty well, so there might be something wrong with either upstream dependencies (http/request) or the website itself. Kudos to you!

@fredrikekelund
Copy link
Collaborator

Nice to hear! But just to make sure we cracked this, what problem are you still experiencing (the one you say might lay in the website itself)?

@tmpfs
Copy link
Contributor

tmpfs commented Feb 14, 2016

I am interested in whether this is working correctly with redirects, using 0.6.1 from the registry I got it to work with:

crawler.on('fetchredirect', function(item, url) {
  crawler.queue.add(url.protocol, url.host, url.port, url.path);
})

I believe this functionality should be configurable too (followRedirects maybe). Is the above solution ok? Would the above code break when this issue is fixed?

_Edit:_ Please do not use the above code, it has issues with triggering fetch conditions and redirects on the same host are working as expected.

@fredrikekelund
Copy link
Collaborator

@tmpfs did you manage to resolve the issue you were having?

@tmpfs
Copy link
Contributor

tmpfs commented Feb 16, 2016

@fredrikekelund, yes I created an isolated test that verifies that 301 and 302 redirects are being followed for the same domain:

https://github.com/tmpfs/linkdown/blob/master/test/standalone/redirect.js

I will edit the above comment to reflect that should not be used.

@fredrikekelund
Copy link
Collaborator

@tmpfs cool! 👍

@fredrikekelund
Copy link
Collaborator

@tibetty I'll close this ticket. If you experience any further issues, don't hesitate to open a new one!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants