Crawler crashes on faulty href values #385

konstantinblaesi · 2017-07-26T14:58:42Z

What happened?

The crawler crawler crashes because of this uncaught exception:

TypeError: invalid input
    at URI.p.href (***\node_modules\urijs\src\URI.js:1183:13)
    at new URI (***\node_modules\urijs\src\URI.js:70:10)
    at URI (***\node_modules\urijs\src\URI.js:46:16)
    at ***\node_modules\simplecrawler\lib\crawler.js:1737:28
    at FetchQueue.oldestUnfetchedItem (***\node_modules\simplecrawler\lib\queue.js:250:13)
    at Crawler.crawl (***\node_modules\simplecrawler\lib\crawler.js:1729:19)
    at ontimeout (timers.js:386:14)
    at tryOnTimeout (timers.js:250:5)
    at Timer.listOnTimeout (timers.js:214:5)

The error happens here:

https://github.com/cgiffard/node-simplecrawler/blob/v1.1.4/lib/crawler.js#L1738

What should have happened?

The crawler should have skipped the faulty URL without crashing.

Steps to reproduce the problem

Crawl any page that has a faulty href such as
<a href="http://">http://</a>

or use this snippet to reproduce the problem:

const uri = require('urijs');

let queueItem = {
    url: "http:///"
}

var url = uri(queueItem.url).normalize();
var host = uri({
    protocol: url.protocol(),
    hostname: url.hostname(),
    port: url.port()
}).href();

Would you filter bad URLs in Crawler.cleanExpandResources() ?

The text was updated successfully, but these errors were encountered:

konstantinblaesi changed the title ~~Crawler crashes on faulty href attributes~~ Crawler crashes on faulty href values Jul 27, 2017

konstantinblaesi mentioned this issue Jul 27, 2017

Skip faulty URLs such as those missing a hostname or having an invali… #387

Closed

konstantinblaesi mentioned this issue Aug 9, 2017

Use latest URI.js, which has improved validation logic. #393

Merged

fredrikekelund closed this as completed in #393 Aug 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler crashes on faulty href values #385

Crawler crashes on faulty href values #385

konstantinblaesi commented Jul 26, 2017 •

edited

Crawler crashes on faulty href values #385

Crawler crashes on faulty href values #385

Comments

konstantinblaesi commented Jul 26, 2017 • edited

What happened?

What should have happened?

Steps to reproduce the problem

konstantinblaesi commented Jul 26, 2017 •

edited