Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Crawler crashes on faulty href values #385

Closed
konstantinblaesi opened this issue Jul 26, 2017 · 0 comments
Closed

Crawler crashes on faulty href values #385

konstantinblaesi opened this issue Jul 26, 2017 · 0 comments

Comments

@konstantinblaesi
Copy link
Contributor

konstantinblaesi commented Jul 26, 2017

What happened?

The crawler crawler crashes because of this uncaught exception:

TypeError: invalid input
    at URI.p.href (***\node_modules\urijs\src\URI.js:1183:13)
    at new URI (***\node_modules\urijs\src\URI.js:70:10)
    at URI (***\node_modules\urijs\src\URI.js:46:16)
    at ***\node_modules\simplecrawler\lib\crawler.js:1737:28
    at FetchQueue.oldestUnfetchedItem (***\node_modules\simplecrawler\lib\queue.js:250:13)
    at Crawler.crawl (***\node_modules\simplecrawler\lib\crawler.js:1729:19)
    at ontimeout (timers.js:386:14)
    at tryOnTimeout (timers.js:250:5)
    at Timer.listOnTimeout (timers.js:214:5)

The error happens here:

https://github.com/cgiffard/node-simplecrawler/blob/v1.1.4/lib/crawler.js#L1738

What should have happened?

The crawler should have skipped the faulty URL without crashing.

Steps to reproduce the problem

Crawl any page that has a faulty href such as
<a href="http://">http://</a>

or use this snippet to reproduce the problem:

const uri = require('urijs');

let queueItem = {
    url: "http:///"
}

var url = uri(queueItem.url).normalize();
var host = uri({
    protocol: url.protocol(),
    hostname: url.hostname(),
    port: url.port()
}).href();

Would you filter bad URLs in Crawler.cleanExpandResources() ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant