Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Links getting skipped due to escape sequence in href #439

Closed
braj1999 opened this issue Oct 3, 2018 · 4 comments
Closed

Links getting skipped due to escape sequence in href #439

braj1999 opened this issue Oct 3, 2018 · 4 comments

Comments

@braj1999
Copy link

braj1999 commented Oct 3, 2018

What happened?

The hrefs having escape characters in them are getting skipped (not getting followed).

What should have happened?

The crawler must follow the links which have escape characters in them.

Steps to reproduce the problem

Please crawl this website: https://rust-belt-rust.com/
The links in header are getting skipped, The link https://rust-belt-rust.com/past/ is not getting crawled due to escape sequence in it.

The problem seems to be in the cleanURL function here https://github.com/simplecrawler/simplecrawler/blob/master/lib/crawler.js#L818

@braj1999
Copy link
Author

@kbychkov Could you please advise us on this.

@kbychkov
Copy link
Contributor

@braj1999, just curious, is there a reason why these links encoded to entities?

@braj1999
Copy link
Author

@kbychkov its the website owner who decided to encode it: seems the website is not following standard..

@kbychkov
Copy link
Contributor

I'm not sure if should fix the cleanURL function. I tend to avoid writing patches on every specific case, especially when a problem could be solved somehow. In the issue the problem is rather about discovering resources. It looks like the crawler.discoveryRegex rules doesn't do the job well. The default rules are just an example and could be easily redefined. Thus, you could add new one for the specific case or even completely rewrite using cheeriojs, for example.

const Crawler = require("simplecrawler");

const crawler = new Crawler("http://conf2018.rust-belt-rust.com");

crawler.maxDepth = 2;

crawler.discoverRegex.push(string => {
  const result = string.match(/\s(?:href|src)\s*=\s*("|').*?\1/gi);
  return Array.isArray(result)
    ? result.map(item => item.replace(///gi, "/"))
    : undefined;
});

crawler.on("fetchheaders", queueItem => {
  console.log(queueItem.stateData.code, queueItem.url);
});

crawler.start();

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants