Links getting skipped due to escape sequence in href #439

braj1999 · 2018-10-03T14:03:32Z

What happened?

The hrefs having escape characters in them are getting skipped (not getting followed).

What should have happened?

The crawler must follow the links which have escape characters in them.

Steps to reproduce the problem

Please crawl this website: https://rust-belt-rust.com/
The links in header are getting skipped, The link https://rust-belt-rust.com/past/ is not getting crawled due to escape sequence in it.

The problem seems to be in the cleanURL function here https://github.com/simplecrawler/simplecrawler/blob/master/lib/crawler.js#L818

The text was updated successfully, but these errors were encountered:

braj1999 · 2019-03-12T10:16:19Z

@kbychkov Could you please advise us on this.

kbychkov · 2019-03-13T05:11:53Z

@braj1999, just curious, is there a reason why these links encoded to entities?

braj1999 · 2019-03-13T06:29:13Z

@kbychkov its the website owner who decided to encode it: seems the website is not following standard..

kbychkov · 2019-03-14T06:32:33Z

I'm not sure if should fix the cleanURL function. I tend to avoid writing patches on every specific case, especially when a problem could be solved somehow. In the issue the problem is rather about discovering resources. It looks like the crawler.discoveryRegex rules doesn't do the job well. The default rules are just an example and could be easily redefined. Thus, you could add new one for the specific case or even completely rewrite using cheeriojs, for example.

const Crawler = require("simplecrawler");

const crawler = new Crawler("http://conf2018.rust-belt-rust.com");

crawler.maxDepth = 2;

crawler.discoverRegex.push(string => {
  const result = string.match(/\s(?:href|src)\s*=\s*("|').*?\1/gi);
  return Array.isArray(result)
    ? result.map(item => item.replace(/&#x2F;/gi, "/"))
    : undefined;
});

crawler.on("fetchheaders", queueItem => {
  console.log(queueItem.stateData.code, queueItem.url);
});

crawler.start();

kbychkov mentioned this issue Mar 18, 2019

Add a replacement in cleanURL #432

Merged

kbychkov closed this as completed in b184ff3 Apr 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Links getting skipped due to escape sequence in href #439

Links getting skipped due to escape sequence in href #439

braj1999 commented Oct 3, 2018 •

edited

braj1999 commented Mar 12, 2019

kbychkov commented Mar 13, 2019

braj1999 commented Mar 13, 2019

kbychkov commented Mar 14, 2019

Links getting skipped due to escape sequence in href #439

Links getting skipped due to escape sequence in href #439

Comments

braj1999 commented Oct 3, 2018 • edited

What happened?

What should have happened?

Steps to reproduce the problem

braj1999 commented Mar 12, 2019

kbychkov commented Mar 13, 2019

braj1999 commented Mar 13, 2019

kbychkov commented Mar 14, 2019

braj1999 commented Oct 3, 2018 •

edited