You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.
I'm not sure if should fix the cleanURL function. I tend to avoid writing patches on every specific case, especially when a problem could be solved somehow. In the issue the problem is rather about discovering resources. It looks like the crawler.discoveryRegex rules doesn't do the job well. The default rules are just an example and could be easily redefined. Thus, you could add new one for the specific case or even completely rewrite using cheeriojs, for example.
What happened?
The
hrefs
having escape characters in them are getting skipped (not getting followed).What should have happened?
The crawler must follow the links which have escape characters in them.
Steps to reproduce the problem
Please crawl this website: https://rust-belt-rust.com/
The links in header are getting skipped, The link https://rust-belt-rust.com/past/ is not getting crawled due to escape sequence in it.
The problem seems to be in the cleanURL function here https://github.com/simplecrawler/simplecrawler/blob/master/lib/crawler.js#L818
The text was updated successfully, but these errors were encountered: