Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlFilter terminates recursive scraping #460

Closed
jkanel opened this issue Sep 18, 2021 · 2 comments
Closed

urlFilter terminates recursive scraping #460

jkanel opened this issue Sep 18, 2021 · 2 comments

Comments

@jkanel
Copy link

jkanel commented Sep 18, 2021

See this discussion thread for more detail.

ISSUE: When the rootUrl does not match the urlFilter criteria, recursive scraping terminates at the root page.

RECOMMENDED SOLUTION: Do not apply the urlFilter to the rootUrl; or make it an option to ignore the rootUrl from scraping.

DESIRED: I'm specifying a rootUrl and would like the scraper to recurse through all hyperlinks. The rootUrl will not be downloaded in this scenario. When the scraper finds a hyperlink ending in .abc it should download the file.

ACTUAL: The rootUrl (see code below) does not meet the urlFilter criteria and the scraper stops with no recursion. The scraper should find a hyperlink to http://trillian.mit.edu/~jc/music/book/SCD/Book45.abc in the rootUrl among other .abc urls, but it does not. Note that when I set the rootUrl equal to an .abc url, e.g. the example above, the file downloads as expected.

const rootUrl = "http://trillian.mit.edu/~jc/music/book/SCD"

scrape({
  urls: rootUrl,
  recursive: true,
  maxRecursiveDepth: 5,
  urlFilter: function(url) {
    let match = url.match(/\.abc$/);
    return (match && match[0]);
  },
  directory: savePath
});
@jkanel jkanel changed the title urlFilter causes the urlFilter terminates recursive scraping Sep 18, 2021
s0ph1e added a commit that referenced this issue Oct 18, 2021
@s0ph1e s0ph1e closed this as completed in 461c6ac Oct 18, 2021
@s0ph1e
Copy link
Member

s0ph1e commented Oct 18, 2021

Fix for the issue will be released in next version, most probably 5.0.0

@jkanel
Copy link
Author

jkanel commented Oct 18, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants