urlFilter terminates recursive scraping #460

jkanel · 2021-09-18T16:31:51Z

See this discussion thread for more detail.

ISSUE: When the rootUrl does not match the urlFilter criteria, recursive scraping terminates at the root page.

RECOMMENDED SOLUTION: Do not apply the urlFilter to the rootUrl; or make it an option to ignore the rootUrl from scraping.

DESIRED: I'm specifying a rootUrl and would like the scraper to recurse through all hyperlinks. The rootUrl will not be downloaded in this scenario. When the scraper finds a hyperlink ending in .abc it should download the file.

ACTUAL: The rootUrl (see code below) does not meet the urlFilter criteria and the scraper stops with no recursion. The scraper should find a hyperlink to http://trillian.mit.edu/~jc/music/book/SCD/Book45.abc in the rootUrl among other .abc urls, but it does not. Note that when I set the rootUrl equal to an .abc url, e.g. the example above, the file downloads as expected.

const rootUrl = "http://trillian.mit.edu/~jc/music/book/SCD"

scrape({
  urls: rootUrl,
  recursive: true,
  maxRecursiveDepth: 5,
  urlFilter: function(url) {
    let match = url.match(/\.abc$/);
    return (match && match[0]);
  },
  directory: savePath
});

The text was updated successfully, but these errors were encountered:

s0ph1e · 2021-10-18T15:52:30Z

Fix for the issue will be released in next version, most probably 5.0.0

jkanel · 2021-10-18T16:35:17Z

Great to hear! Thanks, Sophia. Jeff Kanel Mobile | 614.778.5562 ***@***.***

…

On Mon, Oct 18, 2021 at 11:52 AM Sophia Antipenko ***@***.***> wrote: Fix for the issue will be released in next version, most probably 5.0.0 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#460 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPDG45VNWAYYWJ544XRPTTUHQ7ETANCNFSM5EJLOQEQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jkanel changed the title ~~urlFilter causes the~~ urlFilter terminates recursive scraping Sep 18, 2021

s0ph1e added a commit that referenced this issue Oct 18, 2021

Awoid filtering our root urls, closes #460

c963864

s0ph1e closed this as completed in 461c6ac Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlFilter terminates recursive scraping #460

urlFilter terminates recursive scraping #460

jkanel commented Sep 18, 2021 •

edited

s0ph1e commented Oct 18, 2021

jkanel commented Oct 18, 2021 via email

urlFilter terminates recursive scraping #460

urlFilter terminates recursive scraping #460

Comments

jkanel commented Sep 18, 2021 • edited

s0ph1e commented Oct 18, 2021

jkanel commented Oct 18, 2021 via email

jkanel commented Sep 18, 2021 •

edited